cancel
Showing results for 
Search instead for 
Did you mean: 

Choice of OCR

dranakan
Champ on-the-rise
Champ on-the-rise
Hello,

I am evaluating different OCR to incorporate in Alfresco. The aim of these OCR is for me to get some fields from a paper (on a invoice for exemple). It would to generate a pdf and a other file with the value (name=bob, numberInvoice=23423, …). My softwares to tests are :
- Kofax
- eCopy
- Iris capture
- Adobe Capture

I'am looking for the cheapest. Have you got another OCR that you use in the same context ?

Now, I'am working with Adobe Capture, but I not able to extract the data in other file with the value (name=bob, numberInvoice=23423, …). Someone can explain me how to do ?
12 REPLIES 12

jhonabraham
Champ in-the-making
Champ in-the-making
Email
Print
Reprint
Magazine
Newsletters
Learn RSS
del.icio.us
My Yahoo
Digg
Newsvine
Blogger
Live Journal
StumbleUpon
Reddit
facebook
OCR or e-Invoicing—Making the Right Choice for your Organization
By Thayer Stewart, Special Contributor – Supply Chain Management Review, 8/7/2009 8:19:00 AM
The global supply chain demands buyers and suppliers invest in technologies engineered to provide uninterrupted delivery of goods and services. One of the largest impediments to efficient and profitable time to market is the delay associated with invoice receipt and payment. Invoice capture solutions have emerged as vital components of the overall procurement to pay (P2P) process, though not all are created equal.

When selecting an invoice data capture solution with the purpose of streamlining and optimizing the P2P and accounts payable processes, organizations often consider a number of possibilities, including optical character recognition (OCR) and e-Invoicing. Although OCR does offer some benefit to accounts payable departments, e-Invoicing stands as the clear winner when comparing accuracy, cost-effectiveness and overall return on investment.

When making such an important decision, organizations should ask several questions to help determine the solution that best meets their needs. What are the core differences between the technologies? How accurate are the solutions? When will I achieve ROI? Will they decrease paper consumption? How will my suppliers react? Are there any other viable alternatives?

OCR, or optical character recognition, is a technology that’s been around for decades. The basic premise of OCR is that information on paper documents can be extracted and automatically entered into an organization’s A/P workflow or ERP system, eliminating the need for data entry staff. OCR has been successfully applied to many functions that involve standard forms, such as medical claims and mortgage applications, however, it has had limited success with non-standard, variable documents such as invoices. Data errors are common and exception handling is a significant issue that requires ongoing manual intervention.

E-Invoicing is the electronic transfer of invoice data from the supplier to the buyer usually through a third-party network that facilitates and streamlines the exchange process. Invoice information is taken directly from a supplier’s billing system, validated and enriched via the network platform and then imported directly into their customer’s ERP system. No paper is involved and the manual intervention associated with exception handling in the OCR process is eliminated with e-Invoicing.

"Rumah Dijual
mengembalikan jati diri bangsa"

thomas_x
Champ on-the-rise
Champ on-the-rise
we just have created a transformer for tiff  to searchable pdf's, and want to create a transformer from pdf to searchable pdf  our transformer does not work !

here is the tiff to pdf transformer (working)


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
    <beans>
        <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">

            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>

              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value> –>
                                    <value>/opt/alfresco-4.0.d/ocr</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>

              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                    <value>-l</value>
                                    <value>deu</value> –>
                                    <value>/opt/alfresco-4.0.d/ocr</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>

              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype"><value>image/tiff</value></property>
                        <property name="targetMimetype"><value>text/plain</value></property>
                    </bean>
                 </list>
              </property>
        </bean>

        <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.ocr.tiff" />
            </property>
        </bean>
</beans>


here is our pdf to pdf transformer   (not working)


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
    <beans>
        <bean id="transformer.worker.ocr.pdf" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">

            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>

              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value> –>
                                    <value>/opt/alfresco-4.0.d/ocrPDF</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>

              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                    <value>-l</value>
                                    <value>deu</value> –>
                                    <value>/opt/alfresco-4.0.d/ocrPDF</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>

              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype"><value>application/pdf</value></property>
                        <property name="targetMimetype"><value>text/plain</value></property>
                    </bean>
                 </list>
              </property>
        </bean>

        <bean id="transformer.ocr.pdf" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.ocr.pdf" />
            </property>
        </bean>
</beans>

and here is the ocrPDF script


#!/bin/bash
# Run OCR on a multi-page PDF file and create a new pdf with the
# extracted text in hidden layer. Requires cuneiform, hocr2pdf, gs.
# Usage: ./dwim.sh input.pdf output.pdf

set -e

input="$1"
output="$2"
echo "$(date)" >>/tmp/ocrtransform.log
echo "ocrPDFfrom $input to $output" >>/tmp/ocrtransform.log
tmpdir="$(mktemp -d)"

# extract images of the pages (note: resolution hard-coded)
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH – "$input"

# OCR each page individually and convert into PDF
for page in "$tmpdir"/page-*.tiff
do
    base="${page%.tiff}"
#    cuneiform -f hocr -o "$base.html" "$page"
    tesseract "$page" "$base" -l deu hocr
    hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
    echo "hocr2pdf $page to $base" >>/tmp/ocrtransform.log
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf

rm -rf – "$tmpdir"


we have tested the ocrPDF script …. it's ok !
but when we upload a file, the ocrPDF script is not executed !

does anybody know what's the problem in this transnformer definition ?

thanks

wmay
Champ in-the-making
Champ in-the-making
Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on  a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739