cancel
Showing results for 
Search instead for 
Did you mean: 

eCopy and Kofax?

why1525
Champ in-the-making
Champ in-the-making
Dear all,

I am researching Scanning n OCR solution for a company which is using Alfresco now.

There is a question i need to know. Is it any Scanning and OCR software can integrated with Alfrsco easily? or only eCopy and Kofax solution able to do this?

I need this answer very much because the due date is near.

Thanks you.
25 REPLIES 25

ianschwartz
Champ in-the-making
Champ in-the-making
It doesn't look like you ever got your answer. I'm curious too.

alexander
Champ in-the-making
Champ in-the-making
I integrated open source OCR engines Tesseract and Ocropus from Google.

Code is not production ready, but feel free to message me if interested.

Alexander

benswitzer
Champ in-the-making
Champ in-the-making
I integrated open source OCR engines Tesseract and Ocropus from Google.

Code is not production ready, but feel free to message me if interested.

Alexander

Alexander,

I too have played with Tesseract, but with little success.  Is it possible for you to pass on some tips?

Ben

benswitzer
Champ in-the-making
Champ in-the-making
Hey all.

I spent some time on this finally and have been successful in building and running OCRopus and Tesseract.  As per the FAQ: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions  OCRopus uses Tesseract as its character recognition plug-in.

All is good running this app from the command line on Ubuntu.  Have successfully converted several images of various layouts.  (With various results of course!)

I created a Content Transformer through the following file (ocr-transformers-context.xml):


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
   <bean id="transformer.OCR" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
      <property name="checkCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
               <map>
                  <entry key=".*">
                     <value>ocrocmd –help</value>
                  </entry>
               </map>
            </property>
            <property name="errorCodes">
               <value>1,2,251</value>
            </property>
         </bean>
      </property>
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
               <map>
                  <entry key="Linux">
                     <value>ocrocmd ${source} > ${target}</value>
                  </entry>
                  <entry key="Windows.*">
                     <value>ocrocmd ${source} > ${target}</value>
                  </entry>
               </map>
            </property>
            <property name="errorCodes">
               <value>1,2,251</value>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey" >
               <constructor-arg><value>image/jpeg</value></constructor-arg>
               <constructor-arg><value>text/plain</value></constructor-arg>
            </bean>
         </list>
      </property>
   </bean>
</beans>

The command ocrocmd ${source} > ${target} works well from the command line (ocrocmd Page1.tif > Page1.html) but falls down when run by Alfresco.

This is a snippet of the trace:


11:27:04,312 User:xxxxxxxx DEBUG [util.exec.RuntimeExec] Execution result:
   os:         Linux
   command:    ocrocmd /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformer_source_62969.jpg > /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformer_target_62970.txt
   succeeded:  false
   exit code:  251
   out:        <!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html  xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name='ocr-system' content='OCRopus 0.1.1; Tue Jan 29 14:4
   err:        Ocropus Alpha (sauvola, rast, curved, tesseract)
0.1.1; Tue Jan 29 14:46:16 EST 2008; Linux singer 2.6.22-14-server #1 SMP Tue Dec 18 08:31:40 UTC 2007 i686 GNU/Linux

File is not valid: >
ocrocmd: file format not recognized


Seems the pipe is being interprected as file to be processed by OCRopus.  Placing single quotes around the {} like thus ocrocmd '${source}' > '${target}' doesn't work through Alfresco but does on the command line.  The trace from that error is as follows:


21:39:58,035 User:administrator DEBUG [util.exec.RuntimeExec] Execution result:
   os:         Linux
   command:    ocrocmd '/opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformer_source_13877.jpg' > '/opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformer_target_13878.txt'
   succeeded:  true
   exit code:  0
   out:
   err:        Ocropus Alpha (sauvola, rast, curved, tesseract)
0.1.1; Tue Jan 29 14:46:16 EST 2008; Linux singer 2.6.22-14-server #1 SMP Tue Dec 18 08:31:40 UTC 2007 i686 GNU/Linux

File is not valid: '/opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTra


I didn't truncate the log, Alfresco (log4j) did.  Help!  What am I missing?  Do I need to escape the pipe?  If so, how?  I tried just putting a \ in front of it.  No go.

Thanks a bunch.

Best,
Ben

alexander
Champ in-the-making
Champ in-the-making
As I was asked few times to post a code to plug tesseract in, here it is


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
  
   <bean id="transformer.Ocr.Png2Html" class="com.onepoint.transform.RuntimeExecutableOutContentTransformer" parent="baseContentTransformer">
      <property name="checkCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key=".*">
                        <value>ocropus</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">
               <value>1,2</value>
            </property>
         </bean>
      </property>
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux*">
                        <value>ocropus ocr ${source}</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">
               <value>1,2</value>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey" >
                <constructor-arg><value>image/png</value></constructor-arg>
                <constructor-arg><value>text/html</value></constructor-arg>
            </bean>
         </list>
      </property>
   </bean>


   <bean id="transformer.Ocr.Jpeg2Html" class="com.onepoint.transform.RuntimeExecutableOutContentTransformer" parent="baseContentTransformer">
      <property name="checkCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key=".*">
                        <value>ocropus</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">
               <value>1,2</value>
            </property>
         </bean>
      </property>
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux*">
                        <value>ocropus ocr ${source}</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">
               <value>1,2</value>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey" >
                <constructor-arg><value>image/jpeg</value></constructor-arg>
                <constructor-arg><value>text/html</value></constructor-arg>
            </bean>
         </list>
      </property>
   </bean>

<bean id="transformer.Ocr.Tiff2Txt" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">

      <property name="checkCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key=".*">
                        <value>tesseract</value>
                    </entry>
                </map>

            </property>
            <property name="errorCodes">
               <value>1,2</value>
            </property>
         </bean>
      </property>
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">

            <property name="commandMap">
                <map>
                    <entry key="Linux*">
                        <value>tesseract ${source} ${target}</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">

               <value>1,2</value>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey" >
                <constructor-arg><value>image/tiff</value></constructor-arg>

                <constructor-arg><value>text/plain</value></constructor-arg>
            </bean>
         </list>
      </property>
   </bean>

<bean id="transformer.complex.Jpeg2Text"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.Ocr.Jpeg2Html" />
            <ref bean="transformer.HtmlParser" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>text/html</value>
         </list>
      </property>
   </bean>
<bean id="transformer.complex.Png2Text"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.Ocr.Png2Html" />
            <ref bean="transformer.HtmlParser" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>text/html</value>
         </list>
      </property>
   </bean>


</beans>


I remember making a minor change in Tesseract C source to produce text (not HTML) by default as I did not need HTML and did not want to spend time with figuring out how to pass parameters to command line.

kayseryu
Champ in-the-making
Champ in-the-making
I'm a newbie and I'm just investigating a lil' about all this Alfresco functionality.
About this matter, is there a free OCR working for Windows OS that could work in the same manner that u post before?
Also, I noticed that the transformer class used for tesseract/ocropus is custom made (com.onepoint.transform…) Could you also post the code for this class?

muhammad_qasim
Champ in-the-making
Champ in-the-making
I want the text in the images to be searchable through the search option.
I already installed ocropus and Tesseract. I was searching through the posts until i found this post. I have few questions.
In which folder on Linux should I have to put ocr-transformers-context.xml and what other measures I have to do to enable ocr searching capablities in alfresco. . The system is an Ubuntu 8.x. Alfresco 2.1 is working fine.

krisapong
Champ in-the-making
Champ in-the-making
Hi,


I found out the web http://www.alfresco-plugin.com or http://www.qdoclive.com on the internet which they have the Scanning Modules with OCR for the Windows. There are some cool features such as

- Zoneing OCR
- OCR/Barcode to Field
- OCR/Barcode to filename/folder name
- Template and Rules supports
- saving to FTP, Webdav, File Sharing supports

Also, compatible with Alfresco 2.1 and 2.9

I think these plugin module is very helpful for the Scanning and Archiving solution, if you are looking to use the Alfresco


Krisapong S.

muhammad_qasim
Champ in-the-making
Champ in-the-making
My logic is…if you are using an open source software then its soulution should also be open source (i mean free).