cancel
Showing results for 
Search instead for 
Did you mean: 

Tesseract full integration - CE3.4b

normando
Champ in-the-making
Champ in-the-making
Hello all.

I was writed a how to in spanish forums to integrate tesseract OCR into alfresco. I repeat the post here because that will be useful for somebody.

Sorry for my bad language. I will try to do my best.

Before beging the howto, it is importante to make a few considerations:

1- Tesseract only support tif files.
2- Tesseract 2.x do not work with files ended with .tiff extension. Only work with .tif extensions. Versión 3.x can
3- Because alfresco store the temporary files with an extension as your own mime type (ex. if you upload a .tif file, alfresco make a node for that file with the .tiff extension), we need to make a wrapper to deal with this issue (copy the node to .tif extension)
4- Tesseract append to the output file the .txt extension, so if you use:

tesseract input_file.tif output_file.txt
you will get a file output_file.txt.txt

So we will not add the .txt extension.

Before anything, ensure you have installed tesseract, and the dictionaries for the language you want to made OCR. Also add the english too.

The first step to do is test tesseract from the command line:
tesseract input_file.tif output_file -l eng

The syntax parameter -l indicate the tesseract will use the english dictionary. You can modify this to spa for spanish, and so on. If the above command produce a text file with the content of the OCR, then we can continue.

Create a file extension, named ocrtiff-transform-context.xml into /tomcat/shared/classes/alfresco/extenssion with the following content:

    <?xml version='1.0' encoding='UTF-8'?>
    <!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

    <beans>
        <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">

            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>

              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value> –>
                                    <value>/opt/alfresco/ocr</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>

              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                    <value>-l</value>
                                    <value>eng</value> –>
                                    <value>/opt/alfresco/ocr</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>

              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype"><value>image/tiff</value></property>
                        <property name="targetMimetype"><value>text/plain</value></property>
                    </bean>
                 </list>
              </property>
        </bean>

        <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.ocr.tiff" />
            </property>
        </bean>
    </beans>

Then create a wrapper file, named "ocr" and put into the alfresco root directory. In my case I put into /opt/alfresco.

    #!/bin/bash
    # save arguments to variables
    SOURCE=$1
    TARGET=$2
    TMPDIR=/tmp
    FILENAME=`basename $SOURCE`
    OCRFILE=$FILENAME.tif

    # to see what happens
    #echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log

    cp -f $SOURCE $TMPDIR/$OCRFILE

    # call tesseract and redirect output to $TARGET
    tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
    rm -f $TMPDIR/$OCRFILE

Made then as executable file (chown 755 ocr)

Restart alfresco, and upload a tif file. Then if you are into alfresco explorer, click on the "I" (info) icon to see the strings that tesseract can extract from the fif file. Also you can made a search into alfresco explorer, or share (in this last, you need to upload the file under your site).

Enjoy

PD: If you can improve this how to, let me know.
15 REPLIES 15

fatal
Champ in-the-making
Champ in-the-making
Hi,

I think they are a problem with
Made then as executable file (chown 755 ocr)

I have another problem, i have create my extension but OCR don't start when i upload my .tif :s
it's possible to create screenshoot ? ty

Sorry my english is very bad (3.4.B version)

kockiren
Champ in-the-making
Champ in-the-making
I has the same problem, my solution was to correct the Path in the XML document to: /opt/alfresco-3.4c/ocr

Regards
Rene

nacira_dahmani
Champ in-the-making
Champ in-the-making
hello to every body, sorry for my bad english language
i use Windows Vista not linux or other os, i use Alfresco 3.3 and i look if there are anyone completed a successful integration of Tesseract with Alfresco but under Windows. Please help me please because  it is very impotant for me, Any help is appreciated, Thank you very well …..

dark_rider
Champ on-the-rise
Champ on-the-rise
Thanks for your post… It is very useful for me.

nacira_dahmani
Champ in-the-making
Champ in-the-making
Good morning for every body, i work with Windows Vista and  I seek for a script to integrate tesseract OCR into alfrescoa , if somebody can help me I would be to him very grateful, thank you for any help and sorry for my bad use of the language.

nicolasraoul
Star Contributor
Star Contributor
Thanks a lot Normando!
After following just those instructions, any TIFF that I upload becomes findable with full text search 🙂
Nicolas Raoul

plepot
Champ in-the-making
Champ in-the-making
Hello,

Here is the Windows version of the Alfresco Tesseract integration. The solution is validated under community 3.4d

The bean:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
      <value>/C</value>
      <value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
      <value>/C</value>
      <value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/tiff</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>
      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.tiff" />
    </property>
  </bean>
</beans>

The batch wrapper to save in the root directory of Alfresco.

REM to see what happens
echo from %1 to %2 >>C:\tmp\ocrtransform.log


copy /Y %1 C:\TMP\%~n1%~x1

REM  call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" C:\TMP\%~n1%~x1 %~d2%~p2%~n2 -l fra
del C:\TMP\%~n1%~x1

You may need to tweak the directories to your system

togum
Champ in-the-making
Champ in-the-making
Thank you very much plepot.  Smiley Happy
It's work for me. But, does tesseract has the ability for converting TIF to PDF?

Thanks in advanced.

plepot
Champ in-the-making
Champ in-the-making
Hello,

Sorry for (very) late reply. Actually, no tesseract produces a text file attached to the Alfresco document. I'd think that if you want to convert TIF to PDF you'd have to create a code within Alfresco. I do not know how to do it.

Regards,
Philippe