cancel
Showing results for 
Search instead for 
Did you mean: 

Tesseract full integration - CE3.4b

normando
Champ in-the-making
Champ in-the-making
Hello all.

I was writed a how to in spanish forums to integrate tesseract OCR into alfresco. I repeat the post here because that will be useful for somebody.

Sorry for my bad language. I will try to do my best.

Before beging the howto, it is importante to make a few considerations:

1- Tesseract only support tif files.
2- Tesseract 2.x do not work with files ended with .tiff extension. Only work with .tif extensions. Versión 3.x can
3- Because alfresco store the temporary files with an extension as your own mime type (ex. if you upload a .tif file, alfresco make a node for that file with the .tiff extension), we need to make a wrapper to deal with this issue (copy the node to .tif extension)
4- Tesseract append to the output file the .txt extension, so if you use:

tesseract input_file.tif output_file.txt
you will get a file output_file.txt.txt

So we will not add the .txt extension.

Before anything, ensure you have installed tesseract, and the dictionaries for the language you want to made OCR. Also add the english too.

The first step to do is test tesseract from the command line:
tesseract input_file.tif output_file -l eng

The syntax parameter -l indicate the tesseract will use the english dictionary. You can modify this to spa for spanish, and so on. If the above command produce a text file with the content of the OCR, then we can continue.

Create a file extension, named ocrtiff-transform-context.xml into /tomcat/shared/classes/alfresco/extenssion with the following content:

    <?xml version='1.0' encoding='UTF-8'?>
    <!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

    <beans>
        <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">

            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>

              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value> –>
                                    <value>/opt/alfresco/ocr</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>

              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
    <!–                            <value>tesseract</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                    <value>-l</value>
                                    <value>eng</value> –>
                                    <value>/opt/alfresco/ocr</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>

              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype"><value>image/tiff</value></property>
                        <property name="targetMimetype"><value>text/plain</value></property>
                    </bean>
                 </list>
              </property>
        </bean>

        <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.ocr.tiff" />
            </property>
        </bean>
    </beans>

Then create a wrapper file, named "ocr" and put into the alfresco root directory. In my case I put into /opt/alfresco.

    #!/bin/bash
    # save arguments to variables
    SOURCE=$1
    TARGET=$2
    TMPDIR=/tmp
    FILENAME=`basename $SOURCE`
    OCRFILE=$FILENAME.tif

    # to see what happens
    #echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log

    cp -f $SOURCE $TMPDIR/$OCRFILE

    # call tesseract and redirect output to $TARGET
    tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
    rm -f $TMPDIR/$OCRFILE

Made then as executable file (chown 755 ocr)

Restart alfresco, and upload a tif file. Then if you are into alfresco explorer, click on the "I" (info) icon to see the strings that tesseract can extract from the fif file. Also you can made a search into alfresco explorer, or share (in this last, you need to upload the file under your site).

Enjoy

PD: If you can improve this how to, let me know.
15 REPLIES 15

mlauer
Champ in-the-making
Champ in-the-making
You can define a transformer from tiff to pdf. The original tiff will be shown in the PDF with hidden Textlayer - searchable, markable, indexed by alfresco.
You can do that in a shell-script using optimize2bw, tesseract, hocr2pdf and ptftk.

Best Regards
ml

Thank you very much plepot.  Smiley Happy
It's work for me. But, does tesseract has the ability for converting TIF to PDF?

Thanks in advanced.

togum
Champ in-the-making
Champ in-the-making
Thank you very much mlauer

Sorry long time no update
I'll try to use shell-script that you've mentioned.



You can define a transformer from tiff to pdf. The original tiff will be shown in the PDF with hidden Textlayer - searchable, markable, indexed by alfresco.
You can do that in a shell-script using optimize2bw, tesseract, hocr2pdf and ptftk.

Best Regards
ml

mlauer
Champ in-the-making
Champ in-the-making
Hi

Here is the shell script (Ubuntu 10.4):


#!/bin/bash
#############################################################
# tiff_ocr2pdf.sh
# TIF-Datei in durchsuchbares PDF umwandeln
#############################################################
# 31.10.2011 ml - neu erstellt
#############################################################
SOURCE=$1
TARGET=$2
TEMP=`mktemp -t tiffocrXXXXXXXX`
TEMP="${TEMP}_"

tiffsplit $1 "${TEMP}"

for TIFF in ${TEMP}*
do
# segmentation fault bei –denoise!
   optimize2bw –dpi 300 -i ${TIFF} -o ${TIFF}opt.tif
   tesseract ${TIFF}opt.tif ${TIFF}tmp hocr
   hocr2pdf -s -i ${TIFF} -o ${TIFF}.pdf < ${TIFF}tmp.html
done
# PDFs zusammenfassen
pdftk ${TEMP}*.pdf output $2
#############################################################
# aufraeumen
rm ${TEMP}*


tesseract 3.x is neeeded!

wmay
Champ in-the-making
Champ in-the-making
Hi,

We have implemented an OCR server integrated with Alfresco, which can be used as transformer or via Javascript and Java. It runs on  a separate OCR server and supports Abbyy and Google OCR. for more informaiton see here - https://forums.alfresco.com/en/viewtopic.php?f=33&t=44739

stephane77
Champ in-the-making
Champ in-the-making
Hello,

I just installed alfresco on a windows server 2008 R2 and I saw that you have integrated OCR Alfresco.

I do not know how you integrated the scripts under alfresco, I do not no engineer.

Can you help me?

Thank you for your help

happy New Year  Smiley Very Happy

kylee
Champ in-the-making
Champ in-the-making
Hi,

I have tried the configuration as suggested on Normando on Alfresco CE 4.2f on Ubuntu 14.04 and it doesn't seem to work.  the target file size (txt) is 0. 

I have done some basic troubleshooting and I thik it only boils down to the line


# call tesseract and redirect output to $TARGET
/usr/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng


I uploaded a tif file (1 page) into share and then

I did an echo of:
1) echo "$TMPDIR/$OCRFILE" >>/tmp/ocrtransform.log and it gives


/tmp/RuntimeExecutableContentTransformerWorker_source_2064892405511152431.tiff.tif


2) echo "$TARGET" >>/tmp/ocrtransform.log


/opt/alfresco-4.2.f/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_3161704112042319622.txt


I have tried using the command line and it is OK.   Any suggestions please?