Hyland Connect

miguelrodriguez · ‎10-11-2017

Purpose
Tesseract
Transformation context file
OCR Script
Tesseract properties
Debugging
Alfresco Transformation Service
Tesseract execution
References

Purpose

The purpose of this blog is to show how to scan images containing text so that the text is indexed and searchable by Alfresco. The following file types are supported: PNG, BMP, JPEG, GIF, TIFF and PDF (containing images).

For this exercise we are going to use a Linux OS...but this solution should equally work on Windows OS.

To scan images we are going to use Tesseract-ocr (tesseract). This package contains an OCR engine - libtesseract and a command line program - tesseract.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

Tesseract

Since we are using tesseract-ocr we need to install tesseract software for our Linux distribution (version 3 or greater)

Please follow the instructions explained here: Installing Tesseract

Transformation context file

Create a file named transformer-context.xml in alfresco's extension folder i.e. tomcat/shared/classes/alfresco/extension with the following content:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
     license agreements. See the NOTICE file distributed with this work for additional 
     information regarding copyright ownership. The ASF licenses this file to 
     You under the Apache License, Version 2.0 (the "License"); you may not use 
     this file except in compliance with the License. You may obtain a copy of 
     the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
     by applicable law or agreed to in writing, software distributed under the 
     License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
     OF ANY KIND, either express or implied. See the License for the specific 
     language governing permissions and limitations under the License. -->
<beans>

     <!-- Transforms from TIFF to plain text using Tesseract
           and a custom script -->
     <bean id="transformer.worker.ocr.tiff"
          class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
          <property name="mimetypeService">
               <ref bean="mimetypeService" />
          </property>
          <property name="checkCommand">
               <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${tesseract.exe}</value>
                                        <value>-v</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
                    <property name="errorCodes">
                         <value>2</value>
                    </property>
               </bean>
          </property>

          <property name="transformCommand">
               <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${ocr.script}</value>
                                        <value>${source}</value>
                                        <value>${target}</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
                    <property name="errorCodes">
                         <value>1,2</value>
                    </property>
                    <property name="waitForCompletion">
                         <value>true</value>
                    </property>
               </bean>
          </property>
          <property name="transformerConfig">
               <ref bean="transformerConfig" />
          </property>
     </bean>

     <bean id="transformer.ocr.tiff"
          class="org.alfresco.repo.content.transform.ProxyContentTransformer"
          parent="baseContentTransformer">
          <property name="worker">
               <ref bean="transformer.worker.ocr.tiff" />
          </property>
     </bean>

     <!-- Transforms from PDF to TIFF using Ghostscript -->
     <bean id="transformer.worker.pdf.tiff"
          class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
          <property name="mimetypeService">
               <ref bean="mimetypeService" />
          </property>
          <property name="checkCommand">
               <bean name="transformer.ImageMagick.CheckCommand" class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${ghostscript.exe}</value>
                                        <value>-v</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
               </bean>
          </property>

          <property name="transformCommand">
               <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                         <map>
                              <entry key=".*">
                                   <list>
                                        <value>${ghostscript.exe}</value>
                                        <value>-o</value>
                                        <value>${target}</value>
                                        <value>-sDEVICE=tiff24nc</value>
                                        <value>-r300</value>
                                        <value>${source}</value>
                                   </list>
                              </entry>
                         </map>
                    </property>
                    <property name="errorCodes">
                         <value>1,2</value>
                    </property>
                    <property name="waitForCompletion">
                         <value>true</value>
                    </property>
               </bean>
          </property>
          <property name="transformerConfig">
               <ref bean="transformerConfig" />
          </property>
     </bean>

     <bean id="transformer.pdf.tiff"
          class="org.alfresco.repo.content.transform.ProxyContentTransformer"
          parent="baseContentTransformer">
          <property name="worker">
               <ref bean="transformer.worker.pdf.tiff" />
          </property>
     </bean>

</beans>‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

We can see we are using a few variables here:

tesseract.exe: this is the tesseract binary file, normally installed as /usr/bin/tesseract
ocr.script: this is the script we are calling to transform images to text, installed in Alfresco home folder as ocr.sh
ghostcript.exe: this is the ghostcript binary file...usually is the gs binary file
source: this is the source image file
target: this is the resulting text file

OCR Script

The next step is to create the ocr.sh script. The location of the script will be reference also in alfresco-global.properties file by the property ocr.script as shown later in this blog.

Assuming Alfresco is installed in /opt/alfresco, create a file name /opt/alfresco/ocr.sh with the following content:

# save arguments to variables
SOURCE=$1
TARGET=$2
TMPDIR=/tmp/tesseract
FILENAME=`basename $SOURCE`
OCRFILE=$FILENAME.tif
LD_LIBRARY_PATH=/usr/lib

# Create temp directory if it doesn't exist
mkdir -p $TMPDIR

# to see what happens
# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log
 
cp -f $SOURCE $TMPDIR/$OCRFILE
 
# call tesseract and redirect output to $TARGET
/usr/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
rm -f $TMPDIR/$OCRFILE‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

A couple of points to consider here:

We are using LD_LIBRARY_PATH to point to the OS library path to find the libraries required by tesseract. If we don't do this it will be using the library path defined by Alfresco pointing to commons/lib folder, but the version of the libraries may not be the ones required by tesseract.
We are defining the location of the tesseract binary file as /usr/bin/tesseract. If installed on a different location then adjust the path to tesseract accordingly.

Finally make sure the ocr.sh file has executable permission set. You can set it with the following command: chmod 755 /opt/alfresco/ocr.sh

Tesseract properties

The next step is to define a set of properties for tesseract in alfresco-global.properties.

# OCR Script
ocr.script=/opt/alfresco/ocr.sh

#GS executable
ghostscript.exe=gs

#Tesseract executable
tesseract.exe=tesseract

# Define a default priority for this transformer
content.transformer.ocr.tiff.priority=10

# List the transformations that are supported
content.transformer.ocr.tiff.extensions.tiff.txt.supported=true
content.transformer.ocr.tiff.extensions.tiff.txt.priority=10
content.transformer.ocr.tiff.extensions.jpg.txt.supported=true
content.transformer.ocr.tiff.extensions.jpg.txt.priority=10
content.transformer.ocr.tiff.extensions.png.txt.supported=true
content.transformer.ocr.tiff.extensions.png.txt.priority=10
content.transformer.ocr.tiff.extensions.gif.txt.supported=true
content.transformer.ocr.tiff.extensions.gif.txt.priority=10

# Define a default priority for this transformer
content.transformer.pdf.tiff.available=true
content.transformer.pdf.tiff.priority=10
# List the transformations that are supported
content.transformer.pdf.tiff.extensions.pdf.tiff.supported=true
content.transformer.pdf.tiff.extensions.pdf.tiff.priority=10

content.transformer.complex.Pdf2OCR.available=true
# Commented to be compatible with Alfresco 5.x
# content.transformer.complex.Pdf2OCR.failover=ocr.pdf
content.transformer.complex.Pdf2OCR.pipeline=pdf.tiff|tiff|ocr.tiff
content.transformer.complex.Pdf2OCR.extensions.pdf.txt.supported=true
content.transformer.complex.Pdf2OCR.extensions.pdf.txt.priority=10

# Disable the OOTB transformers
content.transformer.double.ImageMagick.extensions.pdf.tiff.supported=false
content.transformer.complex.PDF.Image.extensions.pdf.tiff.supported=false
content.transformer.ImageMagick.extensions.pdf.tiff.supported=false
content.transformer.PdfBox.extensions.pdf.txt.supported=false
content.transformer.TikaAuto.extensions.pdf.txt.supported=false‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The main property to consider is ocr.script pointing to the location of the ocr.sh file...adjust accordingly. All other properties can be left as they are.

Debugging

There are two areas we can debug:

The Alfresco transformation service
Tesseract execution

Alfresco Transformation Service

To debug the transformation service edit the file tomcat/shared/classes/alfresco/extension/custom-log4j.properties and add the following line at the bottom:

log4j.logger.org.alfresco.repo.content.transform=trace‍‍‍‍‍‍‍

Alfresco needs restarting to pick up this debug entry.

Tesseract execution

To get some execution information from tesseract edit the file /opt/alfresco/ocr.sh and uncomment the following entry by removing the '#' from the beginning of the line:

# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log‍‍‍‍‍‍‍

Now when an image file with text is loaded in Alfresco we can see similar entries in alfresco.log file showing the ocr.sh script being called.

2017-10-10 15:20:17,182 DEBUG [content.transform.RuntimeExecutableContentTransformerWorker] [http-bio-8443-exec-6] Transformation completed:
source: ContentAccessor[ contentUrl=store:///opt/alfresco/tomcat/temp/Alfresco/ComplextTransformer_intermediate_pdf_9017478201188837562.tiff, mimetype=image/tiff, size=24925880, encoding=UTF-8, locale=en_GB]
target: ContentAccessor[ contentUrl=store://2017/10/10/15/20/d3b4b9aa-ad28-4c8c-ae86-f99938bf4125.bin, mimetype=text/plain, size=1173, encoding=UTF-8, locale=en_GB]
options: {maxSourceSizeKBytes=-1, pageLimit=-1, use=index, timeoutMs=120000, maxPages=-1, contentReaderNodeRef=null, sourceContentProperty=null, readLimitKBytes=-1, contentWriterNodeRef=null, targetContentProperty=null, includeEmbedded=null, readLimitTimeMs=-1}
result:
Execution result:
os: Linux
command: /opt/alfresco/ocr.sh /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txt
succeeded: true
exit code: 0
out:
err: Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Page 1

2017-10-10 15:20:17,183 TRACE [content.transform.TransformerLog] [http-bio-8443-exec-6] 4.1.2 tiff txt INFO <<TemporaryFile>> 23.7 MB 1,950 ms ocr.tiff<<Runtime>>
2017-10-10 15:20:17,183 TRACE [content.transform.TransformerDebug] [http-bio-8443-exec-6] 4.1.2 Finished in 1,950 ms

We can also take a look at the /tmp/ocrtransform.log file to see what files have been processed.

from /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff to /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txt

That's it, you should now be able to search for the text contained in the image files.

References

Most of the information on this blog comes from this GitHub repository https://github.com/bchevallereau/alfresco-tesseract, with some additional adjustments and inclusions.

Hyland Connect

Indexing images with text in Alfresco with Tesseract-ocr

Purpose

Tesseract

Transformation context file

OCR Script

Tesseract properties

Debugging

Alfresco Transformation Service

Tesseract execution

References