
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
- Purpose
- Tesseract
- Transformation context file
- OCR Script
- Tesseract properties
- Debugging
- Alfresco Transformation Service
- Tesseract execution
- References
Purpose
The purpose of this blog is to show how to scan images containing text so that the text is indexed and searchable by Alfresco. The following file types are supported: PNG, BMP, JPEG, GIF, TIFF and PDF (containing images).
For this exercise we are going to use a Linux OS...but this solution should equally work on Windows OS.
To scan images we are going to use Tesseract-ocr (tesseract). This package contains an OCR engine - libtesseract
and a command line program - tesseract
.
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".
Tesseract
Since we are using tesseract-ocr we need to install tesseract software for our Linux distribution (version 3 or greater)
Please follow the instructions explained here: Installing Tesseract
Transformation context file
Create a file named transformer-context.xml in alfresco's extension folder i.e. tomcat/shared/classes/alfresco/extension with the following content:
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'><!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --><beans> <!-- Transforms from TIFF to plain text using Tesseract and a custom script --> <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService"> <ref bean="mimetypeService" /> </property> <property name="checkCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key=".*"> <list> <value>${tesseract.exe}</value> <value>-v</value> </list> </entry> </map> </property> <property name="errorCodes"> <value>2</value> </property> </bean> </property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key=".*"> <list> <value>${ocr.script}</value> <value>${source}</value> <value>${target}</value> </list> </entry> </map> </property> <property name="errorCodes"> <value>1,2</value> </property> <property name="waitForCompletion"> <value>true</value> </property> </bean> </property> <property name="transformerConfig"> <ref bean="transformerConfig" /> </property> </bean> <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer"> <property name="worker"> <ref bean="transformer.worker.ocr.tiff" /> </property> </bean> <!-- Transforms from PDF to TIFF using Ghostscript --> <bean id="transformer.worker.pdf.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService"> <ref bean="mimetypeService" /> </property> <property name="checkCommand"> <bean name="transformer.ImageMagick.CheckCommand" class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key=".*"> <list> <value>${ghostscript.exe}</value> <value>-v</value> </list> </entry> </map> </property> </bean> </property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key=".*"> <list> <value>${ghostscript.exe}</value> <value>-o</value> <value>${target}</value> <value>-sDEVICE=tiff24nc</value> <value>-r300</value> <value>${source}</value> </list> </entry> </map> </property> <property name="errorCodes"> <value>1,2</value> </property> <property name="waitForCompletion"> <value>true</value> </property> </bean> </property> <property name="transformerConfig"> <ref bean="transformerConfig" /> </property> </bean> <bean id="transformer.pdf.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer"> <property name="worker"> <ref bean="transformer.worker.pdf.tiff" /> </property> </bean></beans>
We can see we are using a few variables here:
- tesseract.exe: this is the tesseract binary file, normally installed as /usr/bin/tesseract
- ocr.script: this is the script we are calling to transform images to text, installed in Alfresco home folder as ocr.sh
- ghostcript.exe: this is the ghostcript binary file...usually is the gs binary file
- source: this is the source image file
- target: this is the resulting text file
OCR Script
The next step is to create the ocr.sh script. The location of the script will be reference also in alfresco-global.properties file by the property ocr.script as shown later in this blog.
Assuming Alfresco is installed in /opt/alfresco, create a file name /opt/alfresco/ocr.sh with the following content:
# save arguments to variablesSOURCE=$1TARGET=$2TMPDIR=/tmp/tesseractFILENAME=`basename $SOURCE`OCRFILE=$FILENAME.tifLD_LIBRARY_PATH=/usr/lib# Create temp directory if it doesn't existmkdir -p $TMPDIR# to see what happens# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log cp -f $SOURCE $TMPDIR/$OCRFILE # call tesseract and redirect output to $TARGET/usr/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l engrm -f $TMPDIR/$OCRFILE
A couple of points to consider here:
- We are using LD_LIBRARY_PATH to point to the OS library path to find the libraries required by tesseract. If we don't do this it will be using the library path defined by Alfresco pointing to commons/lib folder, but the version of the libraries may not be the ones required by tesseract.
- We are defining the location of the tesseract binary file as /usr/bin/tesseract. If installed on a different location then adjust the path to tesseract accordingly.
Finally make sure the ocr.sh file has executable permission set. You can set it with the following command: chmod 755 /opt/alfresco/ocr.sh
Tesseract properties
The next step is to define a set of properties for tesseract in alfresco-global.properties.
# OCR Scriptocr.script=/opt/alfresco/ocr.sh#GS executableghostscript.exe=gs#Tesseract executabletesseract.exe=tesseract# Define a default priority for this transformercontent.transformer.ocr.tiff.priority=10# List the transformations that are supportedcontent.transformer.ocr.tiff.extensions.tiff.txt.supported=truecontent.transformer.ocr.tiff.extensions.tiff.txt.priority=10content.transformer.ocr.tiff.extensions.jpg.txt.supported=truecontent.transformer.ocr.tiff.extensions.jpg.txt.priority=10content.transformer.ocr.tiff.extensions.png.txt.supported=truecontent.transformer.ocr.tiff.extensions.png.txt.priority=10content.transformer.ocr.tiff.extensions.gif.txt.supported=truecontent.transformer.ocr.tiff.extensions.gif.txt.priority=10# Define a default priority for this transformercontent.transformer.pdf.tiff.available=truecontent.transformer.pdf.tiff.priority=10# List the transformations that are supportedcontent.transformer.pdf.tiff.extensions.pdf.tiff.supported=truecontent.transformer.pdf.tiff.extensions.pdf.tiff.priority=10content.transformer.complex.Pdf2OCR.available=true# Commented to be compatible with Alfresco 5.x# content.transformer.complex.Pdf2OCR.failover=ocr.pdfcontent.transformer.complex.Pdf2OCR.pipeline=pdf.tiff|tiff|ocr.tiffcontent.transformer.complex.Pdf2OCR.extensions.pdf.txt.supported=truecontent.transformer.complex.Pdf2OCR.extensions.pdf.txt.priority=10# Disable the OOTB transformerscontent.transformer.double.ImageMagick.extensions.pdf.tiff.supported=falsecontent.transformer.complex.PDF.Image.extensions.pdf.tiff.supported=falsecontent.transformer.ImageMagick.extensions.pdf.tiff.supported=falsecontent.transformer.PdfBox.extensions.pdf.txt.supported=falsecontent.transformer.TikaAuto.extensions.pdf.txt.supported=false
The main property to consider is ocr.script pointing to the location of the ocr.sh file...adjust accordingly. All other properties can be left as they are.
Debugging
There are two areas we can debug:
- The Alfresco transformation service
- Tesseract execution
Alfresco Transformation Service
To debug the transformation service edit the file tomcat/shared/classes/alfresco/extension/custom-log4j.properties and add the following line at the bottom:
log4j.logger.org.alfresco.repo.content.transform=trace
Alfresco needs restarting to pick up this debug entry.
Tesseract execution
To get some execution information from tesseract edit the file /opt/alfresco/ocr.sh and uncomment the following entry by removing the '#' from the beginning of the line:
# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log
Now when an image file with text is loaded in Alfresco we can see similar entries in alfresco.log file showing the ocr.sh script being called.
2017-10-10 15:20:17,182 DEBUG [content.transform.RuntimeExecutableContentTransformerWorker] [http-bio-8443-exec-6] Transformation completed:source: ContentAccessor[ contentUrl=store:///opt/alfresco/tomcat/temp/Alfresco/ComplextTransformer_intermediate_pdf_9017478201188837562.tiff, mimetype=image/tiff, size=24925880, encoding=UTF-8, locale=en_GB]target: ContentAccessor[ contentUrl=store://2017/10/10/15/20/d3b4b9aa-ad28-4c8c-ae86-f99938bf4125.bin, mimetype=text/plain, size=1173, encoding=UTF-8, locale=en_GB]options: {maxSourceSizeKBytes=-1, pageLimit=-1, use=index, timeoutMs=120000, maxPages=-1, contentReaderNodeRef=null, sourceContentProperty=null, readLimitKBytes=-1, contentWriterNodeRef=null, targetContentProperty=null, includeEmbedded=null, readLimitTimeMs=-1}result:Execution result:os: Linuxcommand: /opt/alfresco/ocr.sh /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txtsucceeded: trueexit code: 0out:err: Tesseract Open Source OCR Engine v3.04.01 with LeptonicaPage 12017-10-10 15:20:17,183 TRACE [content.transform.TransformerLog] [http-bio-8443-exec-6] 4.1.2 tiff txt INFO <<TemporaryFile>> 23.7 MB 1,950 ms ocr.tiff<<Runtime>>2017-10-10 15:20:17,183 TRACE [content.transform.TransformerDebug] [http-bio-8443-exec-6] 4.1.2 Finished in 1,950 ms
We can also take a look at the /tmp/ocrtransform.log file to see what files have been processed.
from /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff to /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txt
That's it, you should now be able to search for the text contained in the image files.
References
Most of the information on this blog comes from this GitHub repository https://github.com/bchevallereau/alfresco-tesseract, with some additional adjustments and inclusions.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.