The purpose of this blog is to show how to scan images containing text so that the text is indexed and searchable by Alfresco. The following file types are supported: PNG, BMP, JPEG, GIF, TIFF and PDF (containing images).
For this exercise we are going to use a Linux OS...but this solution should equally work on Windows OS.
To scan images we are going to use Tesseract-ocr (tesseract). This package contains an OCR engine - libtesseract
and a command line program - tesseract
.
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".
Since we are using tesseract-ocr we need to install tesseract software for our Linux distribution (version 3 or greater)
Please follow the instructions explained here: Installing Tesseract
Create a file named transformer-context.xml in alfresco's extension folder i.e. tomcat/shared/classes/alfresco/extension with the following content:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
You under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<beans>
<!-- Transforms from TIFF to plain text using Tesseract
and a custom script -->
<bean id="transformer.worker.ocr.tiff"
class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>${tesseract.exe}</value>
<value>-v</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>2</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>${ocr.script}</value>
<value>${source}</value>
<value>${target}</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
<property name="waitForCompletion">
<value>true</value>
</property>
</bean>
</property>
<property name="transformerConfig">
<ref bean="transformerConfig" />
</property>
</bean>
<bean id="transformer.ocr.tiff"
class="org.alfresco.repo.content.transform.ProxyContentTransformer"
parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.worker.ocr.tiff" />
</property>
</bean>
<!-- Transforms from PDF to TIFF using Ghostscript -->
<bean id="transformer.worker.pdf.tiff"
class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
<property name="mimetypeService">
<ref bean="mimetypeService" />
</property>
<property name="checkCommand">
<bean name="transformer.ImageMagick.CheckCommand" class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>${ghostscript.exe}</value>
<value>-v</value>
</list>
</entry>
</map>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandsAndArguments">
<map>
<entry key=".*">
<list>
<value>${ghostscript.exe}</value>
<value>-o</value>
<value>${target}</value>
<value>-sDEVICE=tiff24nc</value>
<value>-r300</value>
<value>${source}</value>
</list>
</entry>
</map>
</property>
<property name="errorCodes">
<value>1,2</value>
</property>
<property name="waitForCompletion">
<value>true</value>
</property>
</bean>
</property>
<property name="transformerConfig">
<ref bean="transformerConfig" />
</property>
</bean>
<bean id="transformer.pdf.tiff"
class="org.alfresco.repo.content.transform.ProxyContentTransformer"
parent="baseContentTransformer">
<property name="worker">
<ref bean="transformer.worker.pdf.tiff" />
</property>
</bean>
</beans>
We can see we are using a few variables here:
The next step is to create the ocr.sh script. The location of the script will be reference also in alfresco-global.properties file by the property ocr.script as shown later in this blog.
Assuming Alfresco is installed in /opt/alfresco, create a file name /opt/alfresco/ocr.sh with the following content:
# save arguments to variables
SOURCE=$1
TARGET=$2
TMPDIR=/tmp/tesseract
FILENAME=`basename $SOURCE`
OCRFILE=$FILENAME.tif
LD_LIBRARY_PATH=/usr/lib
# Create temp directory if it doesn't exist
mkdir -p $TMPDIR
# to see what happens
# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log
cp -f $SOURCE $TMPDIR/$OCRFILE
# call tesseract and redirect output to $TARGET
/usr/bin/tesseract $TMPDIR/$OCRFILE ${TARGET%\.*} -l eng
rm -f $TMPDIR/$OCRFILE
A couple of points to consider here:
Finally make sure the ocr.sh file has executable permission set. You can set it with the following command: chmod 755 /opt/alfresco/ocr.sh
The next step is to define a set of properties for tesseract in alfresco-global.properties.
# OCR Script
ocr.script=/opt/alfresco/ocr.sh
#GS executable
ghostscript.exe=gs
#Tesseract executable
tesseract.exe=tesseract
# Define a default priority for this transformer
content.transformer.ocr.tiff.priority=10
# List the transformations that are supported
content.transformer.ocr.tiff.extensions.tiff.txt.supported=true
content.transformer.ocr.tiff.extensions.tiff.txt.priority=10
content.transformer.ocr.tiff.extensions.jpg.txt.supported=true
content.transformer.ocr.tiff.extensions.jpg.txt.priority=10
content.transformer.ocr.tiff.extensions.png.txt.supported=true
content.transformer.ocr.tiff.extensions.png.txt.priority=10
content.transformer.ocr.tiff.extensions.gif.txt.supported=true
content.transformer.ocr.tiff.extensions.gif.txt.priority=10
# Define a default priority for this transformer
content.transformer.pdf.tiff.available=true
content.transformer.pdf.tiff.priority=10
# List the transformations that are supported
content.transformer.pdf.tiff.extensions.pdf.tiff.supported=true
content.transformer.pdf.tiff.extensions.pdf.tiff.priority=10
content.transformer.complex.Pdf2OCR.available=true
# Commented to be compatible with Alfresco 5.x
# content.transformer.complex.Pdf2OCR.failover=ocr.pdf
content.transformer.complex.Pdf2OCR.pipeline=pdf.tiff|tiff|ocr.tiff
content.transformer.complex.Pdf2OCR.extensions.pdf.txt.supported=true
content.transformer.complex.Pdf2OCR.extensions.pdf.txt.priority=10
# Disable the OOTB transformers
content.transformer.double.ImageMagick.extensions.pdf.tiff.supported=false
content.transformer.complex.PDF.Image.extensions.pdf.tiff.supported=false
content.transformer.ImageMagick.extensions.pdf.tiff.supported=false
content.transformer.PdfBox.extensions.pdf.txt.supported=false
content.transformer.TikaAuto.extensions.pdf.txt.supported=false
The main property to consider is ocr.script pointing to the location of the ocr.sh file...adjust accordingly. All other properties can be left as they are.
There are two areas we can debug:
To debug the transformation service edit the file tomcat/shared/classes/alfresco/extension/custom-log4j.properties and add the following line at the bottom:
log4j.logger.org.alfresco.repo.content.transform=trace
Alfresco needs restarting to pick up this debug entry.
To get some execution information from tesseract edit the file /opt/alfresco/ocr.sh and uncomment the following entry by removing the '#' from the beginning of the line:
# echo "from $SOURCE to $TARGET" >>/tmp/ocrtransform.log
Now when an image file with text is loaded in Alfresco we can see similar entries in alfresco.log file showing the ocr.sh script being called.
2017-10-10 15:20:17,182 DEBUG [content.transform.RuntimeExecutableContentTransformerWorker] [http-bio-8443-exec-6] Transformation completed:source: ContentAccessor[ contentUrl=store:///opt/alfresco/tomcat/temp/Alfresco/ComplextTransformer_intermediate_pdf_9017478201188837562.tiff, mimetype=image/tiff, size=24925880, encoding=UTF-8, locale=en_GB]target: ContentAccessor[ contentUrl=store://2017/10/10/15/20/d3b4b9aa-ad28-4c8c-ae86-f99938bf4125.bin, mimetype=text/plain, size=1173, encoding=UTF-8, locale=en_GB]options: {maxSourceSizeKBytes=-1, pageLimit=-1, use=index, timeoutMs=120000, maxPages=-1, contentReaderNodeRef=null, sourceContentProperty=null, readLimitKBytes=-1, contentWriterNodeRef=null, targetContentProperty=null, includeEmbedded=null, readLimitTimeMs=-1}result:Execution result:os: Linuxcommand: /opt/alfresco/ocr.sh /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txtsucceeded: trueexit code: 0out:err: Tesseract Open Source OCR Engine v3.04.01 with LeptonicaPage 12017-10-10 15:20:17,183 TRACE [content.transform.TransformerLog] [http-bio-8443-exec-6] 4.1.2 tiff txt INFO <<TemporaryFile>> 23.7 MB 1,950 ms ocr.tiff<<Runtime>>2017-10-10 15:20:17,183 TRACE [content.transform.TransformerDebug] [http-bio-8443-exec-6] 4.1.2 Finished in 1,950 ms
We can also take a look at the /tmp/ocrtransform.log file to see what files have been processed.
from /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_source_5734790636289670188.tiff to /opt/alfresco/tomcat/temp/Alfresco/RuntimeExecutableContentTransformerWorker_target_1506982845420553983.txt
That's it, you should now be able to search for the text contained in the image files.
Most of the information on this blog comes from this GitHub repository https://github.com/bchevallereau/alfresco-tesseract, with some additional adjustments and inclusions.