cancel
Showing results for 
Search instead for 
Did you mean: 

Content transformer for OCR ABBYY OCR4LINUX client

deajan
Champ on-the-rise
Champ on-the-rise
Hello,

I've freshly installed an Alfresco Community 5.0.c on a CentOS 7 server and tried to copy my earlier OCR integration from my Alfresco 4.2.c install.

I've created a custom content transformer in /opt/alfresco-5.0.c/tomcat/shared/classes/alfresco/extension/ocr-tiff-pdf-transformer.xml containing the following:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
    <beans>
        <bean id="transformer.worker.ocrabbyy.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>
              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
                                     <value>/usr/local/bin/abbyyocr9</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>
              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
                                    <value>/usr/local/bin/abbyyocr9</value>
                                    <value>-if</value>
                                    <value>${source}</value>
                                    <value>-pem</value>
                                    <value>ImageOnText</value>
                                    <value>-pfpf</value>
                                    <value>Automatic</value>
                                    <value>-pfpr</value>
                                    <value>300</value>
                                    <value>-f</value>
                                    <value>PDF</value>
                                    <value>-of</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>
              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype">
                            <value>image/tiff</value>
                        </property>
                        <property name="targetMimetype">
                            <value>application/pdf</value>
                        </property>
                    </bean>
                 </list>
              </property>
        </bean>
        <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.ocrabbyy.tiff" />
            </property>
        </bean>
</beans>


Relaunched my alfresco instance, and checked on http://mylocaltestserver:8080/alfresco/service/mimetypes?mimetype=application/pdf#application/pdf whether the transformer is registered or not.

The transformer doesn't appear.
I've enabled the following logging options in/opt/alfresco-5.0.c/tomcat/shared/classes/alfresco/extension/custom-log4j.properties and restarted the service again.


log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=debug
#log4j.logger.org.alfresco.util.exec.RuntimeExecBootstrapBean=debug
log4j.logger.org.alfresco.util.exec.RuntimeExec=debug


But there isn't even a trace of that transformer in /opt/alfresco-5.0.c/alfresco.log

Anything i missed to enable custom transformers ?

Regards,
Ozy.
2 REPLIES 2

deajan
Champ on-the-rise
Champ on-the-rise
Well i found the solution.
Whenever a transformer is added, it's name should end with -context.xml
This wasn't the default behavior in 4.2c, neither is it clearly expressed in the wiki.

Anyway, i managed to write a nice Abbyy OCR wrapper that works for Abbyyocr11, that can convert tiff or jpeg images to OCRed PDFs.
Here's the working code of my transformer. Hopefully this will help other underpaid sysadmins Smiley Happy

# cat /opt/alfresco-5.0.c/tomcat/shared/classes/alfresco/extension/ocr-transform-context.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
    <beans>
        <bean id="transformer.worker.ocrabbyy.tiffjpeg" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>
              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
                                     <value>/usr/local/bin/abbyyocr11</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>
              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
                                    <value>/usr/local/bin/abbyyocr11</value>
                                    <value>-if</value>
                                    <value>${source}</value>
                                    <value>-lpp</value>                     <!– Predefined profile –>
                                    <value>TextExtraction_Accuracy</value>
                                    <value>-adb</value>                     <!– Detect barcodes –>
                                    <value>-ido</value>                     <!– Detect and rotate image orientation –>
                                    <value>-adtop</value>                   <!– Detect text embedded in images –>
                                    <value>-rl</value>                      <!– List of languages for the document –>
                                    <value>French,English,Spanish</value>
                                    <value>-recc</value>                    <!– Enhanced character confidence –>
                                    <value>-pfs</value>                     <!– PDF Export preset –>
                                    <value>Balanced</value>
                                    <value>-pacm</value>                    <!– PDF Export format –>
                                    <value>Pdfa_3a</value>
                                    <value>-ptem</value>                    <!– PDF Export text format –>
                                    <value>ImageOnText</value>
                                    <value>-f</value>                       <!– Output format –>
                                    <value>PDF</value>
                                    <value>-of</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>
              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype">
                            <value>image/tiff</value>
                        </property>
                        <property name="targetMimetype">
                            <value>application/pdf</value>
                        </property>
                    </bean>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype">
                            <value>image/jpeg</value>
                        </property>
                        <property name="targetMimetype">
                            <value>application/pdf</value>
                        </property>
                    </bean>
                 </list>
              </property>
        </bean>
        <bean id="transformer.ocr.tiffjpeg" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.ocrabbyy.tiffjpeg" />
            </property>
        </bean>
</beans>


Also, to use that transformer, you should set it's priority over the tiff2pdf from ImageMagick, or totally disable the ImageMagick one in case you don't wan't a failover transformer without OCR.

In /opt/alfresco-5.0.c/tomcat/shared/classes/alfresco-global.properties
Add the following lines (comment the one you don't want)

### Extension priority
content.transformer.ocr.tiffjpeg.priority=50
content.transformer.ImageMagick.extensions.tiff.pdf.supported=false


Here you go, a fully working OCR transformer that wrapps ABBBYY OCR4Linux client into Alfresco.

deajan
Champ on-the-rise
Champ on-the-rise
Just a note:
When declaring this transformer, viewing jpegs and tiffs in Alfresco automatically trigger a pdf conversion which uses the content transformer.
In the case of Abbyy OCR, a OCRed page is counted for every tiff / jpeg view.