cancel
Showing results for 
Search instead for 
Did you mean: 

Content transformer PDF to PDF [SOLVED]

deajan
Champ on-the-rise
Champ on-the-rise
Hello,

A couple of days ago i posted a working content transfomer from jpg and tiff to pdf using Abbyy OCR engine under linux.
I would also like to be able to run this transfomer for older PDF documents that were already scanned.

Is there any way to transform from a PDF to a PDF using a content transfomer that doesn't involve an ugly chain like pdf -> tiff -> pdf ?

Regards,
Ozy.
3 REPLIES 3

sujaypillai
Confirmed Champ
Confirmed Champ
Why would you like to run the transformer for that when you already have a PDF?

Well as i said, i have some PDFs that simply are plain scanned images i'd like to run through my ocr content transformer i wrote here:
https://forums.alfresco.com/forum/installation-upgrades-configuration-integration/integration-other-...

So basically i want to make those PDFs searchable.

deajan
Champ on-the-rise
Champ on-the-rise
[Edit from 10 Apr 2015]
I added a unit mesure for the TIFF file so the OCR handles document sizes properly
[/Edit]

I finally did a ugly PDF to TIFF to PDF transformation but ImageMagick does a real bad work with the PDF to TIFF conversion (fax quality).
I wrote a content transformer that calls ImageMagick with 200dpi but a standard pdf file of 1M becomes a huge 15M TIFF, so i added some parameters like removing alpha channel, specifying color depth and compression which produces normal sized files.

Usually 200dpi is what a standard scanner does. Be aware that a wrong dpi size can increase the size of your document, and some OCR software that count pages will count double if the size is more than A4 or whatever they use.

Here's the custom content transformer to put into <alfresco_path>/tomcat/shared/classes/alfresco/extension/pdf2tiff-transform-context.xml which will produce good quality tiffs from pdf.

It would indeed be easier to modify the basic transformer that ships with Alfresco, but i don't have a clue were to find that config file (hidden in some .jar ?).


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
    <beans>
        <bean id="transformer.worker.imagemagickcustom.pdftiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
            <property name="mimetypeService">
                <ref bean="mimetypeService" />
            </property>
              <property name="checkCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
                                     <value>${img.exe}</value>
                                     <value>–version</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>2</value>
                    </property>
                 </bean>
              </property>
              <property name="transformCommand">
                 <bean class="org.alfresco.util.exec.RuntimeExec">
                    <property name="commandsAndArguments">
                        <map>
                            <entry key=".*">
                                <list>
                                    <value>${img.exe}</value>
                                    <value>-density</value> <!– Resolution, use the same as your scanner, usually 200 or 300dpi –>
                                    <value>200</value>
                                    <value>-depth</value> <!– Used 12 bit color depth, sufficient for long term document storage –>
                                    <value>8</value>
                                    <value>+matte</value> <!– Remove alpha channel –>
                                    <value>-compress</value> <!– Compression, can be zip or lzw, got better results with zip –>
                                    <value>zip</value>
                                    <value>-units</value> <!– Mesurement unit for the TIFF file –>
                                    <value>pixelsperinch</value>
                                    <value>${source}</value>
                                    <value>${target}</value>
                                </list>
                            </entry>
                        </map>
                    </property>
                    <property name="errorCodes">
                       <value>1,2</value>
                    </property>
                 </bean>
              </property>
              <property name="explicitTransformations">
                 <list>
                    <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
                        <property name="sourceMimetype">
                            <value>application/pdf</value>
                        </property>
                        <property name="targetMimetype">
                            <value>image/tiff</value>
                        </property>
                    </bean>
                 </list>
              </property>
        </bean>
        <bean id="transformer.imagemagickcustom.pdftiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
            <property name="worker">
                <ref bean="transformer.worker.imagemagickcustom.pdftiff" />
            </property>
        </bean>
</beans>


Indeed, don't forget to modify alfresco-global.properties by adding


transformer.imagemagickcutsom.pdftiff.priority=50
content.transformer.ImageMagick.extensions.pdf.tiff.supported=false