cancel
Showing results for 
Search instead for 
Did you mean: 

[Solved]OCRed PDF is not indexed!!

melttech
Champ in-the-making
Champ in-the-making
Hi,

I uploaded a searchable pdf file into Alfresco and use Alfresco Explorer advance search to find a word in its content. It is successful and search give me results.

BUT, then i uploaded an OCRed searchable pdf (using ABBYY fine reader 11) into Alfresco and Advance search didnt find any result even if i copy the word inside the pdf and paste it in search field.

How can i debug this?

I want to know steps from upload to index. Where can i see all indexed words?

Cheers,


3 REPLIES 3

jpfi
Champ in-the-making
Champ in-the-making
Hi,
you can inspect your search index by using luke: https://code.google.com/p/luke/
cheers, jan

melttech
Champ in-the-making
Champ in-the-making
Thanks for replying.
I already have luke even before posting. I'm not very familiar with it. Can you tell me how Luke can help me to debug? I can see the filename of document indexed in alf_data directory but not its content.

I verify that this transformer is executed without error :

2013-03-04 10:32:48,448  DEBUG [util.exec.RuntimeExec] [http-8080-2] Execution result:
   os:         Linux
   command:    pdftotext -enc UTF-8 /opt/apache-tomcat-6.0.36/temp/Alfresco/RuntimeExecu              tableContentTransformerWorker_source_922569492484223827.pdf
   succeeded:  true
   exit code:  0
   out:
   err:


And i open the result of transformation in tomcat/temp/Alfresco, there is txt file named Failover transformer intermediate tikaauto content transformer. I open the file and its empty.


I try to search for a word in the content. Search doesnt give me any result in Alfresco Explorer Search. Note that this problem only happen to PDF that has been OCRed by ABBYY fineReader 11. If i use searchable PDF file (not OCRed), the content is indexed correctly and search give me results.

Cheers,

melttech
Champ in-the-making
Champ in-the-making
I dont know how, but it works after i remove options in pdftotext transformcommand (im getting pdftotext from http://www.foolabs.com/xpdf/download.html). Im sharing my code that works for Alfresco v4.x.x. Enable log4j

log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG
log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG
log4j.logger.org.alfresco.repo.content.transform.ContentTransformerRegistry=DEBUG

Hope this help others:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
        <!– disable standard pdfbox text transformer –>
        <bean id="transformer.PdfBox" class="java.lang.String"/>
        <!– has the above injected, is newly created below –>
    <bean id="transformer.complex.OpenOffice.PdfBox" class="java.lang.String"/>

        <!– pdftotext command line binary –>
       <bean id="transformer.PdfToTextTool" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
         <property name="worker">
            <ref bean="transformer.worker.PdfToTextTool" />
         </property>
      </bean>
        <bean id="transformer.worker.PdfToTextTool" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
                <property name="mimetypeService">
               <ref bean="mimetypeService" />
            </property>
            <property name="transformCommand">
                        <bean name="transformer.pdftotext.Command"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <!–<value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux -enc UTF-8 ${source} ${target}</value>–>
                                                        <value>pdftotext -enc UTF-8 ${source} ${target}</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-win32.exe -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <property name="explicitTransformations">
                        <list>
                                <!–<bean
                                        class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey">
                                        <constructor-arg>
                                                <value>application/pdf</value>
                                        </constructor-arg>
                                        <constructor-arg>
                                                <value>text/plain</value>
                                        </constructor-arg>
                                </bean>–>
                     <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                         <property name="sourceMimetype"><value>application/pdf</value></property>
                         <property name="targetMimetype"><value>text/plain</value></property>
                     </bean>
                        </list>
                </property>
        </bean>

   <!– replaces bean transformer.complex.OpenOffice.PdfBox –>
   <bean id="transformer.complex.OpenOffice.PdfToTextTool"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.OpenOffice" />
            <ref bean="transformer.PdfToTextTool" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>application/pdf</value>
         </list>
      </property>
   </bean>
</beans>


Cheers,