cancel
Showing results for 
Search instead for 
Did you mean: 

PDF FullText Indexing Alfresco indexer ignores blanks/spaces

gsx
Champ in-the-making
Champ in-the-making
Hi all,

I have a question about the PDF indexing functionality in Alfresco. I have many pdf files that have been created by using a document scanner (Fujitsu ScanSnap S510) which is simply scanning documents, doing OCR and creating PDF files. The software suite of Fujitsu is using ABBYY Tools for this purpose.

These PDF documents of course are searchable afterwards and I had no problems so far. I've used Windows Desktop Search in the past (indexing functionality) to be able to do a full text search on these documents (both - Foxit and Adobe IFilters are working fine). This documents are treated by usual search engines like other PDF files and are not causing any problems.

The Problem:

After installing Alfresco 3.0 I am experiencing one big problem with this kind of PDF files that I mentioned above. The integrated indexer seems to ignore the blanks/spaces between the words when it's indexing the PDF files. For a better understanding I'll list some examples how the index looks like:

***********************************

Original Text in PDF file: Hello World 123    
Indexed Text in Alfresco: HelloWorld123


Original Text in PDF file: Alfresco Open Source Enterprise Content Management System including document management
Indexed Text in Alfresco: AlfrescoOpenSourceEnterpriseContentManagementSystemincludingdocumentmanagement

***********************************

The main problem with this issue is that doing a full text search on this PDF files is very hard   Smiley Indifferent . You have to be very careful and have to use many "*" wildcards and I suppose that the indexing functionality in Alfresco is not intended to work like this. However this problem is only occurring with this kind of PDF files. Other PDF files are indexed correctly including the blank fields between the words.

Is there a possibility to use a different PDF indexer in Alfresco and rebuild the index from scratch or is there a fix, update or workaround available? I can create and upload some example files so that anyone can reproduce this problem.


Many thanks in advance!  Smiley Very Happy
4 REPLIES 4

gsx
Champ in-the-making
Champ in-the-making
I was searching for the cause of this problem for a long time. I think I finally know where the issue is coming from. After finding out that Alfresco is using PDFBox to extract the text from PDF files I've searched in the bug reports of PDFBox and found many entries for exactly the same problem. As I don't believe that this issue will be resolved soon (the issue is reported since 3 years now…) I would like to switch to a different transformation/extraction tool and rebuild the index in Alfresco from scratch.

Is this approach possible somehow? Does someone know good alternatives to PDFBox that are working together with Alfresco?

BTW: you can find one of the reported issues in PDFBox here: http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832

You can test this behaviour with following PDF document: http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502

I would appreaciate any ideas. Thanks!

jitse
Champ in-the-making
Champ in-the-making
Hi, you could change the pdfbox transformer (doing the conversion to plain text) to pdftotext (fromhttp://www.foolabs.com/xpdf/).
You can activate this transformer by http://wiki.alfresco.com/wiki/Content_Transformations following the wiki.
You should add a configuration somewhat like this (depending on where you place your totext-tool)


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
        <!– disable standard pdfbox text transformer –>
        <bean id="transformer.PdfBox" class="java.lang.String"/>
        <!– has the above injected, is newly created below –>
    <bean id="transformer.complex.OpenOffice.PdfBox" class="java.lang.String"/>

        <!– pdftotext command line binary –>
        <bean id="transformer.PdfToTextTool"
                class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer"
                parent="baseContentTransformer">
                <property name="transformCommand">
                        <bean name="transformer.pdftotext.Command"
                                class="org.alfresco.util.exec.RuntimeExec">
                                <property name="commandMap">
                                        <map>
                                                <entry key="Linux.*">
                                                        <!–<value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux -enc UTF-8 ${options} ${source} ${target}</value>–>
                                                        <value>/usr/bin/pdftotext -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                                <entry key="Windows.*">
                                                        <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-win32.exe -enc UTF-8 ${options} ${source} ${target}</value>
                                                </entry>
                                        </map>
                                </property>
                                <property name="defaultProperties">
                                        <props>
                                                <prop key="options"></prop>
                                        </props>
                                </property>
                        </bean>
                </property>
                <property name="explicitTransformations">
                        <list>
                                <!–<bean
                                        class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey">
                                        <constructor-arg>
                                                <value>application/pdf</value>
                                        </constructor-arg>
                                        <constructor-arg>
                                                <value>text/plain</value>
                                        </constructor-arg>
                                </bean>–>
                     <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                         <property name="sourceMimetype"><value>application/pdf</value></property>
                         <property name="targetMimetype"><value>text/plain</value></property>
                     </bean>
                        </list>
                </property>
        </bean>

   <!– replaces bean transformer.complex.OpenOffice.PdfBox –>
   <bean id="transformer.complex.OpenOffice.PdfToTextTool"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.OpenOffice" />
            <ref bean="transformer.PdfToTextTool" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>application/pdf</value>
         </list>
      </property>
   </bean>
</beans>

A german guy http://thinkalfresco.blogspot.com/2009/03/speeding-up-pdf-indexing-alfresco-hack.html did some nice comparison which led me to it, although I myself was experiencing problems with the full-text indexing of pdf's in the AVM(staging sandbox) which I have yet to resolve.

I hope this helps.

Jitse

_sax
Champ in-the-making
Champ in-the-making
Just for the record  Smiley Happy
With 3.2 this doesn't seem to work, as the appropriate class is missing.
http://forums.alfresco.com/en/viewtopic.php?f=10&t=19404

mwildam
Champ in-the-making
Champ in-the-making
Experiencing the same (Missing RuntimeExecutableContentTransformer) I have filed a bug: https://issues.alfresco.com/jira/browse/ALFCOM-3288