PDF FullText Indexing Alfresco indexer ignores blanks/spaces
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-01-2009 05:58 AM
Hi all,
I have a question about the PDF indexing functionality in Alfresco. I have many pdf files that have been created by using a document scanner (Fujitsu ScanSnap S510) which is simply scanning documents, doing OCR and creating PDF files. The software suite of Fujitsu is using ABBYY Tools for this purpose.
These PDF documents of course are searchable afterwards and I had no problems so far. I've used Windows Desktop Search in the past (indexing functionality) to be able to do a full text search on these documents (both - Foxit and Adobe IFilters are working fine). This documents are treated by usual search engines like other PDF files and are not causing any problems.
The Problem:
After installing Alfresco 3.0 I am experiencing one big problem with this kind of PDF files that I mentioned above. The integrated indexer seems to ignore the blanks/spaces between the words when it's indexing the PDF files. For a better understanding I'll list some examples how the index looks like:
***********************************
Original Text in PDF file: Hello World 123
Indexed Text in Alfresco: HelloWorld123
Original Text in PDF file: Alfresco Open Source Enterprise Content Management System including document management
Indexed Text in Alfresco: AlfrescoOpenSourceEnterpriseContentManagementSystemincludingdocumentmanagement
***********************************
The main problem with this issue is that doing a full text search on this PDF files is very hard
. You have to be very careful and have to use many "*" wildcards and I suppose that the indexing functionality in Alfresco is not intended to work like this. However this problem is only occurring with this kind of PDF files. Other PDF files are indexed correctly including the blank fields between the words.
Is there a possibility to use a different PDF indexer in Alfresco and rebuild the index from scratch or is there a fix, update or workaround available? I can create and upload some example files so that anyone can reproduce this problem.
Many thanks in advance!
I have a question about the PDF indexing functionality in Alfresco. I have many pdf files that have been created by using a document scanner (Fujitsu ScanSnap S510) which is simply scanning documents, doing OCR and creating PDF files. The software suite of Fujitsu is using ABBYY Tools for this purpose.
These PDF documents of course are searchable afterwards and I had no problems so far. I've used Windows Desktop Search in the past (indexing functionality) to be able to do a full text search on these documents (both - Foxit and Adobe IFilters are working fine). This documents are treated by usual search engines like other PDF files and are not causing any problems.
The Problem:
After installing Alfresco 3.0 I am experiencing one big problem with this kind of PDF files that I mentioned above. The integrated indexer seems to ignore the blanks/spaces between the words when it's indexing the PDF files. For a better understanding I'll list some examples how the index looks like:
***********************************
Original Text in PDF file: Hello World 123
Indexed Text in Alfresco: HelloWorld123
Original Text in PDF file: Alfresco Open Source Enterprise Content Management System including document management
Indexed Text in Alfresco: AlfrescoOpenSourceEnterpriseContentManagementSystemincludingdocumentmanagement
***********************************
The main problem with this issue is that doing a full text search on this PDF files is very hard

Is there a possibility to use a different PDF indexer in Alfresco and rebuild the index from scratch or is there a fix, update or workaround available? I can create and upload some example files so that anyone can reproduce this problem.
Many thanks in advance!

Labels:
- Labels:
-
Archive
4 REPLIES 4
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-01-2009 06:50 AM
I was searching for the cause of this problem for a long time. I think I finally know where the issue is coming from. After finding out that Alfresco is using PDFBox to extract the text from PDF files I've searched in the bug reports of PDFBox and found many entries for exactly the same problem. As I don't believe that this issue will be resolved soon (the issue is reported since 3 years now…) I would like to switch to a different transformation/extraction tool and rebuild the index in Alfresco from scratch.
Is this approach possible somehow? Does someone know good alternatives to PDFBox that are working together with Alfresco?
BTW: you can find one of the reported issues in PDFBox here: http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832
You can test this behaviour with following PDF document: http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502
I would appreaciate any ideas. Thanks!
Is this approach possible somehow? Does someone know good alternatives to PDFBox that are working together with Alfresco?
BTW: you can find one of the reported issues in PDFBox here: http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832
You can test this behaviour with following PDF document: http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502
I would appreaciate any ideas. Thanks!
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2009 05:37 AM
Hi, you could change the pdfbox transformer (doing the conversion to plain text) to pdftotext (fromhttp://www.foolabs.com/xpdf/).
You can activate this transformer by http://wiki.alfresco.com/wiki/Content_Transformations following the wiki.
You should add a configuration somewhat like this (depending on where you place your totext-tool)
A german guy http://thinkalfresco.blogspot.com/2009/03/speeding-up-pdf-indexing-alfresco-hack.html did some nice comparison which led me to it, although I myself was experiencing problems with the full-text indexing of pdf's in the AVM(staging sandbox) which I have yet to resolve.
I hope this helps.
Jitse
You can activate this transformer by http://wiki.alfresco.com/wiki/Content_Transformations following the wiki.
You should add a configuration somewhat like this (depending on where you place your totext-tool)
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd"><beans> <!– disable standard pdfbox text transformer –> <bean id="transformer.PdfBox" class="java.lang.String"/> <!– has the above injected, is newly created below –> <bean id="transformer.complex.OpenOffice.PdfBox" class="java.lang.String"/> <!– pdftotext command line binary –> <bean id="transformer.PdfToTextTool" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer"> <property name="transformCommand"> <bean name="transformer.pdftotext.Command" class="org.alfresco.util.exec.RuntimeExec"> <property name="commandMap"> <map> <entry key="Linux.*"> <!–<value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux -enc UTF-8 ${options} ${source} ${target}</value>–> <value>/usr/bin/pdftotext -enc UTF-8 ${options} ${source} ${target}</value> </entry> <entry key="Windows.*"> <value>${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-win32.exe -enc UTF-8 ${options} ${source} ${target}</value> </entry> </map> </property> <property name="defaultProperties"> <props> <prop key="options"></prop> </props> </property> </bean> </property> <property name="explicitTransformations"> <list> <!–<bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey"> <constructor-arg> <value>application/pdf</value> </constructor-arg> <constructor-arg> <value>text/plain</value> </constructor-arg> </bean>–> <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" > <property name="sourceMimetype"><value>application/pdf</value></property> <property name="targetMimetype"><value>text/plain</value></property> </bean> </list> </property> </bean> <!– replaces bean transformer.complex.OpenOffice.PdfBox –> <bean id="transformer.complex.OpenOffice.PdfToTextTool" class="org.alfresco.repo.content.transform.ComplexContentTransformer" parent="baseContentTransformer" > <property name="transformers"> <list> <ref bean="transformer.OpenOffice" /> <ref bean="transformer.PdfToTextTool" /> </list> </property> <property name="intermediateMimetypes"> <list> <value>application/pdf</value> </list> </property> </bean></beans>
A german guy http://thinkalfresco.blogspot.com/2009/03/speeding-up-pdf-indexing-alfresco-hack.html did some nice comparison which led me to it, although I myself was experiencing problems with the full-text indexing of pdf's in the AVM(staging sandbox) which I have yet to resolve.
I hope this helps.
Jitse

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-17-2009 09:26 AM
Just for the record
With 3.2 this doesn't seem to work, as the appropriate class is missing.
http://forums.alfresco.com/en/viewtopic.php?f=10&t=19404

With 3.2 this doesn't seem to work, as the appropriate class is missing.
http://forums.alfresco.com/en/viewtopic.php?f=10&t=19404
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2009 09:12 AM
Experiencing the same (Missing RuntimeExecutableContentTransformer) I have filed a bug: https://issues.alfresco.com/jira/browse/ALFCOM-3288
