cancel
Showing results for 
Search instead for 
Did you mean: 

PDF Indexing

smca
Champ in-the-making
Champ in-the-making
I am new to Alfresco… downloaded and installed just a few hours ago. It's up and running and I can login and upload files and create users.

The problem I am facing is that the PDF files I uploaded can only be searched by filename/title, while other files e.g. Word .doc files can be searched using words in the file.

So, if I have Raptors.doc and Raptors.pdf, and both contain the word "Toronto" in them, a search for "Raptors" retrieves both documents, but a search for "Toronto" retrieves just the .doc file, not the .pdf file.

As a side note, both the .pdf and the .doc files were created from the same source in Google Docs & Spreadsheet.

Am I missing something?

Thanks in advance.

____________________________________

added after initial post
____________________________________

I noticed that PDFs under 500kB have no issues with indexing. However, for *large* PDFs (say above 3 MB, which isn't really large, I have PDFs well over 70MB), I get the following error message:

Metadata extraction failed: reader: ContentAccessor[ contentURL=store://C:\Alfresco\tomcat\temp\Alfresco\alfresco39438.upload, mimetype=application/pdf, size=12976429, encoding=UTF-8] extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@17349d7
6 REPLIES 6

kevinr
Star Contributor
Star Contributor
I noticed that PDFs under 500kB have no issues with indexing. However, for *large* PDFs (say above 3 MB, which isn't really large, I have PDFs well over 70MB), I get the following error message:

Metadata extraction failed: reader: ContentAccessor[ contentURL=store://C:\Alfresco\tomcat\temp\Alfresco\alfresco39438.upload, mimetype=application/pdf, size=12976429, encoding=UTF-8] extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@17349d7

Internally we use the open source PDFBox library to perform the to text conversion of PDF documents. It is possible the library is having a problem with certain documents - are there any other errors (stack trace?) in the log? Can you enter the following search in the search box in the web-client:

nift

it should return those documets that have failed to index due to the transformation engine failing.

If a transformation takes too long, it is shunted into a background thread, but it should still complete unless there is an error.

Thanks,

Kevin

smca
Champ in-the-making
Champ in-the-making
I searched for nift - no results.

I searched for nitf - all the documents that failed indexing showed up!

I noticed that my setup can cope with large (i.e. >10MB) Word .doc and OpenOffice .sxw documents. It's having troubles only with .pdf files.

By the way, I am pretty new to Tomcat and Alfresco. How do I get a stack trace and where is the log file usually located?

smca
Champ in-the-making
Champ in-the-making
Is there any way around the issue of indexing failures (at least in hardware if not in software)? Like adding more RAM (I have 2GB) or a more powerful processor (I am using and Athlon X2 6000+). As far as the specs of my machine is concerned, it is nothing to sneeze at. It's quite powerful.

At this time, I am still testing various document management systems (alfresco, dspace, plone, etc.) to keep track of all documents created by everybody in my family (currently over 5GB of .doc, .pdf, .xls, .ppt, .rtf … some are over 9 years old…). If hardware is the limiting factor on my "test" system, I would consider a more powerful system for the final deployment, like a quad-core CPU. How much more powerful can a home PC get beyond what I have in terms of CPU and RAM unless I move to 64-bit, which I may if push comes to shove.

kevinr
Star Contributor
Star Contributor
The machine specs sound more than enough. The problem is the PDFBox library failing on certain PDF fails - it's nothing to do with memory/cpu given your machine specs. It may be worth a try trying to update the PDFBox library used in alfresco if there is a newer version available, otherwise we would need to submit bugs to the author. Failing that if you can find another PDF->text java library then it could be configured in as the transformer instead of PDFBox.

Thanks,

Kevin

manfred99
Champ in-the-making
Champ in-the-making
Environment: Alfresco 2.0, Tomcat5, Linux

When uploading a 3.3MB PDF the following error occures:

14:51:10,569 WARN  [org.alfresco.web.bean.repository.Repository] Metadata extraction failed: 
   reader: ContentAccessor[ contentUrl=store:///srv/www/tomcat5/base/temp/Alfresco/alfresco40141.upload, mimetype=application/pdf, size=0, encoding=UTF-8]
   extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@5cd46b

forcev
Champ in-the-making
Champ in-the-making
I've got a slightly different issue, in that I have .odt and .pdf files with exactly the same content (the .pdf is created from the .odt in OpenOffice).  When I run a search only the .pdf files are returned.

I am running v2.0.0 Community on SUSE 10.0, and have tried the search as several users including "guest" and "admin" and get the same result.  The documents are in the Guest Space, with consumer access to all.