cancel
Showing results for 
Search instead for 
Did you mean: 

Search inside Pdf files

icarrara
Champ in-the-making
Champ in-the-making
Hi! I'm testing Alfresco 2.1 and 2.2 and I have a question about to search contents inside Pdf files.

Using the Web Client I'm trying to search some strings that are presents inside Pdf files but no one Pdf file is returned from the query.

The search is ok when I search text that are presents inside .doc and .txt files but no results are returned for string presents in Pdf file.

Where I'm wrong?

Thanks in advance for any help!

Ivano Carrara
9 REPLIES 9

icarrara
Champ in-the-making
Champ in-the-making
P.S.: Note that actually OpenOffice is NOT installed on the server where Alfresco is running.

cricalix
Champ in-the-making
Champ in-the-making
If you open the PDF file in a PDF reader such as Acrobat, can you select the words as text to paste them elsewhere, or does it only offer to 'Copy Image'?

If it's the latter, the PDF has no text content that Alfresco can index, and you'd need to OCR the document first.

kevinr
Star Contributor
Star Contributor
OpenOffice is not used for PDF to text conversion. We use a library called PDFBox for that. It works with 99% of PDFs we have tried - but as user 'cricalix' noted, it can only index plain text in the document.

Thanks,

Kevin

braulio_moura
Champ in-the-making
Champ in-the-making
Hi !!!

I've deplyed alfresco.war in a Tomcat environment, and I'm having the same situation: search doesn't work with .pdf files.

My .pdf files contains text, so that I can copy its content…

Can you help me????

kevinr
Star Contributor
Star Contributor
Do you see any errors in the logs?

Any other users out there have the same issue?

Thanks,

Kevin

fugu
Champ in-the-making
Champ in-the-making
we have an alfresco 3.0 labs with 270,000 documents PDF with ocr and alfresco does not search inside them

can someone please let us know if we have to activate something there?

regards

adrian cadena

savic_prvoslav
Champ on-the-rise
Champ on-the-rise
I must ask have you turned on advanced search and
Content Format: adobe pdf document  type and
Show me results for: File names and contents
this works for me nice.

chrisb
Champ in-the-making
Champ in-the-making
I don't know if this is related, but it might be if the PDF content being stored in Alfresco is generated by running the OpenOffice PDF converter.

We use Alfresco to allow users to upload MS Office documents to an Alfresco DM repository in 2.1.1E which are converted to PDF by Alfresco using OpenOffice before being stored in the repository. The issue users are reporting is that trying to search inside the OpenOffice generated PDF's doesn't work as expected when viewing the PDF's in Adobe reader. The search results inside the Adobe Reader search don't take you to the correct location / content in the PDF.

Perhaps there is an issue with indexing of the PDF content by the Lucene search engine because of this?

Clearly this won't apply if the PDF's are generated in another manner and then uploaded to the repo.

yunda
Champ in-the-making
Champ in-the-making
You can try TextFinding.com to search inside pdf files. It is perfectly useful to search pdf for you.