cancel
Showing results for 
Search instead for 
Did you mean: 

Indexing of PDF custom fields?

pascalsartorett
Champ in-the-making
Champ in-the-making
I would like to store scanned images in Alfresco, including some metadata (either recognized by OCR or manual indexing). We usually do this by creating a PDF, containing the metadata in custom fields (e.g. "Customer" => "ACME Inc.").

Problem: Alfresco doesn't seem to index the PDF custom fields… I thought Lucene was guilty, but it seems (?) that it is not Lucene which handles the PDF, but an other converter such as XPDF.

Hence my questions:

1- How could Alfresco also index the metadata? By using an other converter?
2- Do anybody see an other workaround?

I know that I could store the data and metadata in two separate files (TIIF + XML), but it would really much better to have them in a single file.

Pascal
2 REPLIES 2

aatamer
Champ in-the-making
Champ in-the-making
Pascal,
The same problem is in our system.  I've found it's much easier to deal with metadata in the file and have a way of extracting it afterwards, rather than always bringing the metadata with you in a separate database.  At least in Alfresco, Alfresco can pull out the necessary Title/description/author fields to display in your content view based on content rules.

PDF metadata extraction in Alfresco doesn't seem to work properly.  Sometimes nothing gets extracted, sometimes gibberish gets extracted. So I would be interested in a solution to this as you would.

amarendrakt
Champ on-the-rise
Champ on-the-rise
Hi Friends,

   I'm also facing the same problem. Actually I'm able to read data from PDF (PDF forms) files using PDFBox but I want this information to be extracted as metadata just like the way Name,/Author/Desc of PDF document and it should be displayed on screen.

I have my own custom aspect which shows Status of Customers (Active/Inactive) for this we have PDF forms which hav texfields to enter the status of customers.I have developed class that will read these textfields but now iI want that these data which is nothing but the Custome metadata for Customer should get extracted (and not maually entered) while adding this file in Alfresco.

Can anyone help me out in this?

I already put up two questions on forum about extracting metadata but iI think nobody knows how to do it or nobody is interested to do so.