cancel
Showing results for 
Search instead for 
Did you mean: 

Extraction of content from PDF file

ashwini_g_krish
Champ in-the-making
Champ in-the-making
Hi,

we are working on Alfresco, and our basic requirement is to extract metadata from a PDF file and also its content and save the extracted metadata and content in a text file like csv or .txt files.

so kindly help me out with the same.
8 REPLIES 8

mitpatoliya
Star Collaborator
Star Collaborator
I think you need to create custom metadata extractor for your requirement.
Please refer following link and let me know if you have any doubts.
http://wiki.alfresco.com/wiki/Metadata_Extraction

Hi,

actually its already configured in alfresco. but we are not able to find where it is stored and we are not getting how we can get that metadata into some text file. so it would be helpful if ou can provide screen shots for the same problem.


Ashwini

mitpatoliya
Star Collaborator
Star Collaborator
Ashwini,

Ideally what metadata extractor does is it get the properties from the files which we are uploading and attach it to the newly created content in the alfresco as metadata.
Ex.
When any word doc or pdf document lying in your c or d drive it has some set of properties like author,name,title etc… right?
When you upload that in to alfresco
metadataextractor comes in to picture and extract those property and attach it to the newly created file in alfresco as metadata of the file according to the model in alfresco.

As per your requirement what you need to do is either extend the handler class of metadata extractor to achieve what you are looking for
or
create a script which read those property and create txt file (simpler approch)

Hi Mits,

Thank you for the reply. It will be helpful for us. i will create the script but the confusion is where to put the script and how we can run that script??
we are able to run the rules but we are not getting where the data is getting stored and can we specify the destination folder while creating rules.

mitpatoliya
Star Collaborator
Star Collaborator
Put script in Data Dictionary>Scripts
Create a rule which will execute that script on arrival of pdf files.
That script should create txt file, read all meatadatas from that pdf file and put it in new file created.

Thank you for the solution.
can u please provide me a simple script for getiing the filename of uploaded file that works in alfresco?

mitpatoliya
Star Collaborator
Star Collaborator
var filename=document.properties.name;

document object will be readily available in the context when your invoke the script via rule on document arrival.
It points to the current document which is getting uploaded.

Thank you so much for the reply.