topic Re: Extraction of content from PDF file in Alfresco Archive

Extraction of content from PDF file

ashwini_g_krish — Mon, 18 Mar 2013 12:27:07 GMT

Hi,we are working on Alfresco, and our basic requirement is to extract metadata from a PDF file and also its content and save the extracted metadata and content in a text file like csv or .txt files.so kindly help me out with the same.

Re: Extraction of content from PDF file

mitpatoliya — Tue, 19 Mar 2013 11:52:05 GMT

I think you need to create custom metadata extractor for your requirement.
Please refer following link and let me know if you have any doubts.
http://wiki.alfresco.com/wiki/Metadata_Extraction

Re: Extraction of content from PDF file

ashwini_g_krish — Tue, 19 Mar 2013 13:43:14 GMT

Hi,

actually its already configured in alfresco. but we are not able to find where it is stored and we are not getting how we can get that metadata into some text file. so it would be helpful if ou can provide screen shots for the same problem.

Ashwini

Re: Extraction of content from PDF file

mitpatoliya — Wed, 20 Mar 2013 05:56:15 GMT

Ashwini,

Ideally what metadata extractor does is it get the properties from the files which we are uploading and attach it to the newly created content in the alfresco as metadata.
Ex.
When any word doc or pdf document lying in your c or d drive it has some set of properties like author,name,title etc… right?
When you upload that in to alfresco
metadataextractor comes in to picture and extract those property and attach it to the newly created file in alfresco as metadata of the file according to the model in alfresco.

As per your requirement what you need to do is either extend the handler class of metadata extractor to achieve what you are looking for
or
create a script which read those property and create txt file (simpler approch)

Re: Extraction of content from PDF file

ashwini_g_krish — Wed, 20 Mar 2013 09:29:00 GMT

Hi Mits,

Thank you for the reply. It will be helpful for us. i will create the script but the confusion is where to put the script and how we can run that script??
we are able to run the rules but we are not getting where the data is getting stored and can we specify the destination folder while creating rules.

Re: Extraction of content from PDF file

mitpatoliya — Thu, 21 Mar 2013 07:35:59 GMT

Put script in Data Dictionary>Scripts
Create a rule which will execute that script on arrival of pdf files.
That script should create txt file, read all meatadatas from that pdf file and put it in new file created.

Re: Extraction of content from PDF file

ashwini_g_krish — Thu, 21 Mar 2013 08:59:00 GMT

Thank you for the solution.
can u please provide me a simple script for getiing the filename of uploaded file that works in alfresco?

Re: Extraction of content from PDF file

mitpatoliya — Fri, 22 Mar 2013 06:56:00 GMT

var filename=document.properties.name;

document object will be readily available in the context when your invoke the script via rule on document arrival.
It points to the current document which is getting uploaded.

Re: Extraction of content from PDF file

ashwini_g_krish — Fri, 22 Mar 2013 13:31:11 GMT

Thank you so much for the reply.