Metadata Extractor for PDF Forms broken

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-05-2012 10:59 AM
I think that PDFBox has a bug that prevents reading PDF Forms to populate metadata.
https://issues.apache.org/jira/browse/PDFBOX-1100
As a result I need to develop a way to read the values from the fields in a PDF Form.
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.
Has anyone taken this approach? Is it feasible?
Is there another way to read the data from PDF Forms to populate the metadata in Alfresco?
https://issues.apache.org/jira/browse/PDFBOX-1100
As a result I need to develop a way to read the values from the fields in a PDF Form.
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.
Has anyone taken this approach? Is it feasible?
Is there another way to read the data from PDF Forms to populate the metadata in Alfresco?
Labels:
- Labels:
-
Archive
14 REPLIES 14

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2012 10:05 PM
The log message is an INFO message so that's okay.
I'm no expert but I suspect that SOLR uses Tika to extract metadata during the index process. So yes, it should use the new jars as well.
No override of the classes is required. The new jars enable Tika and PDFbox to extract form field using the standard approach.
As you know each PDF Form field must have a name defined, then the extractor maps the form field name to a custom metadata field for that content type.
Custom Content Types: http://docs.alfresco.com/4.0/topic/com.alfresco.enterprise.doc/tasks/kb-define-custom-model.html
Metadata Extraction: http://docs.alfresco.com/4.0/topic/com.alfresco.enterprise.doc/tasks/metadata-config.html
Here's an example where I defined a namespace (myns) and used it both as a prefix for the form fields in Livedata as well as the namespace for custom metadata.
PDF Form Field myns_projectName
Custom Metadata myns
rojectName
Just to emphasize, I developed this only so far as to prove the concept. I wasn't afforded the time to test it rigorously and it has not been put into production. My hope is that Tika gets updated to support PDF form field metadata extraction and Alfresco is updated to use that with PDFbox 1.7.x so that this level of customization is not necessary.
I hope this answers your questions.
I'm on the road visiting colleges with my son the next couple of days so it'll be the end of the week before I can follow up on this thread.
I'm no expert but I suspect that SOLR uses Tika to extract metadata during the index process. So yes, it should use the new jars as well.
No override of the classes is required. The new jars enable Tika and PDFbox to extract form field using the standard approach.
As you know each PDF Form field must have a name defined, then the extractor maps the form field name to a custom metadata field for that content type.
Custom Content Types: http://docs.alfresco.com/4.0/topic/com.alfresco.enterprise.doc/tasks/kb-define-custom-model.html
Metadata Extraction: http://docs.alfresco.com/4.0/topic/com.alfresco.enterprise.doc/tasks/metadata-config.html
For instance, say I have a test pdf form, created in Livecycle, with a field in it named "testData". If I also had defined in our content model an aspect: "my:testAspect" with a property "my:testData", would extracting common metadata on the pdf cause it to gain the my:testAspect aspect with the my:testData property set to whatever was entered in the form field? (without further modification)My prototype was developed using a content type with specific properties. I am pretty sure that the aspect will be added and the field mapped to the aspect property as you suggest.
Here's an example where I defined a namespace (myns) and used it both as a prefix for the form fields in Livedata as well as the namespace for custom metadata.
PDF Form Field myns_projectName
Custom Metadata myns

<!– This adds in the extra mapping for the Open Document extractor –> <bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter" parent="baseMetadataExtracter"> <property name="inheritDefaultMapping"> <value>true</value> </property> <property name="mappingProperties"> <props> <!– Metadata extraction –> <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop> <prop key="namespace.prefix.myns">http://www.acme.com/model/content/1.0</prop> <!– My Namespace Project Model –> <prop key="myns_projectName">myns:projectName</prop> <prop key="myns_organizationName">myns:organizationName</prop> <prop key="myns_organizationAddress">myns:organizationAddress</prop> </props> </property> </bean>
Just to emphasize, I developed this only so far as to prove the concept. I wasn't afforded the time to test it rigorously and it has not been put into production. My hope is that Tika gets updated to support PDF form field metadata extraction and Alfresco is updated to use that with PDFbox 1.7.x so that this level of customization is not necessary.
I hope this answers your questions.
I'm on the road visiting colleges with my son the next couple of days so it'll be the end of the week before I can follow up on this thread.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2012 11:35 PM
Hi Steve,
So does that work for you with properties entered only in the form fields (which is to say, does it work without the javascript from your other post setting custom properties from the form fields)?
The only way I am able to get the test data out of the form (config as per my previous post) appears to be by setting a custom property. If I use liveCycle to add a custom property called testData, whatever I put in for the value of that property is used as metadata by Alfresco. So that, at least, works. However I do not get anything from form fields named testData. I tried to implement the javascript from your other post, I'm not sure if we use a different version or what, but as far as I can tell, the this.info object (from the scope of a field) doesn't exist. I tried form.info and xfa.info and a few others but to no avail.
If I read your post correctly, the javascript shouldn't matter with the changes made to the tika parser, it should get the info from form fields directly. If this is the case, I am not sure what I am doing wrong as the only difference I see between your config and mine is that you specified the mapping as part of the context whereas I offloaded mine to a properties file. I see absolutely no reason for that to matter a whit, but that's the next thing I'll be giving a try, just in case. If I have read this wrong and your solution only works with javascript updating the custom properties in line with the form fields, do you know a more absolute path than 'this' to whichever object should have the info property?
I hope your son finds a good college and enjoys his time there!
So does that work for you with properties entered only in the form fields (which is to say, does it work without the javascript from your other post setting custom properties from the form fields)?
The only way I am able to get the test data out of the form (config as per my previous post) appears to be by setting a custom property. If I use liveCycle to add a custom property called testData, whatever I put in for the value of that property is used as metadata by Alfresco. So that, at least, works. However I do not get anything from form fields named testData. I tried to implement the javascript from your other post, I'm not sure if we use a different version or what, but as far as I can tell, the this.info object (from the scope of a field) doesn't exist. I tried form.info and xfa.info and a few others but to no avail.
If I read your post correctly, the javascript shouldn't matter with the changes made to the tika parser, it should get the info from form fields directly. If this is the case, I am not sure what I am doing wrong as the only difference I see between your config and mine is that you specified the mapping as part of the context whereas I offloaded mine to a properties file. I see absolutely no reason for that to matter a whit, but that's the next thing I'll be giving a try, just in case. If I have read this wrong and your solution only works with javascript updating the custom properties in line with the form fields, do you know a more absolute path than 'this' to whichever object should have the info property?
I hope your son finds a good college and enjoys his time there!
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-07-2012 07:33 AM
Hey there..!! Thank you for sharing this great information here..!!

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-07-2012 08:50 AM
Chris,
When I created the PDF Form I used Adobe Acrobat Pro X (not Livedata) to create the form. This process itself was clumsy thanks to the Adobe product. I used Pro X to create the form and then for each form field I set its Name property http://help.adobe.com/en_US/acrobat/pro/using/WS75136AD2-894B-414e-B296-C590121A789B.w.html
For example if I had a field on the form called Project Name I would set the field name property to be: myns_projectName
Then in the metadata extractor config file I would map it to: myns
rojectName
Probably time to break it down.
I recommend that you create a fillable PDF form with a test field name set to: form_fieldName (to differentiate it from document metadata). Then use eclipse to run PDFBox and have it print out the values set for the field. That's how I diagnosed the problem originally and learned that PDFBox 1.6.0 wasn't parsing the form fields. Once you have a valid PDF Form field and PDFBox parses it correctly you can add more complexity by mixing it into Tika and Alfresco. I jumped directly from getting PDFBox to parse using eclipse to extracting metadata with Alfresco but YMMV.
Good luck.
So does that work for you with properties entered only in the form fields (which is to say, does it work without the javascript from your other post setting custom properties from the form fields)?Nope, no JavaScript in the PDF Form here.
The only way I am able to get the test data out of the form (config as per my previous post) appears to be by setting a custom property. If I use liveCycle to add a custom property called testData, whatever I put in for the value of that property is used as metadata by Alfresco. So that, at least, works. However I do not get anything from form fields named testData. I tried to implement the javascript from your other post, I'm not sure if we use a different version or what, but as far as I can tell, the this.info object (from the scope of a field) doesn't exist. I tried form.info and xfa.info and a few others but to no avail.
When I created the PDF Form I used Adobe Acrobat Pro X (not Livedata) to create the form. This process itself was clumsy thanks to the Adobe product. I used Pro X to create the form and then for each form field I set its Name property http://help.adobe.com/en_US/acrobat/pro/using/WS75136AD2-894B-414e-B296-C590121A789B.w.html
For example if I had a field on the form called Project Name I would set the field name property to be: myns_projectName
Then in the metadata extractor config file I would map it to: myns

Probably time to break it down.
I recommend that you create a fillable PDF form with a test field name set to: form_fieldName (to differentiate it from document metadata). Then use eclipse to run PDFBox and have it print out the values set for the field. That's how I diagnosed the problem originally and learned that PDFBox 1.6.0 wasn't parsing the form fields. Once you have a valid PDF Form field and PDFBox parses it correctly you can add more complexity by mixing it into Tika and Alfresco. I jumped directly from getting PDFBox to parse using eclipse to extracting metadata with Alfresco but YMMV.
Good luck.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-16-2012 02:45 AM
Hi,
Sorry if I am hijacking your thread here, but this seems like something that will be relevant to anyone else following these steps.
So I pretty much have it figured out now, the above problem was that I was making the forms with Livecycle - this process doesn't help get data from xfa forms, only from Acrobat created forms. I was able to extract metadata fine from Acrobat forms up until I tried one with a signed signature field in it. The signature isn't something I need to get into metadata, however when running the "extract common metadata" action I would get no metadata. In the UI I saw no response, but in the logs I saw:
This seems to be related to a deprecated function in PDFBox. I am vaguely aware that there would be some way to override tika to use the correct method; This is probably a far more elegant solution. My Java skills are scant however, and I know that we do not need the signature data brought into Alfresco metadata, so I just used a kludgy workaround. In org.apache.tika.parser.pdf.PDFParser.java, around line 325, I made the following change:
Just to make this abundantly clear - with this change, no metadata will be extracted from signature fields. Ever. At all. All it does is prevent the parser from falling over when it hits a signature field.
Sorry if I am hijacking your thread here, but this seems like something that will be relevant to anyone else following these steps.
So I pretty much have it figured out now, the above problem was that I was making the forms with Livecycle - this process doesn't help get data from xfa forms, only from Acrobat created forms. I was able to extract metadata fine from Acrobat forms up until I tried one with a signed signature field in it. The signature isn't something I need to get into metadata, however when running the "extract common metadata" action I would get no metadata. In the UI I saw no response, but in the logs I saw:
WARN [content.metadata.AbstractMappingMetadataExtracter] [http-8443-6] Metadata extraction failed (turn on DEBUG for full error): Extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@ae747d3 Content: ContentAccessor[ contentUrl=store://2012/8/16/15/6/51db06bc-26e2-4871-a613-5e25e823ffac.bin, mimetype=application/pdf, size=50453, encoding=UTF-8, locale=en_US] Failure: Can't get signature as String, use getSignature() instead.null
This seems to be related to a deprecated function in PDFBox. I am vaguely aware that there would be some way to override tika to use the correct method; This is probably a far more elegant solution. My Java skills are scant however, and I know that we do not need the signature data brought into Alfresco metadata, so I just used a kludgy workaround. In org.apache.tika.parser.pdf.PDFParser.java, around line 325, I made the following change:
while(fIter.hasNext()){ PDField field = (PDField)fIter.next(); String checkFieldType = field.getFieldType(); if (checkFieldType != "Sig") { addMetadata(metadata, field.getFullyQualifiedName(), field.getValue()); } }
I realize this could have been accomplished on a single line (if (field.getFieldType() != "Sig") ), I just did it this way because it was easier to debug with breakpoints.Just to make this abundantly clear - with this change, no metadata will be extracted from signature fields. Ever. At all. All it does is prevent the parser from falling over when it hits a signature field.
