cancel
Showing results for 
Search instead for 
Did you mean: 

Metadata Extractor for PDF Forms broken

devodl
Champ in-the-making
Champ in-the-making
I think that PDFBox has a bug that prevents reading PDF Forms to populate metadata.
https://issues.apache.org/jira/browse/PDFBOX-1100

As a result I need to develop a way to read the values from the fields in a PDF Form.
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.

Has anyone taken this approach? Is it feasible?
Is there another way to read the data from PDF Forms to populate the metadata in Alfresco?
14 REPLIES 14

devodl
Champ in-the-making
Champ in-the-making
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.

Answering his own question Steve states:
PDF Forms can execute JavaScript to copy the form field value to a custom document property. Custom document properties can be read by the PDFBox metadata extractor.

Here is what I prototyped to copy PDF Form field data to a custom property:
1 - Open the PDF Form document and create a custom property: File=>Properties=>Custom tab  give it a name and null value
2 - Edit the PDF Form, select a field and open its Properties then select the Actions tab
3 - Add Action (trigger: Mouse Up, action: Run a JavaScript)
4 - Edit the action "Run a JavaScript" and add the code to copy the data
function writeToProperty() {
var fld = this.getField("dswf_clientName");
this.info.kcms_clientName = fld.value;
}
writeToProperty(); // call my function
This is by no means a complete of the steps required to extract PDF Form data into Alfresco metadata but should provide some direction for other developers.

FWIW,
Steve

tomdick
Champ in-the-making
Champ in-the-making
I am searching for some more data related to it……can you explain this in some detail description?

devodl
Champ in-the-making
Champ in-the-making
Open the PDF Form using Acrobat Pro X

Create the Custom Properties
File => Properties, Custom Tab: add name/value pairs

Add JavaScript to PDF Forms
Change to Forms Edit (Edit=>Tools=>Form then Edit)
Select a field, right-click, Properties
Select the Actions tab, trigger: Mouse Up, action: Run a JavaScript
Edit the JavaScript and enter:
function writeToProperty() {
var fld = this.getField("sourceFieldName");
this.info.targetPropertyName = fld.value;
}
writeToProperty(); // call my function
Save PDF Form using: Save As => Reader Extended PDF => Enable Additional Features…

Caveat: While this will copy data from fields to properties when using Acrobat Pro X it does not appear to work when using Acrobat Reader  Smiley Sad
reference: http://forums.adobe.com/thread/859315

We have not found a solution at this time.

devodl
Champ in-the-making
Champ in-the-making
After much learning I believe I understand the problem and have a version 1.0 solution

Problem
PDFBox 1.6.0 (Alfresco 4.x) is not parsing all the objects in the PDF Form. Specifically it is not parsing the form fields that have been filled out using Acrobat Reader. As a result the PDF Form fields (filled out using Acrobat Reader) cannot be extracted by Alfresco using Tika and PDFBox.

Analysis
The parser in PDFBox 1.7.0 is being improved to handle stream objects in a more complete manner
https://issues.apache.org/jira/browse/PDFBOX-1199
This new code is contained in Rev 1333582 ==> NonSequentialPDFParser.java
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/ and testing shows that the new PDDocument.loadNonSeq(File, RandomAccess) method and form field values created by Acrobat Reader are now readable.  Smiley Very Happy

Tika 1.1 currently calls org.apache.pdfbox.pdmodel.PDDocument.load() which correctly parses the metadata of the document but fails to parse the PDF Form fields. Furthermore the new org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq() method will correctly parse the PDF form fields but is not able to parse the metadata. Evidently you can't have it both ways.  Smiley Surprised

Solution
- Checkout, build and deploy PDFBox-1.7.0-SNAPSHOT.jar containing the loadNonSeq() code
- Modify and deploy the Tika 1.1 package to use the new PDFBox code
  The InputStream needs to be processed twice, once for metadata and once for form field data so a temp file is used instead.
The org.apache.tika.parser.pdf.PDFParser class was edited as follows:
        try {
           // New - Use a temp file so it can be parsed twice
            tstream = TikaInputStream.get(stream, tmp);
            tsFile = tstream.getFile();

            // PDFBox can process entirely in memory, or can use a temp file
            //  for unpacked / processed resources
            // Decide which to do based on if we're reading from a file or not already
            if (tstream != null && tstream.hasFile()) {
               // File based, take that as a cue to use a temporary file
               scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
               pdfDocument = PDDocument.load(tsFile, scratchFile);
            } else {
               // Go for the normal, stream based in-memory parsing
               pdfDocument = PDDocument.load(tsFile);
            }
…snip code to cope with encrypted files…          
            metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
            extractMetadata(pdfDocument, metadata);
           
            // New - Now parse again but non-sequentially to retrieve any form field data
            pdfFormDoc = PDDocument.loadNonSeq(tsFile, scratchFile);         
            extractFormFieldData(pdfFormDoc, metadata);           
           
            PDF2XHTML.process(pdfDocument, handler, metadata,
                    extractAnnotationText, enableAutoSpace,
                    suppressDuplicateOverlappingText, sortByPosition);
In addition to changing the parse() method above, a new method was added to process the AcroForm fields as follows:

    private void extractFormFieldData(PDDocument document, Metadata metadata)
            throws TikaException, IOException {            
      PDDocumentCatalog docCatalog = document.getDocumentCatalog();
       PDAcroForm acroForm = docCatalog.getAcroForm();
       if (acroForm != null) {
         List fldList = acroForm.getFields();
         Iterator fIter = fldList.iterator();
         while(fIter.hasNext()){
           PDField field = (PDField)fIter.next();
          
           addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());    
           if (logger.isDebugEnabled())
             {
              String logMsg = "extracting: " + field.getFullyQualifiedName();
                logMsg += "    value: " + field.getValue();
                 logger.debug(logMsg);
             }
         }      
       }
    }
I'm sure that there are better ways of doing this but I chose to use a temp file just to get it working.
Perhaps the Tika and PDFBox developers will consider this problem as the two projects evolve.
I hope that this posting helps others.

chrisokelly
Champ on-the-rise
Champ on-the-rise
Hi Steve,

I'm really interested in applying this fix as we use Adobe forms heavily and would like to do more with them in Alfresco. I've done a little with Javascript and PHP but not really Java all so much. Are you able to give a bit more detail as to how to follow the steps you have there? I know how to use svn to checkout from the URL you gave, but I am not sure how to build and deploy the jar file, or where I would put it once I have. I also am not sure how to edit the PDFParser class.

Thanks very much in advance for your help, if you have the time and inclination to give it.

devodl
Champ in-the-making
Champ in-the-making
Chris,
It's been a couple of months since I worked on this so my memory isn't too fresh.
But at a very high level here's how I built the jar files:
Using eclipse Indigo (3.7) with subversion and Maven plugins
- Checkout the Tika 1.1 project from: http://svn.apache.org/repos/asf/tika/tags/1.1
- Checkout the PDFBox 1.7.0 SNAPSHOT (or higher) http://svn.apache.org/repos/asf/pdfbox/tags/1.7.1

Resolve dependencies (tika is dependent on pdfbox)
This is where I learned how to use Maven with eclipse
- Use Maven to build PDFBox with maven goals of "clean and install"  (Hint: eclipse Run Configurations…)
- Modify the Tika code org.apache.tika.parser.pdf.PDFParser  
  - Using the eclipse editor modify the class as described earlier in this thread
   - change the parse() method
    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
      
        PDDocument pdfDocument = null;
        PDDocument pdfFormDoc = null;
       TikaInputStream tstream = null;
        File tsFile = null;
        TemporaryResources tmp = new TemporaryResources();
        RandomAccess scratchFile = null;

        try {
           // SMD - Use a temp file so it can be parsed twice
            tstream = TikaInputStream.get(stream, tmp);
            tsFile = tstream.getFile();

            // PDFBox can process entirely in memory, or can use a temp file
            //  for unpacked / processed resources
            // Decide which to do based on if we're reading from a file or not already
            if (tstream != null && tstream.hasFile()) {
               // File based, take that as a cue to use a temporary file
               scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
//               pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
               pdfDocument = PDDocument.load(tsFile, scratchFile);
            } else {
               // Go for the normal, stream based in-memory parsing
//               pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
               pdfDocument = PDDocument.load(tsFile);
            }
          
            if (pdfDocument.isEncrypted()) {
                String password = null;
               
                // Did they supply a new style Password Provider?
                PasswordProvider passwordProvider = context.get(PasswordProvider.class);
                if (passwordProvider != null) {
                   password = passwordProvider.getPassword(metadata);
                }
               
                // Fall back on the old style metadata if set
                if (password == null && metadata.get(PASSWORD) != null) {
                   password = metadata.get(PASSWORD);
                }
               
                // If no password is given, use an empty string as the default
                if (password == null) {
                   password = "";
                }
              
                try {
                    pdfDocument.decrypt(password);
                } catch (Exception e) {
                    // Ignore
                }
            }
            metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
            extractMetadata(pdfDocument, metadata);
           
            // SMD - Now parse non-sequentially to retrieve any form field data
            pdfFormDoc = PDDocument.loadNonSeq(tsFile, scratchFile);         
            extractFormFieldData(pdfFormDoc, metadata);           
           
            PDF2XHTML.process(pdfDocument, handler, metadata,
                    extractAnnotationText, enableAutoSpace,
                    suppressDuplicateOverlappingText, sortByPosition);

        } finally {
            if (pdfDocument != null) {
               pdfDocument.close();
               pdfFormDoc.close();
            }
            tmp.dispose();
        }
    }

   - add the extractFormFieldData() method
    /**
     * Steve Deal - Added to parse PDF Form fields
     *
     * @param document
     * @param metadata
     * @throws TikaException
     */
    private void extractFormFieldData(PDDocument document, Metadata metadata)
            throws TikaException, IOException {            
      PDDocumentCatalog docCatalog = document.getDocumentCatalog();
       PDAcroForm acroForm = docCatalog.getAcroForm();
       if (acroForm != null) {
         List fldList = acroForm.getFields();
         Iterator fIter = fldList.iterator();
         while(fIter.hasNext()){
           PDField field = (PDField)fIter.next();
          
           addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());    
           if (logger.isDebugEnabled())
             {
              String logMsg = "extracting: " + field.getFullyQualifiedName();
                logMsg += "    value: " + field.getValue();
                 logger.debug(logMsg);
             }
         }      
       }
    }

- Use Maven to build Tika with maven goals of "clean and install"   (This was new to me, since I've been using Ant).

If you're new to these tools and the language it will require learning but that's the fun of it  Smiley Happy

I hope this helps.

Steve

chrisokelly
Champ on-the-rise
Champ on-the-rise
Well that sure was an adventure. I've pretty much managed to follow your steps (woohoo!) but I just need a leeedle bit more help to get over the line. Hope it doesn't put you out at all and thanks very much for the help thusfar.

In case someone else sees this post looking for the same thing, I'll just go over my steps here:

-For starters, I tried to checkout the original projects using File>New>Project…>Maven>Checkout Maven Projects from SCM, but had some issues, none of my SVN connectors were showing up in the SCM type field. I spent almost an hour trying to troubleshoot this issue (which appears to be prevalent among eclipse indigo users) before I gave up, uninstalled indigo and installed Eclipse Juno (which looks a little more fancy anyway Smiley Tongue).
after installing juno, m2e and subversive, as well as the Maven SCM Handler for Subversive, I was able to use File>New>Project…>Maven>Checkout Maven Projects from SCM Normally to check out the two URI's you posted.

-I had some trouble resolving dependencies due to two issues. The first was a basic misunderstanding on my part of the difference between a dependency and a folder on the classpath. Once I figured out how to work with the POM.xml files this was sorted. The second was because eclipse took what I expected to be 2 projects and turned them into 14. For example, I had pdfbox-ant, pdfbox-app etc etc, as well as pdfbox and pdfbox-parent. The same was true in the tika files.

- Spent most of my day trying to figure out why I couldn't compile tika. I compiled pdfbox no issues, and made the changes to the pdf parser class, however I ran into a bunch of "error: cannot find symbol"'s when trying to compile. eventually I had to learn what an import statement was and add a few of these (still not sure why I had to import java.io.File, it seems like something that would have been in the file already if it was needed, but I spose that's why I'm not a Java dev).

So now I have managed to build both of these with the changes. I have 5 jar files:
  • pdfbox-1.7.1.jar

  • tika-app-1.1.jar

  • tika-bundle-1.1.jar

  • tika-core-1.1

  • tika-parsers-1.1.jar
If I look on the Alfresco VM, in /opt/alfresco-4.0.d/tomcat/webapps/alfresco/WEB-INF/lib/ I have the pdfbox jar, as well as tika-core and tika-parsers. (no pdfbox or tika jars in the share lib). So I assume my next move from here is to either move the pdfbox jar, the tika-core jar and the tika-parsers jar into /opt/alfresco-4.0.d/tomcat/webapps/alfresco/WEB-INF/lib/ or /opt/alfresco-4.0.d/tomcat/shared/lib/. So that's my first question - will the shared lib work for this case?

Secondly, I know that the filenames are not exactly the same - the version numbers are different. So my second question is - should I delete/move the originals, should I change the names of the jar's I have built so that they will override/overwrite?

devodl
Champ in-the-making
Champ in-the-making
Chris,
Excellent work!  I can relate to the journey you took.

Deployment
You only need to deploy the modified tika-parsers and pdfbox jar files to supercede the original files.  The actual names of the jar files are inconsequential, it is the specific packages and class names as well as method signatures that are critical.
    package:  org.apache.tika.parser.pdf
    class:       PDFParser
    method:   public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
                                                 throws IOException, SAXException, TikaException

I see you read my other posting:  https://forums.alfresco.com/en/viewtopic.php?f=9&t=44804 back in May where I stated:
The only solution we have found is to rename the OOTB jar files and drop the modified jar files into the tomcat/webapps/alfresco/WEB-INF/lib.

I simply rename the OOTB jar files (e.g. tika-parsers-1.2-20120504.jar  ==> tika-parsers-1.2-20120504.jar.original), copy my files to that same directory, and the class loader only loads files with the .jar extension.

I hope that makes it clear.

chrisokelly
Champ on-the-rise
Champ on-the-rise
whew, OK so I renamed the two original jars to .orig, moved the two new jars into the lib, and restarted alfresco. The server starts up fine, so I am left with just a few questions:

  • When the server starts up I see in the logs the following:
  • INFO: Adding 'file:/opt/alfresco-4.0.d/alf_data/solr/lib/tika-parsers-1.1-20111128.jar' to classloader
    07/08/2012 8:18:22 AM org.apache.solr.core.SolrResourceLoader replaceClassLoader
    Which, I know, is in the solr lib not the alfresco lib. My question is - does this indicate a problem? Should SOLR be using the new jars too?
  • Moving from here to extracting data from pdf forms: Do I need to define a new metadata extractor or extend the existing PDF one now?

  • For instance, say I have a test pdf form, created in Livecycle, with a field in it named "testData". If I also had defined in our content model an aspect: "my:testAspect" with a property "my:testData", would extracting common metadata on the pdf cause it to gain the my:testAspect aspect with the my:testData property set to whatever was entered in the form field? (without further modification)

    Or would I need to first override the bean loading org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter and add a custom mapping for testData=my:testData?

Edit:

I've spent some time trying to get this to work, however I am still having trouble. My steps so far:
In /opt/alfresco-4.0.d/tomcat/shared/classes/alfresco/extension I added 2 files, custom-metadata-extractors-context.xml
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
        <bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter" parent="baseMetadataExtracter" >
                <property name="inheritDefaultMapping">
                        <value>true</value>
                </property>
                <property name="mappingProperties">
                        <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
                                <property name="location">
                                        <value>classpath:alfresco/extension/custom-pdfbox-extractor-mappings.properties</value>
                                </property>
                        </bean>
                </property>
        </bean>
</beans>
and custom-pdfbox-extractor-mappings.properties
# Namespace Definitions
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
namespace.prefix.my=my.companyName.root

#Mapping Definitions
testData=my:testData

I already have a custom model deployed, so I added to it:
                <aspect name="my:testAspect">
                        <title>Test Aspect</title>
                        <properties>
                                <property name="my:testData">
                                        <type>d:text</type>
                                </property>
                        </properties>
                </aspect>

I have created a few forms in Livecycle, each with a single text field named testData . The first was a dynamic XML form and the second a static pdf. With each of these I tried the following:
  • Filling the form in using Acrobat X

  • Extending the form using Acrobat X, then filling in with Reader X

  • Distributing the form, opening in Reader X, submitting
Once I had uploaded them to Alfresco, I tried extracting common metadata with and without adding the testData aspect first but got no joy.
Is there something extra you had done to get this to work? I saw the javascript solution you posted in the other thread, and I am hoping this isn't it, as these forms will be filled in using Reader almost universally.

Thanks again btw for the help you've provided so far, which has been invaluable