Hyland Connect

shikarishambu · ‎03-09-2010

I have a custom content model with a number of attributes. When I create pdf or Word documents that need to be specialized to this content type I create custom properties on the document and let the custom metadata extractor work at extracting the metadata info

This seems to work well with Word document. I have a rule on the space where the documents are loaded to - specialize the content type, extract content. However, in case of pdf it seems to only pull the default pdf extract properties. What is it that I am missing? How do I pull custom properties out of pdf i.e. pdfx:mycustomproperty and not pdf:mycustomproperty.

Here is a snippet of my custom-metadata-extrators-context.xml

  <bean id="extracter.OfficeDocument"
  class="org.alfresco.repo.content.metadata.OfficeMetadataExtracter"
  parent="baseMetadataExtracter">
    <property name="inheritDefaultMapping">
      <value>true</value>
    </property>
    <property name="mappingProperties">
      <props>
        <prop key="namespace.prefix.mymodel">http://www.mymodelsoln.com/model/content/1.0</prop>
        <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
      <prop key="docid">mymodel:docid</prop>
        <prop key="comments">mymodel:comments</prop>
        <prop key="category">mymodel:category</prop>
      <prop key="subcategory">mymodel:subcategory</prop>
        <prop key="policyno">mymodel:policyno</prop>
        <prop key="claimno">mymodel:claimno</prop>
      <prop key="agentno">mymodel:agentno</prop>
      <prop key="providerno">mymodel:providerno</prop>
      <prop key="contractno">mymodel:contractno</prop>
       <prop key="govtid">mymodel:govtid</prop>
      <prop key="effectivedate">mymodel:effectivedate</prop>
      <prop key="product">mymodel:product</prop>
      <prop key="region">mymodel:region</prop>
      <prop key="keywords">mymodel:keywords</prop>
      </props>
    </property>
  </bean>
  <bean id="extracter.PDFBox"
  class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter"
  parent="baseMetadataExtracter">
    <property name="inheritDefaultMapping">
      <value>true</value>
    </property>
    <property name="mappingProperties">
      <props>
        <prop key="namespace.prefix.mymodel">http://www.mymodelsoln.com/model/content/1.0</prop>
        <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
      <prop key="docid">mymodel:docid</prop>
        <prop key="comments">mymodel:comments</prop>
        <prop key="category">mymodel:category</prop>
      <prop key="subcategory">mymodel:subcategory</prop>
        <prop key="policyno">mymodel:policyno</prop>
        <prop key="claimno">mymodel:claimno</prop>
      <prop key="agentno">mymodel:agentno</prop>
      <prop key="providerno">mymodel:providerno</prop>
      <prop key="contractno">mymodel:contractno</prop>
       <prop key="govtid">mymodel:govtid</prop>
      <prop key="effectivedate">mymodel:effectivedate</prop>
      <prop key="product">mymodel:product</prop>
      <prop key="region">mymodel:region</prop>
      <prop key="keywords">mymodel:keywords</prop>
      </props>
    </property>
   </bean>‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

derek · ‎03-24-2010

Hi,
Look at the javadocs of the metadata extractor implementations. The PDFBox metadata extractor doesn't extract custom properties. It will be possible to do something like this:


                // Extract remaining custom properties
                for (String customProp : super.getMapping().keySet())
                {
                    if (rawProperties.keySet().contains(customProp))
                    {
                        // Ignore it
                        continue;
                    }
                    String customValue = docInfo.getCustomMetadataValue(customProp);
                    putRawValue(customProp, customValue, rawProperties);
                }
‍‍‍‍‍‍‍‍‍‍‍‍‍

after the catch code. I am putting this into the extractor and testing it now. Implement your own class to get the change now.

Regards

shikarishambu · ‎03-24-2010

Derek, thanks for the info. Are you committing the change to the codebase? Will it be available out-of-box in a later version of Alfresco? Or, should I continue to rely on my changes?

TIA

derek · ‎03-25-2010

Hi,
I have committed the change to 3.2.1 (Entperprise customers) and will merge the change in for 3.3 Community as soon as it passes sanity checking.
Regards

cerberos · ‎07-13-2010

Hi,
got some news about this? I've created a PDF using PDF Box and added a custom property called "prop1". PDFBox library can read it.
I've set a custom mapping for the PDFBox Extractor, using this line:


…
<prop key="prop1">cm:description</prop>
…
‍‍‍‍‍

but when i add a new PDF file nothing appears. I'm using Alfresco Community 3.2-r2, got some news with Alfresco Community 3.3?
How can I develop a custom extracter to work with PDFs, overriding the default extractor???

Thank you.

cerberos · ‎07-15-2010

Ok, seems to work with Alfresco 3.3g. Now PDFBox Metadata Extractor uses Apache TIKA. Dunno how it works, but seems to be ok!

Cheers

Hyland Connect

Cannot get pdf custom properties values - custom metadata ex