cancel
Showing results for 
Search instead for 
Did you mean: 

Custom metadata extraction from MS Word

col_edinburgh
Champ in-the-making
Champ in-the-making
OK after a couple of weeks I'm close to giving up now. All I want to do is extend the Office Metadata extractor to allow me to collect a custom property called projectID.

I have followed the example in the Book, Alfresco Developer Guide (2008) chapter 4 and the WIKI page
http://wiki.alfresco.com/wiki/Metadata_Extraction

but I can't do it.

Following the example in the book I can successfully map the 'keywords' property but I am lost on the 'digging into the extractor class' example . I can't seem to be able to customise the class using the steps laid out

I fear I am going to have to abandon this project as I just can't seem to make it work

Regards
28 REPLIES 28

col_edinburgh
Champ in-the-making
Champ in-the-making
Thanks, that has resolved the startup problem. Unfortunately it's still not picking up the custom:user1 property but does pickup Keywords=custom:keywords

mrogers
Star Contributor
Star Contributor
I suspect you have too many "custom".   Is your property in your word document "custom:user1" or "user1".   Your extractor  don't set "user1" raw property since you have configured it with "custom:user1"

I suggest you run the debugger or add some System.out.println statements to make sure that the metadata extractor is picking up the raw property from your word document and then setting the correct raw property.   If it is I'm fairly sure the mismatch between names will be obvious.

col_edinburgh
Champ in-the-making
Champ in-the-making
it's just user1 on the word document.

mrogers
Star Contributor
Star Contributor
Your loop to
f (metadata.get(CUSTOM_PREFIX + key) != null)   probably needs to loose the concatenation of CUSTOM_PREFIX then
And your configuration of the extractor needs to just be "user1"

Or just do

putRawValue("user1", metadata.get("user1"), properties);

col_edinburgh
Champ in-the-making
Champ in-the-making
I have now got the extractor working with open documents but it still doesn't work with Microsfoft .doc or .docx

EnhancedOpenOffice.java
package com.mpb.extracter;

import java.io.Serializable;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Map;
import java.util.Set;

import org.alfresco.repo.content.MimetypeMap;
import org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter;
import org.alfresco.service.namespace.QName;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.odf.OpenDocumentParser;


public class EnhancedOpenOffice extends TikaPoweredMetadataExtracter
{      
       private static final String CUSTOM_PREFIX = "custom:";

       public static ArrayList<String> SUPPORTED_MIMETYPES = buildSupportedMimetypes(
           new String[] {
               MimetypeMap.MIMETYPE_OPENDOCUMENT_TEXT,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_TEXT_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_GRAPHICS,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_GRAPHICS_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_PRESENTATION,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_PRESENTATION_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_SPREADSHEET,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_SPREADSHEET_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_CHART,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_CHART_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_IMAGE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_IMAGE_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_FORMULA,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_FORMULA_TEMPLATE,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_TEXT_MASTER,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_TEXT_WEB,
               MimetypeMap.MIMETYPE_OPENDOCUMENT_DATABASE
           }, new OpenDocumentParser()
       );

       private static final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss");

       public EnhancedOpenOffice()
       {
           super(SUPPORTED_MIMETYPES);
       }
      
       @Override
       protected Parser getParser() {
          return new OpenDocumentParser();
       }

       @Override
       protected Map<String, Serializable> extractSpecific(Metadata metadata,
            Map<String, Serializable> properties, Map<String, String> headers)
            {
             
          // Handle user-defined properties dynamically
          Map<String, Set<QName>> mapping = super.getMapping();
          for (String key : mapping.keySet())
          {
              if (metadata.get(CUSTOM_PREFIX + key) != null)
              {
                   putRawValue(key, metadata.get(CUSTOM_PREFIX + key), properties);
              }
          }
         
          return properties;
       }
       private Date getDateOrNull(String dateString)
       {
           if (dateString != null && dateString.length() != 0)
           {
               try {
                  return dateFormat.parse(dateString);
               } catch(ParseException e) {}
           }

           return null;
       }
}

EnhancedMicrosoft.java
package com.mpb.extracter;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.Map;

import org.alfresco.repo.content.MimetypeMap;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.microsoft.OfficeParser;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Set;
import org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter;
import org.alfresco.service.namespace.QName;


public class EnhancedMicrosoft extends TikaPoweredMetadataExtracter
{
   private static final String CUSTOM_PREFIX = "custom:";
   
    public static ArrayList<String> SUPPORTED_MIMETYPES = buildSupportedMimetypes(
             new String[] {
                 MimetypeMap.MIMETYPE_WORD,
                 MimetypeMap.MIMETYPE_EXCEL,
                 MimetypeMap.MIMETYPE_PPT},
             new OfficeParser()
       );
       static {
          // Outlook has it's own one!
          SUPPORTED_MIMETYPES.remove(MimetypeMap.MIMETYPE_OUTLOOK_MSG);
       }
      
       private static final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss");

       public EnhancedMicrosoft()
       {
           super(SUPPORTED_MIMETYPES);
       }
      
       @Override
       protected Parser getParser() {
         return new OfficeParser();
       }
      
       @Override
       protected Map<String, Serializable> extractSpecific(Metadata metadata,
            Map<String, Serializable> properties, Map<String, String> headers)
            {
             
          // Handle user-defined properties dynamically
          Map<String, Set<QName>> mapping = super.getMapping();
          for (String key : mapping.keySet())
          {
              if (metadata.get(CUSTOM_PREFIX + key) != null)
              {
                   putRawValue(key, metadata.get(CUSTOM_PREFIX + key), properties);
              }
          }
         
          return properties;
       }
       private Date getDateOrNull(String dateString)
       {
           if (dateString != null && dateString.length() != 0)
           {
               try {
                  return dateFormat.parse(dateString);
               } catch(ParseException e) {}
           }

           return null;
       }
}

EnhancedOpenOffice.properties
#
# OpenDocumentMetadataExtracter - default mapping
#
# author: Derek Hulley

# Namespaces
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
namespace.prefix.custom=custom.model


# Mappings
creationDate=cm:created
creator=cm:author
date=
description=
generator=
initialCreator=
keyword=
language=
printDate=
printedBy=
subject=cm:description
title=cm:title
# mine
user1=custom:user1

EnhancedMicrosoft.properties
#
# OfficeMetadataExtracter - default mapping
#
# author: Derek Hulley

# Namespaces
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
namespace.prefix.custom=custom.model

# Mappings
author=cm:author
title=cm:title
subject=cm:description
createDateTime=cm:created
lastSaveDateTime=cm:modified

# mine
user1=custom:user1

custom-metadata-extractors-context.xml
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<!–
       This sample show how to modify the mappings properties of the new V2.1 Metadata Extractors.
       In this example, in addition to the default mappings, the field 'user1' is mapped to
       'cm:description'.  The available source properties is described on the Javadocs of the
       extracter class.
–>
<beans>

    <!– This adds in the extra mapping for the Open Document extractor –>
    <bean id="extracter.OpenDocument" class="com.mpb.extracter.EnhancedOpenOffice" parent="baseMetadataExtracter" >
        <property name="inheritDefaultMapping">
            <value>true</value>
        </property>
        <property name="mappingProperties">
            <props>
                <prop key="namespace.prefix.custom">custom.model</prop>
                <prop key="user1">custom:user1</prop>
        
            </props>
        </property>
    </bean>
   
     <!– This adds in the extra mapping for the Open Document extractor –>
    <bean id="extracter.Office" class="com.mpb.extracter.EnhancedMicrosoft" parent="baseMetadataExtracter" >
        <property name="inheritDefaultMapping">
            <value>true</value>
        </property>
        <property name="mappingProperties">
            <props>
                <prop key="namespace.prefix.custom">custom.model</prop>
                <prop key="user1">custom:user1</prop>
        
            </props>
        </property>
    </bean>
</beans>

col_edinburgh
Champ in-the-making
Champ in-the-making
Y
putRawValue("user1", metadata.get("user1"), properties);

this didn't work either

col_edinburgh
Champ in-the-making
Champ in-the-making
with reference to http://wiki.alfresco.com/wiki/Metadata_Extraction

bottom of page:
"Be sure to make the class's Javadocs reflect all the extracted values along with the default mappings."

what and how ?

may thanks

col_edinburgh
Champ in-the-making
Champ in-the-making
is there no way to get this working ?

mrogers
Star Contributor
Star Contributor
What happened when you ran the debugger?    Is your custom property being extracted from the word document or is the problem that the value is being extracted but not mapped correctly?

col_edinburgh
Champ in-the-making
Champ in-the-making
What happened when you ran the debugger?    Is your custom property being extracted from the word document or is the problem that the value is being extracted but not mapped correctly?

I don't know how to do that, i know very little about eclipse