cancel
Showing results for 
Search instead for 
Did you mean: 

XML metadata extraction

swithun
Champ in-the-making
Champ in-the-making
I know this question gets asked a lot, and I've read the other threads, but without success. Most other people seemed to be trying to get XML metadata extraction within the web content management system. I don't know if my situation is significantly different.

I would like to be able to import multiple TEI XML files in the system using Alfresco mounted as a network drive, and have them recognised as being TEI XML files, and have various bits of information extracted and put into some custom metadata fields. I have the network drive working. And I have created an aspect which defines the custom metadata fields. What I don't have working is the metadata extraction.

My TEI XML files look like this:


<TEI.2 id="_jamesi_t1428_7_1_d6_trans" n="jamesi_trans">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Procedure: preamble</title>
      </titleStmt>
      <editionStmt>
        <edition n="session">jamesi_t1428_7_1_d2_trans</edition>
      </editionStmt>
      <publicationStmt>
        <date>14280712</date>
      </publicationStmt>
    </fileDesc>
  </teiHeader>
  <text>…</text>
</TEI.2>

(no XML processing instruction or DTD and no namespaces)

The model for my data looks like this:


<model name="rps:rpsModel" xmlns="http://www.alfresco.org/model/dictionary/1.0">
  …
  <namespaces>
    <namespace uri="rps.ns" prefix="rps"/>
  </namespaces>
  <types>
    <type name="rps:document">
      <title>RPS document</title>
      <parent>cm:content</parent>
      <mandatory-aspects>
        <aspect>cm:generalclassifiable</aspect>
      </mandatory-aspects>
    </type>
  </types>
  <aspects>
    <aspect name="rps:metadata">
      <title>RPS Metadata</title>
      <properties>
        <property name="rps:id">
          <type>d:text</type>
        </property>
        <property name="rps:reign">
          <type>d:text</type>
        </property>
        <property name="rps:session">
          <type>d:text</type>
        </property>
        <property name="rps:date">
          <type>d:int</type>
        </property>
      </properties>
    </aspect>
  </aspects>
</model>

With the following additions to web-client-config-custom.xml, I can add the RPS Metadata aspect to objects in Alfresco:


<alfresco-config>
  <config evaluator="aspect-name" condition="rps:metadata">
    <property-sheet>
      <show-property name="rps:id" display-label-id="rpsID"/>
      <show-property name="rps:reign" display-label-id="rpsReign"/>
      <show-property name="rps:session" display-label-id="rpsSession"/>
      <show-property name="rps:date" display-label-id="rpsDate"/>
    </property-sheet>
  </config>
  <config evaluator="string-compare" condition="Content Wizards">
    <content-types>
      <type name="rps:document"/>
    </content-types>
  </config>
  <config evaluator="string-compare" condition="Action Wizards">
    <aspects>
      <aspect name="rps:metadata"/>
    </aspects>
    <specialise-types>
      <type name="rps:document"/>
    </specialise-types>
  </config>
</alfresco-config>

In wcm-xml-metadata-extracter-context.xml, I have added an entry to the bean extracter.xml.sample.selector.XPathSelector:


<entry key="/TEI.2">                                                                                                                                                                                  
  <ref bean="extracter.TEIMetadataExtracter" />                                                                                                                                                      
</entry>

This should match the root element of my TEI XML files. Right?

In custom-metadata-extrators-context.xml, I've added the following bean:


<bean id="extracter.TEIMetadataExtracter" class="org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter" parent="baseMetadataExtracter" init-method="init" >
  <property name="mappingProperties">
    <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
      <property name="properties">
        <props>
          <prop key="namespace.prefix.rps">rps.ns</prop>
          <prop key="rpsID">rps:id</prop>
          <prop key="rpsReign">rps:reign</prop>
          <prop key="rpsSession">rps:session</prop>
          <prop key="rpsDate">rps:date</prop>
        </props>
      </property>
    </bean>
  </property>
  <property name="xpathMappingProperties">
    <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
      <property name="properties">
        <props>
          <!– should there be some namespace prop here? –>
          <prop key="rpsID">/TEI.2/@id</prop>
          <prop key="rpsReign">/TEI.2/teiHeader/fileDesc/titleStmt/title/text()</prop>
          <prop key="rpsSession">/TEI.2/teiHeader/fileDesc/editionStmt/edition[@n='session']/text()</prop>
          <prop key="rpsDate">/TEI.2/teiHeader/fileDesc/publicationStmt/date/text()</prop>
        </props>
      </property>
    </bean>
  </property>
</bean>

Again, these XPath strings should match elements and attributes in my TEI XML documents, right?

These XML config files are getting loaded, but to no effect. The custom metadata fields are not added upon ingest, nor are they populated when I add the custom aspect to the object and run the action for extracting metadata.

Do I have things in the right files? Are there other changes I should make to any of the config files? Is it a namespace issue? I made up a namespace for my custom aspect, but it isn't used in the TEI XML files I want to extract metadata from. These documents have no namespace. How does one handle that (an empty <prop> element caused an error)?

Thanks.

(edit: my Alfresco version is Community - v3.3.0 (2765))
5 REPLIES 5

derek
Star Contributor
Star Contributor
Hi,

At first glance, your config looks correct.  Where have you placed your *-context.xml files?
Do you get interesting stuff when you turn on debug?
   log4j.logger.org.alfresco.repo.content.metadata.xml=DEBUG

Regards

swithun
Champ in-the-making
Champ in-the-making
My *-context.xml files are in /opt/alfresco/tomcat/shared/classes/alfresco/extension/, as is my model document. I think this is the right place for them.

When I turn on debugging, this is all that I get on startup:


10:15:06,981  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsDate to /TEI.2/teiHeader/fileDesc/publicationStmt/date/text()
10:15:06,982  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsSession to /TEI.2/teiHeader/fileDesc/editionStmt/edition[@n='session']/text()
10:15:06,982  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsReign to /TEI.2/teiHeader/fileDesc/titleStmt/title/text()
10:15:06,983  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsID to /TEI.2/@id
10:15:06,992  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from version to /model/version/text()
10:15:06,992  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from author to /model/author/text()
10:15:06,993  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from description to /model/description/text()
10:15:06,993  DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from title to /model/@name
10:15:06,999  WARN  [springframework.beans.GenericTypeAwarePropertyDescriptor] Invalid JavaBean property 'overwritePolicy' being accessed! Ambiguous write methods found next to actually used [public void org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter.setOverwritePolicy(java.lang.String)]: [public void org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter.setOverwritePolicy(org.alfresco.repo.content.metadata.MetadataExtracter$OverwritePolicy)]

This indicates that my custom stuff is being picked up. There isn't any more output if I ingest an XML document (and add the aspect which defines the custom metadata fields and run the extract common metadata action). Could the WARNing message be relevant? It has always been there, even before I started trying to customise things. And commenting out the offending bean property makes no difference, apart from removing the log file entry.

Namespace issues are always complicated. I have a NS defined for my custom aspect, type and metadata fields. But the actual documents I want to ingest have no namespace defined, not even an empty one.

I have read posts saying that you need to comment out metadataExtracterRegistry property of the avmMetadataExtracter bean. I've tried this, with no effect.

Thanks for the quick reply. I'm sure I'm very close.

derek
Star Contributor
Star Contributor
Hi,
The problem must lie with the wcm-xml-metadata-extracter-context.xml file, which is specifically tailored to work with WCM.  I'm guessing that the registry used there is not the correct one.

swithun
Champ in-the-making
Champ in-the-making
In the end, I got what I wanted by writing a pair of Java classes. One extends org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter, and is registered as a bean in content-services-context.xml. The other extends org.xml.sax.helpers.DefaultHandler, and is called by the first to parse the XML files and return a HashMap of metadata properties which can then be put into rawProperties using putRawValue.

Probably not the most elegant way, but it works.

derek
Star Contributor
Star Contributor
@swithum
It seems elegant to me!  The generalized XML metadata extraction is quite nasty; I imagine that extracting specific values based on your expected XML is much neater - and deterministic.
Regards