cancel
Showing results for 
Search instead for 
Did you mean: 

Extracting XML meta data into aspects

samuel_penn
Champ in-the-making
Champ in-the-making
Hi,

I'm currently trying to get my head around the XML Meta Data extractor as described at http://wiki.alfresco.com/wiki/Metadata_Extraction.

Something that isn't made clear, is what happens if the extractor is setup to write the data into a property field which only exists in an aspect? Is the aspect automatically added to the document, or does it have to already exist on the document? I'd like to define an aspect which describes some of the form fields in the WCM form data, and have that aspect automatically added by the extractor as required. I can't see any mention of whether something needs to be set up to get it to happen, or whether it's impossible.

Having setup an extractor in wcm-xml-metadata-extracter-context.xml, and switching on debug for metadata, I can see that my configuration is being picked up when the server starts:


16:27:55,613 DEBUG [content.metadata.AbstractMappingMetadataExtracter] Added mapping from atoz to [{http://www.centrom.com/alfresco/localgov/model}atoz]
16:27:55,629 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from atoz to /art:article/art:header/art:atoz/text()

However, when I save a suitable web form in WCM, I see the following in the logs:


16:29:18,083 DEBUG [content.metadata.MetadataExtracterRegistry] Finding extractors for text/xml
16:29:18,130 DEBUG [metadata.xml.XPathMetadataExtracter]
No working metadata extractor could be found:
   Document: ContentAccessor[ contentUrl=store://2008/9/29/16/29/cf7eb2e7-e0e5-4cca-972f-655a78f91e98.bin, mimetype=text/xml, size=760, encoding=UTF-8, locale=en_US]
16:29:18,130 DEBUG [metadata.xml.XPathMetadataExtracter]
XML metadata extractor redirected:
   Reader:    ContentAccessor[ contentUrl=store://2008/9/29/16/29/cf7eb2e7-e0e5-4cca-972f-655a78f91e98.bin, mimetype=text/xml, size=760, encoding=UTF-8, locale=en_US]
   Extracter: null
   Metadata: {{http://www.alfresco.org/model/content/1.0}name=metatest.xml, {http://www.alfresco.org
/model/system/1.0}node-dbid=19105, {http://www.alfresco.org/model/system/1.0}store-identifier=hertsm
ere–admin–preview, {http://www.alfresco.org/model/wcmappmodel/1.0}orginalparentpath=hertsmere--adm
in–preview:/www/avm_webapps/ROOT, {http://www.alfresco.org/model/content/1.0}content=contentUrl=sto
re://2008/9/29/16/29/cf7eb2e7-e0e5-4cca-972f-655a78f91e98.bin|mimetype=text/xml|size=760|encoding=UT
F-8|locale=en_US_, {http://www.alfresco.org/model/content/1.0}owner=admin, {http://www.alfresco.org/
model/content/1.0}title={en_US=metatest.xml}, {http://www.alfresco.org/model/content/1.0}modified=Mo
n Sep 29 16:29:17 BST 2008, {http://www.alfresco.org/model/system/1.0}node-uuid=UNKNOWN, {http://www
.alfresco.org/model/wcmappmodel/1.0}parentformname=web-article, {http://www.alfresco.org/model/conte
nt/1.0}created=Mon Sep 29 16:29:17 BST 2008, {http://www.alfresco.org/model/system/1.0}store-protoco
l=avm, {http://www.alfresco.org/model/content/1.0}creator=admin, {http://www.alfresco.org/model/cont
ent/1.0}modifier=admin, {http://www.alfresco.org/model/wcmappmodel/1.0}renditions=[/www/avm_webapps/
ROOT/metatest.jsp]}

The 'no working metadata extractor could be found' suggests that it's not actually finding the extractor. I also had the impression that the extraction only happened when the form content was published to the staging sandbox - this debug is appearing when I save the form content in the user's sandbox, and I get no metadata debug at all when the form content is pushed to staging.

Looking at any version of the metadata.xml file in the node browser shows that no aspect has been added, and no metadata has been added.

The meta data extraction config I'm using is below - could anyone tell me if it looks sensible?


   <bean id="extracter.xml.centrom.ArticleModelMetadataExtracter"
         class="org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter"
         parent="baseMetadataExtracter"
         init-method="init" >
      <property name="mappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="properties">
               <props>
                  <prop key="namespace.prefix.lg">http://www.centrom.com/alfresco/localgov/model</prop>
                  <prop key="atoz">lg:atoz</prop>
               </props>
            </property>
         </bean>
      </property>
     
      <property name="xpathMappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="properties">
               <props>
                  <prop key="namespace.prefix.art">http://www.centrom.com/localgov/wcm/article</prop>
                  <prop key="atoz">/art:article/art:header/art:atoz/text()</prop>
               </props>
            </property>
         </bean>
      </property>
   </bean>
  
  
   <!–
      This selector examines the XML documents, executing the given XPath statements until a
      match is made.
   –>
   <bean id="extracter.xml.centrom.selector.XPathSelector"
         class="org.alfresco.repo.content.selector.XPathContentWorkerSelector"
         init-method="init">
      <property name="workers">
         <map>
            <entry key="/art:article">
               <ref bean="extracter.xml.centrom.ArticleModelMetadataExtracter" />
            </entry>
         </map>
      </property>
   </bean>
  
   <bean id="extracter.xml.centrom.XMLMetadataExtracter"
         class="org.alfresco.repo.content.metadata.xml.XmlMetadataExtracter"
         parent="baseMetadataExtracter">

      <property name="registry">
         <ref bean="avmMetadataExtracterRegistry" />
      </property>

      <property name="overwritePolicy">
         <value>EAGER</value>
      </property>
      <property name="selectors">
         <list>
            <ref bean="extracter.xml.centrom.selector.XPathSelector" />
         </list>
      </property>
   </bean>

My aspect is defined as follows:


   <namespaces>
      <namespace uri="http://www.centrom.com/alfresco/localgov/model" prefix="lg"/>
   </namespaces>
  
    <aspects>
        <aspect name="lg:article">
            <title>Article Aspect</title>
            <properties>
                <property name="lg:atoz">
                    <type>d:text</type>
                </property>
            </properties>
        </aspect>
    </aspects>


Thanks,
Sam.
17 REPLIES 17

pmonks
Star Contributor
Star Contributor
what happens if the extractor is setup to write the data into a property field which only exists in an aspect? Is the aspect automatically added to the document, or does it have to already exist on the document?
This is exactly how it's intended to work, and in this case the aspect will get applied automatically by the XPath Metadata Extractor - it determines which aspect(s) to apply based on the target properties defined in the "mappingProperties" Spring bean property.

Also, is that the entire Spring configuration?  If so, it appears that it's missing the metadata extractor registry Spring bean that enables metadata extraction for WCM content (see http://wiki.alfresco.com/wiki/Metadata_Extraction#Activating_Meta-data_Extraction_for_WCM for details).

Cheers,
Peter

samuel_penn
Champ in-the-making
Champ in-the-making
what happens if the extractor is setup to write the data into a property field which only exists in an aspect? Is the aspect automatically added to the document, or does it have to already exist on the document?
This is exactly how it's intended to work, and in this case the aspect will get applied automatically by the XPath Metadata Extractor - it determines which aspect(s) to apply based on the target properties defined in the "mappingProperties" Spring bean property.

Okay, excellent. I figured that it could be doing something like that, but wasn't sure.

Any idea as to whether it gets applied whenever form content is saved to a user's sanbox, or just when it gets pushed to staging?

Also, is that the entire Spring configuration?  If so, it appears that it's missing the metadata extractor registry Spring bean that enables metadata extraction for WCM content (see http://wiki.alfresco.com/wiki/Metadata_Extraction#Activating_Meta-data_Extraction_for_WCM for details).

No it's not - I only included the bits I changed, which was to replace the sample configs in the existing file. The avmMetadataExtracterRegistry, avmNodeService and avmMetadataExtracter are also defined at the top of the file.

Thanks,
Sam.

samuel_penn
Champ in-the-making
Champ in-the-making
I've now got it working - I've gone back to removing all the namespaces from the various XML paths, and it is populating the properties in the user's sandbox and adding the aspect automatically. I've included the working definitions below just in case anyone finds them useful as another example.


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<!–
   Sample configuration of a XmlMetadataExtracter in use within the WCM environment.
  
   This show how XML metadata extraction can be set up to extract metadata from different
   formats of XML.  It also shows how metadata can be extracted in WCM projects.
  
   Since: 2.1
   Author: Derek Hulley
–>
<beans>
  
   <!–
      In order to limit the number of extractors active for Web Content Management, a separate registry
      must be created for the extractors.
   –>
   <bean id="avmMetadataExtracterRegistry" class="org.alfresco.repo.content.metadata.MetadataExtracterRegistry" />
  
   <!–
      Configure the AVM services to broadcast the content update notifications.
   –>
   <bean id="avmNodeService" class="org.alfresco.repo.avm.AVMNodeService" init-method="init">
        <property name="dictionaryService">
            <ref bean="dictionaryService"/>
        </property>
        <property name="avmService">
            <ref bean="avmLockingAwareService"/>
        </property>
        <property name="policyComponent">
            <ref bean="policyComponent"/>
        </property>
        <property name="invokePolicies">
            <value>true</value>
        </property>
    </bean>
   <bean id="avmMetadataExtracter" class="org.alfresco.repo.avm.AvmMetadataExtracter" init-method="init">
      <property name="policyComponent">
         <ref bean="policyComponent"/>
      </property>
      <property name="extracterAction">
         <bean class="org.alfresco.repo.action.executer.ContentMetadataExtracter" >
            <property name="dictionaryService">
               <ref bean="dictionaryService"/>
            </property>
            <property name="nodeService">
               <ref bean="avmNodeService" />
            </property>
            <property name="contentService">
               <ref bean="contentService" />
            </property>
            <property name="metadataExtracterRegistry">
               <ref bean="avmMetadataExtracterRegistry" />
            </property>
            <property name="carryAspectProperties">
               <value>true</value>
            </property>
         </bean>
      </property>
   </bean>

   <bean id="extracter.xml.centrom.ArticleModelMetadataExtracter"
         class="org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter"
         parent="baseMetadataExtracter"
         init-method="init" >
      <property name="mappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="properties">
               <props>
                  <prop key="namespace.prefix.lg">http://www.centrom.com/alfresco/localgov/model</prop>
                  <prop key="atoz">lg:atoz</prop>
               </props>
            </property>
         </bean>
      </property>
     
      <property name="xpathMappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="properties">
               <props>
                  <prop key="namespace.prefix.art">http://www.centrom.com/localgov/wcm/article</prop>
                  <prop key="atoz">/article/header/atoz/text()</prop>
               </props>
            </property>
         </bean>
      </property>
   </bean>
  
  
   <!–
      This selector examines the XML documents, executing the given XPath statements until a
      match is made.
   –>
   <bean id="extracter.xml.centrom.selector.XPathSelector"
         class="org.alfresco.repo.content.selector.XPathContentWorkerSelector"
         init-method="init">
      <property name="workers">
         <map>
            <entry key="/article">
               <ref bean="extracter.xml.centrom.ArticleModelMetadataExtracter" />
            </entry>
         </map>
      </property>
   </bean>
  
   <bean id="extracter.xml.centrom.XMLMetadataExtracter"
         class="org.alfresco.repo.content.metadata.xml.XmlMetadataExtracter"
         parent="baseMetadataExtracter">

      <property name="registry">
         <ref bean="avmMetadataExtracterRegistry" />
      </property>

      <property name="overwritePolicy">
         <value>EAGER</value>
      </property>
      <property name="selectors">
         <list>
            <ref bean="extracter.xml.centrom.selector.XPathSelector" />
         </list>
      </property>
   </bean>
</beans>

Thanks,
Sam.

steventux
Champ in-the-making
Champ in-the-making
Hi Sam,
I am a bit lost with the configuration steps needed to enable WCM Metadata Extraction.
Following the Wiki I've activated wcm-xml-metadata-extracter-context.xml (assuming this is done by renaming the file from *.xml.sample to *.xml, I haven't changed its directory location).
I've adapted the example you gave in this topic and changed the logging level for the XPathMetadataExtracter to debug so I should see content being processed.
I'm wondering if I am missing something vital as I see no log activity like the output you were getting (I see [metadata.xml.XPathMetadataExtracter] preRegister called).
Do I need to activate the custom-repository-context.xml and register the extracter there as well?
Also I understand the XPath parts of the configuration as they are self explanatory but I am unsure how to alter the mappingProperties suitably, is this simply a case of assigning an abbreviation to the full schema URI for later use in searching?
Thanks
Steve

steventux
Champ in-the-making
Champ in-the-making
Sorry, please ignore these questions. Something very basic was wrong, the SVN build script wasn't copying the wcm-xml-metadata-extracter-context.xml file into the webapp.

samuel_penn
Champ in-the-making
Champ in-the-making
All the above seems to be working okay, however I've now added a date field to my aspect which I need to be set to be the current date[1] when the aspect is added.

To do this, I've created a Java class which implements NodeServicePolicies.OnAddAspectPolicy, and which sets the property to be the current date/time when the aspect is first applied. After adding a bean definition which points to this class into custom-model-context.xml, I've tested this in the DM side of Alfresco and it works as aspected - adding the aspect causes the date to be set to the current date.

However, when the meta-data extractor adds the aspect in WCM, this code isn't run, and the date property isn't set. Is there something else that needs to be configured to get the policy to fire in WCM?

Thanks,
Sam.

[1] It needs to be writable, so though it will normally be the created date I need my own property. Being able to set a default value for a date property of 'now' would also work (and be easier), but as far as I can tell, this isn't possible.

pmonks
Star Contributor
Star Contributor
The AVM doesn't support behavioural aspects, so the best way to accomplish this is to add the current date into the XML (via a dynamic include) and then extract it (via XML Metadata Extractor) into the aspect.

More generally, it's best not to think of AVM aspects as anything more than an intermediate, temporary storage area between XML element values and the Lucene indexes.  In fact in a future release we hope to do away with all this aspect / XML Metadata Extractor stuff for Web Form backed content, and instead allow the indexing behaviour to be configured directly in the Web Form XSD (via annotations, much the same way that we allow the Web Form UI widgets to be configured).

Cheers,
Peter

samuel_penn
Champ in-the-making
Champ in-the-making
The AVM doesn't support behavioural aspects, so the best way to accomplish this is to add the current date into the XML (via a dynamic include) and then extract it (via XML Metadata Extractor) into the aspect.

I had moved to this way of doing things because I couldn't see a way of getting the current date into the XML. I hadn't thought of doing a dynamic include - presumably this would set a default value for the form element. Seems a bit of a messy way of doing it however.

Forms also don't seem to support datetimes, which was another (less serious) downside to sticking the data into the form.

More generally, it's best not to think of AVM aspects as anything more than an intermediate, temporary storage area between XML element values and the Lucene indexes.  In fact in a future release we hope to do away with all this aspect / XML Metadata Extractor stuff for Web Form backed content, and instead allow the indexing behaviour to be configured directly in the Web Form XSD (via annotations, much the same way that we allow the Web Form UI widgets to be configured).

In this case it was just a  workaround for the (apparent) lack of good date support in web forms. Annotations to control indexing (plus maybe auto-population of field values from data sources?) would be nice however.

I'll give the dynamic include a try.

Thanks,
Sam.

samuel_penn
Champ in-the-making
Champ in-the-making
That seems to work, with one problem - because Web Forms don't allow datetime values, I have it stored in two different fields (one date, one time). Is it possible to combine these two fields into a single datetime on the aspect during the metadata extraction?

Thanks,
Sam.