XML metadata extraction
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2010 10:20 AM
I know this question gets asked a lot, and I've read the other threads, but without success. Most other people seemed to be trying to get XML metadata extraction within the web content management system. I don't know if my situation is significantly different.
I would like to be able to import multiple TEI XML files in the system using Alfresco mounted as a network drive, and have them recognised as being TEI XML files, and have various bits of information extracted and put into some custom metadata fields. I have the network drive working. And I have created an aspect which defines the custom metadata fields. What I don't have working is the metadata extraction.
My TEI XML files look like this:
(no XML processing instruction or DTD and no namespaces)
The model for my data looks like this:
With the following additions to web-client-config-custom.xml, I can add the RPS Metadata aspect to objects in Alfresco:
In wcm-xml-metadata-extracter-context.xml, I have added an entry to the bean extracter.xml.sample.selector.XPathSelector:
This should match the root element of my TEI XML files. Right?
In custom-metadata-extrators-context.xml, I've added the following bean:
Again, these XPath strings should match elements and attributes in my TEI XML documents, right?
These XML config files are getting loaded, but to no effect. The custom metadata fields are not added upon ingest, nor are they populated when I add the custom aspect to the object and run the action for extracting metadata.
Do I have things in the right files? Are there other changes I should make to any of the config files? Is it a namespace issue? I made up a namespace for my custom aspect, but it isn't used in the TEI XML files I want to extract metadata from. These documents have no namespace. How does one handle that (an empty <prop> element caused an error)?
Thanks.
(edit: my Alfresco version is Community - v3.3.0 (2765))
I would like to be able to import multiple TEI XML files in the system using Alfresco mounted as a network drive, and have them recognised as being TEI XML files, and have various bits of information extracted and put into some custom metadata fields. I have the network drive working. And I have created an aspect which defines the custom metadata fields. What I don't have working is the metadata extraction.
My TEI XML files look like this:
<TEI.2 id="_jamesi_t1428_7_1_d6_trans" n="jamesi_trans"> <teiHeader> <fileDesc> <titleStmt> <title>Procedure: preamble</title> </titleStmt> <editionStmt> <edition n="session">jamesi_t1428_7_1_d2_trans</edition> </editionStmt> <publicationStmt> <date>14280712</date> </publicationStmt> </fileDesc> </teiHeader> <text>…</text></TEI.2>
(no XML processing instruction or DTD and no namespaces)
The model for my data looks like this:
<model name="rps:rpsModel" xmlns="http://www.alfresco.org/model/dictionary/1.0"> … <namespaces> <namespace uri="rps.ns" prefix="rps"/> </namespaces> <types> <type name="rps:document"> <title>RPS document</title> <parent>cm:content</parent> <mandatory-aspects> <aspect>cm:generalclassifiable</aspect> </mandatory-aspects> </type> </types> <aspects> <aspect name="rps:metadata"> <title>RPS Metadata</title> <properties> <property name="rps:id"> <type>d:text</type> </property> <property name="rps:reign"> <type>d:text</type> </property> <property name="rps:session"> <type>d:text</type> </property> <property name="rps:date"> <type>d:int</type> </property> </properties> </aspect> </aspects></model>
With the following additions to web-client-config-custom.xml, I can add the RPS Metadata aspect to objects in Alfresco:
<alfresco-config> <config evaluator="aspect-name" condition="rps:metadata"> <property-sheet> <show-property name="rps:id" display-label-id="rpsID"/> <show-property name="rps:reign" display-label-id="rpsReign"/> <show-property name="rps:session" display-label-id="rpsSession"/> <show-property name="rps:date" display-label-id="rpsDate"/> </property-sheet> </config> <config evaluator="string-compare" condition="Content Wizards"> <content-types> <type name="rps:document"/> </content-types> </config> <config evaluator="string-compare" condition="Action Wizards"> <aspects> <aspect name="rps:metadata"/> </aspects> <specialise-types> <type name="rps:document"/> </specialise-types> </config></alfresco-config>
In wcm-xml-metadata-extracter-context.xml, I have added an entry to the bean extracter.xml.sample.selector.XPathSelector:
<entry key="/TEI.2"> <ref bean="extracter.TEIMetadataExtracter" /> </entry>
This should match the root element of my TEI XML files. Right?
In custom-metadata-extrators-context.xml, I've added the following bean:
<bean id="extracter.TEIMetadataExtracter" class="org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter" parent="baseMetadataExtracter" init-method="init" > <property name="mappingProperties"> <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean"> <property name="properties"> <props> <prop key="namespace.prefix.rps">rps.ns</prop> <prop key="rpsID">rps:id</prop> <prop key="rpsReign">rps:reign</prop> <prop key="rpsSession">rps:session</prop> <prop key="rpsDate">rps:date</prop> </props> </property> </bean> </property> <property name="xpathMappingProperties"> <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean"> <property name="properties"> <props> <!– should there be some namespace prop here? –> <prop key="rpsID">/TEI.2/@id</prop> <prop key="rpsReign">/TEI.2/teiHeader/fileDesc/titleStmt/title/text()</prop> <prop key="rpsSession">/TEI.2/teiHeader/fileDesc/editionStmt/edition[@n='session']/text()</prop> <prop key="rpsDate">/TEI.2/teiHeader/fileDesc/publicationStmt/date/text()</prop> </props> </property> </bean> </property></bean>
Again, these XPath strings should match elements and attributes in my TEI XML documents, right?
These XML config files are getting loaded, but to no effect. The custom metadata fields are not added upon ingest, nor are they populated when I add the custom aspect to the object and run the action for extracting metadata.
Do I have things in the right files? Are there other changes I should make to any of the config files? Is it a namespace issue? I made up a namespace for my custom aspect, but it isn't used in the TEI XML files I want to extract metadata from. These documents have no namespace. How does one handle that (an empty <prop> element caused an error)?
Thanks.
(edit: my Alfresco version is Community - v3.3.0 (2765))
Labels:
- Labels:
-
Archive
5 REPLIES 5
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2010 01:39 PM
Hi,
At first glance, your config looks correct. Where have you placed your *-context.xml files?
Do you get interesting stuff when you turn on debug?
log4j.logger.org.alfresco.repo.content.metadata.xml=DEBUG
Regards
At first glance, your config looks correct. Where have you placed your *-context.xml files?
Do you get interesting stuff when you turn on debug?
log4j.logger.org.alfresco.repo.content.metadata.xml=DEBUG
Regards
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-23-2010 05:50 AM
My *-context.xml files are in /opt/alfresco/tomcat/shared/classes/alfresco/extension/, as is my model document. I think this is the right place for them.
When I turn on debugging, this is all that I get on startup:
This indicates that my custom stuff is being picked up. There isn't any more output if I ingest an XML document (and add the aspect which defines the custom metadata fields and run the extract common metadata action). Could the WARNing message be relevant? It has always been there, even before I started trying to customise things. And commenting out the offending bean property makes no difference, apart from removing the log file entry.
Namespace issues are always complicated. I have a NS defined for my custom aspect, type and metadata fields. But the actual documents I want to ingest have no namespace defined, not even an empty one.
I have read posts saying that you need to comment out metadataExtracterRegistry property of the avmMetadataExtracter bean. I've tried this, with no effect.
Thanks for the quick reply. I'm sure I'm very close.
When I turn on debugging, this is all that I get on startup:
10:15:06,981 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsDate to /TEI.2/teiHeader/fileDesc/publicationStmt/date/text()10:15:06,982 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsSession to /TEI.2/teiHeader/fileDesc/editionStmt/edition[@n='session']/text()10:15:06,982 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsReign to /TEI.2/teiHeader/fileDesc/titleStmt/title/text()10:15:06,983 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from rpsID to /TEI.2/@id10:15:06,992 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from version to /model/version/text()10:15:06,992 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from author to /model/author/text()10:15:06,993 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from description to /model/description/text()10:15:06,993 DEBUG [metadata.xml.XPathMetadataExtracter] Added mapping from title to /model/@name10:15:06,999 WARN [springframework.beans.GenericTypeAwarePropertyDescriptor] Invalid JavaBean property 'overwritePolicy' being accessed! Ambiguous write methods found next to actually used [public void org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter.setOverwritePolicy(java.lang.String)]: [public void org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter.setOverwritePolicy(org.alfresco.repo.content.metadata.MetadataExtracter$OverwritePolicy)]
This indicates that my custom stuff is being picked up. There isn't any more output if I ingest an XML document (and add the aspect which defines the custom metadata fields and run the extract common metadata action). Could the WARNing message be relevant? It has always been there, even before I started trying to customise things. And commenting out the offending bean property makes no difference, apart from removing the log file entry.
Namespace issues are always complicated. I have a NS defined for my custom aspect, type and metadata fields. But the actual documents I want to ingest have no namespace defined, not even an empty one.
I have read posts saying that you need to comment out metadataExtracterRegistry property of the avmMetadataExtracter bean. I've tried this, with no effect.
Thanks for the quick reply. I'm sure I'm very close.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-23-2010 07:25 AM
Hi,
The problem must lie with the wcm-xml-metadata-extracter-context.xml file, which is specifically tailored to work with WCM. I'm guessing that the registry used there is not the correct one.
The problem must lie with the wcm-xml-metadata-extracter-context.xml file, which is specifically tailored to work with WCM. I'm guessing that the registry used there is not the correct one.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-12-2010 05:02 AM
In the end, I got what I wanted by writing a pair of Java classes. One extends org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter, and is registered as a bean in content-services-context.xml. The other extends org.xml.sax.helpers.DefaultHandler, and is called by the first to parse the XML files and return a HashMap of metadata properties which can then be put into rawProperties using putRawValue.
Probably not the most elegant way, but it works.
Probably not the most elegant way, but it works.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-12-2010 06:26 AM
@swithum
It seems elegant to me! The generalized XML metadata extraction is quite nasty; I imagine that extracting specific values based on your expected XML is much neater - and deterministic.
Regards
It seems elegant to me! The generalized XML metadata extraction is quite nasty; I imagine that extracting specific values based on your expected XML is much neater - and deterministic.
Regards
