topic The meaning of XML extractor selector in Alfresco Archive

The meaning of XML extractor selector

kilo — Fri, 29 Jan 2010 23:26:56 GMT

Hello Gurus,I'm trying to understand Alfresco's built-in XML meta data-extraction, which I understand requires 3 configurations:1. Configure the selector class ( where I set the "worker" property)2. Map a local variable to a content type property, where the extracted value will go to3. Map the local

Re: The meaning of XML extractor selector

derek — Mon, 01 Feb 2010 16:26:42 GMT

Hi,
The mimetype "XML" is really an infinitely variable document format; we can't rely on it to be anything except well-formed. The simplest way for the extractor to know what 'type' of XML it is dealing with is to "peek" into the document. The selector runs XPath statements until it gets a hit; it then passes the document to the corresponding XPathMetadataExctractor, which runs multiple XPath statements to extract values from the documents; the extracted values are then passed through the normal mapping phase which pushes the values into a form that will be sent for persistence.
The XmlMetadataExtracterTest extracts values from different types of xml: an Alfresco content model and an Eclipse project definition.
Regards
Derek
PS. Recent context 'subsystem' work added some extra complexity to the code.

Re: The meaning of XML extractor selector

kilo — Tue, 02 Feb 2010 16:19:25 GMT

Thanks, Derek. Your explanation on the intent of XML selector is very good. Does the selector process also validate (i.e. if DOCTYPE is present) the document?

I also understand why there is a two step mapping from extracted values to content property (extracted value –> local variable –> content property) . It provides an opportunity to transform extracted value before assigning it to a content property.

Thanks.

Re: The meaning of XML extractor selector

derek — Wed, 03 Feb 2010 11:41:59 GMT

Hi,
How strict the document builder is dependent on the parser that Java chooses at runtime:

documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();‍

We have xercesImpl-2.8.0.jar on our classpath by default.

Regards