Hyland Connect

kilo · ‎01-29-2010

Hello Gurus,

I'm trying to understand Alfresco's built-in XML meta data-extraction, which I understand requires 3 configurations:

I'm trying to understand the design. Apparently, the selector class only peeks inside XML (for validation?). It is the XPATH extractor that does the real work. So why does selector need to be configured? Why do I need to provide root of my XPATH within selection configurations also - which I already provided during XPATH mapping to a local var.

I'm confused. Why is there a round-about way of mapping parameter extracted from XPATH to a content property? Why do we need an intermediate mapping?

I'm trying to understand. I will appreciate any pointers.

derek · ‎02-01-2010

Hi,
The mimetype "XML" is really an infinitely variable document format; we can't rely on it to be anything except well-formed. The simplest way for the extractor to know what 'type' of XML it is dealing with is to "peek" into the document. The selector runs XPath statements until it gets a hit; it then passes the document to the corresponding XPathMetadataExctractor, which runs multiple XPath statements to extract values from the documents; the extracted values are then passed through the normal mapping phase which pushes the values into a form that will be sent for persistence.
The XmlMetadataExtracterTest extracts values from different types of xml: an Alfresco content model and an Eclipse project definition.
Regards
Derek
PS. Recent context 'subsystem' work added some extra complexity to the code.

kilo · ‎02-02-2010

Thanks, Derek. Your explanation on the intent of XML selector is very good. Does the selector process also validate (i.e. if DOCTYPE is present) the document?

I also understand why there is a two step mapping from extracted values to content property (extracted value –> local variable –> content property) . It provides an opportunity to transform extracted value before assigning it to a content property.

Thanks.

derek · ‎02-03-2010

Hi,
How strict the document builder is dependent on the parser that Java chooses at runtime:

documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();‍

We have xercesImpl-2.8.0.jar on our classpath by default.

Regards

Hyland Connect

The meaning of XML extractor selector