topic Re: Indexing of web content in Alfresco Archive

Indexing of web content

mark_smithson — Sun, 05 Aug 2007 16:28:01 GMT

It seems that the indexing of content, (rather than custom attributes) for xml documents is not very intelligent. The tag names are included in the index, meaning that a search which includes one of the tag names will return all the documents of that type.Is there a way of changing this behaviour -

Re: Indexing of web content

andy — Mon, 06 Aug 2007 15:07:24 GMT

Hi

See http://wiki.alfresco.com/wiki/Metadata_Extraction#XML_Metadata_Extraction.

The idea is to pull out the data you want as meta data. There is no way to specify tokenisation based on mimetype to tokenise xml with a specific lucene tokeniser.

Andy

Re: Indexing of web content

mark_smithson — Mon, 06 Aug 2007 19:32:32 GMT

Ah,

So if we had a number of elements whose content we wanted indexed we could exptract that using XPath unions an map that to the cm:content property.

Is that what you mean, or am I off track?

Re: Indexing of web content

andy — Fri, 31 Aug 2007 15:14:15 GMT

Hi

Create your own aspect to hold the extracted meta data in properties. Use XPATH expressions to map xml elements to these properties. You could use one hold all property or several, it depends on what you want to do. The properties are likely to be of type d:text.

You can not extract metadata into properties of type d:content.

Andy

Re: Indexing of web content

mark_smithson — Fri, 31 Aug 2007 20:09:45 GMT

Thanks for the reply.

we are using xpath which uses "concat(/node/text(), ' ',/node2/text())" and mapping this to a text field.

It is a shame that we don't have XPath 2.0 support as some of those functions could be quite useful in extracting metadata from xml documents.