cancel
Showing results for 
Search instead for 
Did you mean: 

Indexing of web content

mark_smithson
Champ in-the-making
Champ in-the-making
It seems that the indexing of content, (rather than custom attributes) for xml documents is not very intelligent.

The tag names are included in the index, meaning that a search which includes one of the tag names will return all the documents of that type.

Is there a way of changing this behaviour - perhaps using a different tokeniser? Can anyone point me in the right direction?
4 REPLIES 4

andy
Champ on-the-rise
Champ on-the-rise
Hi

See http://wiki.alfresco.com/wiki/Metadata_Extraction#XML_Metadata_Extraction.

The idea is to pull out the data you want as meta data. There is no way to specify tokenisation based on mimetype to tokenise xml with a specific lucene tokeniser.

Andy

mark_smithson
Champ in-the-making
Champ in-the-making
Ah,

So if we had a number of elements whose content we wanted indexed we could exptract that using XPath unions an map that to the cm:content property.

Is that what you mean, or am I off track?

andy
Champ on-the-rise
Champ on-the-rise
Hi

Create your own aspect to hold the extracted meta data in properties. Use XPATH expressions to map xml elements to these properties. You could use one hold all property or several, it depends on what you want to do. The properties are likely to be of type d:text.

You can not extract metadata into properties of type d:content.

Andy

mark_smithson
Champ in-the-making
Champ in-the-making
Thanks for the reply.

we are using xpath which uses "concat(/node/text(), ' ',/node2/text())" and mapping this to a text field.

It is a shame that we don't have XPath 2.0 support as some of those functions could be quite useful in extracting metadata from xml documents.