topic Re: Weakness of full-text indexing design? in Alfresco Archive

Weakness of full-text indexing design?

kgeis — Thu, 01 Apr 2010 05:39:23 GMT

I am wondering about the design of the full-text indexing in Alfresco. I'm reading the wiki and the code for ADMLuceneIndexerImpl, and it's clear that content properties are converted to text and then that text is indexed by Lucene.Now what about the case where I have structured data inside my cont

Re: Weakness of full-text indexing design?

invictus9 — Thu, 01 Apr 2010 14:38:04 GMT

This is an Information Architecture question.

My approach, at an IA level, would be to establish metadata that you are interested in, such as the name fields, the occupation/job title fields, and export them from the document to an aspect attached to the document. Lucene would then allow you to search for @custom\:surname:Geis and find it within the context you want.

Re: Weakness of full-text indexing design?

kgeis — Thu, 01 Apr 2010 16:10:30 GMT

That's the obvious answer, so I guess I didn't state my problem well enough. I don't know the structure at design-time. I want all of the fields to be searchable but I do not want "Geis Programmer" to return any hits.

Re: Weakness of full-text indexing design?

invictus9 — Mon, 05 Apr 2010 14:46:29 GMT

One of the wonderful things about Alfresco is that Aspects are run-time property packets. As you build up your understanding of the data that arrives, you can create new aspects, populating them from the underlying data using the metadata extraction capabilities of Alfresco.

That being said, I can see your point. However, you are swimming upstream, asking for a context-free search engine to destroy some of its context.

Re: Weakness of full-text indexing design?

kgeis — Tue, 06 Apr 2010 04:19:27 GMT

Again, the weakness of my examples and problem description has been highlighted.

Assume deeply structured content. Hierarchical data within the content does not map to Alfresco metadata on a content object.

Also assume that you will have thousands of documents with hundreds of schemas. At that point, extracting specific metadata to help search is probably not helpful.

The point is, the content is structured, and its content is data not metadata. Can it be indexed in a somewhat structured way so that data from two fields do not bleed into each other as I mentioned?

Re: Weakness of full-text indexing design?

andy — Tue, 06 Apr 2010 13:41:01 GMT

Hi

Out of the box the answer is "Meta data extraction is for this use case, as has already been discussed. If your use case is more complex than this you probably need an XML database. Alfresco is not an XML database"

If your use case is "fairly" simple you may be able to get away with a custom transformer for XML and some custom tokenisation.

Your transformer needs to keep the structural location of the token and the tokeniser has to understand this and not split it up.
The transformer is then generic and so are the tokens it generates.

At search time you can then look for these special tokens - you will have to generate suitable extended tokens for each field.

If you then want to ask about structural queries …. you have to do some pattern matching on the structural bit of the token.

A pluggable index model is on the todo list. Then you could add to indexing as you see fit - effectively write your own structural extractor for XML.

There is nothing special in the index code that means you can use \u0000 for anything.

Andy