cancel
Showing results for 
Search instead for 
Did you mean: 

Weakness of full-text indexing design?

kgeis
Champ on-the-rise
Champ on-the-rise
I am wondering about the design of the full-text indexing in Alfresco.  I'm reading the wiki and the code for ADMLuceneIndexerImpl, and it's clear that content properties are converted to text and then that text is indexed by Lucene.

Now what about the case where I have structured data inside my content and I want it to maintain some of that structure when indexed.  For example, let's say I have an XML document with information captured from a feedback form.

<feedback>
  <first-name>Ken</first-name>
  <last-name>Geis</last-name>
  <title>Programmer</title>
</feedback>
The most obvious transform from this XML to text is to

Ken Geis Programmer
However, I want to prevent Lucene from thinking that this is a single text field.  I don't want to be able to search for the phrase "Geis Programmer" and retrieve this document.  I see some black magic done in ADMLuceneIndexerImpl that makes me wonder, is it as easy as transforming the input to

Ken\u0000Geis\u0000Programmer
If this isn't possible, I might have a problem.
5 REPLIES 5

invictus9
Champ in-the-making
Champ in-the-making
This is an Information Architecture question.

My approach, at an IA level, would be to establish metadata that you are interested in, such as the name fields, the occupation/job title fields, and export them from the document to an aspect attached to the document. Lucene would then allow you to search for @custom\:surname:Geis and find it within the context you want.

kgeis
Champ on-the-rise
Champ on-the-rise
That's the obvious answer, so I guess I didn't state my problem well enough.  I don't know the structure at design-time.  I want all of the fields to be searchable but I do not want "Geis Programmer" to return any hits.

invictus9
Champ in-the-making
Champ in-the-making
One of the wonderful things about Alfresco is that Aspects are run-time property packets. As you build up your understanding of the data that arrives, you can create new aspects, populating them from the underlying data using the metadata extraction capabilities of Alfresco.

That being said, I can see your point. However, you are swimming upstream, asking for a context-free search engine to destroy some of its context.

kgeis
Champ on-the-rise
Champ on-the-rise
Again, the weakness of my examples and problem description has been highlighted.

Assume deeply structured content.  Hierarchical data within the content does not map to Alfresco metadata on a content object.

Also assume that you will have thousands of documents with hundreds of schemas.  At that point, extracting specific metadata to help search is probably not helpful.

The point is, the content is structured, and its content is data not metadata.  Can it be indexed in a somewhat structured way so that data from two fields do not bleed into each other as I mentioned?

andy
Champ on-the-rise
Champ on-the-rise
Hi

Out of the box the answer is "Meta data extraction is for this use case, as has already been discussed. If your use case is more complex than this you probably need an XML database. Alfresco is not an XML database"

If your use case is "fairly" simple you may be able to get away with a custom transformer for XML and some custom tokenisation.

Your transformer needs to keep the structural location of the token and the tokeniser has to understand this and not split it up.
The transformer is then generic and so are the tokens it generates.

At search time you can then look for these special tokens - you will have to generate suitable extended tokens for each field.

If you then want to ask about structural queries …. you have to do some pattern matching on the structural bit of the token.

A pluggable index model is on the todo list. Then you could add to indexing as you see fit - effectively write your own structural extractor for XML.

There is nothing special in the index code that means you can use \u0000 for anything.

Andy
Getting started

Tags


Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.