cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene and stop words in Alfresco

hbf
Champ on-the-rise
Champ on-the-rise
Dear list,

Suppose I have a text document in Alfresco containing the phrase "time is money". I want users to be able to enter "money is time" and find the document. That is, I want to find all documents that contain all the words the user enters, in any order.

Reading Alfresco's Search documentation I could not find a way to formulate a query for this.

Maybe I am missing something?! If so, apologies!

Here comes what I have found out:

The query
TEXT:"money is time"
internally drops the stopword "is" and therefore searches for "money" followed by one or more stop words followed y "time" and will therefore NOT match.

The query
TEXT:"money" AND TEXT:"is" AND TEXT:"time"
searches for all documents containing the three words "money", "is", and "time". As "is" is a stop word, it does not occur in the index and therefore the query returns NO result.

The query
TEXT:"money" AND TEXT:"time"
searches for all documents containing the two words "money" and "time". It finds the document …

… however, I cannot easily generate this query as it requires me to drop all words that Alfresco's analyzer considers stop words.

Is there another way to perform a query for all documents containing a given set of words (possibly including stop words)?

If not, I see two ways out:

* Alfresco exposes the list of stop words (not nice).
* Alfresco's query parser recognizes stop words and handles them accordingly. (It would drop the clause 'AND TEXT:"is"' from the query 'TEXT:"money" AND TEXT:"is" AND TEXT:"time"' for example.)

Many thanks,
Kaspar

P.S. This question is Lucene related. However, I post it here and not to the Lucene mailing list as it depends on Alfresco's particular Lucene adaption. Not knowing the details, I might be wrong, of course.
4 REPLIES 4

hbf
Champ on-the-rise
Champ on-the-rise
(For those who need a temporary fix (like me): the stop-words Alfresco is using seem to be in file AlfrescoStandardAnalyser.java. My code reads them and drops all stop-words from the query. Not nice, but works.)

andy
Champ on-the-rise
Champ on-the-rise
Hi

Alfresco supports the standard lucene query syntax so you can write proximity queries (but not span queries). See http://lucene.apache.org/java/2_1_0/queryparsersyntax.html#Proximity%20Searches

Andy

hbf
Champ on-the-rise
Champ on-the-rise
Hm, thanks for the hint, Andy, it seems to me, however, that this does not solve the problem entirely.

If you want to create a custom "Advanced Search" page for Alfresco where they can enter arbitrary words (see my examples in the previous post), you are forced to know the stop word list (from AlfrescoStandardAnalyser). That creates a dependency which is not nice at all.

I use Alfresco as the backend engine for a CMS, so I find myself exactly in this situation. Assuming I am not the only who will/does do something like this, do you want me to open a JIRA for this?

Regards,
Kaspar

andy
Champ on-the-rise
Champ on-the-rise
Hi

You could use a customized version of the analayser and set an empty stop word list. This way there will be no stop words. It is a simple wrapper class and change to the tokenisation config.

Andy