cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene search with accented characters

pchoe
Champ in-the-making
Champ in-the-making
I have a situation where I have two folders with almost the same name, except for the accent mark.

Hernández-Monrreal, Juan Gabriel
Hernandez-Monrreal, Juan Gabriel

I am trying to search for folders using the following query:
+@cm\:name:"cm:Hernández-Monrreal_x002c__x0020_Juan_x0020_Gabriel" +TYPE:"cm:folder" +PATH:"/app:company_home/st:sites/cm:CLCHURCHLIFE/cm:documentLibrary/cm:G_x0020_-_x0020_H/cm:H/*"

I would expect to just get back the folder with exact match including the accent mark, but I get back both the folders, with and without the accent mark.  It seems lucene will just ignore the accent mark and return matches based on the letters.

Is there a way to make lucene do a search with the accent marks?
3 REPLIES 3

andy
Champ on-the-rise
Champ on-the-rise
Hi

The default analysis is to strip accents.

The analysers are configurable.
You would need to change the analyser for d:text and then reindex.

Change the setting in alfresco/model/dataTypeAnalyzers.properties to:
d_dictionary.datatype.d_text.analyzer=your.analyzer.class

or, copy alfresco/model/dataTypeAnalyzers.properties and related files to a new location and changing the definition of the bean that loads this file – “dictionaryBootstrap” – currently defined in core-services-context.xml.

Andy

pchoe
Champ in-the-making
Champ in-the-making
I put the following custom code for the analyzer:

import java.io.Reader;
import java.util.Set;

import org.alfresco.repo.search.impl.lucene.analysis.AlfrescoStandardFilter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ISOLatin1AccentFilter;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

/**
* Custom Lucene analyzer that doesn't implement ISOLatin1AccentFilter.
*
* @author pchoe
*
*/
public class MSICustomStrictAnalyzer extends Analyzer {
    private Set stopSet;
    public static final String STOP_WORDS[];

    static
    {
        STOP_WORDS = StopAnalyzer.ENGLISH_STOP_WORDS;
    }


    public MSICustomStrictAnalyzer()
    {
        this(STOP_WORDS);
    }

    public MSICustomStrictAnalyzer(String stopWords[])
    {
        stopSet = StopFilter.makeStopSet(stopWords);
    }

    /**
     *
     * @see org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader)
     */
    public TokenStream tokenStream(String fieldName, Reader reader)
    {
        TokenStream result = new StandardTokenizer(reader);
        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);
        result = new StopFilter(result, stopSet);
        //result = new ISOLatin1AccentFilter(result);
        return result;
    }

}

and modified the dataTypeAnalyzers.properties to
# Data Type Index Analyzers

d_dictionary.datatype.d_any.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
#d_dictionary.datatype.d_text.analyzer=org.alfresco.repo.search.impl.lucene.analysis.AlfrescoStandardAnalyser
#d_dictionary.datatype.d_content.analyzer=org.alfresco.repo.search.impl.lucene.analysis.AlfrescoStandardAnalyser
d_dictionary.datatype.d_int.analyzer=org.alfresco.repo.search.impl.lucene.analysis.IntegerAnalyser
d_dictionary.datatype.d_long.analyzer=org.alfresco.repo.search.impl.lucene.analysis.LongAnalyser
d_dictionary.datatype.d_float.analyzer=org.alfresco.repo.search.impl.lucene.analysis.FloatAnalyser
d_dictionary.datatype.d_double.analyzer=org.alfresco.repo.search.impl.lucene.analysis.DoubleAnalyser
d_dictionary.datatype.d_date.analyzer=org.alfresco.repo.search.impl.lucene.analysis.DateAnalyser
d_dictionary.datatype.d_datetime.analyzer=org.alfresco.repo.search.impl.lucene.analysis.DateTimeAnalyser
d_dictionary.datatype.d_boolean.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_qname.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_guid.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_category.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_noderef.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_path.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_locale.analyzer=org.alfresco.repo.search.impl.lucene.analysis.LowerCaseVerbatimAnalyser
d_dictionary.datatype.d_text.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer
d_dictionary.datatype.d_content.analyzer=com.microstrat.alfresco.lucene.MSICustomStrictAnalyzer

so that the custom analyzer would work.  When I run it in the debug mode from eclipse, I see that it will go into the custom analyzer.  But when I do a lucene search even after reindexing, I still get the result as from the default analyzer.

Am I missing a configuration somewhere?

andy
Champ on-the-rise
Champ on-the-rise
Hi

Where have you made the changes?
Sounds like your changes are lost post deployment (probvably over-written in the tomcat expanded view??)

It is best to add an extension to wire up the changes to avoid this.

Andy