First, I need to say that I read a lot about Indexing process, Searching process and effects of languages on these. I read the following topics, and a few others :
My problem is that, with users that can be from different countries/languages, and that mix CIFS and webclient usage (for file uploading of file searching), the results of any search process are unefficient. I mean : they like the functionning of CIFS/windows search. Indeed, they've got a lot of troubles getting the right result using webclient or a search portlet (via webservice), because of all stemming/analyzing procedures that are lead during the indexing process.
So, I'd like to configure a simple indexing anlysis, that would just erase any accents (for french and spanish words), but keep the words unstemmed. If the users look for "procedure", they want to find files containing "procédure" of even "ProCéDUre", whatever the locale of their webclient, the locale of the document, or the way the file was uploaded.
Iwas wondering if it was as simply as - declaring the same LuceneCustomAnalyzer in the DataTypeAnalyzers_locale.properties - Creating this LuceneCustomAnalyzer from the French one, removing the call to FrenchStemmer, and customizing it in order to erase accents. Am I right on this way to do it ?
Is there anything I forgot (like the fact that doing so, any search for "procedureS" (plural) will not show files with "procedure" (singular) ?
Hummm, I'm wondering whether to use FrenchAnalyzer (without FrenchStemmer) et IsoLatin1Filter, or AlfrescoStandardAnalyzer.
One last question. When I'm done with the config changes, will a full reindexing process (index.recovery.mode=FULL) rebuild the indexes taking account of this change about analyzers ?