cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene/SOLR Stemming Analyser

stevegreenbaum
Champ in-the-making
Champ in-the-making
I am using the porter stemming analyser for d_content but noticed that stop words are not removed from the index for new documents I add.  I added the stopword filter to schema.xml underneath the lowercase filter but the stop words still exist in the index.  Is this the correct approach Is there another approach for combining porter and stopwords via configuration or does the porter class need to be modified?

  <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 


Also, regarding the whitespace tokenizer which is set by default in schema.xml for alfrescodatatype, when is the whitespace tokenizer executed relative to the analysers associated with the Alfresco property data types (e.g., content, text) which are specified in the locale specific property files?
5 REPLIES 5

kaynezhang
World-Class Innovator
World-Class Innovator
Adding a filter to alfrescoDataType file in schema.xml will have no effect on d_content data type,alfresco 4.x did not use analyer configured in schema.xml for d_content data type.
It will use data type Index Analyzers configured in
${SOLR_CONFIG_ROOT}/workspace-SpacesStore/alfrescoResources/alfresco/model/dataTypeAnalyzers__{your locale}.properties.
if there is no properties for you locale ,it will use the default AlfrescoStandardAnalyser.

Are you saying that schema.xml is not used at all for setting analysers or are you saying it just doesn't have an effective on d:content?  So for which property types is the whitespace analyser that is declared in schema.xml used by Alfresco? 

Is there a way to apply a filter to the analyser which is set in dataTypeAnalyzers__{your locale} so I can add the stopword filter to the portersnowball analyser? 

kaynezhang
World-Class Innovator
World-Class Innovator
1.In alfresco 4.x all fields are dynamic field,and all filed type are alfrescoDataType( you can see it in schema.xml file).
2.AlfrescoDataType will use SolrLuceneAnalyser as index analyser.
3.SolrLuceneAnalyser is an wrapper analyzer ,it will analyse properties according to the property definition.for example:
   a)for some regular fields(FIELD_ID、FIELD_DBID) ,it will use fix analyzer (LongAnalyser、VerbatimAnalyser);
   b)and for  d:content/d:text/d:mltext it will use MLAnalayser;
4.MLAnalayser will load analyzer according locale which is configured in
  ${SOLR_CONFIG_ROOT}/workspace-SpacesStore/alfrescoResources/alfresco/model/
,for exmaple for english locale it will use analyzers configured in dataTypeAnalyzers_en.properties file
   

   d_dictionary.datatype.d_text.analyzer=org.alfresco.repo.search.impl.lucene.analysis.AlfrescoStandardAnalyser
   d_dictionary.datatype.d_content.analyzer=org.alfresco.repo.search.impl.lucene.analysis.AlfrescoStandardAnalyser
   

That is for d:text and d:cotent property org.alfresco.repo.search.impl.lucene.analysis.AlfrescoStandardAnalyser is used


If you want to use portersnowball analyser and apply a filter to it ,you can implement a new index analyzer that extends SnowballAnalyzer, in your analyzer override tokenStream method and call your StopFilter. And then configure your custom analyser in
${SOLR_CONFIG_ROOT}/workspace-SpacesStore/alfrescoResources/alfresco/model/dataTypeAnalyzers__{your locale}.properties

stevegreenbaum
Champ in-the-making
Champ in-the-making
Thank you!  I appreciate your thoughtful response.  Steve

I wanted to share a response I received via Alfresco support this morning.  They are indicating that even if the stopwords are removed, a search using those stopwords will still find the document because a non-tokenized version is also stored (I assume even if you don't specify "Both" in the model for tokenization).  The AlfrescoStandardAnalyser is supposed to remove stopwords, but I was finding that I could still search on them. 

Here is their response "Alfresco implementaion of Solr every document is indexed on 2 different ways (so it uses twice space than lucene), one using the locale and its defined analyzer (and stopwords) and another raw idexation is done for the "crosslanguage" (multilingual). They are found in your case on the cross language search.  As stop words are not the same across languages we leave all words in. as result … you cannot "not find" stopwords when you specifically search for them. "

kaynezhang
World-Class Innovator
World-Class Innovator
I can't agree all .
I guess for "d:content/d:text/d:mltext" type, cross language search should be implemented like this:
   For every definite locale , locale analyzer defined in dataTypeAnalyzers_**.properties will be used.
   For the crosslanguage one,there also should be an analyzer to be used ,either it is  server default locale's analyzer or just AlfrescoStandardAnalyser.
So either way ,we can implements your requirements by customizing your locale's analyzer and AlfrescoStandardAnalyser.

This is just my guess,I have not tested it,correct me if I'm wrong.
Getting started

Tags


Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.