cancel
Showing results for 
Search instead for 
Did you mean: 

SOLR configuration for search tokenization

venur
Star Contributor
Star Contributor

Hello all

I am looking for right documentation or steps to deal with a request we have.

We want to disable tokenization on special characters. 

I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide

1 ACCEPTED ANSWER

abhinavmishra14
World-Class Innovator
World-Class Innovator

@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did. 

It is based on the links i shared above

Here is what i did:

  • In your $SOLR_HOME\alfresco\conf (you may have a different setup) you need to add following configs to tweak the tokenization process, e.g. "C:\alfresco-search-services\solrhome\alfresco\conf" 
    • Create a file named "Latin-break-only-on-whitespace.rbbi" in $SOLR_HOME\alfresco\conf
    • Add following content:
!!forward;

$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];

# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;

# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace*   {200};

# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace*   {100};

# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+   {1};
  • Create a file named "characters.txt" in $SOLR_HOME\alfresco\conf
    • Add following content:
_ => ALPHA 
- => ALPHA 
$ => ALPHA 
! => ALPHA 
  • Edit the $SOLR_HOME/alfresco/schema.xml
    • find the "fieldType" with below analyzer settings:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
    • Update the tokenizer settings "<tokenizer class="solr.ICUTokenizerFactory" ....>" as below:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
              <analyzer>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
                <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
                <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
                <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                        generateWordParts="1"
                        generateNumberParts="1"
                        catenateWords="1"
                        catenateNumbers="1"
                        catenateAll="1"
                        splitOnCaseChange="1"
                        splitOnNumerics="1"
                        preserveOriginal="1"
                        stemEnglishPossessive="1"
        		types="characters.txt"/>
                <filter class="solr.ICUFoldingFilterFactory"/>
              </analyzer>
            </fieldType>

If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".

Note: You will have to full re-index in order to allow these setting handle the tokennization.

Hope this helps.

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

View answer in original post

11 REPLIES 11

abhinavmishra14
World-Class Innovator
World-Class Innovator

@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did. 

It is based on the links i shared above

Here is what i did:

  • In your $SOLR_HOME\alfresco\conf (you may have a different setup) you need to add following configs to tweak the tokenization process, e.g. "C:\alfresco-search-services\solrhome\alfresco\conf" 
    • Create a file named "Latin-break-only-on-whitespace.rbbi" in $SOLR_HOME\alfresco\conf
    • Add following content:
!!forward;

$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];

# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;

# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace*   {200};

# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace*   {100};

# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+   {1};
  • Create a file named "characters.txt" in $SOLR_HOME\alfresco\conf
    • Add following content:
_ => ALPHA 
- => ALPHA 
$ => ALPHA 
! => ALPHA 
  • Edit the $SOLR_HOME/alfresco/schema.xml
    • find the "fieldType" with below analyzer settings:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
    • Update the tokenizer settings "<tokenizer class="solr.ICUTokenizerFactory" ....>" as below:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
              <analyzer>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
                <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
                <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
                <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                        generateWordParts="1"
                        generateNumberParts="1"
                        catenateWords="1"
                        catenateNumbers="1"
                        catenateAll="1"
                        splitOnCaseChange="1"
                        splitOnNumerics="1"
                        preserveOriginal="1"
                        stemEnglishPossessive="1"
        		types="characters.txt"/>
                <filter class="solr.ICUFoldingFilterFactory"/>
              </analyzer>
            </fieldType>

If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".

Note: You will have to full re-index in order to allow these setting handle the tokennization.

Hope this helps.

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

Thank you very very much @abhinavmishra14 for support, this work. We are not able to implement it so far so left it. but your solution work. We did full re-index also as you said.