cancel
Showing results for 
Search instead for 
Did you mean: 

SOLR configuration for search tokenization

venur
Star Contributor
Star Contributor

Hello all

I am looking for right documentation or steps to deal with a request we have.

We want to disable tokenization on special characters. 

I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide

1 ACCEPTED ANSWER

abhinavmishra14
World-Class Innovator
World-Class Innovator

@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did. 

It is based on the links i shared above

Here is what i did:

  • In your $SOLR_HOME\alfresco\conf (you may have a different setup) you need to add following configs to tweak the tokenization process, e.g. "C:\alfresco-search-services\solrhome\alfresco\conf" 
    • Create a file named "Latin-break-only-on-whitespace.rbbi" in $SOLR_HOME\alfresco\conf
    • Add following content:
!!forward;

$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];

# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;

# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace*   {200};

# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace*   {100};

# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+   {1};
  • Create a file named "characters.txt" in $SOLR_HOME\alfresco\conf
    • Add following content:
_ => ALPHA 
- => ALPHA 
$ => ALPHA 
! => ALPHA 
  • Edit the $SOLR_HOME/alfresco/schema.xml
    • find the "fieldType" with below analyzer settings:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
    • Update the tokenizer settings "<tokenizer class="solr.ICUTokenizerFactory" ....>" as below:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
              <analyzer>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
                <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
                <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
                <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                        generateWordParts="1"
                        generateNumberParts="1"
                        catenateWords="1"
                        catenateNumbers="1"
                        catenateAll="1"
                        splitOnCaseChange="1"
                        splitOnNumerics="1"
                        preserveOriginal="1"
                        stemEnglishPossessive="1"
        		types="characters.txt"/>
                <filter class="solr.ICUFoldingFilterFactory"/>
              </analyzer>
            </fieldType>

If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".

Note: You will have to full re-index in order to allow these setting handle the tokennization.

Hope this helps.

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

View answer in original post

12 REPLIES 12

abhinavmishra14
World-Class Innovator
World-Class Innovator

@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did. 

It is based on the links i shared above

Here is what i did:

  • In your $SOLR_HOME\alfresco\conf (you may have a different setup) you need to add following configs to tweak the tokenization process, e.g. "C:\alfresco-search-services\solrhome\alfresco\conf" 
    • Create a file named "Latin-break-only-on-whitespace.rbbi" in $SOLR_HOME\alfresco\conf
    • Add following content:
!!forward;

$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];

# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;

# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace*   {200};

# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace*   {100};

# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+   {1};
  • Create a file named "characters.txt" in $SOLR_HOME\alfresco\conf
    • Add following content:
_ => ALPHA 
- => ALPHA 
$ => ALPHA 
! => ALPHA 
  • Edit the $SOLR_HOME/alfresco/schema.xml
    • find the "fieldType" with below analyzer settings:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
    • Update the tokenizer settings "<tokenizer class="solr.ICUTokenizerFactory" ....>" as below:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
              <analyzer>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
                <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
                <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
                <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                        generateWordParts="1"
                        generateNumberParts="1"
                        catenateWords="1"
                        catenateNumbers="1"
                        catenateAll="1"
                        splitOnCaseChange="1"
                        splitOnNumerics="1"
                        preserveOriginal="1"
                        stemEnglishPossessive="1"
        		types="characters.txt"/>
                <filter class="solr.ICUFoldingFilterFactory"/>
              </analyzer>
            </fieldType>

If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".

Note: You will have to full re-index in order to allow these setting handle the tokennization.

Hope this helps.

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

Thank you very very much @abhinavmishra14 for support, this work. We are not able to implement it so far so left it. but your solution work. We did full re-index also as you said.

Great writeup. I managed to replicate it with ease, but this tokenizer now behaves a lot like whitespace tokenizer, due to how you have set up the RBBI rules. 

How would one make one that resembles more classic or standard tokenizer, but doesn't split at specific character, like hyphens?

I tried to adapt the RBBI file but I keep ending up rewriting the entire tokenization rules and only the RBBI then apply, which shrink my token output to a very small subset of what standard tokenizer would normally do. Thank you for any help.