01-04-2022 11:57 PM
Hello all
I am looking for right documentation or steps to deal with a request we have.
We want to disable tokenization on special characters.
I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward;
$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];
# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;
# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace* {200};
# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace* {100};
# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
<!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
<filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"
preserveOriginal="1"
stemEnglishPossessive="1"
types="characters.txt"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward;
$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];
# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;
# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace* {200};
# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace* {100};
# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
<!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
<filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"
preserveOriginal="1"
stemEnglishPossessive="1"
types="characters.txt"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
03-07-2022 11:22 PM
Thank you very very much @abhinavmishra14 for support, this work. We are not able to implement it so far so left it. but your solution work. We did full re-index also as you said.
02-02-2026 03:29 AM
Great writeup. I managed to replicate it with ease, but this tokenizer now behaves a lot like whitespace tokenizer, due to how you have set up the RBBI rules.
How would one make one that resembles more classic or standard tokenizer, but doesn't split at specific character, like hyphens?
I tried to adapt the RBBI file but I keep ending up rewriting the entire tokenization rules and only the RBBI then apply, which shrink my token output to a very small subset of what standard tokenizer would normally do. Thank you for any help.
Explore our Alfresco products with the links below. Use labels to filter content by product module.