01-04-2022 11:57 PM
Hello all
I am looking for right documentation or steps to deal with a request we have.
We want to disable tokenization on special characters.
I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward;
$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];
# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;
# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace* {200};
# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace* {100};
# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
<!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
<filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"
preserveOriginal="1"
stemEnglishPossessive="1"
types="characters.txt"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward;
$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];
# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;
# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace* {200};
# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace* {100};
# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
<!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
<filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"
preserveOriginal="1"
stemEnglishPossessive="1"
types="characters.txt"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
03-07-2022 11:22 PM
Thank you very very much @abhinavmishra14 for support, this work. We are not able to implement it so far so left it. but your solution work. We did full re-index also as you said.
Explore our Alfresco products with the links below. Use labels to filter content by product module.