01-04-2022 11:57 PM
Hello all
I am looking for right documentation or steps to deal with a request we have.
We want to disable tokenization on special characters.
I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward; $Whitespace = [\p{Whitespace}]; $NonWhitespace = [\P{Whitespace}]; $Letter = [\p{Letter}]; $Number = [\p{Number}]; # Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer $Whitespace; # Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char # Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig $NonWhitespace* $Letter $NonWhitespace* {200}; # Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char # Mapped to <NUM> token type by DefaultICUTokenizerConfig $NonWhitespace* $Number $NonWhitespace* {100}; # Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char # Mapped to <OTHER> token type by DefaultICUTokenizerConfig $NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/> <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> --> <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="1" types="characters.txt"/> <filter class="solr.ICUFoldingFilterFactory"/> </analyzer> </fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward; $Whitespace = [\p{Whitespace}]; $NonWhitespace = [\P{Whitespace}]; $Letter = [\p{Letter}]; $Number = [\p{Number}]; # Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer $Whitespace; # Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char # Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig $NonWhitespace* $Letter $NonWhitespace* {200}; # Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char # Mapped to <NUM> token type by DefaultICUTokenizerConfig $NonWhitespace* $Number $NonWhitespace* {100}; # Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char # Mapped to <OTHER> token type by DefaultICUTokenizerConfig $NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/> <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> --> <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="1" types="characters.txt"/> <filter class="solr.ICUFoldingFilterFactory"/> </analyzer> </fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
03-07-2022 11:22 PM
Thank you very very much @abhinavmishra14 for support, this work. We are not able to implement it so far so left it. but your solution work. We did full re-index also as you said.
Explore our Alfresco products with the links below. Use labels to filter content by product module.