Hyland Connect

venur · ‎01-04-2022

Hello all

I am looking for right documentation or steps to deal with a request we have.

We want to disable tokenization on special characters.

I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide

abhinavmishra14 · ‎03-04-2022

@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.

It is based on the links i shared above.

Here is what i did:

In your $SOLR_HOME\alfresco\conf (you may have a different setup) you need to add following configs to tweak the tokenization process, e.g. "C:\alfresco-search-services\solrhome\alfresco\conf"
- Create a file named "Latin-break-only-on-whitespace.rbbi" in $SOLR_HOME\alfresco\conf
- Add following content:

!!forward;

$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];

# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;

# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace*   {200};

# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace*   {100};

# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+   {1};

Create a file named "characters.txt" in $SOLR_HOME\alfresco\conf
- Add following content:

_ => ALPHA 
- => ALPHA 
$ => ALPHA 
! => ALPHA

Edit the $SOLR_HOME/alfresco/schema.xml

find the "fieldType" with below analyzer settings:
- <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">

Update the tokenizer settings "<tokenizer class="solr.ICUTokenizerFactory" ....>" as below:

<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
      <analyzer>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
        <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
        <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="1"
                splitOnCaseChange="1"
                splitOnNumerics="1"
                preserveOriginal="1"
                stemEnglishPossessive="1"
		types="characters.txt"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".

Note: You will have to full re-index in order to allow these setting handle the tokennization.

Hope this helps.

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

View answer in original post

abhinavmishra14 · ‎03-04-2022