cancel
Showing results for 
Search instead for 
Did you mean: 

SOLR configuration for search tokenization

venur
Star Contributor
Star Contributor

Hello all

I am looking for right documentation or steps to deal with a request we have.

We want to disable tokenization on special characters. 

I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide

1 ACCEPTED ANSWER

abhinavmishra14
World-Class Innovator
World-Class Innovator

@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did. 

It is based on the links i shared above

Here is what i did:

  • In your $SOLR_HOME\alfresco\conf (you may have a different setup) you need to add following configs to tweak the tokenization process, e.g. "C:\alfresco-search-services\solrhome\alfresco\conf" 
    • Create a file named "Latin-break-only-on-whitespace.rbbi" in $SOLR_HOME\alfresco\conf
    • Add following content:
!!forward;

$Whitespace = [\p{Whitespace}];
$NonWhitespace = [\P{Whitespace}];
$Letter = [\p{Letter}];
$Number = [\p{Number}];

# Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer
$Whitespace;

# Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char
# Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Letter $NonWhitespace*   {200};

# Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char
# Mapped to <NUM> token type by DefaultICUTokenizerConfig
$NonWhitespace* $Number $NonWhitespace*   {100};

# Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char
# Mapped to <OTHER> token type by DefaultICUTokenizerConfig
$NonWhitespace+   {1};
  • Create a file named "characters.txt" in $SOLR_HOME\alfresco\conf
    • Add following content:
_ => ALPHA 
- => ALPHA 
$ => ALPHA 
! => ALPHA 
  • Edit the $SOLR_HOME/alfresco/schema.xml
    • find the "fieldType" with below analyzer settings:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
    • Update the tokenizer settings "<tokenizer class="solr.ICUTokenizerFactory" ....>" as below:
      • <fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false">
              <analyzer>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/>
                <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
                <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
                <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> -->
                <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                        generateWordParts="1"
                        generateNumberParts="1"
                        catenateWords="1"
                        catenateNumbers="1"
                        catenateAll="1"
                        splitOnCaseChange="1"
                        splitOnNumerics="1"
                        preserveOriginal="1"
                        stemEnglishPossessive="1"
        		types="characters.txt"/>
                <filter class="solr.ICUFoldingFilterFactory"/>
              </analyzer>
            </fieldType>

If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".

Note: You will have to full re-index in order to allow these setting handle the tokennization.

Hope this helps.

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

View answer in original post

11 REPLIES 11

venur
Star Contributor
Star Contributor

Any guidance, pointers will be really helpful. 

hi @angelborroy @abhinavmishra14 @afaust if you have any suggestions please share. I am stuck right now

angelborroy
Community Manager Community Manager
Community Manager

Can you provide additional details on you requirement?

Hyland Developer Evangelist

hi @venur  i have not dealt with such scenarios, I will have to check. @angelborroy  may be able to provide some guidance. As mentioned by angel, please share what exactly you want to achieve so we can try the scenario.

Found couple of links on the internet but not sure if they fit your requirement:

https://prowave.io/indexing-special-terms-using-solr

https://soft29.ru/blog/entry/alfresco-solr-enable-search-of

https://stackoverflow.com/questions/18277609/search-in-solr-with-special-characters

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

Thank you for responding @angelborroy 

We are importing images and video files from a third party Dam to alfresco repo. Several images and files have special characters in their names and they are on purpose for some business use cases. 

some examples special characters as below-

$

-

_

and

!

Solr is tokenizing the names by default whenever name has these special characters and treating it as white spaces. I read in some doc that says this is a default behavior. But in our case we get a lot of search result if user tries to search for one file name with identical prefix/postfix.

For testing i tried this to show you the results i am getting 

image

you see above i get all the files that I don't need in results. 
i also try with "" but result remains same.

Please can you guide how to change this default behavior 

venur
Star Contributor
Star Contributor

Thank you @abhinavmishra14 i will check also

angelborroy
Community Manager Community Manager
Community Manager

I guess you can't change that behaviour, since they are special SOLR characters.

You may try escaping that characters in your search string:

https://solr.apache.org/guide/6_6/the-standard-query-parser.html#TheStandardQueryParser-EscapingSpec...

Apart from that, I don't see any other alternative.

Hyland Developer Evangelist

Hi @angelborroy tx for the response. 

We also thought this option, but we can't escape characters now right? after indexes are already created by Solr by bypassing special characters and considering all as whitespaces. Based of what i read so far, there won't be a index for the word at all that includes those special characters e.g. :

restored$image.png

Do you mean still solr would have one index for the whole name with special characters I mentioned? Or am i understanding something wrongly 

angelborroy
Community Manager Community Manager
Community Manager

I guess you're right. I don't see any alternative out of the box to get that results including special characters.

Hyland Developer Evangelist

Thanks @angelborroy  for response. Yeah we know its not possible by default and that is what we are looking extend. 
we are aware of default behavior, and looking for steps to change this behavior either from solr or alfresco.

Your inputs or directions will be helpful