01-04-2022 11:57 PM
Hello all
I am looking for right documentation or steps to deal with a request we have.
We want to disable tokenization on special characters.
I tried searching this forum and documentation but had pointers to proceed. If anyone has done this or knows how to proceed, please guide
03-04-2022 11:46 PM
@venur Been curious about this and have had some time spent on this issue in last couple of weeks. I think, i have a solution that may fit your case. It works for me in few tests that i did.
It is based on the links i shared above.
Here is what i did:
!!forward; $Whitespace = [\p{Whitespace}]; $NonWhitespace = [\P{Whitespace}]; $Letter = [\p{Letter}]; $Number = [\p{Number}]; # Default rule status is {0}=RBBI.WORD_NONE => not tokenized by ICUTokenizer $Whitespace; # Assign rule status {200}=RBBI.WORD_LETTER when the token contains a letter char # Mapped to <ALPHANUM> token type by DefaultICUTokenizerConfig $NonWhitespace* $Letter $NonWhitespace* {200}; # Assign rule status {100}=RBBI.WORD_NUM when the token contains a numeric char # Mapped to <NUM> token type by DefaultICUTokenizerConfig $NonWhitespace* $Number $NonWhitespace* {100}; # Assign rule status {1} (no RBBI equivalent) when the token contains neither a letter nor a numeric char # Mapped to <OTHER> token type by DefaultICUTokenizerConfig $NonWhitespace+ {1};
_ => ALPHA - => ALPHA $ => ALPHA ! => ALPHA
<fieldType name="text___" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false"> <analyzer> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\x{0000}.*\x{0000}" replacement=""/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/> <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> <!-- <tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /> --> <filter class="org.apache.solr.analysis.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1" stemEnglishPossessive="1" types="characters.txt"/> <filter class="solr.ICUFoldingFilterFactory"/> </analyzer> </fieldType>
If you want to configure same settings for archite store, then follow the same steps for "$SOLR_HOME\archive\conf".
Note: You will have to full re-index in order to allow these setting handle the tokennization.
Hope this helps.
01-09-2022 11:13 PM
Any guidance, pointers will be really helpful.
hi @angelborroy @abhinavmishra14 @afaust if you have any suggestions please share. I am stuck right now
01-10-2022 04:42 AM
Can you provide additional details on you requirement?
01-10-2022 03:58 PM
hi @venur i have not dealt with such scenarios, I will have to check. @angelborroy may be able to provide some guidance. As mentioned by angel, please share what exactly you want to achieve so we can try the scenario.
Found couple of links on the internet but not sure if they fit your requirement:
https://prowave.io/indexing-special-terms-using-solr
https://soft29.ru/blog/entry/alfresco-solr-enable-search-of
https://stackoverflow.com/questions/18277609/search-in-solr-with-special-characters
01-10-2022 11:53 PM
Thank you for responding @angelborroy
We are importing images and video files from a third party Dam to alfresco repo. Several images and files have special characters in their names and they are on purpose for some business use cases.
some examples special characters as below-
$
-
_
and
!
Solr is tokenizing the names by default whenever name has these special characters and treating it as white spaces. I read in some doc that says this is a default behavior. But in our case we get a lot of search result if user tries to search for one file name with identical prefix/postfix.
For testing i tried this to show you the results i am getting
you see above i get all the files that I don't need in results.
i also try with "" but result remains same.
Please can you guide how to change this default behavior
01-10-2022 11:54 PM
Thank you @abhinavmishra14 i will check also
01-11-2022 04:42 AM
I guess you can't change that behaviour, since they are special SOLR characters.
You may try escaping that characters in your search string:
Apart from that, I don't see any other alternative.
01-11-2022 10:43 PM
Hi @angelborroy tx for the response.
We also thought this option, but we can't escape characters now right? after indexes are already created by Solr by bypassing special characters and considering all as whitespaces. Based of what i read so far, there won't be a index for the word at all that includes those special characters e.g. :
restored$image.png
Do you mean still solr would have one index for the whole name with special characters I mentioned? Or am i understanding something wrongly
01-12-2022 03:22 AM
I guess you're right. I don't see any alternative out of the box to get that results including special characters.
01-13-2022 07:07 PM
Thanks @angelborroy for response. Yeah we know its not possible by default and that is what we are looking extend.
we are aware of default behavior, and looking for steps to change this behavior either from solr or alfresco.
Your inputs or directions will be helpful
Explore our Alfresco products with the links below. Use labels to filter content by product module.