02-02-2026 03:37 AM
I am trying to modify ICU tokenizer to not split text into tokens on certain characters, like hyphens.
I got the custom rules up in SolrCloud and working, but it appears that only the rules inside the RBBI file apply. I found this working example but the baseline of this tokenizer ruleset behaves as a whitespacetokenizer. I on the other hand am looking to adapt it to behave like classic or standardized tokenizer, with the exception of being able to not split on hyphens and @ for example.
!!chain;
$ALetter = [:L:];
$Numeric = [:N:];
$MidHyphen = [-];
$ALetter ($MidHyphen $ALetter)+ {200};
$Numeric ($MidHyphen $Numeric)+ {200};
Explore our Alfresco products with the links below. Use labels to filter content by product module.