Custom RBBI rules for ICU tokenizer in SOLR that doesn't split tokens on hyphens

quantumbit — Mon, 02 Feb 2026 08:37:53 GMT

I am trying to modify ICU tokenizer to not split text into tokens on certain characters, like hyphens.

I got the custom rules up in SolrCloud and working, but it appears that only the rules inside the RBBI file apply. I found this working example but the baseline of this tokenizer ruleset behaves as a whitespacetokenizer. I on the other hand am looking to adapt it to behave like classic or standardized tokenizer, with the exception of being able to not split on hyphens and @ for example.

!!chain; $ALetter = [:L:]; $Numeric = [:N:]; $MidHyphen = [-]; $ALetter ($MidHyphen $ALetter)+ {200}; $Numeric ($MidHyphen $Numeric)+ {200};

topic Custom RBBI rules for ICU tokenizer in SOLR that doesn't split tokens on hyphens in Alfresco Forum

Custom RBBI rules for ICU tokenizer in SOLR that doesn't split tokens on hyphens