Custom RBBI rules for ICU tokenizer in SOLR that doesn't split tokens on hyphens

Hyland Community

I am trying to modify ICU tokenizer to not split text into tokens on certain characters, like hyphens.

I got the custom rules up in SolrCloud and working, but it appears that only the rules inside the RBBI file apply. I found this working example but the baseline of this tokenizer ruleset behaves as a whitespacetokenizer. I on the other hand am looking to adapt it to behave like classic or standardized tokenizer, with the exception of being able to not split on hyphens and @ for example.

!!chain;

$ALetter = [:L:];
$Numeric = [:N:];
$MidHyphen = [-];

$ALetter ($MidHyphen $ALetter)+ {200};
$Numeric ($MidHyphen $Numeric)+ {200};

0 REPLIES 0

Hyland Connect

Custom RBBI rules for ICU tokenizer in SOLR that doesn't split tokens on hyphens