cancel
Showing results for 
Search instead for 
Did you mean: 

Custom RBBI rules for ICU tokenizer in SOLR that doesn't split tokens on hyphens

quantumbit
Champ in-the-making
Champ in-the-making

I am trying to modify ICU tokenizer to not split text into tokens on certain characters, like hyphens.

I got the custom rules up in SolrCloud and working, but it appears that only the rules inside the RBBI file apply. I found this working example but the baseline of this tokenizer ruleset behaves as a whitespacetokenizer. I on the other hand am looking to adapt it to behave like classic or standardized tokenizer, with the exception of being able to not split on hyphens and @ for example. 

 

!!chain;

$ALetter = [:L:];
$Numeric = [:N:];
$MidHyphen = [-];

$ALetter ($MidHyphen $ALetter)+ {200};
$Numeric ($MidHyphen $Numeric)+ {200};

 

0 REPLIES 0