07-13-2024 02:59 PM
Hi,
I'm trying to return highlight snippets via the search api - which works perfectly fine when searching for words and combining them with AND or OR. But when I search a phrase, the api returns the correct document, but no highlight part in the result. Do you have any pointers to why that is?
Version is 7.2, this query for example works as expected:
{ "query": { "language": "afts", "query": "cm:content:this AND cm:content:is AND cm:content:a AND cm:content:test" }, "include": [ "path" ], "paging": { "maxItems": 10, "skipCount": 0 }, "highlight": { "snippetCount": 3, "mergeContiguous": true, "fragmentSize": 300, "fields": [ { "field": "cm:content" } ] } }
however, if i change it to
"=cm:content:\"this is a test\""
no highlights are returned - only the (correct) hit.
thanks
12-05-2024 10:07 AM - edited 12-05-2024 10:13 AM
I'm clueless on that one too, but I've been working on it and here are my conclusions so far :
I'm not sure how SearchServices uses the parameters of the highlighter (in solrconfig.xml or at query time) especially the hl.usePhraseHighlighter (true by default it can be overwritted at query time but don't seem to change anything) :
https://solr.apache.org/guide/6_6/highlighting.html
In my opinion, it has something to do with the way that solr is used by Alfresco Search Services to tokenize the content of your document it interprets it by splitting it into words and then do some filtering (like removing 's and link words).
If you look into the schema.xml you'll find the configuration it uses :
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
...
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
</analyzer>
<analyzer type="query">
...
</analyzer>
</fieldType>
I have noticed that the highlighted text has its own tokenisation/filter. Hence, I'm more and more suspecting a misconfiguration of either one of them :
<fieldType name="highlighted_text_en" class="solr.TextField">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(#0;.*#0;)" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnCaseChange="1"
splitOnNumerics="1"
preserveOriginal="1"
stemEnglishPossessive="1"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
For more information you can check the solr documentation wich is pretty good at describing the different tokenizers : https://solr.apache.org/guide/6_6/about-tokenizers.html
You can also test your solr config in the analysis tab : http://localhost:8983/solr/#/alfresco/analysis
for this, make sur to use the proper type (text_en in this case).
Explore our Alfresco products with the links below. Use labels to filter content by product module.