cancel
Showing results for 
Search instead for 
Did you mean: 

Configuring SOLR for MoreLikeThis functionality

cszamudio
Champ on-the-rise
Champ on-the-rise
Hi,

I've been trying to generate "More Like This" results from searches using SOLR and have not been successful in getting any results.
I assume I need to reference the TEXT field similar to the Lucene search expression.

I'm testing this using a URL like the following, using the default mlt parameters:

https://localhost:8493/solr/alfresco/afts?q=patent&mlt=true&mlt.count=10&mlt.fl=TEXT&rows=10

My repository is full of related patents, so I know there should be related entries.

I get the appropriate hits back from the search in the JSON response, but nothing in the moreLikeThis portion of the JSON response, e.g.,

"LEAF-4934":{"numFound":0,"start":0,"docs":[]} for each hit.

I can't find any reference to setting up a MoreLikeThisHandler for SOLR in Alfresco, but the SOLR documentation seems to indicate this needs to be set up. I've noticed that the Alfresco solr configuration of the mlt component specified in the SearchHandler so perhaps setting up a new Handler is not necessary.


<requestHandler name="/afts" class="solr.SearchHandler" >
    <lst name="defaults">
     <str name="defType">afts</str>
    </lst>
    <arr name="components">
      <str>setLocale</str>
      <str>query</str>
      <str>facet</str>
      <str>mlt</str>
      <str>highlight</str>
      <str>stats</str>
      <str>debug</str>
      <str>clearLocale</str>
    </arr>
  </requestHandler>


Has anyone had success generating MoreLikeThis results from a search?

Thanks,
Carlos S. Zamudio
7 REPLIES 7

afaust
Legendary Innovator
Legendary Innovator
Hello,

I have recently worked to get the MoreLikeThis feature running in Alfresco for one of our commercial modules. In short, you can forget about using this feature without adding some additional code into Alfresco SOLR <b>and</b> modifying Alfresco base code. Due to the way that Alfresco builds its index, there is no basis for the MoreLikeThisComponent to work with in an out-of-the-box install.

1) Alfresco does not include term vectors in their index fields for most of the really important elements, such as content, properties and types/aspects. It also does not store the actual value (which would blow up the size of the index) of a field. The MoreLikeThisComponent needs either of these two to build a similarity query for a document / search result.

2) Alfresco stores multiple index documents for the same piece of content. There is a main document (LEAF-*) with core information about the type/aspect, metadata and content, as well as multiple secondary documents (AUX-*) for PATH information. Without a custom MoreLikeThisComponent that merges data from both types of documents into a single query, you will not be able to include PATH in the calculation of MoreLikeThis results.

3) The standard MoreLikeThisComponent of SOLR does not respect any filter queries. Alfresco bases its tenant / permission check on filter queries. Without a custom MoreLikeThisComponent that includes these filter queries into the evaluation of MoreLikeThis results, you may end up exposing the existence of sensitive documents to an end user that should not see it.


Short: It is do-able to get Alfresco SOLR to include MoreLikeThis results, but it requires very low-level modifications to Alfresco SOLR. It is definitely not possible to enable this feature with configuration alone.

Regards
Axel

cszamudio
Champ on-the-rise
Champ on-the-rise
Thanks so much for taking the time to reply.  You've saved me some time pursuing the further. I am using Lucene's MoreLikeThis capability in other parts of my project and find it quite useful. Perhaps a feature request is in order.

It's funny though since the Alfresco configuration includes references to "mlt" which is what got me started down this path. (-;

Thanks again.

vincent-kali
Star Contributor
Star Contributor
Hi Axel,
I'm working on the same topic, and saw from Solr documentation that "If termVectors are not stored, MoreLikeThis will generate terms from stored fields".
It means that we'd have to simply enable termVectors on Text field types to make MLT feature working (storing content if of course not an option in most production scenarios) ? Assuming that we skip the PATH and permission topics for the moment….

Thanks,
Vincent

afaust
Legendary Innovator
Legendary Innovator
It's not just a matter of enabling the termVectors. Alfresco does some special handling for fields that mean content for a simple text property can end up in up to 6 individual SOLR fields (for different purposes / use cases). Also, Alfresco may introduce technical markers in the indexed content, such as prefixes to denote locale of texts / value fragments. The standard SOLR MoreLikeThisComponent cannot compensate / aggregate those issues in a way that you end up with a meaningful result.

Also, a lot has changed since this thread was originally started. The SOLR 4 now used in Alfresco 5.x already provides spellcheck capabilities to provide similar queries - although this is only shown to the user if the query did not return (m)any results. Most of my initial tests regarding MoreLikeThis may also need to be re-evaluated for potential improvements or additional hurdles.

vincent-kali
Star Contributor
Star Contributor
Thanks axel for your response.
Do you know if there is any documentation available about the way alfresco is interfaced with solr ?
Starting from source code is really hard in this case….
It means that I shoud rather work with a dedicated solr instance for classification purpose (outside alfresco). Am'I right ?

afaust
Legendary Innovator
Legendary Innovator
If you are looking for technical details how Alfresco leverages SOLR features, you may be out of luck. The documentation at http://docs.alfresco.com contains all that is relevant for any administrator / developer that needs to use the Alfresco SOLR integration / system, but it offers no implementation details. SOLR is not considered a component that developers / users should ever need to extend / customize.

If you need to do some fancy stuff that you can't cover with Alfresco SOLR features, then yes, using a dedicated SOLR for your custom needs may be the best option. I wouldn't though just for the sake of a single feature - it can't really be worth it to go a completely custom / unsupported route. But for me starting from source code is a natural part of using open source and I've grown quite used to it…

vincent-kali
Star Contributor
Star Contributor
We're working on a document classifier, and SOLR MLT feature is really interesting in this case. We'll then work with another solr instance (or just dedicated solr core ?).
Thanks.