Hyland Connect

akaisora · ‎11-18-2016

Good day,

I have a fresh installation of Alfresco 5.1, running by default Solr 4.

I wrote a custom analyzer (for my custom needs) which first detects the language of the document/query and then redirects to the correct analyzer. After doing that, I went to the solr schema files and update all locale field "text_[locale]" to use my custom analyzer for both "index" and "query".

I have been using this analyzer for a previous alfresco version 4.2.c which uses Solr 1.4. So of course I have update my API to use the correct lucene version and all the required changes.

The issue I am having is that during index process; my analyzer is not able to read the text (in fact it only reads values like "doclib" and "Company home") when I upload a document, thus cannot detect the language. However, during query phase, if type a query in the search bar, my analyzer is able to get the query text and detect its language (even from real-time search).

I have tested the analyzer from the Solr admin panel. It was able to read and handle the input of both Query and Index operations. So I think this is a bit specific to alfresco.

Thank you for taking the time to read through all of this.

Have a good day!

akaisora · ‎11-24-2016

I have found the issue, seems like during indexing, Solr does not pass the actual data the filter chain during the analyzer construction. Instead, it takes the filter chain out of the analyzer and then feed it the data, therefore there no way to get the data during analyzer construction.

My bad for this issue is not related to alfresco. Thank you for the help!

View answer in original post

afaust · ‎11-19-2016

I have read through this but could not find a question. Did you intend this as a kind of blog post or is there something that you want 3rd-party input on?

One of the main differences between SOLR 1 and SOLR 4 is that localized text is no longer indexed in one composite field where the values are prefixed with the locale, but each locale now gets its own index field. Since the field is already locale-specific there is no point in including a locale-prefix anymore and for that reason I assume your analyzer is no longer able to detect the locale during indexing. Now I don't know what you are using to detect a locale during query-time since I stands to reason that queries are now targeted at the specific field(s) depending on requested locales and query text should also no longer include any locale prefixes.

Generally, 98% of Alfresco community members will never customize SOLR-tier components and as such only very few people will actually be familiar with any SOLR internals. Specific / helpful responses might be far in between...

akaisora · ‎11-21-2016

Hello,

Thank you very much for your reply.

I do understand that each locale has its own field in Sol4 as opposed to solr1.4. The locale can and will be different from the document's language. That's why I'm using Tika (and experimenting with other libraries) to detect to actual document language from my analyzer. After detecting the language my analyzers calls the right tokenizer and filters for that language.

This solution works perfectly when I run it on a Lucene (v4.9.1) and it also works great as well on Alfresco Solr admin panel (for both query and index analysis). Same result I get when I search for a text (Query) from alfresco's search. The only problem is during document upload, my analyzer is not able to capture or read the document's text. It only (apparently) captures meta data.

My only guess is that during index-time, alfresco does something that I am not aware of, which makes my analyzer unable to read the document text. So the question what might this thing be?

Note: This might be Solr/Lucene specific but this how my analyzer work:

From the entry point "createComponents(String string, Reader reader)" my analyzer reads the entire "reader" as string, detect the language and then constructs a StringReader that is sent to the correct Analyzer for that language.

Thank you very much!

-- sora

afaust · ‎11-21-2016

Metadata and document content are very likely indexed in two separate operations - at least that is the way it has been in Alfresco for a very long time even with Lucene and SOLR 1. This is due to the fact that it can take some time to convert a document into indexable texts and node indexing is typically batched together - so separating the metadata from content during index ensures that all batched nodes are at least metadata-indexed in a reasonable amount of time while content indexing may lag a bit longer.

akaisora · ‎11-24-2016

I have found the issue, seems like during indexing, Solr does not pass the actual data the filter chain during the analyzer construction. Instead, it takes the filter chain out of the analyzer and then feed it the data, therefore there no way to get the data during analyzer construction.

My bad for this issue is not related to alfresco. Thank you for the help!

Hyland Connect

Custom Analyzer - Different Query/Index behaviors