cancel
Showing results for 
Search instead for 
Did you mean: 

Lucene Index Erosion

dbachem
Champ in-the-making
Champ in-the-making
I have worked for 1 years with Alfresco Labs 3.0 and I had problems several times with the outcome of the Lucene search engine. Quite clearly the problems were found in an expansion around a new store to light. Some content usage information, which was stored in a separate store, was not found properly after rebooting the system. It found out that the search for TYPE, ASPECT and properties of type d:boolean and d:category worked as expected. Only the search for a d:text property failed after the reboot (with index.recovery.mode = AUTO). Thus i started an index rebuilt (restart with index.recovery.mode = FULL), which meant that the problem temporarily disappeared. But a short time later, having worked a little with the system, the problem returned - probably caused by arbitrary indexing processes, e.g. after changing any node, or even after uploading some JS WebScripts via WebDAV.

After importing 4,000 PDF documents, the problem has now assumed a new dimension. Although the documents have been indexed, but only basic node properties (Title, Summary, Categories) could found be fairly reliably by full text. However, searching the content of the PDF at a maximum yielded some lucky shots. After several tests it was clear that it was not a problem of protected PDFs, nor a problem of language (mainly German). A debug breakpoint in PdfBoxContentTransformer.transformInternal() finally prooved that after uploading a PDF there was an asynchronous indexing performed (after end of transaction). But not all terms arrived in the index. Only about 30% of the selected words (all meaningful nouns) I found in the search, after Alf-restart this even fell to about 10%.

Moreover, it turned out that even simple properties such as creator and modifier doesn't work reliable. Of 11 documents that were exclusively created and modified by the admin user searching for @cm\:creator:"admin" only 4 documents hit! So once again re-indexing. After that 7 of 11 documents were found with Creator == admin. After I had worked a short time with the system and uploaded another PDF via a web-client, the amount again reduced from 7 to 4 hits. So it seems that any Alfresco indexing processes eliminate parts of our (intact) index. Overall I would sum up the whole problem with the word index-erosion.

Installation details
  • Alfresco Labs 3.0 Stable (Tomcat + MySQL) on Windows and Linux
  •     
  • Content inventory of historically evolved, some with their own model extensions
  •     
  • Since August / September additional stores
  •     
  • In September the entire code was again regarding ResultSet.close () (ResultSets are described since then, as in the Alfresco wiki consistently closed in the finally block)
Has anyone made similar experiences with Alfresco + Lucene, or perhaps any idea what could be the problem?
3 REPLIES 3

dbachem
Champ in-the-making
Champ in-the-making
So, since nobody gave any reply to this, i must reply to the public myself. Maybe this helps some of you who try to get along with multilingual content - as we do:

  • we are dealing with content in different languages (german, english, italian, spanish)

  • not only the content, but also metadata (like cm:summary) are localized to different languages

  • this evokes usage of different analyzers, as defined with WEB-INF/classes/alfresco/model/dataTypeAnalyzers_??.properties

  • the AlfrescoStandardAnalyser (used for english content) doesn't use stemming, but other language-specific analyzers do!

  • this means, that all tokenised properties of nodes which are localized to german are transformed by the GermanAnalyzer

  • in other words: "cm:creator", "cm:modifier" and all other properties of d:text and d:content will be stemmed by a german stemmer!!
And this causes, that for german content in the lucene index, admin will become "admi", meanwhile for english content it will remain "admin". And you won't be able to successfully search for "@cm\:creator:"admin" over german and english content in one query! Cause the query terms itself will be localized to one (and only one) language, so whether will be stemmed (query localized to german) or not (english).

:idea:  From my point of view, the only solution for this problem is to change from language-specific analysers to the AlfrescoStandardAnalyser.

If some of you have any ideas, pls let me know! 
Kind regards, Dirk

spanishjohnny
Champ in-the-making
Champ in-the-making
You are my hero! we struggled for 2 weeks with this search problem, with your advice everything seems to work fine

roberto_negrete
Champ in-the-making
Champ in-the-making
Hi,

I did the same and it seems to work as I expect.

Regards.