Lucene Index Erosion

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎10-22-2009 05:50 AM
I have worked for 1 years with Alfresco Labs 3.0 and I had problems several times with the outcome of the Lucene search engine. Quite clearly the problems were found in an expansion around a new store to light. Some content usage information, which was stored in a separate store, was not found properly after rebooting the system. It found out that the search for TYPE, ASPECT and properties of type d:boolean and d:category worked as expected. Only the search for a d:text property failed after the reboot (with index.recovery.mode = AUTO). Thus i started an index rebuilt (restart with index.recovery.mode = FULL), which meant that the problem temporarily disappeared. But a short time later, having worked a little with the system, the problem returned - probably caused by arbitrary indexing processes, e.g. after changing any node, or even after uploading some JS WebScripts via WebDAV.
After importing 4,000 PDF documents, the problem has now assumed a new dimension. Although the documents have been indexed, but only basic node properties (Title, Summary, Categories) could found be fairly reliably by full text. However, searching the content of the PDF at a maximum yielded some lucky shots. After several tests it was clear that it was not a problem of protected PDFs, nor a problem of language (mainly German). A debug breakpoint in PdfBoxContentTransformer.transformInternal() finally prooved that after uploading a PDF there was an asynchronous indexing performed (after end of transaction). But not all terms arrived in the index. Only about 30% of the selected words (all meaningful nouns) I found in the search, after Alf-restart this even fell to about 10%.
Moreover, it turned out that even simple properties such as creator and modifier doesn't work reliable. Of 11 documents that were exclusively created and modified by the admin user searching for @cm\:creator:"admin" only 4 documents hit! So once again re-indexing. After that 7 of 11 documents were found with Creator == admin. After I had worked a short time with the system and uploaded another PDF via a web-client, the amount again reduced from 7 to 4 hits. So it seems that any Alfresco indexing processes eliminate parts of our (intact) index. Overall I would sum up the whole problem with the word index-erosion.
Installation details
After importing 4,000 PDF documents, the problem has now assumed a new dimension. Although the documents have been indexed, but only basic node properties (Title, Summary, Categories) could found be fairly reliably by full text. However, searching the content of the PDF at a maximum yielded some lucky shots. After several tests it was clear that it was not a problem of protected PDFs, nor a problem of language (mainly German). A debug breakpoint in PdfBoxContentTransformer.transformInternal() finally prooved that after uploading a PDF there was an asynchronous indexing performed (after end of transaction). But not all terms arrived in the index. Only about 30% of the selected words (all meaningful nouns) I found in the search, after Alf-restart this even fell to about 10%.
Moreover, it turned out that even simple properties such as creator and modifier doesn't work reliable. Of 11 documents that were exclusively created and modified by the admin user searching for @cm\:creator:"admin" only 4 documents hit! So once again re-indexing. After that 7 of 11 documents were found with Creator == admin. After I had worked a short time with the system and uploaded another PDF via a web-client, the amount again reduced from 7 to 4 hits. So it seems that any Alfresco indexing processes eliminate parts of our (intact) index. Overall I would sum up the whole problem with the word index-erosion.
Installation details
- Alfresco Labs 3.0 Stable (Tomcat + MySQL) on Windows and Linux
- Content inventory of historically evolved, some with their own model extensions
- Since August / September additional stores
- In September the entire code was again regarding ResultSet.close () (ResultSets are described since then, as in the Alfresco wiki consistently closed in the finally block)
Labels:
- Labels:
-
Archive
3 REPLIES 3

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎11-10-2009 09:29 AM
So, since nobody gave any reply to this, i must reply to the public myself. Maybe this helps some of you who try to get along with multilingual content - as we do:
:idea: From my point of view, the only solution for this problem is to change from language-specific analysers to the AlfrescoStandardAnalyser.
If some of you have any ideas, pls let me know!
Kind regards, Dirk
- we are dealing with content in different languages (german, english, italian, spanish)
- not only the content, but also metadata (like cm:summary) are localized to different languages
- this evokes usage of different analyzers, as defined with WEB-INF/classes/alfresco/model/dataTypeAnalyzers_??.properties
- the AlfrescoStandardAnalyser (used for english content) doesn't use stemming, but other language-specific analyzers do!
- this means, that all tokenised properties of nodes which are localized to german are transformed by the GermanAnalyzer
- in other words: "cm:creator", "cm:modifier" and all other properties of d:text and d:content will be stemmed by a german stemmer!!
:idea: From my point of view, the only solution for this problem is to change from language-specific analysers to the AlfrescoStandardAnalyser.
If some of you have any ideas, pls let me know!
Kind regards, Dirk

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎05-28-2010 05:09 AM
You are my hero! we struggled for 2 weeks with this search problem, with your advice everything seems to work fine

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎07-09-2010 01:15 PM
Hi,
I did the same and it seems to work as I expect.
Regards.
I did the same and it seems to work as I expect.
Regards.
