cancel
Showing results for 
Search instead for 
Did you mean: 

Bad Lucene index distribution among segments

rivarola
Champ on-the-rise
Champ on-the-rise
Hello,

On a server with a bit more than 1 million documents, I have noticed this bad Lucene index segments distribution:

09:00:00,440 DEBUG [org.alfresco.repo.search.impl.lucene.index.IndexInfo]
Entry List
1         Name=cb6b6534-1162-403c-acea-59b5d9c55dba Type=INDEX Status=COMMITTED Docs=2445933 Deletions=0
2         Name=e9b269b1-2206-448d-b98d-b52e45093000 Type=INDEX Status=COMMITTED Docs=13284 Deletions=0
3         Name=16aa5e8a-f42f-413b-b01f-30b22e77f29f Type=INDEX Status=COMMITTED Docs=4199 Deletions=0
4         Name=784c8097-1758-4d1f-8815-4ed84623781d Type=INDEX Status=COMMITTED Docs=2688 Deletions=0
5         Name=ee0adccf-247a-4a9c-85c9-a43386bb7a5a Type=INDEX Status=COMMITTED Docs=321 Deletions=0
6         Name=657e4ccb-b69f-4525-95c4-388bfb7a920e Type=INDEX Status=COMMITTED Docs=64 Deletions=0
7         Name=b3e937ca-c0d6-43ea-af39-a0f7ca88a6bf Type=INDEX Status=COMMITTED Docs=27 Deletions=0
8         Name=bc0f647b-0e3f-4bfd-a487-6d47062ff42a Type=INDEX Status=COMMITTED Docs=12 Deletions=0
9         Name=514d7c32-93eb-4035-852d-e685f68a608a Type=DELTA Status=ACTIVE Docs=0 Deletions=0


I use Alfresco 4.2-b and the configuration in alfresco-global.properties is :

### Lucene indexing ###
index.subsystem.name=lucene
lucene.indexer.mergerTargetIndexCount=10
lucene.indexer.mergerMergeFactor=10
lucene.indexer.writerMergeFactor=10


It is not good at all. On another server, with the same index configuration and comparable number of documents, I get a far better segments distribution.
How can I force Lucene to optimize its index without making a full reindex?
4 REPLIES 4

afaust
Legendary Innovator
Legendary Innovator
Hello,

what precisely do you consider bad with regards to this distribution? Your distribution actually looks ok to me - not perfect but a realistic result of normal operations.

In your case, the only thing that seems off to me is the relation between index segment #1 and #2. Optimally, the difference should be within an order of magnitude, not several or double digit. But this might be the result of a recent round of merging and segment #2 is only now starting to fill up again as new documents / changes come in.
Otherwise the progression is quite decent - you might even make do with fewer segments to improve search performance since you (currently) have very few documents in the three highest numbere segments.

I am not aware of any method to optimize the segment structure apart from performing a full reindex. The only way I can think of involves writing a low-level Lucene tool to basically create a new index by piping the contents from the old into it, resulting in an implicit optimisation by "writing anew", without any actual reindexing that requires access to Alfresco DB / content.

Regards
Axel

rivarola
Champ on-the-rise
Champ on-the-rise
Hi Alex,
Two things seem strange to me :
- I asked for 10 segments and there are only 8
- the first one is far too big: 200 times the second. Usually the biggest is only 3 times the second.

  Philippe

afaust
Legendary Innovator
Legendary Innovator
Hello Philippe,

you have set a "target factor" of 10 which Lucene uses as input in its optimization / distribution but it is a "target", not a guarantee. Much like the MERGE index segments are created / deleted as needed and almost never are exactly the amount specified by the merger target factor, the normal index also goes through creation / deletion cycles. The 200-times larger first segment may very well be the result of Lucene merging the segments previously at positions 1 through 3 into one large segment at position 1, resulting in temporarily 2 fewer segments than the target factor until Lucene re-creates those when needed. With those previous segments, the progression might have been less radical and more in line with your expectations.

Regards
Axel

mrogers
Star Contributor
Star Contributor
You aren't looking for an even spread of documents to segments.  Lucene does not use segments to partition the data in a way that say a hash bucket would, it simply creates a segment as and when it needs to persist its in memory state.   And a background process to merge together old segments and remove deleted documents from the new segment.