I'm working on the setup of an Alfresco 3.4.d community server, based on a linux 64 bits VM. So far, I've run some tests concerning file import, and I've been surprised about the differences on memory allocation depending on whether I upload files importing a zip via alfresco's import web interface, or using FTP. Despite some tunings on lucene parameters, all tests using FTP show a huge heap memory usage, like for example about 400MB after 50k files upload, against 50MB after importing 120k files with 12 zips containing 10k files each. For the purpose of my tests, I'm using generated pdf files, only containing a 5KB raw image. Uploading more files on FTP always leads to OOM. Digging into heap dumps reveals, among other things, a huge number of lucene's RAMFile and SegmentReader instances (like 5500 and 1800). Also, IndexInfo.deletableReaders linked queue takes a lot of place, compared to the zip files method. By the way, after FTP upload, IndexInfo file shows a good spread of documents in lucene indexes, with the highest numbered index containing around 100 documents. But I've noticed a high number (1745 in this case) of segment folders in Spacestore. This might have something to do with the number of SegmentReader instances. I guess the code behind those two methods is different, in the case on FTP deposit alfresco has to handle each single file, while importing a zip allows some optimizations.
Also, according a recommandation on the forum about an issue, I've tried to import 60k files with in-transaction indexing disabled, and I let the recovery mode to AUTO. Upon the next restart, Alfresco indexed everything, but aroung 48k files were in the first index, and 12k files in the 2nd. Which is, given the documentation I've read and some user cases, not good.
I'm planning on running the same kind of tests on a version 4 of Alfresco, to check if the problem is also there, but the project manager wants the server set up with the 3.4.d.
So, my questions are: - Is there, maybe among lucene parameters, something to tune? The documentation is quite light about a lot of them, and even more concerning the impact on the indexer and the merger. Tweaking them to see the impact on each deposit method is quite long. - Just for my own information, how does Alfresco use Lucene? From what I might have understood, Alfresco builds several Lucene index. Why?
Alfresco is an interesting solution, but it's a big project using third party tools, so I have some difficulties to find an "entry point" leading to the origin of the issue.