Hyland Connect

chini · ‎02-14-2013

Hello,

I am currently running into performance problems with a custom bulk import process. Hopefully someone can help me find the bottleneck.

First some background information:
We planned to import a large amount of documents into Alfresco 3.4.7 EE, in controlled and monitored batches.

The steps involved are the following:
1. Batch bulk import via Peter Monks bulk import tool (in-memory, no streaming of binary data needed)
2. Creation of missing Lucene indexes for new content
3. Run of a custom scheduled job that moves the new document nodes and adds meta data to it from an external database.

Additional information about each step:

#1: Works flawlessly, the tool runs stable and fast.
The import job registers binary content only, no specific meta-data is being added via xml files.
The content to be processed is physically stored in a custom content store folder inside a the Alfresco default content store, with the following structure:
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 1}/{UNIQUE DOC FOLDER 1}/{FILE 1}
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 1}/{UNIQUE DOC FOLDER 2}/{FILE 2}
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 1}/{UNIQUE DOC FOLDER 3}/{FILE 3}
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 2}/{UNIQUE DOC FOLDER 4}/{FILE 4}
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 2}/{UNIQUE DOC FOLDER 5}/{FILE 5}
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 3}/{UNIQUE DOC FOLDER 6}/{FILE 6}
{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 3}/{UNIQUE DOC FOLDER 7}/{FILE 7}
Each dossier folder can contain up to 250 document folders (average 125), containing exactly 1 file each.
During this phase the in-transaction creation of indexes is DISABLED, the reindexing mode is set to NONE.

#2: Seems to work fine as well, as far as the logs state this process runs through without problems.
During this phase the in-transaction creation of indexes is ENABLED, the reindexing mode is set to AUTO.

#3: This job is a custom Java job, scheduled as cron job and wired in through Spring via a custom Alfresco context file.
Simplified each run performs the following actions:
3.1) retrieve a list of X dossiers and the contained Y documents to be processed from an external migration database (connected via a custom ibatis connection)
3.2) iterate over all dossier entities and perform the perform the following actions for each dossier:
3.2.1) Check via Lucene (aspect search) whether the current dossier exists in Alfresco; if not create a new space for the dossier
3.2.2) If a new dossier space has been created in {3.2.1} and can be found, iterate over all document entities and perform for each:
3.2.2.1) Move the document from the original space (as imported in {step 1}) to the new archive dossier location (using FileFolderService.move)
3.2.2.2) Modify the type of the document to a custom type
3.2.3.3) Add new custom metadata to the document, as provided through the ibatis database entity
3.2.4. Persist the changes done to the repository (NodeService.setProperties).
Since this job makes use of Lucene while processing the repository changes, this job is being set up to make use of in-transaction indexing, the reindexing mode is set to AUTO.

So far we processed the import for about 30000 documents.

What we noticed (after enabling debug log settings) is that:
A) for the indexer component, we see that the largest commited index contains 400000 documents
B) At times the cron job does not run at all, most likely because Alfrescos indexer/merging service is too busy and uses too many threads concurrently for the index merging process
C) Also we see a pattern in the run frequencies; e.g. if we set up the job to run once every minute:
it does so for 8-9 times, then it stops running for about 3 minutes, later on it even takes breaks of up to 30 minutes.
A few hours later the pattern repeats as described, without restarting Alfresco.

Looking a the logs we find a lot of actions being done by the index merger during the breaks mentioned.

So to start with I have two questions I hope you can help us with:
1. Which default Alfresco runtime settings could/should we override in order to support the import to run smoothly and stable?
2. Is there a limit to the execution of configured quartz/cron jobs, like maximal thread count, cpu, mem, etc.?
3. How come the index can grow so quickly and contain that many documents, while (according to Alfresco wiki/forum recommendations) it should ideally only contains a few hundred max?

If required I can submit more details about our configuration, like alfresco-global.properties, alfresco.log and such.

Any hints or direction welcome!

andy · ‎03-08-2013

Hi

It looks like you are not batching your docs up in transactions and doing each individual operation in a single transaction.
Add meta data - move - change type etc etc.

Why do you need the indexer to find the dossier - I assume you know where to make it if you do not find it - so find it by xpath.

You need to work out why you have made so many documents! Are you deleting things but in fact archiving them?

Andy

Hyland Connect

Large Lucene index size prevents cron job from running?