<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Large Lucene index size prevents cron job from running? in Alfresco Archive</title>
    <link>https://connect.hyland.com/t5/alfresco-archive/large-lucene-index-size-prevents-cron-job-from-running/m-p/278127#M231257</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hello,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I am currently running into performance problems with a custom bulk import process. Hopefully someone can help me find the bottleneck.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;First some background information:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;We planned to import a large amount of documents into Alfresco 3.4.7 EE, in controlled and monitored batches.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The steps involved are the following:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;1. Batch bulk import via Peter Monks bulk import tool (in-memory, no streaming of binary data needed)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;2. Creation of missing Lucene indexes for new content&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3. Run of a custom scheduled job that moves the new document nodes and adds meta data to it from an external database.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Additional information about each step:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;#1: Works flawlessly, the tool runs stable and fast. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;The import job registers binary content only, no specific meta-data is being added via xml files. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;The content to be processed is physically stored in a custom content store folder inside a the Alfresco default content store, with the following structure:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 1}/{UNIQUE DOC FOLDER 1}/{FILE 1}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 1}/{UNIQUE DOC FOLDER 2}/{FILE 2}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 1}/{UNIQUE DOC FOLDER 3}/{FILE 3}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 2}/{UNIQUE DOC FOLDER 4}/{FILE 4}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 2}/{UNIQUE DOC FOLDER 5}/{FILE 5}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 3}/{UNIQUE DOC FOLDER 6}/{FILE 6}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;{CS BASE FOLDER}/{UNIQUE DOSSIER FOLDER ID 3}/{UNIQUE DOC FOLDER 7}/{FILE 7}&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Each dossier folder can contain up to 250 document folders (average 125), containing exactly 1 file each.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;During this phase the in-transaction creation of indexes is DISABLED, the reindexing mode is set to NONE.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;#2: Seems to work fine as well, as far as the logs state this process runs through without problems. &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;During this phase the in-transaction creation of indexes is ENABLED, the reindexing mode is set to AUTO.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;#3: This job is a custom Java job, scheduled as cron job and wired in through Spring via a custom Alfresco context file.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Simplified each run performs the following actions:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.1) retrieve a list of X dossiers and the contained Y documents to be processed from an external migration database (connected via a custom ibatis connection)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2) iterate over all dossier entities and perform the perform the following actions for each dossier:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2.1) Check via Lucene (aspect search) whether the current dossier exists in Alfresco; if not create a new space for the dossier&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2.2) If a new dossier space has been created in {3.2.1} and can be found, iterate over all document entities and perform for each:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2.2.1) Move the document from the original space (as imported in {step 1}) to the new archive dossier location (using FileFolderService.move)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2.2.2) Modify the type of the document to a custom type&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2.3.3) Add new custom metadata to the document, as provided through the ibatis database entity&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3.2.4. Persist the changes done to the repository (NodeService.setProperties).&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Since this job makes use of Lucene while processing the repository changes, this job is being set up to make use of in-transaction indexing, the reindexing mode is set to AUTO.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;So far we processed the import for about 30000 documents.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;What we noticed (after enabling debug log settings) is that:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;A) for the indexer component, we see that the largest commited index contains 400000 documents&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;B) At times the cron job does not run at all, most likely because Alfrescos indexer/merging service is too busy and uses too many threads concurrently for the index merging process&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;C) Also we see a pattern in the run frequencies; e.g. if we set up the job to run once every minute:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;it does so for 8-9 times, then it stops running for about 3 minutes, later on it even takes breaks of up to 30 minutes.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;A few hours later the pattern repeats as described, without restarting Alfresco.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Looking a the logs we find a lot of actions being done by the index merger during the breaks mentioned.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;So to start with I have two questions I hope you can help us with:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;1. Which default Alfresco runtime settings could/should we override in order to support the import to run smoothly and stable?&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;2. Is there a limit to the execution of configured quartz/cron jobs, like maximal thread count, cpu, mem, etc.?&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;3. How come the index can grow so quickly and contain that many documents, while (according to Alfresco wiki/forum recommendations) it should ideally only contains a few hundred max?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;If required I can submit more details about our configuration, like alfresco-global.properties, alfresco.log and such.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Any hints or direction welcome!&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Thu, 14 Feb 2013 08:47:41 GMT</pubDate>
    <dc:creator>chini</dc:creator>
    <dc:date>2013-02-14T08:47:41Z</dc:date>
    <item>
      <title>Large Lucene index size prevents cron job from running?</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/large-lucene-index-size-prevents-cron-job-from-running/m-p/278127#M231257</link>
      <description>Hello,I am currently running into performance problems with a custom bulk import process. Hopefully someone can help me find the bottleneck.First some background information:We planned to import a large amount of documents into Alfresco 3.4.7 EE, in controlled and monitored batches.The steps involve</description>
      <pubDate>Thu, 14 Feb 2013 08:47:41 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/large-lucene-index-size-prevents-cron-job-from-running/m-p/278127#M231257</guid>
      <dc:creator>chini</dc:creator>
      <dc:date>2013-02-14T08:47:41Z</dc:date>
    </item>
    <item>
      <title>Re: Large Lucene index size prevents cron job from running?</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/large-lucene-index-size-prevents-cron-job-from-running/m-p/278128#M231258</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;It looks like you are not batching your docs up in transactions and doing each individual operation in a single transaction.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Add meta data - move - change type etc etc.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Why do you need the indexer to find the dossier - I assume you know where to make it if you do not find it - so find it by xpath.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;You need to work out why you have made so many documents! Are you deleting things but in fact archiving them?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Andy&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 08 Mar 2013 15:18:31 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/large-lucene-index-size-prevents-cron-job-from-running/m-p/278128#M231258</guid>
      <dc:creator>andy</dc:creator>
      <dc:date>2013-03-08T15:18:31Z</dc:date>
    </item>
  </channel>
</rss>

