<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Lucene indexing and performance issue in Alfresco Archive</title>
    <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162210#M116102</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi netdata,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The speed at which content can be loaded into Alfresco is negatively impacted by mixing queries and node creation operations. Specifically, I found that the checking for the existence of a node just prior to insertion caused heavy database contention and slowed throughput significantly. The alternative is to check for existence in one pass and then insert or update as required in a subsequent pass. The only downside is that a node could be deleted or added by an external process between passes potentially causing a name collision. This rarely if ever happens in practice and can be dealt with through properly designed recover logic.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I am not sure how significant an impact the above database contention issue will be with Oracle since the project was deployed on SQL Server. I have many years experience developing against Oracle 7.x - 10g for other applications and am inclined to believe that the performance hit would not be as bad due to Oracle's superior concurrency model. At the very least you will not see the dreaded SQL Server "deadlock" exceptions that occur when record insertions trip over read-only query operations on the same tables &lt;img id="smileyhappy" class="emoticon emoticon-smileyhappy" src="https://connect.hyland.com/i/smilies/16x16_smiley-happy.png" alt="Smiley Happy" title="Smiley Happy" /&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Regarding, the indexing improvements, I will need to check with my client before getting into any specific details. The optimization process was a very significant effort and I am not certain that I am at liberty to discuss the details under my NDA. Please contact me by private message if this something you would like to pursue.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Thanks,&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;David&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 07 Mar 2008 18:00:25 GMT</pubDate>
    <dc:creator>davidtaylor</dc:creator>
    <dc:date>2008-03-07T18:00:25Z</dc:date>
    <item>
      <title>Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162203#M116095</link>
      <description>We are using Alfresco to store content that is automatically retrieved from external content providers. We are using a single default content store for storing all of our content. This is because we have to make queries from basically all of the content, which would not be possible if we would have</description>
      <pubDate>Tue, 05 Feb 2008 14:56:35 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162203#M116095</guid>
      <dc:creator>jannek</dc:creator>
      <dc:date>2008-02-05T14:56:35Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162204#M116096</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Like most hierarchical data storage systems (eg. filesystems), Alfresco performance suffers if you attempt to store too much content in a single hierarchy node, and I wouldn't be surprised it that's a large part of the problem you're seeing.&amp;nbsp; Even if you don't require a space hierarchy, I would strongly suggest using one - perhaps a date / time based "hash bucket" structure like that used in the content store.&amp;nbsp; This will help to distribute your content out across multiple spaces, resulting in better performance.&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sun, 10 Feb 2008 23:49:35 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162204#M116096</guid>
      <dc:creator>pmonks</dc:creator>
      <dc:date>2008-02-10T23:49:35Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162205#M116097</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Related to this: is it possible to prevent indexing some of the files that are attached to content? Instructions tell how you can prevent indexing certain metadata attributes, but how do you prevent that certain attached files would not be indexed? Indexing files which would not need to be indexed consumes lot of time and processing power.&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 11 Feb 2008 12:07:30 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162205#M116097</guid>
      <dc:creator>amh11</dc:creator>
      <dc:date>2008-02-11T12:07:30Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162206#M116098</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;to clarify: for certain text file attachements, it would be only needed to store and be able to fetch them, not necessary to index contents of text files.&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 11 Feb 2008 13:46:23 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162206#M116098</guid>
      <dc:creator>amh11</dc:creator>
      <dc:date>2008-02-11T13:46:23Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162207#M116099</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;If you add your own property of type d:content it can be defined as unindexed. The default cm:content property is indexed. &lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Andy&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 19 Feb 2008 14:45:31 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162207#M116099</guid>
      <dc:creator>andy</dc:creator>
      <dc:date>2008-02-19T14:45:31Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162208#M116100</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I can appreciate your concerns since I have implemented an Alfresco-based platform which currently houses 15+ million images (and growing) and makes use of the web client.&amp;nbsp; We implemented our data loader using the JCR interface in preference to web services since testing showed the performance was much better. Currently, after much tuning, we can load 20,000 new images and apply a custom aspect in just under three minutes. Originally the same file took 3+ hours to load due to a variety of indexing and other design issues.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Here are some things we learned in the process:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;- Enabling versioning on nodes has a huge negative impact on performance. We saw a 40% improvement in throughput by not applying the versionable aspect to new nodes.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;- Distributing content across multiple folder nodes is essential for good performance. We generally try to keep the number of nodes per folder under 10k. Since our image names are serialized, we take a 4 character segment of the file name and use it as a sort of hash value to select a subdirectory. The images are also have a natural distribution by client ID.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;- The order of operations when loading files greatly impacts throughput due to database contention issues. Checking for file name collisions inline with the code that performed node insertion caused a major performance hit and occasional database deadlocks (a MS SQL Server specific problem). Separating these operations into a multi-pass process greatly improved throughput.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;- We discovered and corrected some database indexing issues that severely impacted MS SQL Server. The addition of some indexes greatly reduced lock contention and eliminated a full table scan that was a major drag on performance.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Hope this information helps.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;David&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 20 Feb 2008 07:53:25 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162208#M116100</guid>
      <dc:creator>davidtaylor</dc:creator>
      <dc:date>2008-02-20T07:53:25Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162209#M116101</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi David,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Can you help me with this as well please?&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;We are seeing the same issues as described here.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;We are practically doing the same as you do.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Can you please explain me for which tables you created extra indexes in oracle?&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;We have disabled versioning on all our files so this cannot be an issue.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;All our spaces contain at the most 100 files or subspaces.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Adding a document using the API takes about 5 seconds.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;I see you are able to do 110 files per second.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;And what do you mean by separating the order of operations ?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;You could help us a lot.&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 07 Mar 2008 13:49:57 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162209#M116101</guid>
      <dc:creator>netdata</dc:creator>
      <dc:date>2008-03-07T13:49:57Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162210#M116102</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Hi netdata,&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The speed at which content can be loaded into Alfresco is negatively impacted by mixing queries and node creation operations. Specifically, I found that the checking for the existence of a node just prior to insertion caused heavy database contention and slowed throughput significantly. The alternative is to check for existence in one pass and then insert or update as required in a subsequent pass. The only downside is that a node could be deleted or added by an external process between passes potentially causing a name collision. This rarely if ever happens in practice and can be dealt with through properly designed recover logic.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;I am not sure how significant an impact the above database contention issue will be with Oracle since the project was deployed on SQL Server. I have many years experience developing against Oracle 7.x - 10g for other applications and am inclined to believe that the performance hit would not be as bad due to Oracle's superior concurrency model. At the very least you will not see the dreaded SQL Server "deadlock" exceptions that occur when record insertions trip over read-only query operations on the same tables &lt;img id="smileyhappy" class="emoticon emoticon-smileyhappy" src="https://connect.hyland.com/i/smilies/16x16_smiley-happy.png" alt="Smiley Happy" title="Smiley Happy" /&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Regarding, the indexing improvements, I will need to check with my client before getting into any specific details. The optimization process was a very significant effort and I am not certain that I am at liberty to discuss the details under my NDA. Please contact me by private message if this something you would like to pursue.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Thanks,&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;David&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 07 Mar 2008 18:00:25 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162210#M116102</guid>
      <dc:creator>davidtaylor</dc:creator>
      <dc:date>2008-03-07T18:00:25Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162211#M116103</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;STRONG&gt;Hi David&lt;/STRONG&gt;&lt;SPAN&gt;, at the moment we are evaluating Alfresco to use it for a relatively big, muliportal publishing website with lots of reads in parallel (maybe 1000) and writes (up to 200). I heard from an well-experienced Alfresco consultant&amp;nbsp; :!:&amp;nbsp; about performance problems caused by how Alfresco uses Lucene indexing (or something like this). Thus I googled to find anything new about this issue. Ofcourse we need a highly scalable eCMS platform (without workarounds). Could you please inform me about the actual state of these kind of performance problems? You don't support a public bugtracking system for Alfresco dev, do you?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Kind regards, Dirk&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 29 May 2008 08:57:12 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162211#M116103</guid>
      <dc:creator>dbachem</dc:creator>
      <dc:date>2008-05-29T08:57:12Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162212#M116104</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;I have another question about performance :&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The alfresco in production is sometimes rowing. Basically, all is ok.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;But not often, when I do a click on a folder for example or a link, I can wait up to 5 seconds at least.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Why this happen ? (lucene background indexing ? multi user ? lucene queries ?)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Otherwise, most of the time, the application is well responding &lt;img id="smileyhappy" class="emoticon emoticon-smileyhappy" src="https://connect.hyland.com/i/smilies/16x16_smiley-happy.png" alt="Smiley Happy" title="Smiley Happy" /&gt;.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Which files do we have to tune for improving performance ?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Thanks&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Wed, 18 Feb 2009 15:04:24 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162212#M116104</guid>
      <dc:creator>zomurn</dc:creator>
      <dc:date>2009-02-18T15:04:24Z</dc:date>
    </item>
    <item>
      <title>Re: Lucene indexing and performance issue</title>
      <link>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162213#M116105</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;SPAN&gt;Some basic thoughts that come up:&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;The disk file system: did you know that ext4 is very much better in performance then ext3, perhaps it can help. Al least looking into a good file system is worth while looking into.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Externalize Lucene to a different server? Is this possible? Is it possible to replace it for example by Xapian?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Load balanced DB? Master-slave DB (one db to write one to read)&lt;/SPAN&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 28 Aug 2009 11:05:51 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-archive/lucene-indexing-and-performance-issue/m-p/162213#M116105</guid>
      <dc:creator>bwakkie</dc:creator>
      <dc:date>2009-08-28T11:05:51Z</dc:date>
    </item>
  </channel>
</rss>

