Hyland Connect

jannek · ‎02-05-2008

We are using Alfresco to store content that is automatically retrieved from external content providers. We are using a single default content store for storing all of our content. This is because we have to make queries from basically all of the content, which would not be possible if we would have several content stores.

We have currently about 2.5 million+ content items (nodes) in our content store. Items comprise of images and custom attributes that contain additional content and metadata. This leads to a total of approx. 20 million+ attributes in the node properties table. All of this content is under one folder node (space) because we don't really require several spaces and we are not using the Alfresco UI for authoring and editing content.

We are going to have several times this amount of data in the future. We need to automatically insert new content from external content providers / data sources. It is expected that we are going to have tens of millions of content items / nodes in the near future. This leads to a problem with Lucene indexing and Alfresco performance.

We are using the Web services API for automatically inserting new content. It seems that we cannot insert items while the Lucene index is updated. Even now it takes about one minute to insert a single "product" item, which consists of about 25 nodes and 200 properties, before the Web service gives a resposnse. This is considered to be way too slow for our purposes, since we need to update and insert tens of thousands of "products" (content nodes) per day. It also seems that Alfresco is under a very heavy load in such circumstance, since it takes about 70 percent of CPU time while the automated insert process is run.

Does anyone have any ideas how we could improve the performance and optimize this?

pmonks · ‎02-10-2008

Like most hierarchical data storage systems (eg. filesystems), Alfresco performance suffers if you attempt to store too much content in a single hierarchy node, and I wouldn't be surprised it that's a large part of the problem you're seeing. Even if you don't require a space hierarchy, I would strongly suggest using one - perhaps a date / time based "hash bucket" structure like that used in the content store. This will help to distribute your content out across multiple spaces, resulting in better performance.

amh11 · ‎02-11-2008

Related to this: is it possible to prevent indexing some of the files that are attached to content? Instructions tell how you can prevent indexing certain metadata attributes, but how do you prevent that certain attached files would not be indexed? Indexing files which would not need to be indexed consumes lot of time and processing power.

amh11 · ‎02-11-2008

to clarify: for certain text file attachements, it would be only needed to store and be able to fetch them, not necessary to index contents of text files.

andy · ‎02-19-2008

Hi

If you add your own property of type d:content it can be defined as unindexed. The default cm:content property is indexed.

Andy

davidtaylor · ‎02-20-2008

Hi,

I can appreciate your concerns since I have implemented an Alfresco-based platform which currently houses 15+ million images (and growing) and makes use of the web client. We implemented our data loader using the JCR interface in preference to web services since testing showed the performance was much better. Currently, after much tuning, we can load 20,000 new images and apply a custom aspect in just under three minutes. Originally the same file took 3+ hours to load due to a variety of indexing and other design issues.

Here are some things we learned in the process:

- Enabling versioning on nodes has a huge negative impact on performance. We saw a 40% improvement in throughput by not applying the versionable aspect to new nodes.

- Distributing content across multiple folder nodes is essential for good performance. We generally try to keep the number of nodes per folder under 10k. Since our image names are serialized, we take a 4 character segment of the file name and use it as a sort of hash value to select a subdirectory. The images are also have a natural distribution by client ID.

- The order of operations when loading files greatly impacts throughput due to database contention issues. Checking for file name collisions inline with the code that performed node insertion caused a major performance hit and occasional database deadlocks (a MS SQL Server specific problem). Separating these operations into a multi-pass process greatly improved throughput.

- We discovered and corrected some database indexing issues that severely impacted MS SQL Server. The addition of some indexes greatly reduced lock contention and eliminated a full table scan that was a major drag on performance.

Hope this information helps.

David

netdata · ‎03-07-2008

Hi David,

Can you help me with this as well please?
We are seeing the same issues as described here.

We are practically doing the same as you do.
Can you please explain me for which tables you created extra indexes in oracle?
We have disabled versioning on all our files so this cannot be an issue.
All our spaces contain at the most 100 files or subspaces.

Adding a document using the API takes about 5 seconds.
I see you are able to do 110 files per second.
And what do you mean by separating the order of operations ?

You could help us a lot.

davidtaylor · ‎03-07-2008

Hi netdata,

The speed at which content can be loaded into Alfresco is negatively impacted by mixing queries and node creation operations. Specifically, I found that the checking for the existence of a node just prior to insertion caused heavy database contention and slowed throughput significantly. The alternative is to check for existence in one pass and then insert or update as required in a subsequent pass. The only downside is that a node could be deleted or added by an external process between passes potentially causing a name collision. This rarely if ever happens in practice and can be dealt with through properly designed recover logic.

I am not sure how significant an impact the above database contention issue will be with Oracle since the project was deployed on SQL Server. I have many years experience developing against Oracle 7.x - 10g for other applications and am inclined to believe that the performance hit would not be as bad due to Oracle's superior concurrency model. At the very least you will not see the dreaded SQL Server "deadlock" exceptions that occur when record insertions trip over read-only query operations on the same tables

Regarding, the indexing improvements, I will need to check with my client before getting into any specific details. The optimization process was a very significant effort and I am not certain that I am at liberty to discuss the details under my NDA. Please contact me by private message if this something you would like to pursue.

Thanks,
David

dbachem · ‎05-29-2008

Hi David, at the moment we are evaluating Alfresco to use it for a relatively big, muliportal publishing website with lots of reads in parallel (maybe 1000) and writes (up to 200). I heard from an well-experienced Alfresco consultant :!: about performance problems caused by how Alfresco uses Lucene indexing (or something like this). Thus I googled to find anything new about this issue. Ofcourse we need a highly scalable eCMS platform (without workarounds). Could you please inform me about the actual state of these kind of performance problems? You don't support a public bugtracking system for Alfresco dev, do you?

Kind regards, Dirk

zomurn · ‎02-18-2009

I have another question about performance :

The alfresco in production is sometimes rowing. Basically, all is ok.
But not often, when I do a click on a folder for example or a link, I can wait up to 5 seconds at least.
Why this happen ? (lucene background indexing ? multi user ? lucene queries ?)
Otherwise, most of the time, the application is well responding

.

Which files do we have to tune for improving performance ?

Thanks

Hyland Connect

Lucene indexing and performance issue