We are using Alfresco to store content that is automatically retrieved from external content providers. We are using a single default content store for storing all of our content. This is because we have to make queries from basically all of the content, which would not be possible if we would have several content stores.
We have currently about 2.5 million+ content items (nodes) in our content store. Items comprise of images and custom attributes that contain additional content and metadata. This leads to a total of approx. 20 million+ attributes in the node properties table. All of this content is under one folder node (space) because we don't really require several spaces and we are not using the Alfresco UI for authoring and editing content.
We are going to have several times this amount of data in the future. We need to automatically insert new content from external content providers / data sources. It is expected that we are going to have tens of millions of content items / nodes in the near future. This leads to a problem with Lucene indexing and Alfresco performance.
We are using the Web services API for automatically inserting new content. It seems that we cannot insert items while the Lucene index is updated. Even now it takes about one minute to insert a single "product" item, which consists of about 25 nodes and 200 properties, before the Web service gives a resposnse. This is considered to be way too slow for our purposes, since we need to update and insert tens of thousands of "products" (content nodes) per day. It also seems that Alfresco is under a very heavy load in such circumstance, since it takes about 70 percent of CPU time while the automated insert process is run.
Does anyone have any ideas how we could improve the performance and optimize this?