cancel
Showing results for 
Search instead for 
Did you mean: 

Additional DB table for search/retrieval of hundred million documents

tgrozdek
Champ in-the-making
Champ in-the-making
Hi,

I wonder how to implement solution for fast search/retrieval of documents when you might have a hundreds of millions documents in Alfresco 5.0 (One or Community version).

Alfresco is primarily used as rudimentary document management system - import document and search/retrieve document, nothing else.
Documents are mostly standard office documents with average 100 kB size. Documents arrive in tens of thousands every month (few thousand per day).

Idea is, when a document arrives in Alfresco, copy its attributes and docID to some DB table (whether direct online or by job offline). Searching of documents would be done by fast queries on DB table and retrieval should be done directly on Alfresco.
Solution for importing documents to Alfresco and fetching them already exists (web services) and could be changed.

I'm interested is this a right way to go and how to implement this solution with best performance for searching/retrieval of documents (import will work with this doc numbers) - our general concern is how will Alfresco work with this this number of documents and what are possible bottlenecks in Alfresco.
What should You use to copy doc attributes to table (events, java), would you do it online or offline, how exactly would you retrieve documents on Alfresco ?
Currently, we have two virtual servers, one for whole Alfresco and the other one as Index server.

Thanks in advance.
Tom

PS.
I know that it is possible to scale server architecture horizontally and vertically with a number of servers but I'm interested in this kind of solution with exeternal table.
6 REPLIES 6

mrogers
Star Contributor
Star Contributor
There shouldn't be any need for that.    What is odd about searching that drives you in that direction?

If you do want to continue then I suggest you implement a new search subsystem to work on your own table.

tgrozdek
Champ in-the-making
Champ in-the-making
Hi,

I do not see any big Alfresco implementation, for instance, all Alfresco implementations have a 7 billion documents. Quote from Alfresco site: "Alfresco manages over seven billion documents for more than 1,800 customers in 195 countries, supporting 11 million users in their daily work."
Other examples on Alfresco official site mention tens of millions of documents in some companies but all of this numbers are pretty low for our possible needs, where the numbers may easily reach hundreds of millions of documents even a billion documents if we broaden our scope.

Also, I heard on some Alfresco course that Alfresco has or may have performance issues when number of archived documents reaches million.

My biggest concern about Alfresco is performance and we don't want to end up scaling our servers in dozens or so.

So, we tried to find some other solution for supporting large numbers of documents and one of them is using DB table for import/search/retrieval.

Thanks for Your answer.
Tom.

mrogers
Star Contributor
Star Contributor
I'm not aware of any archiving problem around a million.  Is there a JIRA reference for that?

Also I suspect those numbers are a) conservative and b) out of date.   There have been some huge implementations in the last couple of years.

I suggest you try out alfresco 5.1 before re-inventing the wheel.

tgrozdek
Champ in-the-making
Champ in-the-making
I don't know if there is Jira for that, it is just what I heard.

We already have Alfresco 5.1.
So, if we put one hundred million documents on two average Alfresco servers there would be no major performance issues ?

Thanks in advance. Tom.

mrogers
Star Contributor
Star Contributor
I suggest you try out your use case, before trying to design a solution to a problem that may not exist.   At 100M docs SOLR may need to be sharded, but if you were prepared to do your queries in the database then perhaps you don't need it at all.

tgrozdek
Champ in-the-making
Champ in-the-making
Yes, we should try it first though it's not easy thing to do - test environment is weaker than a production one and production environment couldn't be messed around because of other production solutions in use …

Anyway, big thanks for Your time and I think we can go in suggested direction.
Tom.