cancel
Showing results for 
Search instead for 
Did you mean: 

Alfresco folders structure problem with millions of documents

spilby
Confirmed Champ
Confirmed Champ
Goor evening!

Currently we have, with Alfresco 4.1.6, nearly two million documents distributed in folders that are structured as follows:

Repository: /User Homes/APP/*

Where * is a 10-digit numeric code, and are approximately 3000 folders. (Each code represents a user).

The user folders makes the tree of folders like he wants, but each can have thousands and thousands of documents, getting to have a folder tree with more than 200 directories in one of its branches.

The problem we have is that the access is slower growing, to the point that if I try to navigate with the Alfresco share when it's looking at the contents of /User Homes/APP, after several minutes just giving an error and don't shows the elements.

But not only that, our java application that navigate and create folders using the Java API, is more slow month by month, reaching some searching timeouts. We invoke Java methods like find childrens or lucene queries and it takes too long to respond.

If this is a folder structure problem… I have a question. Is it right to hang folders from User Homes? Would it be more efficient for example create a different Site for every 3000 users? Or would be the same and try to search or display the sites continue hanging? Any ideas of a different main structure?

Thank you!



8 REPLIES 8

afaust
Legendary Innovator
Legendary Innovator
Hello,

"is it right to hang folders from User Homes?" - Well, there is nothing particularly wrong with it. I have customers with up to 28 million documents and they often pretty much structure documents in a way where they have a couple of thousand folders on the root level (not User Homes, but somewhere else) and then two levels down they place the actual documents.

There are several potential issues that may need to be investigated:

- The Share document library navigation tree by default evaluates not only the next level of folders to be loaded but also checks every folder if it contains other elements. In a structure where you load 100 folders and each of those hold up to 100 folders/documents, you effectively load/process 10000 elements which can put a serious strain on the system. There is a "evaluate-child-folders" option that you can set for the "tree" in a "DocumentLibrary" config section in share-config-custom.xml to disable this feature.

- Are you using SOLR? If you are performing queries and still using embedded Lucene, the permission checking of results can be quite expensive, espeically when paging through large folder contents.

- Also, depending on the query, the performance of the query itself may be suboptimal and increase the more content you have in the system (regardless if it is in the same folder tree or not) I have one customer still using Lucene and it's the size of the index, slow disk access and RAM / heap constraints that cause some of the suboptimal queries to timeout or even bog the entire system. E.g. PATH queries, larger/lower comparisons with textual ID-like properties or too loose / too "wildcarded" queries can be problematic.

- Have you adapted the Alfresco internal caches to allow more nodes to be kept in memory? This is only possible if you have enough RAM / heap available, but can dramatically improve performance. The internal caches are limited in size and if you have structures where folders can contain large numbers of folders / documents, you'll easily end up in a scenario where a few users navigating through the structure can saturate the caches and cause partial clearing / removals to make room. Every time you hit a structure level where contents / folders can't be found in the caches, Alfresco has to do a retrieval from the DB. Depending on DB performance, this can be several orders of magnitude slower / more expensive.
This is the issue where storing fewer contents / folders per folder would have a dramatic impact since you reduce the amount of elements that potentially need to be loaded from caches / DB, avoid cache over-saturation and even if caches are too small, the DB has to do less work to load the elements for the next level of the navigation.



Before you go and change anything with your structure, you need to determine where precisely your performance issues lie. Look at Java process monitors to check memory and CPU usage, do a couple of profiler runs to identify hotspot operation - and based on that can you (or others) determine the best course of action.

Also, since you are using Alfresco 4.1.6 (an Enterprise release), I'd advise you to contact support and discuss with them options / techniques to determine performance bottlenecks (if you haven't done already).

Regards
Axel

spilby
Confirmed Champ
Confirmed Champ
Hi Axel. First of all, many thanks for your recommendations.

The nodes of our Alfresco structure have over 30 custom properties each one, in addition to the commons (name, title, dates, etc). Our custom model define it. We can change the parameters of the share config, thanks. I'm sure this give us a best performance with the share.

But if the check of other elements that you mentioned affects to the performence with a lot of folders, when we use our java application (a portlet to create, move, and navigate inside the tree of folders, assigning these custom properties) and obtain the nodes and properties with the Alfresco API, I suppose that these 30 properties affects the performance, too. Maybe 30 are too many properties for a node? But we need all of them.

Yes, we are using SOLR 1.4 with our Alfresco 4.1.6. And we use Alfresco FTS language for the queries. And the query method of the SearchService API to do that.

Two examples of queries that we use:

String query_by_expedient_number = "TYPE:\"expedient\" AND =@exp\\:expedientNumber:\"4362413\"";
String query_by_uuid = "TYPE:\"expedient\" AND (cm\\:title:\"test\" OR TEXT:\"abc\")";


We use before the PATH in the queries too, in order to limit the folders for searching, but we read that use PATH and metadata in the same query affects negatively to the perform and is more efficient for a quickly response don't use PATH like a element of the query and only use metadata.

Do you think are there optimal queries?

On the other hand, I think that another bad performance operation, is simply when we do this:


getNodeService().getChildAssocs(nodeRef);


to obtain the childrens of a node. For a folder with a lot of childrens, this operations takes too long. And there aren't queries, only a simple call to a api method. I think it's the optimal to obtain the childrens. Isn't it?

Thanks again.

Ups, one thing more… If we have more than one workspace (imagine, 50 workspace to distribute the folders that hangs to the User Homes), it maybe improve the performance? Depends the user, we connect to one or other workspace. I don't think is a horrible idea, but to know more possibilities.




afaust
Legendary Innovator
Legendary Innovator
Hello,

doing an unfiltered getChildAssocs(NodeRef) in a structure like yours is performance-suicide. You may get better results with a paginated / filtered access, i.e. FileFolderService.list. You could also use FTS queries with the PARENT keyword to select only those children that you actually need (and not load all which is what getChildAssocs does). In Alfresco 4.2, this kind of FTS query could also be executed in transactional manner on the DB side without using the index, so you would benefit more after an upgrade.

Your queries look fine and simple enough for the most part. Only the TEXT query fragment could be quite expensive as this searches via all d:text properties. It could be sensible to replace it with a query template and define that query template to only search on d:text properties of your expedient type.

The number of properties should not be that significant of a factor. We consult on / support installation where some types of nodes can have up to 80 or more properties. The number of properties only limits how far you can scale the Alfresco caches - since each node requires more memory to be cached, you can cache fewer nodes with the same JVM heap.

Of course anything that distributes your content structure into a less wide base under user homes should help improve the perceived performance. For this it wouldn't be necessary to setup a different workspace. You could sub-divide your user folders base on a subset of the 10-digit ID, e.g. all with prefixes 00 to 10 in one folder, and so on and so forth. For this Alfresco already provides a regex based home folder provider that we have used in one installation with 70.000 users synchronized from LDAP-AD - user folders were subdivided by initial letter, then first 3 digits of personel ID and then finally the actual user name.

Regards
Axel

spilby
Confirmed Champ
Confirmed Champ
Ok Axel, thanks a lot! I try to change all queries to replace getChildAssoc to queries with PARENT (or a method of FileFolderService).

Only one question more… I try this query:


TYPE:"{customModel}exp" AND PARENT:"workspace://SpacesStore/30da316f-9d2a-4e37-a28b-89d86bff6582" AND =@exp\:num_exp:"TE 432"


using PARENT instead of PATH. But return me zero results. Parent node is a directory and the node that I search is inside it, but in more than one level directory.

PARENT only search on the first children level? Are there a sintax to search recursively inside various children levels?

Thanks again!



afaust
Legendary Innovator
Legendary Innovator
Hello,

there is also ANCESTOR which basically selects anything below a certain node. But there is a major difference between PARENT and ANCESTOR - PARENT can be executed transactionally against the database while ANCESTOR always has to use the index. If you replace PATH with ANCESTOR, that effectively makes no difference, but if possible (e.g. you know you only have one level) use PARENT for the transctionality.

Regards
Axel

spilby
Confirmed Champ
Confirmed Champ
Ok! Thanks. In that case, I will use PARENT only when I find on a one level child.

Between PATH and ANCESTOR, there are no difference with the response velocity on Alfresco 4.1.6. It's ok? PATH isn't recommended in a combined query with more metadates. I understand that ANCESTOR isn't recommended neither. Or is a little bit quicky ANCESTOR than PATH?

afaust
Legendary Innovator
Legendary Innovator
ANCESTOR should generally be more efficient, but this will only be noticeable when your comparison PATH contains some wildcards, e.g. /app:company_home/st:sites/*//* (a PATH that selects everything in any site apart from the site itself). It also depends on the variation of paths in the system, e.g. when a lot of content shares large parts of the path, you will notice PATH performance less than in a naturally grown, free-form structure.

Keep in mind that PATH performance differs between search systems and Alfresco versions. E.g. there have been a lot of improvements in Alfresco 5 / SOLR 4 that should not be discounted.

ANCESTOR may be listed in the wiki as not recommened but that entry is quite old and should not be considered official. But yes, ANCESTOR is not listed in the <a href="http://docs.alfresco.com/community/concepts/rm-searchsyntax-fields.html">official documentation</a> - wether this is by choice or accident I can't say.

spilby
Confirmed Champ
Confirmed Champ
Oks, perfect! I understand. Thanks for your reply again! We will upgrade to SOLR 4 and Alfresco 5 early, and I can apply all of your advices. Thanks! Smiley Happy