cancel
Showing results for 
Search instead for 
Did you mean: 

Is there still a limit on the amount of files in a folder?

miroslav
Star Contributor
Star Contributor

Hello,

Is there still a performance limit on the amount of files in a folder please? Is there any other limit on the number of files/folders and how deep they can be? What are the ways to work around this limit? I would like to use a Share and content-app UI.

Thank you!

1 ACCEPTED ANSWER

abhinavmishra14
World-Class Innovator
World-Class Innovator

There is no hard limit however, you may start seeing performance issue on share ui once number of files/folders are more than 1000+ in same folder. On the repo side recommended file/folders should not be more than 3000 in same folder. 

The recommended way to store files/folders is to store them in time-stamped structure. 

So instead of this:

Folder 1
  File1.1
  File 1.2
  ....
  .....
  ...
  File 1.1000
  ...
  ...
  File 1.2000
 

Store the files/folders like this (timestamped structure): 

Folder 1
   2022  <-- Year
     08    <-- Month
       17  <-- Day
         03 <-- Hour
           24 <-- Minute
             File 1.1
    ............

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

View answer in original post

4 REPLIES 4

abhinavmishra14
World-Class Innovator
World-Class Innovator

There is no hard limit however, you may start seeing performance issue on share ui once number of files/folders are more than 1000+ in same folder. On the repo side recommended file/folders should not be more than 3000 in same folder. 

The recommended way to store files/folders is to store them in time-stamped structure. 

So instead of this:

Folder 1
  File1.1
  File 1.2
  ....
  .....
  ...
  File 1.1000
  ...
  ...
  File 1.2000
 

Store the files/folders like this (timestamped structure): 

Folder 1
   2022  <-- Year
     08    <-- Month
       17  <-- Day
         03 <-- Hour
           24 <-- Minute
             File 1.1
    ............

~Abhinav
(ACSCE, AWS SAA, Azure Admin)

Hi, sorry, is this an official statement from Alfresco? If not, is there any official statement from Alfresco?
I need to justify this, because some say 1000, some say 2000, some say 5000 and some say 10,000 - 20,000 in the forum.

Hi, there's no official statement. This because it depends on your use case and your hardware/resources.

The number of direct children affects performances every time you need to access that folder, especially with a not-admin user (as the repo needs to check the permissions on every node). If performances degrades too much, you will start hitting some timeout and errors.


Leo Mattioli - Technical Account Manager @Hyland.

An Alfresco folder (or container more generally) can contain a large number of items that are typically files and/or sub-folders. A user-visible folder containing thousands of items can be accessed via any of the official Alfresco user interfaces, other applications, APIs, or protocols. The repository does not place any limits on how many items are able to be stored in a single container.

As a rule of thumb, a folder that contains more than 5,000 (if you use share) to 10,000 (if you only use API) items should be considered for re-structuring; possibly by splitting it into a set of sub-folders. What makes a folder "too large" depends on how those items are accessed, especially when trying to browse (ie. "list") the items within the folder. Operations that list the contents of the container, such as "getChildren" on a folder, can  really hurt the system. The size of the child items does not matter, only their number.

Listing large folders is inherently a resource intensive process. Over time, we have improved the system's behavior with large folders and we continue to look for more improvements. However, we do not expect major changes to this aspect of the system's behavior in the near future.

If you don't respect that, at some point you'll have bad performance, OOM, and unstable solution. Granted it is much better with Elastic search now and new more powerful hardware.


Big Folder issue explication
----------------------------

Folders were originally designed with the expectation that a user would most likely be browsing the contents of the folder. With that in mind, practical limits such as 5000 children made perfect sense. Today the challenges around large folders exist on multiple levels, hardware, ressources, interfaces, etc.

UI Performance
--------------
Rendering large folders becomes a challenge for User Interfaces.

- Whatever is being rendered is generally stored in memory on the client machine (e.g. the browser).

- Large sets of data being returned to the UI can also impact user experience, if the browser response time, is perceived as being too lengthy.

- While Share has paging for the main display, the sidebar display is not paged and represents an area of risk.

- Share will truncate document lists to 1000. The biggest impact this has, is the inability to choose a folder in a "copy to" or "move to" dialog due to the truncated list.

- Browsing is also problematic. There is essentially no way to browse to a folder that is significantly "deep" in the list. 

- Search is a decent workaround, for example, if you need to go to the 9999th folder you can search for "test folder 9999" and it'll show up in the search results.

ACL Evaluation
--------------
When a call to getChildren is made, each of the nodes that are returned must be evaluated, to see if the user requesting the list of node, can see each node. This permission check step will remove any nodes that the user does not have at least ReadProperties permission for. This two step process makes it difficult for paging to be implemented. It is also staining the memory of the server because it have to create a matrice of right for each nodes in that big fat folder.

Here are some observations from the field:
- We have seen a cases where calls to getChildren on folders that had 60,000-120,000 objects performed poorly. Not only that but on older version it would generaly end up in an OOM whatever ammount of RAM you thrown at the server. It is better now with Solr6 and Elastic but I would still no reccomand that structure. If you really need big flat folder, maiber Nuxeo would be better suited for your used case.
- In some of those cases the customer was able to run getChildren as the System user, which bypassed the permissions check. But then it BYPASS the permissions check !
- In other cases the customer used a search query on meta data instead. Path queries will kill your server in such organisation.
- It is worth pointing out, that Solr based searches can be paged because the ACLs are part of the Solr index, and path are indexed in Elastic. Generally less issues with elastic in such folder organisation.
- Searches done in the Transactional Metadata Query mode (e.g. against the DB) are also subject from the two step process. This is consistent with the original use case, being able to immediately retrieve an object (or small collection of objects) by their metadata without needing to wait for Solr to index them.

DB Performance
--------------
Given the fact that the entire set of child associations must be retrieved calling getChildren on a large folder and their ACL it will put a strain on the data base. It is not rare to see DB answering in more than 100sec for a folder containing 100k nodes

Memory
------

Since Alfresco has to evaluate all children and all ACL in the folder, it is going to put everything in memory in a big matrice and that will eat up your ram like the cookie monster, to the point basic garbage collection opperation can't free enough memory to continue to work and you'll face an OOM. Especially if your system is heavily used.

Use Cases
---------
A large folder is either intended to be browseable by the user, or is non-browseable.

There is no immediate performance problem with having a non-browseable container with many children, but there is a risk that someone will accidentally trigger a call to "getChildren" that causes system performance to degrade until the query completes. The easiest way to guard against this risk to system performance is to hash the folder contents across sub-folders. This should not impact the user experience, as the folder is not intended to be browsed so users will never see the sub-folders. If a hashing mechanism is not implemented, then steps should be taken to prevent "getChildren" from being executed. If you really need a business path, add it in a metadata, an aspect, it will be easier to search for that, than using path queries and you'll still be able to display it in your business UI.

If folder is intended to be browsed, then the information architect should think carefully about the use case. It is unlikely that a folder can be usefully browsed with tens of thousands of items, and there is likely a system of categorization that will better serve the users than having everything in the same container.

Guidelines
----------
When folder content approaches 5,000 to 10,000 items, we recommend using a hashing mechanism to spread content through sub-folders so as to not tax the system. Adding additional folders to hash content has minimal overhead. It would not impact any of the search queries and would only cause .1% to .2% overhead in additional nodes being generated. The management of a hashed file plan can be done automatically by using content policies, rules and actions, or a scheduled job.

In addition
------------
Containers that have more than tens of thousands of items can be stored within the repository. Theses items can be accessed directly by their NodeRef, ObjectID, or by a qname-path. The content can be located by Search, categories, or tags.

When using the APIs (eg. CMIS or REST API) to get/list children, the results should be paged using skipCount & maxItems query parameters. For example, the client may choose to list X items at a time. If maxItems is not specified then a default paging size may be used. This is 100 items for both OpenCMIS (Apache Chemistry) and the V1 REST API "list children".

In the case of file protocols such as WebDAV and FTP, only the first 5000 items will be returned unless the system admin has increased the "system.filefolderservice.defaultListMaxResults" property.

In the rare case that getChildren must be used on a large folder, running the command as the System user will avoid expensive permission checks.

You should be cautious about browsing "User Homes" as an admin. This could be a slow operation on systems with a large number of users. It may make sense to configure a "home folder provider" to split the user home directories across a set of sub-folders.