cancel
Showing results for 
Search instead for 
Did you mean: 

Setup for import 1 mio files

ustraub
Champ in-the-making
Champ in-the-making
Hi,

I want to import several millions of files into Alfresco. The file names are numerical values like 1000001, 1000002, etc. The files are later accessed by their file name only. The directory where they will be stored in Alfresco is irrelevant for the later access from outside Alfresco.  Alfresco is accessed via CMIS.

Is there are performance difference between the following two configurations?

(1) all files are stored in one directory e.g. /base_dir/1000001, /base_dir/1000002, …

(2) the files are stored with an additional intermediate subdir derived from the last digit:
    /base_dir/00/1000000, /base_dir/00/1000010, /base_dir/00/1000020, …
    /base_dir/01/1000001, /base_dir/01/1000011, /base_dir/01/1000021, …
    /base_dir/02/1000002, /base_dir/02/1000012, /base_dir/02/1000022, …
    …

Does the creation of files in an Alfresco directory (like /base_dir) work sequentially or concurrently?
If it is sequentially with respect to one directory would the creation in different directories (e.g. /base_dir/00 and /base_dir/01) work concurrently?
Would the configuration (2) be faster when filled by multiple threads than configuration (1)?

Regards
U.Straub
5 REPLIES 5

mrogers
Star Contributor
Star Contributor
Some answers.

Content ingestion is concurrent with retries.

Yes multiple threads will help.

You may also like to look at the bulk upload tool.

ustraub
Champ in-the-making
Champ in-the-making
Hi,

what do you mean by "Content ingestion is concurrent with retries."?

If I understand you correctly your recommendation is to use configuration (2) with 10 threads? Theoretically it should be at best 10 times faster than configuration (1) with 10 threads (in practice a factor 5 would be great), is that correct?.
Configuration (1) with 1 thread would as fast as with 10 threads, correct?

What performance do you expect with your bulk upload tool?
Currently we have 0,3 sec per file uploaded with configuration (1) and 1 thread (and it gets worse with more threads).

Regards
U.Straub

mdutoo
Champ on-the-rise
Champ on-the-rise
Hi

In addition to the bulk upload tool, there is the Alfresco ETL Connector for Talend, which uses an ETL-optimized version of the native bulk import services (per file transaction & error code…) :

http://knowledge.openwide.fr/Main/AlfrescoETLConnector

Regards

jego
Star Contributor
Star Contributor
There is also the migration-center to import files from filesystem into alfresco. You can download a free version here:
http://www.migration-center.com/free-versions/

Thanks,
Jens

lista
Star Contributor
Star Contributor