cancel
Showing results for 
Search instead for 
Did you mean: 

Loading 20k Users

martyphee
Champ in-the-making
Champ in-the-making
A third party created a java application to load users using the api (/api/people/).  Currently we building out a QA environment and loading about 20k users.  This is after create 20+ share sites, loading content, creating groups, and setting up categories.

Right now the script is loading users at a rate of 0.3/users per sec.  Obviously extremely slow.

Is there a quicker way to load them?  I'm still learning the api and other tools available.

Machines are (two of them) a HP G6 (I think).  6 cores, 24gig memory,  1gig eth, RHEL 5, 64bit, Oracle RAC and two mirrored disks (15k I believe).  The content is on an NFS mount (which does introduce latency), but the indexes are local.  I did try moving content to a local mount and it was a significant enough improvement.

My best guess is the indexing is causing contention and slowing the process down.
3 REPLIES 3

afaust
Legendary Innovator
Legendary Innovator
Why is a third party application pushing users into Alfresco and not Alfresco pulling users from a directory (Synchronization subsystem)?

Is the application loading users into Alfresco using multiple processes/threads? We run a user/group synchronisation job at one of our customers with about 20 parallel threads and have observed a nearly linear increase in throughput, meaning that no contention actually took place and the server mostly waited on other resources to complete (mostly DB and ORM mapping)

As users are just content to the repository, creating them involves the usual content related DB interaction and indexing (which in turn also talks to the DB). The application server usually idles merrily along, mostly waiting on queries to the DB to return. This is especially true if due to some policies, rules or group constellations, the amount of DB interactions grows far beyond what is actually necessary to create a user (e.g. running into the Hibernate non-lazy fetch and cache or "shallow-container-with-lots-of-stuff" problem [descending into a container with a couple k objects as direct children which get preloaded into the transactional cache]). Depending on what your rules and policies do in that scenario, moving content from NFS to local (or using an asynchronously replicating content store) may or may not improve your user import significantly.

As a rule I always try to follow this sequence:
1) synch with LDAP/AD
2) setup categories
3) setup sites
4) either through patch or client application bulk-associate users with groups (if not already associated through LDAP/AD synch)
5) setup/activate rules and content policies
6) load content


Suggestions:

1) use parallel import if not already performed
2) disable indexing completely during bulk load and rebuild your indices after bulk load completed
3) isolate and disable inefficient policies/rules that are not required for the buik load (in my experience, quite a lot of policies/rules get evaluated but actually do nothing due to the developer forgetting to properly check applicability before the logic is executed)

martyphee
Champ in-the-making
Champ in-the-making
The users are actually coming form OID (Oracle).  Not sure why they designed it this way.  I came in later to help figure out the problems they were having.

How do I disable the indexing?

afaust
Legendary Innovator
Legendary Innovator
You can set this in your alfresco-global.properties


index.tracking.disableInTransactionIndexing=true