cancel
Showing results for 
Search instead for 
Did you mean: 

How to apply an aspect for all documents in a big repo

cesarista
World-Class Innovator
World-Class Innovator

Hi Community:

Sometimes, I have a painful requirement when dealing with large repositories (let's say more than 10M of documents).
I have to apply an aspect (cm:indexControl) and some properties (cm:isIndexed=true, and cm:isContentIndexed=false) on every document of the repository. What strategies may you use in a very large repository ? Is there a safer or controlled way for doing this ?

In the past I did it in smaller repos with a basic script, useful but I think it is not enough for this case.

- I used REST API for obtaining the full set of nodeRefs to apply. Basically I did TYPE based paginated searches for every document type.
- And then I iterated over the set of custom nodeRefs, with a simple custom webscript for applying the aspect and properties on each node.

Surely this is not the most effective / fast way for doing. What do you think ? Is there a way for not doing this one by one ? How would you improve each part ?

I use Alfresco 5.2 EE and Alfresco Search Services 1.3.

Kind regards and thanks in advance.
--C.

P.S: Yes, the idea is reindexing SOLR later, for getting smaller SOLR contentstore and indices.

3 REPLIES 3

angelborroy
Community Manager Community Manager
Community Manager

I guess the safer way is to create something on the Repo side, using the Java API.

Developing an Scheduled Job to apply the aspect to the nodes using a paginated search will be faster than using the external API.

Hyland Developer Evangelist

cesarista
World-Class Innovator
World-Class Innovator

Thanks for the idea Angel:

It seems reasonable to develop an scheduled job. It reminds a little bit the SOLR cronjob strategy (but in this case it would be in the repo part).

But do you know how would you query over all living and relevant nodes in an efective way ? 

Regards.

--C.

You may use DB or Search Service in order to get the batch of nodes to be updated. Using DB will be more efficient, but it may depend on your requirements.

If you need some inspiration, take a look at the implementation of the TraschcanCleaner addon:

https://github.com/Alfresco/alfresco-trashcan-cleaner-module/blob/master/src/main/java/org/alfresco/...

Hyland Developer Evangelist