cancel
Showing results for 
Search instead for 
Did you mean: 

safely crawl all documents via webscript

jaeni
Champ in-the-making
Champ in-the-making
what i am trying to do is:

find all nodes in the repo and get their file size. also get all versions of the node and calculate the overall filesize of the node and its versions.

how can i safely crawl every document in the repository?

searchservice is going to hit the result-limit easily, even if i increase the limit, the searches wont return results.

by traversing recursively through the repository i also seem to fill up the solr caches



private static void traverse(List<FileInfo> context) {
    for (FileInfo node : context) {
        if (node.isFolder()) {
            traverse(fileFolderService.list(node.getNodeRef()));
        }
        else {
            // is file = do stuff
        }
    }
}



… :44,186  INFO  [solr.component.AsyncBuildSuggestComponent] [Suggestor-alfresco-1] Loaded suggester shingleBasedSuggestions, took 267411 ms
… :53,005  WARN  [cache.node.nodesTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.nodesTransactionalCache' is full (125000).
… :17,075  WARN  [cache.node.aspectsTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.aspectsTransactionalCache' is full (65000).
… :17,081  WARN  [cache.node.propertiesTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.propertiesTransactionalCache' is full (65000).
… :19,938  WARN  [alfresco.cache.contentUrlTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.contentUrlTransactionalCache' is full (65000).
… :19,991  WARN  [alfresco.cache.contentDataTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.contentDataTransactionalCache' is full (65000).
… :49,599  WARN  [org.alfresco.nodeOwnerTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.nodeOwnerTransactionalCache' is full (40000).
… :27,516  WARN  [cache.node.childByNameTransactionalCache] [http-apr-8080-exec-1] Transactional update cache 'org.alfresco.cache.node.childByNameTransactionalCache' is full (65000).



i understand it is an antipattern to grab everything at once, but i don't know of any service/api that allows me to page the results into batches/pages,
please enlighten me Smiley Sad

version: 5.0.c
2 REPLIES 2

mrogers
Star Contributor
Star Contributor
It's not the solr cache that is filling up, its the transaction cache.   

The node service does give you all results, so apart from being slow that is O.K.   

The problem is that your code is executing in a single transaction and at some point there will likely be a limit on the number of database rows that can be updated

What the alfresco code does itself in those situations is to use the batch processor to break up your huge transaction into smaller chunks.    That's probably what you want to do here.

jaeni
Champ in-the-making
Champ in-the-making
Is there an easy example that i could re-use? if found user-rename tool but i am even more riddled now