Hyland Connect

sebp · ‎12-15-2009

I want to search my repository for duplicate documents. Therefore, in a first step, I go through it recursively and give each content node a hash sum property. In a second step, to know which document has how many duplicates I have to go through the repository again. For each node it is checked if there are other nodes with the same hash sum. So for each node the following method is called:


   /**
    * Sets the number of duplicates on each duplicate node with the given
    * hashsum.
    * 
    * @param storeRef
    * @param hashSum
    * @param deletedNode
    *            the number will not be set on this node. If null it is
    *            ignored.
    */
   void setOthersForHashsum(StoreRef storeRef, String hashSum,
         NodeRef deletedNode) {
      ResultSet rs = searchService.query(storeRef, "lucene",
            "@dup\\:hashsum:" + hashSum);

      if (logger.isDebugEnabled())
         logger.debug("Found " + rs.length() + " equals for hashsum " + hashSum);

      List<NodeRef> nodeRefs = rs.getNodeRefs();
      if (deletedNode != null) {
         int l = nodeRefs.size();
         for (int i = 0; i < l; i++)
            if (nodeRefs.get(i).getId().equals(deletedNode.getId())) {
               nodeRefs.remove(i);
               break;
            }
      }
      
      int duplicateCount = nodeRefs.size() - 1;
      for (NodeRef current : nodeRefs)
         if (nodeService.exists(current))
            nodeService.setProperty(current, PROP_COUNT, duplicateCount);
      
      rs.close();
      rs = null;
   }
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

My problem is that this method gets slower and slower the more nodes I process. In the beginning it takes 100ms to process a node but after 10000 nodes it takes 1s to process a single node with this method. My repository contains ~55000 nodes in ~2000 folders. With my current implementation it would take days to process it completely.
In the beginning I forgot to close the search ResultSet. This had a huge impact on the performance. Now, since I close the ResultSet the method is processed much faster but still the more nodes I process the slower it gets.

Btw: The whole project can be found here http://forge.alfresco.com/projects/duplicates/

rogier_oudshoor · ‎12-17-2009

In Alfresco, most processing is done inside of a transaction. What you're doing, is add more and more information inside a single transaction - increasing the transaction overheid. You can try to break down your batch into multiple transactions and see if it speeds up.

loftux · ‎12-17-2009

Side note: Be aware of how Alfresco makes copies. When you copy nodeA pointing to contentA to nodeB, the new nodeB will still point to contentA. That is until either nodeA or nodeB is versioned, at that time each content node will point to a unique content.
So if you hash sum is on file content, you may get false positives. you can probably take this into account checking versionlabel and/or checking if its has the copiedfrom aspect.

sebp · ‎12-17-2009

Thanks for your replies. I think the problem was that I let the administrator start the "Update duplicates" action from within the web client ui. So there automatically was a surrounding transaction that was growing bigger and bigger. Now I update the duplicates using a CronScheduledQueryBasedTemplateActionDefinition. If it runs in ISOLATED_TRANSACTIONS mode everything is fine.

ganeshkolhe · ‎12-23-2009

Hi sebp,
Can you please share your implementation with us . How you used CronScheduledQueryBasedTemplateActionDefinition and how you have set ISOLATED_TRANSACTIONS mode?

Regards,
Ganesh

sebp · ‎12-23-2009

Hi Ganesh,
Have a look at the duplicate finders' manual page http://www.hmedia.de/wiki/doku.php?id=products:duplicates
There you will find the scheduled-action-services-context.xml file where the action is started as a CronScheduledQueryBasedTemplateActionDefinition. You can find more details about scheduled actions and the available transaction modes at http://wiki.alfresco.com/wiki/Scheduled_Actions
The implementation can be found at http://forge.alfresco.com/projects/duplicates/

Hyland Connect

[solved]The more search queries the slower lucene gets