Hyland Connect

mcook · ‎01-29-2009

We are running alfresco version 2.9 in 4.2.2.GA on Linux 2.6.18-8.el5 (i386). We have created some custom scripts to manage driver documents. It checks to see if a driver has all the required documents on file and when they need updated (they have expired or will expire soon). We are currently having an issue with the documents that are being returned by the lucene search we are doing. We are supposed to send an email to those concerned listing the documents that are going to expire or have expired; however, the list is not showing all the documents when the task is run automatically every morning. The list differs (contains all the documents it should) when the same script is manually run by accessing it via the url in a browser. I cannot show the document sets because the file names contain sensitive information and we would be in violation of regulations. My supervisor stated we could present to you dummy data if you request an example of our results. Here is part of the script we are running:

<import resource="/Company Home/Data Dictionary/Scripts/burris-common.js">
<import resource="/Company Home/Data Dictionary/Scripts/exclusion-set.js">

function transportation_dailyTasks() {
   checkExpires();
   checkRetention();
   checkDocumentSets();
}

///////////////////////////////////////////////////
function checkExpires() {
   // records that are burris, are expireable, aren't superceded, aren't terminated
   var results = search.luceneSearch("TYPE:\""+BURRIS_DOC_TYPE+"\" +ASPECT:\""+EXPIREABLE_ASPECT+"\" -ASPECT:\""+SUPERCEDED_ASPECT+"\" -ASPECT:\""+TERMINATED_ASPECT+"\"");
   var expireds = findExpiredRecords(results);

   sendExpiredNotifications(expireds);
}

function findExpiredRecords(results) {
   var found = new Array();
   
   // find the docs that are expired and store according to group responsible so that they get one email with all the expirations
   for each(var result in results) {

      // files that are sitting at the facility-level are unfiled and not approve yet, ignore them
      if(result.parent.parent.name == "Transportation") continue;
   
      var properties = result.properties;
      var expirationDate = properties[EXPIRATION_DATE];
      var now = new Date();
      
      var groupEntry = found[result.properties[GROUP_RESPONSIBLE]];
      if(!groupEntry) {
         groupEntry = new Array();
         groupEntry["group"] = result.properties[GROUP_RESPONSIBLE];
         found[result.properties[GROUP_RESPONSIBLE]] = groupEntry;
      }
      
      var entry = new Array();
      entry["file"] = result.name;

      var difference = getDayDifference(expirationDate, now);
      if(difference <= 0) {
         result.properties[EXPIRED] = true;
         result.save();
      
         entry["isExpired"] = true;
         groupEntry.push(entry);
      } else if(difference <= 30) {
         entry["isExpired"] = false;
         groupEntry.push(entry);
      }
   }

   return found;
}

function sendExpiredNotifications(results) {
   for each(var group in results) {
      // the only reason a file won't have a group is if it hasn't been approved yet
      if(group["group"] && !group["group"].match(/_x0020_/)) {
         var users = people.getMembers(people.getGroup(group["group"]));
   
         for each(var p in users) {
            if (!p.properties.email || p.properties.email == " ") {
               continue;
            }
         
            var mail = actions.create("mail");
            
            mail.parameters.to = p.properties.email;
            mail.parameters.from = "Burris Alfresco <noreply@burrislogistics.com>";
            mail.parameters.subject = "Results from Alfresco Expiration Check";
            mail.parameters.text  = "Hello " + p.properties.firstName + ",\r\n\r\n";
            mail.parameters.text += "This is an automated email, the result of an ";
            mail.parameters.text += "Alfresco Expiring Record check. The results are ";
            mail.parameters.text += "listed below.\r\n\r\n";
         
                 var hasFiles = false;
            for each(var item in group) {
               if (!item["file"]) continue;

                    hasFiles = true;
               mail.parameters.text += "   * " + item["file"] + ": ";
               mail.parameters.text += item["isExpired"] ? "expired" : "expiring in less than 30 days";
               mail.parameters.text += "\r\n";
            }
         
            mail.parameters.text += "\r\n";
            mail.parameters.text += "Please login to Alfresco (" + ALFRESCO_URL + ") ";
            mail.parameters.text += "to update these documents.\r\n\r\n";
            mail.parameters.text += "Thank you.";
      
            if(hasFiles) {
               mail.execute(roothome);
            }

            appendLog( mail.parameters.text );
         }
         sendLog();
      }
      else {
         appendLog("These items have a Group Name that does not behave:");
         for each(var item in group) {
            if(item["file"] == undefined) continue;
                 appendLog("   * " + item["file"]);
         }
         sendLog();
      }
   }
}
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Please help us resolve this as this issue is time sensitive for our company. Any help will be greatly appreciated.

dhalupa · ‎01-29-2009

How many hits do you think your Lucene query typically returns? If the number of Lucene hits is large (close to 1000 or more) than ACL evaluation process could affect the number of found nodes.
There are few parameters which affect this behaviour
1. state of ACL cache before the repository is queried
2. size of ACL cache
3. system.acl.maxPermissionCheckTimeMillis defaults to 10000ms and defines maximum time that will be spent on ACL evaluation
4. system.acl.maxPermissionChecks defaults to 1000 and defines maximum number of ACL evaluations that will be performed for collection of hits that will be returned from Lucene query. If the number of hits exceedes this number, the rest will simply be discarded

Hope this helps,

Denis

mcook · ‎01-30-2009

Thank you for the advise. However, I do no think that is the cause of our current issue. Right now, the search is returning about 450 files, but in the future it could return many more. Ill look into changing those flags, and Ill hope this fixes things, but I remain skeptical that it will.

dhalupa · ‎01-30-2009

If I understood you correctly, this is some kind of system task which runs periodically. If this is the case, than there is one other thing you might consider. You could try using SearchService without transaction, security and audit interceptors applied. The id of that bean is "searchService" and in that case you will bypass acl evaluation completely. The only thing that you have to be careful about is that you have to create transaction boundaries yourself since it will not be handled by spring. I'm not sure how you will access this bean from JavaScript though, since I'm pretty sure that it is not injected into the JavaScript context.

Kind regards,

Denis

mcook · ‎02-04-2009

I have solved our issue, but I never found out the root cause of it. I would like to figure that out however, to prevent this from happening again. The documents were either not indexed or improperly indexed. A server restart solved the issue because by default Alfresco does some reindexing on startup. At least this is my best guess. Our server needs to remain up as much as possible (only restarted during maintenance windows); however, if this happens again we will inevitably require a restart unless another solution is found. Does anyone have any thoughts?

lotharm · ‎02-05-2009

How the daily is triggered? Perhaps it runs as guest user?

However, I would like to comment on the general approach you took, because we also took the same search-the-index-approach once, but concluded it is not the right way.
Querying the index for a certain property value (valid, expired or the like) to produce report-like lists seems like a reasonable idea. But it turns out that
the result list is not stable. That means, doing the same query twice, possibly gives a longer list. Possibly, because it is usually not visible on small systems, but with heavier load it might happen more often.The driving factors rise:
* a loaded system may hit the value of system.acl.maxPermissionCheckTimeMillis=10 seconds.
* a big result list may hit the value of system.acl.maxPermissionChecks=1000
* In a cluster setup, where every instance has its own index, a document might be missing for a short time in the index.
…and the result will just be cut. Very difficult to reproduce, too.

Using the index for something else than full text search is a debatable thing then. In my eyes, a lookup by a property value as above is not a thing for a full-text index, instead its a thing for the relational database.

I think Alfresco did a good job on the full text index. It is fast, respects transactions and the results for full-text search are well ranked. But it has to be used for what it was build for. By design the index is cutting the result list. This improves search speed and there is no need for the 10.000th result hit in a full text search. Did one ever go the second page on google?

Though, we solved our issue by using direct database queries through hibernate with success.

Hoping this gives some guidance,
lothar

mcook · ‎02-09-2009

Is it possible to do a hibernate query from the Javascript API? A quick google search did not turn up much.

dhalupa · ‎02-09-2009

How the daily is triggered? Perhaps it runs as guest user?
By design the index is cutting the result list.

It is not "index" (I guess you meant Lucene query) which is cutting result list but rather acl evaluation. Number of hits returned from Lucene is correct

dhalupa · ‎02-09-2009

Is it possible to do a hibernate query from the Javascript API? A quick google search did not turn up much.

No, it is not possible

lotharm · ‎02-09-2009

It is not "index" (I guess you meant Lucene query) which is cutting result list but rather acl evaluation.

To make it clear, I was talking about the Indexing-and-Searching component as a whole. That thingy you get as the SearchService bean, which one has to use in a black-box manner.
There is lucene hidden below of it, wrapped by a ACL-checker, which also cuts due to time constraints.

Number of hits returned from Lucene is correct

Not in a cluster environment. Some transactions might be missing in the local lucene index of an Alfresco instance.

Regards,
lothar

Hyland Connect

Weird Search Result Issue