cancel
Showing results for 
Search instead for 
Did you mean: 

Age-Off Unaccesed Files

jsb
Champ on-the-rise
Champ on-the-rise
We have about 1 million files in our Alfresco that have not been <strong>accessed</strong> (aka viewed in share/explorer) in over a year. We want to remove these files. Even more we want to implement a age-off policy that removes files automatically when they haven't been accessed in a year.

I think the best way to do this would be with a Scheduled Action. I have two ideas for how to do this.

————————————————-
Approach #1
————————————————-
I have the scheduled action running, but I don't know how to query for what I want. Here are my two files:
scheduled-action-services-context.xml (More or less copied from somewhere else on the internet…)
<blockcode>
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
    <!–
    Define the model factory used to generate object models suitable for use with freemarker templates.
    –>
    <bean id="templateActionModelFactory" class="org.alfresco.repo.action.scheduled.FreeMarkerWithLuceneExtensionsModelFactory">
        <property name="serviceRegistry">
            <ref bean="ServiceRegistry"/>
        </property>
    </bean>
<!– Execute the script /Company Home/Record Management/ageoff.js –>
    <bean id="runScriptAction" class="org.alfresco.repo.action.scheduled.SimpleTemplateActionDefinition">
        <property name="actionName">
            <value>script</value>
        </property>
        <property name="parameterTemplates">
            <map>
                <entry>
                    <key>
                        <value>script-ref</value>
                    </key>
                    <!– Note that as of Alfresco 4.0, due to a  Spring upgrade, the FreeMarker ${foo} entries must be escaped –>
                    <value>\$\{selectSingleNode('workspace://SpacesStore', 'lucene', 'PATH:"/app:company_home/app:dictionary/app:scripts/cm:ageoff.js"' )\}</value>
                </entry>
            </map>
        </property>
        <property name="templateActionModelFactory">
            <ref bean="templateActionModelFactory"/>
        </property>
        <property name="dictionaryService">
            <ref bean="DictionaryService"/>
        </property>
        <property name="actionService">
            <ref bean="ActionService"/>
        </property>
        <property name="templateService">
            <ref bean="TemplateService"/>
        </property>
    </bean>
    <bean id="runScript" class="org.alfresco.repo.action.scheduled.CronScheduledQueryBasedTemplateActionDefinition">
        <property name="transactionMode">
            <value>UNTIL_FIRST_FAILURE</value>
        </property>
        <property name="compensatingActionMode">
            <value>IGNORE</value>
        </property>
        <property name="searchService">
            <ref bean="SearchService"/>
        </property>
        <property name="templateService">
            <ref bean="TemplateService"/>
        </property>
        <property name="queryLanguage">
            <value>lucene</value>
        </property>
        <property name="stores">
            <list>
                <value>workspace://SpacesStore</value>
            </list>
        </property>
        <property name="queryTemplate">
            <value>PATH:"/app:company_home"</value>
        </property>
        <property name="cronExpression">
            <!– In reality this will be once a day, this is just for testing –>
            <value>0 0/3 * * * ?</value>
        </property>
        <property name="jobName">
            <value>jobD</value>
        </property>
        <property name="jobGroup">
            <value>jobGroup</value>
        </property>
        <property name="triggerName">
            <value>triggerD</value>
        </property>
        <property name="triggerGroup">
            <value>triggerGroup</value>
        </property>
        <property name="scheduler">
            <ref bean="schedulerFactory"/>
        </property>
        <property name="actionService">
            <ref bean="ActionService"/>
        </property>
        <property name="templateActionModelFactory">
            <ref bean="templateActionModelFactory"/>
        </property>
        <property name="templateActionDefinition">
            <ref bean="runScriptAction"/> <!– This is name of the action (bean) that gets run –>
        </property>
        <property name="transactionService">
            <ref bean="TransactionService"/>
        </property>
        <property name="runAsUser">
            <value>System</value>
        </property>
    </bean>
</beans>

</blockcode>


ageoff.js
<blockcode>

// I am testing with this date range because I am testing in a temporary 4.2.e instance.
var temp = "NOW-1YEAR/DAY TO NOW/DAY+1DAY"
// Real date range will be something like "MIN TO NOW-1YEAR/DAY"

// This is kind of what I want, but it doesn't work. I think my query is somehow wrong. Also, I don't
// think "@cm\\:accessed" exists, but when I replace it with "@cm\\:created" it doesn't seem to work anyway.
var docs = search.luceneSearch("PATH:\"/app:company_home/app:user_homes//*\" AND @cm\\:accessed:[" + temp + "] AND TYPE:\"cm:content\" AND -TYPE:\"cm:folder\"");

//———————————————————————————–
//This will get a list of everything in user homes. This works! (but not what I want)
//———————————————————————————–
//var docs = search.luceneSearch("PATH:\"/app:company_home/app:user_homes//*\" AND TYPE:\"cm:content\" AND -TYPE:\"cm:folder\"");

var dest;
for(dest=0; dest < docs.length; dest++) {
        // Instead of remove I think I want to set the sys:temporary aspect?
        var success = docs[dest].remove();
}
</blockcode>

<strong>Question</strong>: Is there a way to query based on when the documents were accessed or viewed (even just viewed on the share site, not necessarily downloaded)? Google returns hits for cm:created and cm:modified, but not cm:accessed. I think this approach is somewhat dead for this reason.

——————————————————————————–
Approach #2
——————————————————————————–
I have this java class that can correctly find the files that have not been accessed in a year if I run it from the contentstore root directory (alf_data/contentstore). According to this page https://wiki.alfresco.com/wiki/Custom_Actions I believe I shouldn't have too hard of a time converting this class into a custom action.


import java.util.Map;
import java.lang.ProcessBuilder;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;


// Find un-accessed files:
//   find ./ -atime +366 -type f -exec ls -l –time=atime {} \;
//

public class AgeOff {
   /**
    *
    */
   private static String loadStream(InputStream s) throws Exception {
      BufferedReader br = new BufferedReader(new InputStreamReader(s));
      StringBuilder sb = new StringBuilder();
      String line;
      while((line = br.readLine()) != null) {
         sb.append(line).append("\n");
      }
      return sb.toString();
   }      

   public static void main(String[] args) {
      ProcessBuilder pb = new ProcessBuilder("/bin/bash", "-c", "find ./ -atime +366 -type f -exec ls -l –time=atime {} \\;");
      try {
         Process p = pb.start();
         String output = loadStream(p.getInputStream());
         String outerr = loadStream(p.getErrorStream());

         System.out.println(output);
         System.err.println("————-ERRORS————-");
         System.err.println(outerr);
      } catch(Exception e) {
         System.out.println("EXCEPTION: " + e.getMessage());
      }
     }
}



Problem with this approach is that I now have a list of files in the contentstore and I need to somehow translate that to alfresco Nodes so that I can delete or set sys:temporary. Is there a way to translate a contentstore path to an alfresco node?

———————————————————————–
Summary
———————————————————————–
1) Is there a valid way to query for cm:accessed?
2) Is there a way to translate a contentstore path/id to an alf_node? (I hope that is the correct terminology)

I have looked into the Records Management module and it doesn't seem to be what I want. But maybe I am missing something.
This post (https://forums.alfresco.com/forum/end-user-discussions/alfresco-share/automatically-deleting-documen...) is relevant, but I want accessed, not created.

We have two alfresco instances running for different purposes, 4.2.e and 3.4.8. Ideally I need something that works for both, but I really just want some sort of push in the right direction. I have only tested the above with 4.2.e because it is community and I can spin up a temporary instance to test with so I don't touch production. So I suppose I would rather have help with 4.2.e if there is no common solution.
OS is CentOS.
1 ACCEPTED ANSWER

afaust
Legendary Innovator
Legendary Innovator
Hello,

cm:accessed exists in the data model but is not maintained by default to avoid performance overhead for something that most people won't use. Also, the meaning of cm:accessed can be quite different (accessed metadata vs. accessed content; user access vs. system access) depending on user / customer requirements, so it is difficult to come up with an efficient way to track this that can cater to all potential parties. As long as cm:accessed is not maintained you can't query for it. You could write custom functionality that does maintain that property using your semantics of "accessed".

That covers #1…

The URL / path in a content store may not belong to exactly one node, so it can not always be translated to a single, unique node. Also, some paths in the content store may be in the "orphaned" stage where not a single node claims ownership of it. A reverse lookup of path to node is possible via the database, but there isn't an API / service operation to do that which might be accessed from an addon.
Due to the non-uniqueness of path-to-node relations, you might also end up in the situation where accessing one node in Alfresco keeps the other one alive despite it not being accessed at all. Additionally, the access time in the file system may be updated by any process that runs on your server, i.e. backup & recovery processes, messing up the evaluation from the user perspective.


Personally, I would go with option #1 and implement the various hooks / extensions to maintain the cm:accessed property. But this is something that is not at all trivial due to the special features of all the various interfaces Alfresco provides. So for any casual user / customer of Alfresco, I would not advise to tackle that without professional support (you could sink a lot of time).

Regards
Axel

View answer in original post

1 REPLY 1

afaust
Legendary Innovator
Legendary Innovator
Hello,

cm:accessed exists in the data model but is not maintained by default to avoid performance overhead for something that most people won't use. Also, the meaning of cm:accessed can be quite different (accessed metadata vs. accessed content; user access vs. system access) depending on user / customer requirements, so it is difficult to come up with an efficient way to track this that can cater to all potential parties. As long as cm:accessed is not maintained you can't query for it. You could write custom functionality that does maintain that property using your semantics of "accessed".

That covers #1…

The URL / path in a content store may not belong to exactly one node, so it can not always be translated to a single, unique node. Also, some paths in the content store may be in the "orphaned" stage where not a single node claims ownership of it. A reverse lookup of path to node is possible via the database, but there isn't an API / service operation to do that which might be accessed from an addon.
Due to the non-uniqueness of path-to-node relations, you might also end up in the situation where accessing one node in Alfresco keeps the other one alive despite it not being accessed at all. Additionally, the access time in the file system may be updated by any process that runs on your server, i.e. backup & recovery processes, messing up the evaluation from the user perspective.


Personally, I would go with option #1 and implement the various hooks / extensions to maintain the cm:accessed property. But this is something that is not at all trivial due to the special features of all the various interfaces Alfresco provides. So for any casual user / customer of Alfresco, I would not advise to tackle that without professional support (you could sink a lot of time).

Regards
Axel
Getting started

Tags


Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.