Obsolete Pages{{Obsolete}}
The official documentation is at: http://docs.alfresco.com
Design DocumentActivities Service3.0
DRAFT/WIP
Activities High-level Design Approach
The following provides a high-level design approach to support the 3.0 Activities Requirements.
Split & Parallelize
- keep processing as light as possible, but allow for the pre-calculation of user feeds to be split across 'n' CPUs
- maintain a running list of activity posts (continuously added to)
- background 'feed' job activates on a regular cycle
- evenly allocates posts to 'n' 'feed' tasks (one per CPU)
- each 'feed' task processes its allocation of posts and generates activities for relevant users/sites - posts are marked as processed
- background cleaner jobs activate on regular cycles (or during the night)
- 'feed cleaner' job removes feeds that are out of date and/or possibly user/site feeds that are greater than a system max size
- 'post cleaner' job removes posts that are processed - could be kept for period of time to aid debug and/or troubleshooting
- background 'post lookup' job activates on a regular cycle
- to provide secondary lookup of activity data for a well-defined entity (eg. node ref)
Data Schema
Activity Post
sequence id (pk)
posting userid
site network
app tool
post date
activity type
activity data
job task node
status
last modified
- 'sequence id' is an incrementing sequence used for limiting the posts processed by a task (while new posts are continuously added)
- 'posting userid' originating user who posted this activity
- 'site network' site id context
- 'app tool' app id context
- NOTE: if not the site name then may need the site name in the activity data to generate certain feed views
- 'post date' date+time when activity raised (posted)
- 'activity type' named type
- 'activity data' JSON format, so that it can be converted to Freemarker model - in order to apply activity templates
- 'job task node' node hash - is used to partition and allocate posts to a job task node
- calculated as mod('posting user id'.hash(), no. of task CPUs)
- storage may also be partitioned by job task (e.g. sql table partition)
- NOTE: this is pre-calculated on post to simplify & improve performance of query - however, the number of actual CPUs may vary, if task nodes are added, removed or die during the post period. This is ok, it may mean that some CPUs are not used, or at worst, some CPUs have more work than others. The number of available CPUs can be retrieved at the end of each job cycle.
- 'status' can be immediately posted or pending an additional lookup, once posted and processed then eligible for cleanup - transitions are (PENDING ->) POSTED -> PROCESSED
- 'last modified' for debug/troubleshooting only, set to post date when inserted, then updated when status changes
Activity Feed
feed userid
posting userid
site network
app tool
post date
activity type
activity summary
activity format
id (pk)
feed date
post id
- 'feed userid' may be used to partition to support parallel user feed queries, userid can also be a site id (for site activities feed)
- 'posting userid' originating user who posted this activity
- 'site network' site id context
- 'app tool' app id context
- 'post date' date+time of activity
- 'activity type' named type
- 'activity summary' generated activity summary, can also be pass-though of JSON activity data
- 'activity format' format of activity summary, eg. atom, html, json ...
- 'id' DB-generated PK, for debug/troubleshooting only
- 'feed date' for debug/troubleshooting only - date+time when feed generated, as opposed to post date
- 'post id' for debug/troubleshooting only - not a FK, can dangle when posts are cleaned, might be used to implement re-generate
Activity Feed Control
feed userid
site network
app tool
last modified
- 'feed userid' feed user can have zero or more opt-out feed controls
- NOTE: userid can also be a site id (feed controls for site activities feed, set by a site admin)
- 'site network' site id - if set, opt out for this site
- 'app tool' app id - if set, opt out for this app tool
- NOTE: can combine with site - ie. opt-out of app tool for given site
- 'last modified' for debug/troubleshooting only, set when inserted
NOTE: in future release, could add 'activity type', 'posting userid' etc
Posting an Activity
- creation of an activity post - fast (eg. insert row), possibly asynchronous?? - handle tx error/rollback
- thread pool?
- only post in accordance with posting user privacy controls
Feed Generator
Feed Job
- simple task scheduler - job initiator
- scheduled job - eg. run every X minutes (if not already busy) - should probably be less than 10 minutes, to keep feeds reasonably up-to-date
select max(sequence id), job task node from post group by job task node
for each job task node
start feed task(job task node, max(sequence id))
end for
- tuning parameters
- frequency of job cycle
- number of posts processed by each task (NOTE: this throttles the processing, but may result in a growing list of posts - bad - need more CPUs in that case)
- cluster-aware, to avoid contention
Feed Task
- simple activity generator
get activities
- select posting userid, site network, app tool, activity type, activity data, post date
- from activity post
- where job task node = [job task id]
- and sequence id <base activity type>.<format>.ftl, eg. create.atomentry.ftl
- stored in data dictionary
- Company Home -> Data Dictionary -> Activity Templates
- stored in namespace hierarchy, eg. reserved namespace is org.alfresco which maps to Company Home -> Data Dictionary -> Activity Templates -> org -> alfresco
- simple fallback mechanism
- if 'org.alfresco.folder.create' does not exist then will fallback to 'org.alfresco.create', or if this does not exist then 'org.alfresco.generic'
Implementation Choices
Audit Trail
Discarded since the amount of re-use is likely to be minimal at this level. Also the detailed requirements/use-cases for fine-grained repository-driven audit events are different to those for coarse-grained application-driven activity events. For example, audit trails are typically required to be more long-lived compared to the more transitory nature of activity feeds.
The current audit mechanism is used by the repository to provide a low-level audit trail, using audit interceptors for public API methods. The audit mechanism also provides a simple interface for applications to set arbitrary custom audit events. In theory, one might consider using this custom audit trail in lieu of the activity posts,. However, this would also require enhancements to provide additional features to enable grid-based processing, including non-Hibernate data access layer, option to delete audit entries etc. The
Hadoop/HBase
Discarded due to where the project is in its lifetime. Also, concerns over single master node, HBase reliability and black box algorithms, data stores. Requires much more research to understand fully before relying on in production systems.
DB, Processing Grid
Currently, prototyping with MySQL & GridGain and/or JPPF. Provides us much more control over implementation.
Tasks are distributed, hence, in addition to the actual task context, the associated class dependencies would need to be either installed at each grid node or ideally distributable via a distributed/p2p classloader:
- third-party libs
- JDBC driver
- SQL mapping layer (eg. iBatis)
- Freemarker engine
Can also optionally plug-in a local implementation by default, which can then be re-configured to a grid implementation, as needed.