Recently I have been involved in an investigation task about the Alfresco Tracker Subsystem, with a specific focus on the Content Tracker. This post is the output of the analysis, which can also be found in the SearchServices repository, under the documentation folder.
The ContentTracker is part of the Alfresco Tracker Subsystem which is composed by the following members:
Each Solr that composes your search infrastructure (regardless it holds a monolithic core or a shard) has a singleton instance of each tracker type, which is registered, configured and scheduled at SolrCore startup. The SolrCore holds a TrackerRegistry for maintaining a list of all "active" tracker instances.
The ContentTracker queries for documents with "unclean" content (i.e. data whose content has been modified in Alfresco), and then updates them. Periodically, at a configurable frequency, the ContentTracker checks for transactions containing nodes that have been marked as "Dirty" (changed) or "New". Then,
Later, the CommitTracker will persist those changes.
The class diagram below provides a high-level overview about the main classes involved in the Tracker Subsystem.
As you can see from the diagram, there's an abstract interface definition (Tracker) which declares what is expected by a Tracker and a Layer Supertype (AbstractTracker) which adopts a TemplateMethod [1] approach. It provides the common behavior and features inherited by all trackers, mainly in terms of:
The Tracker behavior is defined in the track() method that each tracker must implement. As said above, the AbstractTracker forces a common behaviour on all trackers by declaring a final version of that method, and then it delegates to the concrete trackers (subclasses) the specific logic by requiring them the implementation of the doTrack() method.
Each tracker is a stateful object which is initialized, registered in a TrackerRegistry and scheduled at startup in the SolrCoreLoadRegistration class. The other relevant classes depicted in the diagram are:
The Trackers startup and registration flow is depicted in the following sequence diagram:
Solr provides, through the interface SolrEventListener, a notification mechanism for registering custom plugins during a SolrCorelifecycle. The Tracker Subsystem is initialized, configured and scheduled in the SolrCoreLoadListener which delegates the concrete work to SolrCoreLoadRegistration. Here, a new instance of each tracker is created, configured, registered and then scheduled by means of a Quartz Scheduler. Trackers can share a common frequency (as defined in the alfresco.cronproperty) or they can have a specific configuration (e.g. alfresco.content.tracker.cron).
The SolrCoreLoadRegistration also registers a shutdown hook which makes sure all registered trackers will follow the same hosting SolrCore lifecycle.
The sequence diagram below details what happens in a single tracking task executed by the ContentTracker:
At a given frequency (which again, can be the same for each tracker or overriden per tracker type) the Quartz Scheduler invokes the doTrack() method of the ContentTracker. Prior to that, the logic in the AbstractTracker is executed following the TemplateMethod [1] described above; specifically the "Running" lock is acquired and the tracker is put in a "Running" state.
Then the ContentTracker does the following:
In order to do that, the ContentTracker never uses directly the proxy towards ACS (i.e. the SOLRAPIClient instance); instead, it delegates that logic to the SolrInformationServer class. The first step (getDocsWithUncleanContent) searches in the local index all transactions which are associated to documents that have been marked as "Dirty" or "New". The field where this information is recorded is FTSSTATUS; it could have one of the following values:
The "Dirty" documents are returned as triples containing the tenant, the ACL identifier and the DB identifier.
NOTE: this first phase uses only the local Solr index, no remote call is involved.
If the list of Tenant/ACLID/DBID triples is not empty, that means we need to fetch and update the text content of the corresponding documents. In order to do that, each document is wrapped in a Runnable object and submitted to a thread pool executor. That makes each document content processing asynchronous.
The ContentIndexWorkerRunnable, once executed, delegates the actual update to the SolrInformationServer which, as said above, contains the logic needed for dealing with the underlying Solr infrastructure; specifically:
The Rollback sequence diagram illustrates how the rollback process works:
The commit/rollback process is a responsibility of the CommitTracker, so the ContentTracker is involved in these processes only indirectly.
When it is executed, the CommitTracker acquires the execution locks from the MetadataTracker and the AclTracker. Then it checks if one of them is in a rollback state. As we can imagine, that check will return true if some unhandled exception has occurred during indexing.
If one of the two trackers above reports an active rollback state, the CommitTracker lists all trackers, invalidates their state and issues a rollback command to Solr. That means any update sent to Solr by any tracker will be reverted.
The only source that the ContentTracker checks in order to determine the "unclean" content that needs to be updated is the local index. As consequence of that, the ContentTracker behavior is the same regardless the search infrastructure shape and the context where the hosting Solr instance lives. That is, if we are running a standalone Solr instance there will be one a ContentTracker for each core watching the corresponding (monolithic) index. If instead we are in a sharded scenario, each shard will have a ContentTracker instance that will use the local shard index.
In order to properly work in a Master/Slave infrastructure, the Tracker Subsystem (not the only ContentTracker) needs to be
The only exceptions to that rule are about:
The document file in the SearchService repository provides an additional paragraph with the configuration attributes related with the Tracker subsystem. I didn't put that long table in this post because it doesn't add any information: if you need to configure the trackers just have a look at the end of that document.
The Tracker Subsystem is one of the main areas where the Search Team is devolving analysis and investigation efforts: that will allow to find a space for introducing further improvements in the architecture.
-------
[1] https://en.wikipedia.org/wiki/Template_method_pattern
[2] http://docs.alfresco.com/5.1/concepts/solr-shard-config.html