cancel
Showing results for 
Search instead for 
Did you mean: 
resplin
Elite Collaborator
Elite Collaborator

Obsolete Pages{{Obsolete}}

The official documentation is at: http://docs.alfresco.com



High Availability
Back to Server Configuration




Introduction


This page describes some of the components and configuration options for  running multiple server instances together in a cluster.  Sample  configurations or configuration snippets that arise from real-world  solutions are given. There are many combinations of configurations and  levels of complexity left out.  Feel free to contribute further  examples.



Note:The configurations here apply to Alfresco V2.1.3 until obsolete in 4.2 , and do not apply to WCM installations (which cannot be clustered in the 2.x or 3.x series). 

Also please note that clustering becomes enterprise only as of 4.2 so please refer to the enterprise documentation.     While some of the principals will still apply some of the technologies like JGroups are no longer used.

This document assumes knowledge of how to extend the server configuration. (Repository Configuration)
The following naming convention applies:


  • <configRoot> - the location of the default configuration files for a standard Alfresco installation (for example, file <configRoot>/alfresco/application-context.xml)
  • <extConfigRoot> - the externally located configuration location (for example, file <extConfigRoot>/alfresco/extension/custom-repository-context.xml)

High availability components


NB: As of V3.1, the cluster components will only initialize if the following property is defined in alfresco-global.properties:

 alfresco.cluster.name=<CLUSTERNAME>

See the page Configuring JGroups and Alfresco Clusters.


JLAN clustering


The JLAN protocols, FTP, Cifs, NFS can be clustered in Alfresco version 4.0 onwards. For Enterprise versions, CIFS / JLAN clustering will be available as of Alfresco 4.0.1.
Configuring Hazelcast for JLAN clustering


Client session replication


This is supported in Alfresco V2.2.0 onwards.


Content store replication


The underlying content binaries are distributed by either sharing a common content store between all machines in a cluster or by replicating content between the clustered machines and a shared store(s).


Index synchronization


The indexes provide searchable references to all nodes in Alfresco. The index store is transaction-aware and cannot be shared between servers. The indexes can be recreated from references to the database tables. In order to keep the indexes up to date (indicated by boxes 'Index A' and 'Index B') a timed thread updates each index directly from the database. When a node is created (this may be a user node, content node, space node, and so on) metadata is indexed and the transaction information is persisted in the database. When the synchronization thread runs, the index is updated using the transaction information stored in the database.

On the ADM side of the database schema, the tables that contain the index tracking information are:


  • alf_server

Each committing server has an entry in this table.


  • alf_transaction

Each write operation to nodes in the ADM generates a unique transaction ID. The transaction is attached to the entry in the server table. The column commit_time_ms ensures that the transactions can be ordered by commit time. A unique GUID (due to historical reasons) is also attached here.


  • alf_node (See alf_node_status on 2.2SP2 and earlier)

An entry is created here for each node, including nodes that have been deleted. With each modification, the status is updated to refer to the transaction entry for the committing transaction.

When indexes are replayed on a machine (for rebuilding or tracking) the transactions are time-ordered. The server can then determine which nodes have been modified or deleted in a particular transaction. Using this information, the submit a list of modifications to the Lucene-based indexing code.


Cache replication


The Level 2 cache provides out-of-transaction caching of Java objects inside the Alfresco system. Alfresco only provides support for EHCache. This guide describes the synchronization of EHCache across clusters. Using EHCache does not restrict the Alfresco system to any particular application server, so it is completely portable.


Database synchronization


It is possible to have co-located databases which synchronize with each other. This guide describes the setup for a shared database. If you wish to use co-located databases then refer to your database vendor's documentation to do this.


Scenarios


Simple repository clustering


Alfresco_LB_Diagram.png


  • Single, shared content store
  • Sticky client sessions

In this scenario, we have a single repository database and file system (for the content store) and multiple web app servers accessing the content simultaneously. This configuration does not guard against repository file system or database failure, but allows multiple web servers to share the web load, and provides redundancy in case of a web server failure. Each web server has local indexes (on the local file system).



For this example we will utilize a hardware load balancer to balance the web requests among multiple web servers. The load balancer must support 'sticky' sessions so that each client always connects to the same server during the session. The file system and database will reside on separate servers, which allows us to use alternative means for file system and database replication. The configuration in this case will consist of L2 Cache replication and index synchronization.


Sample cluster setup


A sample guide to setup a two   node cluster as per above diagram can be found    here .


L2 cache replication


We will use EHCache replication to update every cache in the cluster with changes from each server. This is accomplished by overriding the default EHCache configuration in . The complete configuration is given in the sample extension file ehcache-custom.xml.sample.cluster.

ehcache-custom.xml



<ehcache>
    <diskStore
      path='java.io.tmpdir'/>

    <cacheManagerPeerProviderFactory
            class='net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory'
            properties='peerDiscovery=automatic,
                        multicastGroupAddress=230.0.0.1,
                        multicastGroupPort=4446'/>

    <cacheManagerPeerListenerFactory
        class='net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory'
    />

...
</ehcache>

NOTE: from 3.1 on, cache replication is based on Jgroups (see ECM admin guide for more info)
In particular



alfresco.cluster.name=someclustername

has to be set in alfresco-global.properties so that jgroups start.
Jgroup issues can be debugged using:



log4j.logger.org.alfresco.repo.jgroups=debug
log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=debug

Lucene index synchronization


The above configuration has been pulled into internal context files.  Properties changing the reindex behaviour are defined and available for overriding in the general repository properties file:

<conf>/alfresco/repository.properties



# ######################################### #
# Index Recovery and Tracking Configuration #
# ######################################### #
#
# Recovery types are:
#    NONE:     Ignore
#    VALIDATE: Checks that the first and last transaction for each store is represented in the indexes
#    AUTO:     Validates and auto-recovers if validation fails
#    FULL:     Full index rebuild, processing all transactions in order.  The server is temporarily suspended.
index.recovery.mode=VALIDATE
# FULL recovery continues when encountering errors
index.recovery.stopOnError=false
index.recovery.maximumPoolSize=5
# Set the frequency with which the index tracking is triggered.
# For more information on index tracking in a cluster:
#    http://wiki.alfresco.com/wiki/High_Availability_Configuration_V1.4_to_V2.1#Version_1.4.5.2C_2.1.1_and_later
# By default, this is effectively never, but can be modified as required.
#    Examples:
#       Never:                   * * * * * ? 2099
#       Once every five seconds: 0/5 * * * * ?
#       Once every two seconds : 0/2 * * * * ?
#       See http://quartz.sourceforge.net/javadoc/org/quartz/CronTrigger.html
index.tracking.cronExpression=* * * * * ? 2099
index.tracking.adm.cronExpression=${index.tracking.cronExpression}
index.tracking.avm.cronExpression=${index.tracking.cronExpression}
# Other properties.
index.tracking.maxTxnDurationMinutes=10
index.tracking.reindexLagMs=1000
index.tracking.maxRecordSetSize=1000
index.tracking.maxTransactionsPerLuceneCommit=100
index.tracking.disableInTransactionIndexing=false

The triggers for ADM (document management) and AVM (web content management, where applicable) index tracking are combined into one property for simplicity.  These can be set separately, if required.  The following properties should typically be modified in the clustered environment:

<ext-conf>/alfresco/extension/custom-repository.properties:



index.tracking.cronExpression=0/5 * * * * ?
index.recovery.mode=AUTO

Setting the recovery mode to AUTO will ensure that the indexes are fully recovered if missing or corrupt, and will top the indexes up during bootstrap in the case where the indexes are out of date.  This happens frequently when a server node is introduced to the cluster.  AUTO will ensure that backup, stale or no indexes can be used for the server.

NOTE:  The index tracking relies heavily on the approximate commit time of transactions.  This means that machines in a cluster need to be time-synchronized, the more accurately the better.  The default configuration only triggers tracking every 5 seconds and enforces a minimum age of transaction of 1 second.  This is controlled by the property index.tracking.reindexLagMsFor example, if the clocks of the machines in the cluster can only be guaranteed to within 5 seconds, then the tracking properties might look like this:

<ext-conf>/alfresco/extension/custom-repository.properties:



index.tracking.cronExpression=0/5 * * * * ?
index.recovery.mode=AUTO
index.tracking.reindexLagMs=10000

The index tracking keeps a history of transactions that might yet be committed based on the IDs of new transactions that appear.  In order to ensure that long-running transactions are properly detected in a cluster, the maximum transaction duration needs to be set:

<ext-conf>/alfresco/extension/custom-repository.properties:



index.tracking.maxTxnDurationMinutes=10

The number of concurrent threads that will be assigned to index rebuilding can be set.  Each of these threads picks up and indexes a batch of transactions and indexes them as a single Lucene batch.  This reduces IO and contention for Lucene merges and makes index rebuildin faster; especially during a full index rebuild.The defaults cater for a low throughput of small write transactions, but will perform full index rebuilding in good-sized batches.  If your use-cases includes many large transactions, then it would be better to lower the number of transactions in the index batch.

<ext-conf>/alfresco/extension/custom-repository.properties:



index.recovery.maximumPoolSize=5
index.tracking.maxTransactionsPerLuceneCommit=100

Local and shared content stores


HA_ClusteredServer.jpg


Scenario


  • Local content store replicated to shared store

Read the Content Store Configuration as an introduction.

This scenario extends the previous examples by showing how to replicate the content stores to have a content store local to each machine that replicates data to and from a shared location.  This may be required if there is high latency when communicating with the shared device, if the shared device doesn't support fast random access read/write file access.


Content replication


Assume that both Server A and Server B (and all servers in the cluster) store their content locally in /var/alfresco/content-store.  The Shared Backup Store is visible to all servers as /share/alfresco/content-store.  The following configuration override must be applied to all servers:
<ext-config>/alfresco/extension/replicating-content-services-context.xml.sample:



   <bean id='localDriveContentStore' class='org.alfresco.repo.content.filestore.FileContentStore'>
      <constructor-arg>
         <value>/var/alfresco/content-store</value>
      </constructor-arg>
   </bean>
   <bean id='networkContentStore' class='org.alfresco.repo.content.filestore.FileContentStore'>
      <constructor-arg>
         <value>/share/alfresco/content-store</value>
      </constructor-arg>
   </bean>
   <bean id='fileContentStore' class='org.alfresco.repo.content.replication.ReplicatingContentStore' >
      <property name='primaryStore'>
         <ref bean='localDriveContentStore' />
      </property>
      <property name='secondaryStores'>
         <list>
            <ref bean='networkContentStore' />
         </list>
      </property>
      <property name='inbound'>
         <value>true</value>
      </property>
      <property name='outbound'>
         <value>true</value>
      </property>
      <property name='retryingTransactionHelper'>
         <ref bean='retryingTransactionHelper'/>
      </property>
   </bean>

Read-only clustered server


Scenario


  • Some servers are read-only

It is possible to bring a server up as part of the cluster, but force all transactions to be read-only.  This effectively prevents any database writes.


Custom repository properties


<ext-config>/alfresco/extension/custom-repository.properties:



# the properties below should change in tandem
server.transaction.mode.default=PROPAGATION_REQUIRED, readOnly
server.transaction.allow-writes=false
#server.transaction.mode.default=PROPAGATION_REQUIRED
#server.transaction.allow-writes=true

Verifying the cluster


This section addresses the steps required to start the clustered servers and test the clustering after the the necessary configuration changes have been made to the servers.


Installing and upgrading on a cluster


Clean installation


  • Start Alfresco on one of the clustered machines and ensure that the bootstrap process proceeds normally.
  • Test the system's basic functionality:
    • Login as different users
    • Access some content
    • Perform a search
    • Modify some properties

Upgrades


  • Upgrade the WAR and config for all machines in the cluster.
  • Start Alfresco on one of the clustered machines and ensure that the upgrade proceeds normally.
  • Test the system's basic functionality.
  • Start each machine in the cluster one

Testing the cluster


There are a set of steps that can be done to verify that clustering is working for the various components involved.  You will need direct web client access to each of the machines in the cluster.  The operation is done on machine M1 and verified on the other machines Mx.  The process can be switched around with any machine being chosen as M1.


Cache clustering


  • M1: Login as admin.
  • M1: Browse to the Guest Home space, locate the tutorial PDF document and view the document properties.
  • Mx: Login as admin.
  • Mx: Browse to the Guest Home space, locate the tutorial PDF document and view the document properties.
  • M1: Modify the tutorial PDF's description field, adding 'abcdef'.
  • Mx: Refresh the document properties view of the tutorial PDF document.
  • Mx: The description field must have changed to include 'abcdef'.

Index clustering


  • ... Repeat Cache Clustering.
  • M1: Perform an advanced search of the description field for 'abcdef' and verify that the tutorial document is returned.
  • Mx: Search the description field for 'abcdef'.  As long as enough time was left for the index tracking (10s or so), the document must show up in the search results.

Content replication/sharing


  • M1: Add a text file to Guest Home containing 'abcdef'.
  • Mx: Refresh the view of Guest Home and verify that the document is visible.
  • Mx: Open the document and ensure that the correct text is visible.
  • Mx: Perform a simple search for 'abcdef' and ensure that the new document is retrieved.  This relies on index tracking, so it may take a few seconds for the document to be visible.

Tracking issues


The following log categories can be enabled to help track issues in the cluster:


  • log4j.logger.net.sf.ehcache.distribution=DEBUG: Check that heartbeats are receieved from from live machines (Community)
  • log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=DEBUG: Check heartbeat message sending and receiving (Enterprise)
  • log4j.logger.org.alfresco.repo.node.index.IndexTransactionTracker=DEBUG: Remote index tracking for ADM.
  • log4j.logger.org.alfresco.repo.node.index.AVMRemoteSnapshotTracker=DEBUG: Remote index tracking for AVM.



For jgroups based clusters, the following is also useful:



log4j.logger.org.alfresco.repo.jgroups=debug
log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=debug

With the debug on, check that that the IP addresses used are correct for both the JGroups connection and the EHCache peer URLs.  These can be changed using configuration.



<cacheManagerPeerListenerFactory
class='net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory'
properties='hostName=w.x.y.z, port=40001, socketTimeoutMillis=5000'/>

If cache clustering is not working, the EHCache website describes some common problems: EHCache Documentation.  The remote debugger can be downloaded as a separate bundle from the EHCache distribution library  and executed:

> tar -xzvf ehcache-debugger-1.7.1-distribution.tgz
> cd ehcache-debugger-1.7.1/target
> mkdir db
> cd db
> jar xf ../ehcache-debugger-1.7.1.jar
> cd ../..
> java -cp slf4j-api-1.5.8.jar:slf4j-jdk14-1.5.8.jar:target/db:/usr/share/alfresco/alfresco.war/WEB-INF/lib/\* \
  net.sf.ehcache.distribution.RemoteDebugger  \
  /usr/share/alfresco/extensions/extension/ehcache-custom.xml \
  org.alfresco.repo.domain.hibernate.NodeImpl
> java -jar ehcache-1.3.0-remote-debugger.jar

Cluster Validation Tool


The cluster validation tool validates the connection between each node in the cluster by checking whether their mutual shared caches are synchronising. When a validation check is initiated a request is sent to each node in the cluster, triggering the node to validate its connection with every other node in the cluster. The validating node authenticates as the admin user and generates a ticket which it then sends to every other node in the cluster as part of a validation request. The receiving node checks whether the ticket is present in its cache; if it is, the connection with the validating node is regarded as valid and the response indicates so, otherwise the connection is regarded as invalid and the response indicates this. The presence of nodes in the cluster is remembered unless they are explicitly removed from the tracking.

The cluster validation tool is exposed through JMX under 'Alfresco - Cluster' (in 4.2 onwards). Each discovered cluster node, uniquely identified by a guid, has an entry; the connection statuses of other nodes it knows about are shown. A cluster check can be initiated by executing the 'clusterCheck' operation. The 'getFailingNodePairs' operation will return a list of those cluster node pairs that are not working. The 'stopChecking' operation, when passed the guid of a cluster node, will stop the cluster node from being tracked (if, for example, the cluster node is removed from the cluster). If subscribed to receive notifications, notifications indicating the detection of nodes and changes in node pair status will be delivered.


Common mistakes


Several clusters


If you have several alfresco clusters on the same network, you need to pay special attention to the addresses and ports used for Ehcache UDP broadcasts (see parameters multicastGroupAddress and multicastGroupPort above). These parameters must be unique to each cluster. If you use the same paramters for all the clusters, then each cluster will try to communicate with the other clusters leading to all sort of issues.

A common situation is where you have a Testing/Validation Alfresco cluster environment and a production Alfresco cluster environment on the same network. When you copy/transfer your testing configuration files to production, you need to remember to change the parameters.


Hostnames


See that /etc/hosts does not include the hostname on the localhost line, else the ehcache cluster will not work.




Back to Server Administration Guide