cancel
Showing results for 
Search instead for 
Did you mean: 

HA Clustered Servers issues

viv
Champ in-the-making
Champ in-the-making
Hi everyone,

I've configured an high avaibility server between two servers, as shown in this case.

After some tests, I've been in front of some problems :
If a network interuption appends, some of the multicast paquets used by ehcache replication are lost, and file properties aren't updated. So, in this case, they're diferencies between my two servers.
An other problem comes when a server becomes down. The NFS link becomes down too, and the content store replication fails. So, in this case, the serveur up and running can't add files to his own content store.

In a nutshell, Alfresco replication works like a charm, but some problems appends when one of the server break down.

Are they some thinks to change to resolve the problems I have ?

Thanks.
9 REPLIES 9

derek
Star Contributor
Star Contributor
Hi,

If the multicast packets are lost, then the servers will derigister each other from the cluster and you can have each server acting on its own, effectively allowing a mismatch between the servers.  If this is a frequent occurence, then give the caches a timeout value.

The filesystem going down is something you'll need to fix yourself.  You can configure Alfresco to use any number of stores, but they all need to be valid and available for the server.  In the Wiki config, the servers both use a shared filesystem.  The servers will need to get access to that before they can run.

Regards

subspawn
Champ in-the-making
Champ in-the-making
viv:
The NFS issue can be resolved by setting the timeout, retransmission & soft/hard parameters on the client (mounting) side correctly. Please see http://www.faqs.org/docs/linux_network/x-087-2-nfs.mountd.html or man nfs(5) for more information on that.

You'll probably want your cluster nodes to survive an STP/RSTP topology rebuild. In case of RSTP I'd say say 15-20 secs should be sufficient.. STP on the other hand, try if you can establish a node-to-node link, STP rebuilds can clog your network for a pretty long time.


I'm having a similar problem concerning multicast, but I think it's not that easily fixed. Alfresco uses EHCache for replication of indexes & such, and default multicasting is assumed to work. I'm stuck in an environment where it simply does not work (for various technical reasons). Within the documentation of EHCache I found a way around "multicasting" making both nodes aware of each other in config files:

node1

<cacheManagerPeerProviderFactory          class="net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory"
            properties="peerDiscovery=manual,                        rmiUrls=//node2:40000/sampleCache1|//node2:40000/sampleCache2"/>
    <cacheManagerPeerListenerFactory        class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
            properties="port=40000, socketTimeoutMillis=5000"/>

node2

<cacheManagerPeerProviderFactory            class="net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory"
            properties="peerDiscovery=manual,                        rmiUrls=//node1:40000/sampleCache1|//node1:40000/sampleCache2"/>
    <cacheManagerPeerListenerFactory            class="net.sf.ehcache.distribution.RMICacheManagerPeerListenerFactory"
            properties="port=40000, socketTimeoutMillis=5000"/>

Has anyone ever tried (and hopefully succeeded) in manually linking 2 nodes to each other? If so … how Smiley Happy As Index replication does not seem to work with the above config. contentstore is also mutual NFS share for ease of use Smiley Happy

derek
Star Contributor
Star Contributor
Hi,
The indexes don't replicate, per se.  They follow using based quartz job running locally on each server that replays transactions to the index.

subspawn
Champ in-the-making
Champ in-the-making
I see, I did have that configured as http://wiki.alfresco.com/wiki/High_Availability_Configuration_V1.4_to_V2.1 under ' Lucene Index Synchronization'.

But, I remain with the following issues:
If I shut down both nodes & restart, they need full index rebuild or they give "CONTENT INTEGRITY ERROR: Indexes not found for 1 stores."

Is there an easy way to check that both nodes actually communicate with each other in the logs, as all tests on the wiki page fail. When turning on the debug levels I see on both nodes the following: 
DEBUG [node.index.IndexRemoteTransactionTracker] Performing index tracking from txn 204
etc… but no errors or some kind of heartbeat messages like "hello I'm node 1 with IP…" in the log of node 2.

If I check netstat, both nodes have port 40000 open, but no established connection. Perhaps I have to change the rmi url names "sampleCache1" & 2? How should they be named then?

derek
Star Contributor
Star Contributor
You have to list all the caches in that manual config.  It's a pain.

subspawn
Champ in-the-making
Champ in-the-making
I presume you're talking about the caches listed here?:
tomcat/webapps/alfresco/WEB-INF/classes/alfresco/cache-context.xml

I've added those with "Shared" in to the rmiUrl:
//node1:40000/parentAssocsSharedCache
//node1:40000/userToAuthoritySharedCache
//node1:40000/permissionsAccessSharedCache
//node1:40000/nodeOwnerSharedCache
//node1:40000/personSharedCache
//node1:40000/ticketsSharedCache
//node2:40000/parentAssocsSharedCache
//node2:40000/userToAuthoritySharedCache
//node2:40000/permissionsAccessSharedCache
//node2:40000/nodeOwnerSharedCache
//node2:40000/personSharedCache
//node2:40000/ticketsSharedCache

Or am I totally mistaken and/or missing some caches that should be manually added (or where to find them)? Since I still haven't got it to work. Thanks for the great pointers so far by the way 🙂

derek
Star Contributor
Star Contributor
When I said it was a pain, I really meant it.  No, you need to list all the caches in ehcache-custom.xml.sample.cluster.

subspawn
Champ in-the-making
Champ in-the-making
Your definitely weren't kidding 🙂

Damned, what a list, anyhow… I got it to work using this code:

<cacheManagerPeerProviderFactory            class="net.sf.ehcache.distribution.RMICacheManagerPeerProviderFactory"
properties="peerDiscovery=manual,
rmiUrls=//node1:40000/org.hibernate.cache.StandardQueryCache|
//node1:40000/org.hibernate.cache.UpdateTimestampsCache|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.QNameEntityImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeStatusImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeImpl.aspects|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeImpl.properties|
//node1:40000/org.alfresco.repo.domain.hibernate.ChildAssocImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeImpl.sourceNodeAssocs|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeImpl.targetNodeAssocs|
//node1:40000/org.alfresco.repo.domain.hibernate.NodeAssocImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.StoreImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.VersionCountImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.AppliedPatchImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.DbAccessControlListImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.DbAccessControlListImpl.entries|
//node1:40000/org.alfresco.repo.domain.hibernate.DbAccessControlEntryImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.DbPermissionImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.DbAuthorityImpl|
//node1:40000/org.alfresco.repo.domain.hibernate.DbAuthorityImpl.externalKeys|
//node1:40000/org.alfresco.cache.parentAssocsCache|
//node1:40000/org.alfresco.cache.userToAuthorityCache|
//node1:40000/org.alfresco.cache.permissionsAccessCache|
//node1:40000/org.alfresco.cache.nodeOwnerCache|
//node1:40000/org.alfresco.cache.personCache|
//node1:40000/org.alfresco.cache.ticketsCache|
//node2:40000/org.hibernate.cache.StandardQueryCache|
//node2:40000/org.hibernate.cache.UpdateTimestampsCache|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.QNameEntityImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeStatusImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeImpl.aspects|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeImpl.properties|
//node2:40000/org.alfresco.repo.domain.hibernate.ChildAssocImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeImpl.sourceNodeAssocs|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeImpl.targetNodeAssocs|
//node2:40000/org.alfresco.repo.domain.hibernate.NodeAssocImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.StoreImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.VersionCountImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.AppliedPatchImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.DbAccessControlListImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.DbAccessControlListImpl.entries|
//node2:40000/org.alfresco.repo.domain.hibernate.DbAccessControlEntryImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.DbPermissionImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.DbAuthorityImpl|
//node2:40000/org.alfresco.repo.domain.hibernate.DbAuthorityImpl.externalKeys|
//node2:40000/org.alfresco.cache.parentAssocsCache|
//node2:40000/org.alfresco.cache.userToAuthorityCache|
//node2:40000/org.alfresco.cache.permissionsAccessCache|
//node2:40000/org.alfresco.cache.nodeOwnerCache|
//node2:40000/org.alfresco.cache.personCache|
//node2:40000/org.alfresco.cache.ticketsCache"/>
I had to start both nodes at the same time, but once started they appear to stay in sync and update each other Smiley Happy

Thank you very much Smiley Happy !!!

yulimin
Champ in-the-making
Champ in-the-making
Do [size=150]must[/size] start both nodes at the same time ?