cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster content inconsistencies

crichsource360
Champ in-the-making
Champ in-the-making
Hello,

We have a 2 node cluster of Alfresco.  We have content being pushed to the repository via WebDAV.  We find that every once in a while, the opposite node does not have the updated file.  The database has the updated information but, the object (from Hibernate, I presuem) is not updated.  So, when any code accesses the file on node1 it's correct.  But, when you access it on node2, it is inaccurate. The servers have to restart in order for node2 to reflect the latest changes.

We increase the rmi socket timeout which seems to make a small improvement however, it's still happening and the timeout is LONG(hours).
alfresco.ehcache.rmi.sockettimeoutmillis

We are using TCP for jgroups and I've increased logging but, it's like finding a needle in a haystack.

However, I also realize that the ehcache objects are configured to never expire.  Our ehcache_custom.xml has entries like this:
    <cache
        name="org.alfresco.repo.domain.hibernate.NodeImpl"
        maxElementsInMemory="10000"
        eternal="false"
        timeToIdleSeconds="900"
        timeToLiveSeconds="900"
        overflowToDisk="false">

            <cacheEventListenerFactory
                    class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"
                    properties="replicatePuts = false,
                                replicateUpdates = true,
                                replicateRemovals = true,
                                replicateUpdatesViaCopy = false,
                                replicateAsynchronously = false"/>
    </cache>

I'm concerned about expiring and potentially moving a load to the database - not desireable.

Does anybody have any advice to help improve the situation?

UPDATE: Digging a little more, I realize I may be going down the wrong path.  I believe it's using JGroups instead of EHCache so, my change probably didn't affect anything!  Here is the ehcache_custom.xml
    cacheManagerPeerProviderFactory
        class="org.alfresco.repo.cache.AlfrescoCacheManagerPeerProviderFactory"
        properties="heartbeatInterval=5000,
                    peerDiscovery=automatic,
                    multicastGroupAddress=230.0.0.1,
                    multicastGroupPort=4446"
   

I guess this means it's using JGroups?  So maybe a timeout in JGroups?  The alfresco-jgroups-TCP.xml is this:
<config>
    <TCP bind_port="${alfresco.tcp.start_port:7800}"
         loopback="true"
         recv_buf_size="20000000"
         send_buf_size="640000"
         discard_incompatible_packets="true"
         max_bundle_size="64000"
         max_bundle_timeout="30"
         enable_bundling="true"
         use_send_queues="false"
         sock_conn_timeout="300"
         skip_suspected_members="true"
        
         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="25"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="run"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="run"/>
                        
    <TCPPING timeout="3000"
             initial_hosts="${alfresco.tcp.initial_hosts:localhost[7800]}"
             port_range="${alfresco.tcp.port_range:3}"
             num_initial_members="2"/>
    <MERGE2 max_interval="30000"
              min_interval="10000"/>
    <FD_SIMPLE timeout="10000" max_missed_hbs="10" />
    <VERIFY_SUSPECT timeout="1500"  />
    <BARRIER />
    <pbcast.NAKACK
                   use_mcast_xmit="false" gc_lag="0"
                   retransmit_timeout="300,600,1200,2400,4800"
                   discard_delivered_msgs="true"/>
    <UNICAST timeout="300,600,1200" />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>
    <VIEW_SYNC avg_send_interval="60000"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <FC max_credits="2000000"
        min_threshold="0.10"/>
    <FRAG2 frag_size="60000"  />
    <pbcast.STREAMING_STATE_TRANSFER/>
    <!– <pbcast.STATE_TRANSFER/> –> 
</config>


Any help would be appreciated.


Thanks,
4 REPLIES 4

yogeshpj
Star Contributor
Star Contributor
Hi,

Which version of alfresco are you using ?
As a quick advice, can you please check that both node's system clock are in sync.

crichsource360
Champ in-the-making
Champ in-the-making
This is Version 3.1 Enterprise.
Definitely, the clocks are synchronized and verified to be working correctly.

mrogers
Star Contributor
Star Contributor
If its 3.1 Enterprise then please call alfresco support.

It's Alfresco OEM'd into Adobe LiveCycle and support is ending for this.  I would like to solve this outside of support.