Hyland Connect

crichsource360 · ‎06-04-2014

Hello,

We have a 2 node cluster of Alfresco. We have content being pushed to the repository via WebDAV. We find that every once in a while, the opposite node does not have the updated file. The database has the updated information but, the object (from Hibernate, I presuem) is not updated. So, when any code accesses the file on node1 it's correct. But, when you access it on node2, it is inaccurate. The servers have to restart in order for node2 to reflect the latest changes.

We increase the rmi socket timeout which seems to make a small improvement however, it's still happening and the timeout is LONG(hours).
alfresco.ehcache.rmi.sockettimeoutmillis

We are using TCP for jgroups and I've increased logging but, it's like finding a needle in a haystack.

However, I also realize that the ehcache objects are configured to never expire. Our ehcache_custom.xml has entries like this:
    <cache
        name="org.alfresco.repo.domain.hibernate.NodeImpl"
        maxElementsInMemory="10000"
        eternal="false"
        timeToIdleSeconds="900"
        timeToLiveSeconds="900"
        overflowToDisk="false">

            <cacheEventListenerFactory
                    class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"
                    properties="replicatePuts = false,
                                replicateUpdates = true,
                                replicateRemovals = true,
                                replicateUpdatesViaCopy = false,
                                replicateAsynchronously = false"/>
    </cache>

I'm concerned about expiring and potentially moving a load to the database - not desireable.

Does anybody have any advice to help improve the situation?

UPDATE: Digging a little more, I realize I may be going down the wrong path. I believe it's using JGroups instead of EHCache so, my change probably didn't affect anything! Here is the ehcache_custom.xml
    cacheManagerPeerProviderFactory
        class="org.alfresco.repo.cache.AlfrescoCacheManagerPeerProviderFactory"
        properties="heartbeatInterval=5000,
                    peerDiscovery=automatic,
                    multicastGroupAddress=230.0.0.1,
                    multicastGroupPort=4446"


I guess this means it's using JGroups? So maybe a timeout in JGroups? The alfresco-jgroups-TCP.xml is this:
<config>
    <TCP bind_port="${alfresco.tcp.start_port:7800}"
         loopback="true"
         recv_buf_size="20000000"
         send_buf_size="640000"
         discard_incompatible_packets="true"
         max_bundle_size="64000"
         max_bundle_timeout="30"
         enable_bundling="true"
         use_send_queues="false"
         sock_conn_timeout="300"
         skip_suspected_members="true"

         thread_pool.enabled="true"
         thread_pool.min_threads="1"
         thread_pool.max_threads="25"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="run"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="1"
         oob_thread_pool.max_threads="8"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="run"/>

    <TCPPING timeout="3000"
             initial_hosts="${alfresco.tcp.initial_hosts:localhost[7800]}"
             port_range="${alfresco.tcp.port_range:3}"
             num_initial_members="2"/>
    <MERGE2 max_interval="30000"
              min_interval="10000"/>
    <FD_SIMPLE timeout="10000" max_missed_hbs="10" />
    <VERIFY_SUSPECT timeout="1500" />
    <BARRIER />
    <pbcast.NAKACK
                   use_mcast_xmit="false" gc_lag="0"
                   retransmit_timeout="300,600,1200,2400,4800"
                   discard_delivered_msgs="true"/>
    <UNICAST timeout="300,600,1200" />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>
    <VIEW_SYNC avg_send_interval="60000"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <FC max_credits="2000000"
        min_threshold="0.10"/>
    <FRAG2 frag_size="60000" />
    <pbcast.STREAMING_STATE_TRANSFER/>
    <!– <pbcast.STATE_TRANSFER/> –>
</config>

Any help would be appreciated.

Thanks,

yogeshpj · ‎06-09-2014

Hi,

Which version of alfresco are you using ?
As a quick advice, can you please check that both node's system clock are in sync.

crichsource360 · ‎06-10-2014

This is Version 3.1 Enterprise.
Definitely, the clocks are synchronized and verified to be working correctly.

mrogers · ‎06-10-2014

If its 3.1 Enterprise then please call alfresco support.

crichsource360 · ‎06-10-2014

It's Alfresco OEM'd into Adobe LiveCycle and support is ending for this. I would like to solve this outside of support.

Hyland Connect

Cluster content inconsistencies