topic Re: Unable to rolling restart my cluster due to Hazelcast timeouts in Alfresco Forum

Unable to rolling restart my cluster due to Hazelcast timeouts

josh_barrett — Thu, 19 Oct 2017 13:30:25 GMT

I am running 5.1.1 on an environment and ran into an issue yesterday under peak load.We had a couple of servers get into a bad state so we tried to do a rolling restart of Alfresco.The servers wouldn't start up because of a Hazelcast timeout. Probably because the cluster was so busy.We had to stop

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

afaust — Fri, 20 Oct 2017 12:00:40 GMT

Hazelcast timeouts can be caused by many things, from actual networking issues over CPU overload to memory / garbage collection issues on the other cluster node.The issue I have seen the most often is the latter, with a system being poorly configured and very close to garbage collection hell, where only a slight change in circumstance would bring down the entire cluster. You need to investigate what issue you were actually suffering from. I'd advise running some JVM monitoring via i.e. jvisualvm during startup (on all cluster nodes) to get a picture of what's going on.

In some circumstances you might even be able to avoid doing a full restart of your entire cluster, e.g. if only the communication / cluster state is affected. Using the JavaScript Console you can restart only the Hazelcast layer, and using the Caches tool of the OOTBee Support Tools addon you can purge data caches to remove potentially stale data.

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

jpotts — Fri, 20 Oct 2017 20:52:46 GMT

Have you tried disabling multicast and instead listing the members of the cluster individually in the hazelcast config?

It looks something like:

            <hz:join>
               <hz:multicast enabled="false"
                     multicast-group="224.2.2.5"
                     multicast-port="54327"/>
               <hz:tcp-ip enabled="true">
                  <hz:members>10.84.1.151,10.84.1.152</hz:members>
               </hz:tcp-ip>
            </hz:join>

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

afaust — Mon, 23 Oct 2017 08:57:36 GMT

With Hazelcast on Repository, multicast is disabled by default. The config example from Jeff applies only to the Share tier where the Hazelcast config is embedded in Spring. For Share the documentation of Alfresco provides the configuration with multicast enabled. The error messages in the logs point to Repository-tier issues though.

Re: Unable to rolling restart my cluster due to Hazelcast timeouts

josh_barrett — Fri, 27 Oct 2017 20:42:59 GMT

Thanks for the replies Axel Faust‌ and Jeff Potts‌. The actual root problem was all of our Alfresco servers in the cluster were close to being maxed on CPU.

The issue was under peak load we had a few background (custom) processes kicking off which put the servers over the edge.

In the heat of the moment we removed all of the servers from the cluster and simply had our API layer talking to Alfresco via CMIS through a load balancer unclustered. We thought we were all good. Servers seemed healthy from CPU, JVM, and the number of requests we were handling. But..... After looking into the logs a majority of the document update calls were failing with messages like the following in our custom API logs.
Expected xxxx bytes but retrieved 0 bytes!

We reproduced this issue in our Performance environment. We resolved this issue by adding the servers back into the cluster. The weird thing was it was only updates causing this issue. New document adds didn't have any issues. Only binary updates. I wonder if this is a bug with the CMIS implementation.