01-19-2017 10:19 AM
Hello - we have an issue that once we roughly go above 100 users we start seeing below errors and our application is no longer able to authenticate against alfresco for any api calls, new tickets or logoff events.
2017-01-19 03:49:28,254 ERROR [org.alfresco.util.transaction.TransactionSupportUtil] After completion (committed) TransactionalCache exception
org.alfresco.error.AlfrescoRuntimeException: 00191108829 Failed to transfer updates to shared cache
2017-01-19 04:59:02,914 ERROR [org.springframework.extensions.webscripts.AbstractRuntime] Exception from executeScript - redirecting to status template error: [CONCURRENT_MAP_PUT] Redo threshold[90] exceeded! Last redo cause: REDO_MAP_OVER_CAPACITY, Name: c:cache.ticketsCache
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Redo threshold[90] exceeded! Last redo cause: REDO_MAP_OVER_CAPACITY, Name: c:cache.ticketsCache
01-19-2017 11:04 AM
Please give more detail about your environment (including exact build versions of Alfresco, Database, O/S etc). Is this using Alfresco Share with Alfresco Platform ?
Is this Community or an Enterprise cluster ? See also [ACE-5184] Tomcat 7 classloader serializes authentication ticket retrieval - Alfresco JIRA (to see if there is any correlation).
Thanks,
Jan
01-19-2017 12:56 PM
Hi Jan,
We use Alfresco 5.0.3.1 enterprise version.
It runs on Windows Server 2012 R2.
We have two Alfresco nodes running in the cluster.
We do not use Alfresco Share but a third party user interface.
We have also logged a ticket for this with Alfresco support wanted to see if the community also had this issue.
Thanks
01-19-2017 01:04 PM
Thanks for the details. In addition to your support ticket (and any community feedback), please take a look at ACE-5184 in case there is any correlation.
Regards,
Jan
01-19-2017 06:34 PM
Unfortunately the Hazelcast cache for tickets (as most other caches) has been configured to use synchronous replication of data to other cluster nodes. This can cause various issues to propagate over multiple members of the cluster. E.g. when a cluster node is suffering from excessive GC overhead this might introduce significant delays to trigger timeouts in the communication and can even cause the cluster to dissolve in the worst cases.
In your case it would be interesting to get more information out of the Hazelcast layer at the time of these errors and why redos have to be performed. I was previously able to analyze internal issues with (the older version of) Hazelcast is by setting the appropriate logger (com.hazelcast) to DEBUG via the Alfresco Support Tools addon. These can then be passed on to Alfresco Support / used here for additional analysis.
Explore our Alfresco products with the links below. Use labels to filter content by product module.