Deadlock when having multiple executors

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-09-2016 07:00 PM
I am using Activiti 5.19.0.1.
The total number of process instances is 10k. It takes 10 to 16 minutes to complete the 10k processes on a mac pro with 16G. The database used is Postgres.
All tasks in the process are set to lazy and exclusive.
Attached a zip file containing the process definition as well as the java code used.
Observations:
1. Almost in all application runs, 10s of jobs get their retries_ < 1 and need to be reset via a Sql statement so that the engine completes them. Question: if a job fails, it fails before even executing the "execute(DelegateExecution execution)" method, right? or it relies on transaction rollback to revert any changes. I want to know if job sends an email but failed the first time, retrying and successfully completing the job the second time does does not result in having the email sent twice, right?
2. Often, the engines stops processing instances after it completes ~9.5 out 10k instances. Sometimes, a deadlock is detected by the application. At others, the deadlock is detected by the database as you can see per the screenshot in attached doc file. Worst of all, is that some times, the engine blocks (nothing happen) forever. From analyzing, the process thread dumps (again see attached doc file), It seems, the engine hit an undetected deadlock. Resting the retries_ count for jobs with retries_ < 1 does complete some jobs but it blocks again. As you can see from the threads dump, many thread are in a runnable mode waiting on the database and too few remain available for further processing. This explains might explain the slowness that some times happen and therefore the engine needs to be restarted in order to continue.
most of the runnable threads have passed in those 2 methods:
org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.updateProcessInstanceLockTime line: 205
org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.clearProcessInstanceLockTime line: 212
By the way, I had encountered similar situations with less executors and more threads. In case you wonder why those specific number of executors and threads, it is the case with which I can frequently reproduce the issue with.
Thank you!
Dan
- Labels:
-
Archive

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-19-2016 11:11 PM
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-22-2016 02:30 PM
I have this topic high on my todo list to
1) verify the deadlocks in your original posts
2) read more about what the select for update nowait impact is and how it reacts on different databases
"Say a job does increase the salary by 5%. If the job is retried, the salary would have been increased twice which is not acceptable from a business standpoint."
No, if the increase in salary is done in a non-transactional way, then you are correct. it's always important in Activiti to make that logic transactional.
"I am afraid, this is flaw in the design of how jobs are selected for execution (multiple threads can execute same job). "
I must disagree with this: multiple threads won't execute the same job. When being retried, yes. But the actual execution of logic _and_ continuation of the process instance: no.
I'll post back when we've looked closer into the 'nowait' semantics and trying to reproduce the deadlocking.
Many thanks for the great comments with loads of information!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-22-2016 03:02 PM
Thanks for clarifying the job retrial mechanism. In light of what you have explained, do you think it make sense to set a limit on the number of job retrials. I think it should be unlimited. Since a job should be retried until success. For there came my confusion.
Thanks,
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-22-2016 03:16 PM
Of course, you can set it to integer.Max if you like.
Alternatively, to have full control, you can inject your own implementation of org.activiti.engine.impl.jobexecutor.FailedJobCommandFactory (which by default creates a RetryJobCmd that does the default behavior) and do whatever you please.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-24-2016 11:15 AM
When using 5.19.0.1, I'm seeing:
Exception in thread "pool-1-thread-6" org.activiti.engine.ActivitiOptimisticLockingException: Could not lock process instance
at org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.updateProcessInstanceLockTime(ExecutionEntityManager.java:208)
at org.activiti.engine.impl.cmd.LockExclusiveJobCmd.execute(LockExclusiveJobCmd.java:55)
Note sure if it's the same thing as you see … BUT when I run on 5.19.0.2 (and master, 5.20-SNAPSHOT), I'm not seeing any exception anymore, and all process instances get completed. Can you give that a go on your side too?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-25-2016 04:52 PM
I did test it on 5.19.0.2 for 10k, it work with no problem, but I observed significant throughput decrease. I will repeat the test for 100k, 1M process and will reflect on that.
Thank you for taking this issue seriously.
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-29-2016 03:52 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎03-29-2016 12:16 PM
In turn, I want to report that we were able to run successfully 1_000_000 process on 4 different nodes, it took 9 hours. The best of all, is that no deadlock were detected.
Thanks for the effort you expended on fixing issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎04-04-2016 09:54 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎06-03-2016 09:57 AM
I had the same issue using Activiti 5.20 deployed on Wildfly with SQL Server when enabling async continuations for some service tasks in my BP.
Note: My set-up of Activiti might be a little bit atypical as I use both activiti-cdi and activiti-rest-api.
Changing contextReuseProssible value to false fixed it for me too. Could you explain what exactly the goal of this property is and what does the context reuse mean exactly for jobs executor?
Thanks,
Rémi
