cancel
Showing results for 
Search instead for 
Did you mean: 

Deadlock when having multiple executors

dan_
Champ in-the-making
Champ in-the-making
I am writing a simple application to run 8 executors, each with a max thread pool of 2.
I am using Activiti 5.19.0.1.
The total number of process instances is 10k. It takes 10 to 16 minutes to complete the 10k processes on a mac pro with 16G. The database used is Postgres.
All tasks in the process are set to lazy and exclusive.
Attached a zip file containing the process definition as well as the java code used.

Observations:
1. Almost in all application runs, 10s of jobs get their retries_ < 1 and need to be reset via a Sql statement so that the engine completes them. Question: if a job fails, it fails before even executing the "execute(DelegateExecution execution)" method, right? or it relies on transaction rollback to revert any changes. I want to know if job sends an email but failed the first time, retrying and successfully completing the job the second time does does not result in having the email sent twice, right?

2. Often, the engines stops processing instances after it completes ~9.5 out 10k instances. Sometimes, a deadlock is detected by the application. At others, the deadlock is detected by the database as you can see per the screenshot in attached doc file. Worst of all, is that some times, the engine blocks (nothing happen) forever. From analyzing, the process thread dumps (again see attached doc file),  It seems, the engine hit an undetected deadlock. Resting the retries_ count for jobs with retries_ < 1 does complete some jobs but it blocks again. As you can see from the threads dump, many thread are in a runnable mode waiting on the database and too few remain available for further processing. This explains might explain the slowness that some times happen and therefore the engine needs to be restarted in order to continue.

most of the runnable threads have passed in those 2 methods:
org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.updateProcessInstanceLockTime line: 205
org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.clearProcessInstanceLockTime line: 212

By the way, I had encountered similar situations with less executors and more threads. In case you wonder why those specific number of executors and threads, it is the case with which I can frequently reproduce the issue with.

Thank you!

Dan
20 REPLIES 20

dan_
Champ in-the-making
Champ in-the-making
Any feedback. partly based a feedback, I need to decide whether to use the engine or not. I am afraid, this is flaw in the design of how jobs are selected for execution (multiple threads can execute same job). Additionally, the mechanism of job retrials is not clear whether it may cause side effects. Say a job does increase the salary by 5%. If the job is retried, the salary would have been increased twice which is not acceptable from a business standpoint.

Thank you!

jbarrez
Star Contributor
Star Contributor
Hi dan_,

I have this topic high on my todo list to
1) verify the deadlocks in your original posts
2) read more about what the select for update nowait impact is and how it reacts on different databases

"Say a job does increase the salary by 5%. If the job is retried, the salary would have been increased twice which is not acceptable from a business standpoint."

No, if the increase in salary is done in a non-transactional way, then you are correct. it's always important in Activiti to make that logic transactional.

"I am afraid, this is flaw in the design of how jobs are selected for execution (multiple threads can execute same job). "

I must disagree with this: multiple threads won't execute the same job. When being retried, yes. But the actual execution of logic _and_ continuation of the process instance: no.

I'll post back when we've looked closer into the 'nowait' semantics and trying to reproduce the deadlocking.

Many thanks for the great comments with loads of information!

dan_
Champ in-the-making
Champ in-the-making
Hi Jerome,

Thanks for clarifying the job retrial mechanism. In light of what you have explained, do you think it make sense to set a limit on the number of job retrials. I think it should be unlimited. Since a job should be retried until success. For there came my confusion.

Thanks,

Dan

jbarrez
Star Contributor
Star Contributor
The idea of a finite number of retries, is to avoid that an already dead system doesn't get hammered forever … and that it goes into some sort of 'dead-letter-queue' thing (like for messaging) where an admin can manually retry the job.

Of course, you can set it to integer.Max if you like.

Alternatively, to have full control, you can inject your own implementation of org.activiti.engine.impl.jobexecutor.FailedJobCommandFactory (which by default creates a RetryJobCmd that does the default behavior) and do whatever you please.

jbarrez
Star Contributor
Star Contributor
Hi dan_,

When using 5.19.0.1, I'm seeing:

Exception in thread "pool-1-thread-6" org.activiti.engine.ActivitiOptimisticLockingException: Could not lock process instance
at org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.updateProcessInstanceLockTime(ExecutionEntityManager.java:208)
at org.activiti.engine.impl.cmd.LockExclusiveJobCmd.execute(LockExclusiveJobCmd.java:55)

Note sure if it's the same thing as you see … BUT when I run on 5.19.0.2 (and master, 5.20-SNAPSHOT), I'm not seeing any exception anymore, and all process instances get completed. Can you give that a go on your side too?

dan_
Champ in-the-making
Champ in-the-making
Thanks Joram,

I did test it on 5.19.0.2 for 10k, it work with no problem, but I observed significant throughput decrease. I will repeat the test for  100k, 1M process and will reflect on that.

Thank you for taking this issue seriously.

Dan

jbarrez
Star Contributor
Star Contributor
Thanks dan_ for the update. A throughput decrease is not wanted, but maybe it's a side effect of the fix we made. Can't recall the exact details, but there was definitely something broken - the queue size was not honored properly, which made that jobs were executed even if the queue was full. So, definitely have a go at playing with queue size parameters too, as it will have a serious impact in that release!

dan_
Champ in-the-making
Champ in-the-making
Thanks Joram for the information,

In turn, I want to report that we were able to run successfully 1_000_000 process on 4 different nodes, it took 9 hours. The best of all, is that no deadlock were detected.

Thanks for the effort you expended on fixing issues.

jbarrez
Star Contributor
Star Contributor
dan_, that is very awesome to hear! Thanks for posting back!

remibantos
Champ in-the-making
Champ in-the-making
Hi,

I had the same issue using Activiti 5.20 deployed on Wildfly with SQL Server when enabling async continuations for some service tasks in my BP.
Note: My set-up of Activiti might be a little bit atypical as I use both activiti-cdi and activiti-rest-api.

Changing contextReuseProssible value to false fixed it for me too. Could you explain what exactly the goal of this property is and what does the context reuse mean exactly for jobs executor?

Thanks,
Rémi