cancel
Showing results for 
Search instead for 
Did you mean: 

Deadlock when having multiple executors

dan_
Champ in-the-making
Champ in-the-making
I am writing a simple application to run 8 executors, each with a max thread pool of 2.
I am using Activiti 5.19.0.1.
The total number of process instances is 10k. It takes 10 to 16 minutes to complete the 10k processes on a mac pro with 16G. The database used is Postgres.
All tasks in the process are set to lazy and exclusive.
Attached a zip file containing the process definition as well as the java code used.

Observations:
1. Almost in all application runs, 10s of jobs get their retries_ < 1 and need to be reset via a Sql statement so that the engine completes them. Question: if a job fails, it fails before even executing the "execute(DelegateExecution execution)" method, right? or it relies on transaction rollback to revert any changes. I want to know if job sends an email but failed the first time, retrying and successfully completing the job the second time does does not result in having the email sent twice, right?

2. Often, the engines stops processing instances after it completes ~9.5 out 10k instances. Sometimes, a deadlock is detected by the application. At others, the deadlock is detected by the database as you can see per the screenshot in attached doc file. Worst of all, is that some times, the engine blocks (nothing happen) forever. From analyzing, the process thread dumps (again see attached doc file),  It seems, the engine hit an undetected deadlock. Resting the retries_ count for jobs with retries_ < 1 does complete some jobs but it blocks again. As you can see from the threads dump, many thread are in a runnable mode waiting on the database and too few remain available for further processing. This explains might explain the slowness that some times happen and therefore the engine needs to be restarted in order to continue.

most of the runnable threads have passed in those 2 methods:
org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.updateProcessInstanceLockTime line: 205
org.activiti.engine.impl.persistence.entity.ExecutionEntityManager.clearProcessInstanceLockTime line: 212

By the way, I had encountered similar situations with less executors and more threads. In case you wonder why those specific number of executors and threads, it is the case with which I can frequently reproduce the issue with.

Thank you!

Dan
20 REPLIES 20

dan_
Champ in-the-making
Champ in-the-making
I am sorry, I could not attach the doc file since only text files can be uploaded.

dan_
Champ in-the-making
Champ in-the-making
Attached the pg_log that shows the deadlocks

dan_
Champ in-the-making
Champ in-the-making
I have fixed the issue I described above which occurs in highly concurrent environments.

It seems that the engine context reuse delays commits on the database side which leads to deadlocks with a lot of commits waiting on the database side and not visible from the application side.

I think the retry count should be unlimited or at least have a high threshold so that a job which has not executed due to unavailable lock, can have the chance to be retried till it is successfully executed.

My fix does:

1- set by default this.contextReusePossible to false in the default constructor of the CommandConfig class
2- use select for update nowait on to update process instance lock time in the Execution.xml file
3- set the number of retries to 10 in the JobEntity class.

I am happy to contribute the attached patch. I am eager to have your feedback.

Thanks


jwestra
Champ in-the-making
Champ in-the-making
Have you been able to repeat on Activiti 6?  Just curious, since they have a different persistence model/approach.

dan_
Champ in-the-making
Champ in-the-making
Not yet, but am going to try it soon.

dan_
Champ in-the-making
Champ in-the-making
Tried Activiti 6 rc2, same issue is experienced.

jbarrez
Star Contributor
Star Contributor
@dan_: thanks for loking into this so detailed.

So remarks
- the contextReusePossible general switch to false is not generally applicable, as it breaks many other use case (service call in java delegates being one). I wonder, why was that needed in the context of this?
- The default retries to 10 is a configurable option anyway
- But the sql change is interesting! … but it's probably postgres only, right? Why is this query better in the locking than the default one?

dan_
Champ in-the-making
Champ in-the-making
Thanks for taking the time to give feedback.

For the why contextReusePossible is set to false. Actually, when running the application (attached to initial post) which starts 10k instances, I systematically reach a point where roughly around 9500 instances terminates, but the engine blocks, the threads are waiting except one or two. I suspected a deadlock, which is usually the case. However, I was surprised to see bunches of commits and update processes set idle and active on the database side, not able proceed due to deadlocks (server status view). In contrast, on the application side only have one or two thread still in runnable state.

I started looking into the code where the commits are done. In the CommandContextInterceptor class, the call to context.close() does the call to commit. However, I found out that if the contextReusePossible is true the commit is not done " if (!contextReused) { context.close();}". It seems to me that explains the bunch of idle commit process on the database side. I wanted the commits to be done as soon as the job is executed. Consequently, switching the contextReusePossible to false by default does the job for me and the my application managed to terminate all processes without blocking of deadlocks.

What, I understand from you that setting contextReusePossible to false by default breaks service calls, you mean developer provided service tasks, right? If this is the case, I can conclude that if an applicaiton having no service tasks, it is will behave correctly right? Or you mean that it breaks Activiti services? In this case, I am eager to know your input on why my application succeeded to terminate all process instances.

Concerning retries. I am wondering whether a retried job gets executed the number of times it is retried. By getting executed, I mean the execute method of the developer DelegateExecution implementation is invoked with every job retry. If this is the case, it might create side-effects, right? Consider the case when a job sends an email, it will sent the email the number of times the job is getting retried right? I understand that, such a situation is tolerable in the case of transactional data change as the transaction is rolled back, nevertheless it requires careful job design to mitigate side-effects.

When I read the code for ExecuteAsyncRunnable, I understand that a job is retried, if an OptimisticLockException is thrown or the developer provide implementation of the DelegateExecution throws and an exception itself right? But when I run my application, I find that the DelegateExecution is executed with every retrial. There is something I do not understand here. Can you please enlighten me on that.

For SQL change "select … for update nowait", it is available with all major databases: Oracle, SQL server, Postgres. In case a lock cannot be obtained immediately, it will raise an error and return right after without blocking which is very interesting to use in the case of Activiti job locking for execution (less resource consumption, more CPU cycles, more interestingly reduces deadlocks).  BTW, I noticed huge performance gains 2 to 3 times improvement, when I applied my changes and running the same application attached to the initial post.

Thanks!

dan_
Champ in-the-making
Champ in-the-making
What I meant by performance improvement is process completed throughput per second.