cancel
Showing results for 
Search instead for 
Did you mean: 

Failed timers and due timers causing login and restart problems

jflukkien
Champ in-the-making
Champ in-the-making
Hi all,

The last 3 days i've been working on an issue we've encountered when using activiti and i'm wondering how to improve our setup to be more resilient to this issue popping up again.

Last saturday we had 30 processes failing because of a mismatch in what our activiti instance was expecting in data and what it got back. The processes all got to the state where they weren't getting retried and our entire application became unresponsive. When trying to log in it would just hang and not respond. And upon inspection of the host where the application was running we saw that there were a lot of futex calls and the amount of locks on the postgres instances behind activiti were becoming high enough for us to become worried.

At first we thought it might be an infrastructure problem. But this morning we were able to get our application up and running again, and i did this by manually setting all the job timers of the failed processes to the future such that it wouldn't try to execute them on startup anymore.

We are running activiti with two application nodes and a replicated postgres database behind it. And we use tomcat to serve the activiti web interface. The only thing we're developing are the bpmns and the java classes which are used in the bpmns.

It seems like our problem was caused by the guarantee that activiti runs jobs which duedates have passed coupled with our activiti application going belly up due to the exceptions it encountered while running our processes.

Now i'm left wondering, how can we prevent this from happening again. Our approach to activiti has been up till now to not develop anything directly in activiti, not touching the database behind it, and doing most of our work with java delegates. But ofcourse something can always go wrong. This is why i'm wondering if other people have encountered something like this. And how they solved it.

Thanks in advance for reading this long post Smiley Tongue

kind regards,

Jonathan


3 REPLIES 3

warper
Star Contributor
Star Contributor
What engine properties do you use?
How long does it take to complete(fail) one synchronous part of a job?
It's possible to get into trouble if you use several engines, your job takes too long and job lock is released before commit of initial job execution.

jflukkien
Champ in-the-making
Champ in-the-making
We have the following Process Engine Config:

JobExecutorActivate is true
EnableDatabaseEventLogging is true
DatabaseSchemaUpdate is true

And one of the workflows that ran this morning took a total of 86 ms from start to finish where the longest synchronous part of a job takes  50 ms (but that is an http call so that is not that strange).

warper
Star Contributor
Star Contributor
Such a speed is beyond my understanding, to be honest. My local activity instance running one process test on mocks does something like 1.5 script tasks per second.
How do you set up retry counts/times for failed processes? Something you describe can happen if you have nearly zero retry timer and a lot of retries, so failed job gets executed over and over for each of 30 failing processes. Otherwise activiti should try them once and sleep happily for the duration of retry interval.
Getting started

Tags


Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.