Hi,
Just a few thought on this subject based on my (short) experience with Activiti:
Recovering from system errors tend to be a recurring issue, specially when dealing with tasks that deal with external systems (eg: email, webservices, external commands, etc). I usually set the async flag on those tasks so, in case of a transient failure, I can retry the implicit job that the engine schedules to handle a given async task.
Maybe this error handling pattern could be made easier to implement by allowing one to define a global error recovery strategy at engine level. Such strategy would receive a DelegateExecution and a Throwable and return a value that instructs what how the engine should proceed. Something like:
DEFAULT - Do whatever the command chain is supposed to do (eg: rollback)
IGNORE - Ignore the error and pretend the execution was ok (useful during development)
ROLLBACK_AND_RETRY_NOW - Side effects can be an issue here…
ROLLBACK_AND_RETRY_LATER -Pretty much what happens when an async task fails, but the handler should have a way to inform when the retry should happen.
In order to avoid an infinite retry loop, local variables containing the number of failed attempts and the date of the last one should be made available to the recovery handler. In any case, the engine should have additional default settings to handle more-or-less gracefully double fault scenarios (eg: an exception in the error handler).