Hyland Connect

matthewpatin · ‎11-30-2015

This is a crosspost from stackoverflow:
http://stackoverflow.com/questions/33962022/using-serializable-transaction-isolation-with-alfresco

<strong>Background</strong>

Alfresco uses the default database transaction isolation, which for our Oracle database is READ_COMMITED. I'm working on a project where non-repeatable and phantom reads could be a major issue, so I'm looking into using SERIALIZABLE transaction isolation to avoid this.

On one hand, we'll have an custom API service which groups changes into atomic transactions - basically CRUD operations on documents. On the other hand, we'll have background processes updating the metadata of this documents running in parallel.

These operations will use Transactional Metadata Queries, as to not further complicate matters by adding eventually consistent SOLR queries into the mix.

The goal is to be able to undertake a major metadata model migration while the API service is running. For the purpose of this question, I'm going to use one attribute as the example, but IRL there will be many changes of this sort. For example, we currently have a metadata field with a constraint: mymodel:doctypes1. But we need to remap the values in mymodel:doctypes1 to a new field with a different constraint: mymodel:doctypes2. (Don't ask me why, I have no control over this decision and I personally question the wisdom of this kind of change).

My understanding of READ_COMMITTED isolation tells me that in this scenario, we are very vulnerable to the following situation:

The background process starts a transaction and reads the value of mymodel:doctypes1.
The API writes a change in mymodel:doctypes1 before the background process commits.
The background process updates the value of mymodel:doctypes2 based on the original value of mymodel:doctypes1.
The two values are now inconsistent: I believe this error is called a non-repeatable read.

<strong>The questions</strong>

Would setting the Oracle database to SERIALIZABLE prevent this issue? Alfresco is using Spring transactions under the hood. My understanding is that a serializable tx isolation with Spring transactions would prevent this issue from happening "transparently".

Does anyone have any real world experience setting an Alfresco database to SERIALIZABLE? Were you trying to solve a similar issue? Did it work? What kind of performance impact did it have for you?

Thanks you very much for sharing your experiences!

afaust · ‎11-30-2015

Hello,

apart from transaction isolation, Alfresco also uses optimistic locking for managing concurrent updates to the same node. This means in your scenario either the the background process or "API" will get an exception about update conflict from the DAO layer, roll back the transaction and try again (retry is another feature of Alfresco DB handling). Thus it is - barring any freaky errors - not possible that you'll end up with inconsistent property values on the same node.

Most likely it is the background process (the longer running of the two) to be affected and roll back. In order to avoid wasting the time spent on remapping, you could adapt the background process to work in intervals / batches (using BatchProcessor utility). This of course means loosing transactionality of the background process but would introduce multiple savepoints and reduce chance of concurrent update issues that require a rollback.

In my experience, changing the transaction isolation level is never necessary - Alfresco provides all the tooling to work with single and mass update scenarios while still providing for consistent states even when background processes and user interactions occur at around the same time.

The optimistic locking / versioning I mentioned deals with the possibility of dirty reads. Phantom reads can occur nontheless. Sure, setting SERIALIZABLE isolation level would help with that but would introduce a possible DB error ("cannot serialize access") that Alfresco does not deal with, leading to user interactions not being completed successfully with potentially cryptic messages to the end-user. It also would be unnecessary overkill to change isolation to hanle a potential issue in one specific functionality out of hundreds…

Using batch processing, the simplest way to deal with phantom reads is by re-running the batch processor which would then pick up any nodes it missed in the previous run.
Since you apparently plan to use an active functionality modifying the old property while the re-mapping is running, you'd also have to account for the possibility that this functionality could modify the old property after re-mapping completed (regardless of transaction isolation). Only if you disabled the functionality that does this type of update before you start re-mapping can you be entirely sure you won't end up with inconsistent / partial states.

Regards
Axel

matthewpatin · ‎11-30-2015

Thanks Axel for your well thought-out answer. I agree with most of your points. As long as we can get an exception on update, if it's properly designed, we wouldn't run into issues. I'm pretty sure we wouldn't get an exception however, and was hoping you could help me see why you think we would. Is it the optimistic locking specifically?

First, I wasn't planning on using the batch processor. I was probably going to write a custom job which I'll explain. Ideally, I want to optimize this process quite a bit, because we are dealing with a very large repository of millions of documents, and I'm still pretty sure that the kind of optimizations I want to do would make us very vulnerable to non-repeatable read errors.

In theory, the fastest way to process the data would be grab the nodes to update and their properties in one pass. I believe this is possible from past experience, but its been a long time since I've tried so I'm not sure if we still can. But if this were the case, we could avoid iterating over every candidate node and inspecting its properties individually. We could just grab a 1000 nodes and their properties and create the updates off of the original values. If we use this approach, I believe we are quite open to repeatable read errors, especially if we don't change the transaction level.

Do you still think a job like this can't put the data in an inconsistent state? Would the batch select optimistically lock all of the nodes, so the custom API would get an exception if it tried to make a change? That would be fine too!

Best regards,
Matthew

afaust · ‎12-01-2015

Hello,

if you try to use aggressive optimizations that end up bypassing the standard Alfresco NodeService / NodeDAO you would be in danger of consistency issues. E.g. if you'd write your own DB query / update scripts (which is a possibility and one I have used in very specific migration use cases) you'd inevitably render the optimisitic locking feature moot and would also risk running into stale cache data issues, especially in case you are running in a clustered Enterprise setup.

If you write a Quartz job and use the standard NodeService / NodeDAO, you can run into DB consistency issues when your job is not made transactionally safe. This would include non-repeatable read errors. This issue is more an issue of proper job implementation than a matter of transaction isolation. Any job that needs to be transactionally safe must use the RetryingTransactionHelper.doInTransaction utility to encapsulate code in one atomic transaction - you can't rely on the service-level transaction handling if the job has orchestrate more than one service call. Even a transaction isolation level of SERIALIZABLE would not help in this case if the job isn't properly implemented.

If you use NodeService / NodeDAO and the job is executed transactionally, then - from past experience including processing hundreds of thousands to millions of nodes - you'll not run into inconsistent state and optimistic locking will catch any concurrent update conflicts. You will not necessarily get an exception because in concurrent update conflicts, the RetryingTransactionHelper will swallow the exception if the transaction can be successfully completed in another attempt (default is up to 40 attempts / retries). Only if all retries / attempts are exhausted is the last exception propagated to the next layer /end user. In a lot of systems, there'll be concurrent update conflicts on a regular basis without anyone being aware of them as they can be resolved within one or two retries.
Only if you enable debug logging for the RetryingTransactionHelper will you be able to notice the amount of silent magic this helper is doing.
So if your custom API tries to make a change and runs into an update conflict, the transaction will rollback, retry and likely succeed (based on the then transactionally re-read, up-to-date node state) - without any exception escalated to the user.

Regards
Axel

matthewpatin · ‎12-04-2015

Thank you so much for this Axel. Just to make sure I did a little mockup to provoke a ConcurrencyException and check that it does indeed behave correctly.

I use a master class to send 49 updates for the same node to the thread pool:


         for (int i=0; i < 50; i++) {
            TestIncrementer ti = new TestIncrementer(node);
            threadPoolExecuter.submit(ti);
         }

‍‍‍‍‍‍‍

My incrementer just reads the existing title, sleeps randomly, and then adds onto it.


int hash  = RetryingTransactionHelper.getActiveUserTransaction().hashCode();
log.debug("Thread id: " + Thread.currentThread().getId() + " tx " + hash);
String title = (String) nodeService.getProperty(nodeRef, ContentModel.PROP_TITLE);
Thread.sleep(Math.round(Math.random()* 1000));
nodeService.setProperty(nodeRef, ContentModel.PROP_TITLE, title + "1");
log.debug("Title: " + title);
‍‍‍‍‍‍‍‍

As expected, I'm seeing concurrency exceptions and retries:


Transaction commit failed: 
   Thread: defaultAsyncAction4
   Txn:    UserTransaction[object=org.alfresco.util.transaction.SpringAwareUserTransaction@65b249f2, status=0]
   Iteration: 8
   Exception follows:
 org.springframework.dao.ConcurrencyFailureException: Failed to update node 126094
‍‍‍‍‍‍‍‍

But they are getting retried.

The final retry leaves the node with exactly 49 aggregations! Which is exactly what needs to happen.

Thanks for your help!

Best regards,
Matthew

Hyland Connect

Using SERIALIZABLE transaction isolation with Alfresco