cancel
Showing results for 
Search instead for 
Did you mean: 

Thread/Transaction-safe action rules for massive concurrency

iru
Confirmed Champ
Confirmed Champ

Hi there,

Got a task that seems easy but is getting us lots of headaches.
Functionally speaking, what we need is to create a custom rule in a folder, that given a file that is being uploaded, moves it into another folder, based on its creation date.

  • File enters: Inbox
  • File is moved to:  Folderized / yyyy / MM / dd / ... (in production up to minutes, for example)


Rule must implement two subtasks

  1. create the folderized structure if not existing
  2. move the file into the folderized structure

____


Now, because this must work thread-safely, on a concurrent batch of file upload, it may happen that multiple threads enter the rule at the same time, and here start the problems.

  • We started using the synchronized word, in the upper method (ActionExecuterAbstractBase#executeImpl), inner methods, or blocks of code, but this does not seem to work because even if we synchronize, the transaction may have not been committed, hence, no task1 output is available to other threads. Added a sleep but cannot ensure that it's done anyway and fails sometimes.

  • We also tried the RetryingTransactionHelper (RTH) workaround exposed by Alexey Vasyukov in the lightning talk about 'Alfresco repo under concurrent write load', of this year's beecon, but got several problems too.

I've already asked how this helper is supposed to work when the flag 'requiresNew' is set to false, because it seems not to retry, or we may be misunderstanding something.
https://community.alfresco.com/message/815007-re-retryingtransactionhelper-learnings?commentID=81500...

The problem with this utility is that it behaves differently on a postgres based server vs. mysql based server (we're using latest alfresco520,  mysql server-core 5.7, with 5.1.32 driver , Ubuntu 16; Updated  checked InnoDB for all tables for transacational management)

Updated Tested in SQLServer too, and gives different behaviour from previous ones too.

The approach that works fine in Postgres is having task 1 under a RetryingTransactionHelper (and new transaction), and don't go into task 2 until it's successfully performed.
Task 2 is not executed under a RTH.

However when we try this with mysql, fails. Guess this error is launched because task 1's output is still not fully operable (due to being made inside a transaction) to the parent transaction.

Updated- MySQL  (first task with RTH, second task not - NonRootNodeWithoutParentsException)
- src: https://pastebin.com/raw/DD0bWjW2
- action context: https://pastebin.com/CVWS9U9q
- error: https://pastebin.com/raw/gx2TkWxn

Updated-SQLServer (DataIntegrityViolationException)

- error: https://pastebin.com/raw/BZSTKDUX 

We also tried to insert the second task inside RTH.
With the requireNew flag set to true seems as if the node is not visible to the code within the transaction and when trying to move it, the node is not found.
Guess because it's a new transaction and does not know about the "working in progress node" (seems some AOP implementation)

Updated Clarify that this happens no matter the database provider, so it's nice to see it's coherent.

Updated (both tasks with RTH, with requiresNew - InvalidNodeRefException: Node does not exist)
- src: https://pastebin.com/CvrRSt3X
- action context: same as before

- error: https://pastebin.com/raw/yVNc8YWf

If we insert it with a requireNewFlag set to false, it gives the previous NonRootNodeWithoutParentsException error.

______

We've tried other approaches/combinations and got several kinds of errors, but tried to summarize everything as much as possible.

The only solution that has seemed to work (in any database provider, concurrently) is when rule is executed in background. However, this is not desired; we would like it to be sequential, so final user can know if the action has been correctly performed.

We could also have a cron-like task, that populates the folderized structure, but will lead to lots of empty folders if no data is uploaded and would have to purge it afterwards.

Any suggestions on how to approach this?


Would provide code, but its something quite simple using fileFolderService, both to create and move. We've  analysed the code that is in the default "MoveActionExecuter" action rule provided in the core of Alfresco, and its quite similar to ours.

Thanks in advance.

12 REPLIES 12

afaust
Legendary Innovator
Legendary Innovator

Ok - so you are not using the default retryingTransactionHelper but a custom instance with some custom extra exceptions. This is not something that I would recommend. For one, most of the exceptions included are not exceptions where a retry will have a chance to "fix" anything - some of them are indicators of serious developer errors, e.g. regarding not validating input / trigger conditions. Basically you are retrying to try and work around (potentially your own) developer errors...
Never have I encountered a situation where the default retrying transaction helper would not suffice.

I also see some potential issue with using the FileFolderUtil.makeFolders utility. This will also traverse and retrieve folder paths that the current user is not allowed to access, and the result may be unusable due to restricted permissions.

One problem you face in both of your constellations is that the FileFolderService.exists() check in line 90/91 will/should always fail on the first try of the action when the target folder does not exist. This is because the new folder will be created in a nested transaction and will not be visible to the outside transaction for the exists() check. As a result of that (I assume) the outer transaction will rollback and retry the entire action.

The issue with the NonRootNodeWithoutParentsException looks like it may have something to do with incorrect isolation levels on the DB tier and/or a potential bug in the parentAssocs cache inside the NodeDAO. Unfortunately, that cache is a custom implementation and not one of the regular caches, so it may be prone to issues not noticed in any of the other caches.
Is the MySQL server configured with the right default transaction level, and with the configuration parameters as recommended in Alfresco documentation? Normally, before the exception is thrown, Alfresco will check the database to load the parents of a node if they have not been cached yet, and for a node that has just been uploaded it should be able to find the parents (unless some other custom code has messed with the DB in the meantime or the isolation level is not working properly).

iru
Confirmed Champ
Confirmed Champ

Thanks for the recommendations regarding the RTH and FileFolderUtil.makeFolders usage, will take them into consideration.

Regarding the FileFolderService.exists, seems not to fail due to target folder being the object that is returned from the previous RTH task (plus we're not seeing the trace placed in the else statement).

As shown in the traces, the problem occurs when we're trying to move the just uploaded document into the just created folder.
__

NonRootNodeWithoutParentsException its an odd one and could understand it's due to some cache bug, but what about the second one; InvalidNodeRefException: Node does not exist?
This, if I understood it correctly, is referring to the fact that the transaction has no context of the node that is being uploaded itself.
__

We've followed the instructions within the configuration documentation http://docs.alfresco.com/5.2/tasks/mysql-config.html

The only aspect it talks about transaction levels, is that the database tables are to use InnoDB, and that's done correctly by the population script of Alfresco itself.

Have also read  http://docs.alfresco.com/4.0/concepts/mysql-config-settings.html

where it recommends to set the "innodb_locks_unsafe_for_binlog = 1"

I'm no DBA, but is there any specific "transaction level"  or "configuration parameters" we might be missing?

As read, the default transaction isolation level for mysql innodb is REPEATABLE-READ.
__

Been going back to previous simpler implementations, taking into consideration your recommendation regarding "CMIS endpoints should always trigger a retrying transaction automatically which would be propagated to any action called from a rule", but still don't see the correct way to implement it.

If multiple threads enter the rule, and only one is to create the folder structure... how should we develop the rule thread-safely?

FileFolderUtils.makeFolders, for instance, should work, as it states in the javadoc: checks for the presence of, and creates as necessary, the folder structure in the provided path

But "FileExistsException" is thrown for non first threads that arrive, and CMIS RTH does not contemplate it as retryable.

- src: https://pastebin.com/nJwQwTmG

- error: (4 thread execution in concurrency, 1 ok, 3 fail)  https://pastebin.com/raw/M23c5j8C

Could force a while statement, until node is created or resolved, but again we're going into deep mud.
It's silly.. but any recommendations on this? Will try anything x)


Best workaround up to the point, using 1 RTH for the folder creation, to be tested in sql server and hope for the best :rolleyes:

afaust
Legendary Innovator
Legendary Innovator

If you look at your own error stacktrace (https://pastebin.com/raw/yVNc8YWf ) you can see that a couple of lines above "org.apache.chemistry.opencmis.server.impl.atompub.CmisAtomPubServlet.service" there is an instance of "org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction" triggered by the RetryingTransactionInterceptor. That is the retrying behaviour of the CMIS endpoint I am referring to - and this will apply to most (if not all) of the default Alfresco endpoints.

I wish you would stop referring to "thread-safety" - what you need/want is "transaction-safety", which is a completely separate technical concept, and by continuing to use "thread-safety" you risk confusing other people / community members that may stumble over this thread.

"If multiple threads enter the rule, and only one is to create the folder structure... how should we develop the rule thread-safely?"
Typically this would be very simple: Stop trying to manage transactions manually and rely on the retrying behaviour of Alfresco. As I already outlined in previous responses, multiple transactions may enter the rule and create the folder structure in parallel, but only one will be allowed to commit while all others will be rejected by the DB with update conflicts, causing the transactions to retry and then they will find the folder structure already created by the one transaction that succeeded.

"But "FileExistsException" is thrown for non first threads that arrive, and CMIS RTH does not contemplate it as retryable."
And that is correct - a FileExistsException is never retryable. If a file already exists with the same name that means there exists a node already committed to the DB, and there is no chance at all that simply retrying the current transactions will see the other node vanish (unless some user / code may be deleting it in a parallel transaction).

"NonRootNodeWithoutParentsException its an odd one and could understand it's due to some cache bug, but what about the second one; InvalidNodeRefException: Node does not exist?"

And that would be the correct exception when you create the folder structure in a nested transaction using a RTH, and then try to use that to move something in the outer transaction. Due to transaction isolation, the NodeRef does indeed not exist (yet) for the outer transaction, triggering an InvalidNodeRefException.

"FileFolderUtils.makeFolders, for instance, should work, as it states in the javadoc: checks for the presence of, and creates as necessary, the folder structure in the provided path"

Yes, but the JavaDoc does NOT mention that the lookup will be done with system privileges which means it can return existing folders that the user is not allowed to access and thus cause AccessDeniedException failures later on. Never take JavaDoc comments for 100% accurate without checking the implementation.

iru
Confirmed Champ
Confirmed Champ

Morning,

Some quick clarifications.

- Yep, it's clear to me the RTH involvement in the CMIS endpoint. Thanks for re-pointing it out

- Yep, you're right, maybe it's more a transaction issue, but from the client's point of view, it's a multiple-concurrent thread test, even if on the server side, as seen, it's leading to multiple-transactions.

- Related to the InvalidNodeRefException, the nodeRef ID the exception is pointing to its not of the just created folder, but of the just uploaded file.
To me, the problem is not in the just created folder, but in the fact that in an inner transaction snapshot, due to isolation, the just just to be uploaded node does not exist.
To reinforce my statement, this happens too when we have 1 RTH with both actions (create and move) within the same block of code.
Just for clarification, this happens in both our mysql and postgres setup.

- Yep, FileFolderUtils.makeFolders is done with system privileges, checked code besides javadoc.
However, no AccessDeniedException is raised in this tests and it's not what we're discussing, although thanks for re-pointing it out. What I was empathizing is how it's programmed, first check, then create if not existing (just what we want).

"If a file already exists with the same name that means there exists a node already committed to the DB"
That's what FileFolderUtils.makeFolders deals with;
1. first checks > node does not exists
2. then tries to create > node exists and gives the exception (created by another thread / transaction)

However, this method does throw "FileExistsException" internally, which, in first instance does not make sense if its job is to check first.

Guess database isolation again.
We do had some custom code in order to create the hierarchical structure of folders, using FileFolderService in a recursive check/creation loop, but wanted to simplify the test and that's why we opted for FileFolderUtils, thought it would be better a product provided solution than our own code.
__

Yesterday we tested it on an SQLServer based installation, and got other round of different exceptions, will update first post now.
__

To sum it up, we still don't know howto solve this within the rule implementation.
- Using 1 RTH for the folder creation, leads to different kind of exceptions, different depending on the database provider, like  mysql and sqlserver (although not in postgres)
- Not using RTH, leads to some threads not to be aware of the "initizliation" of the folderization structure that only one of the batch will create.
When multiple threads (with multiple transactions) try to create the folders FileExistsException is thrown on all but one. As this is not retryable, n-1 errors are thrown to the client-side, being batch size n.
Relying on the RTH of the cmis endpoint seems not to be a valid solution, as the exception is not handled as a retryable one.

We will try to move part of the code to our custom API, hoping CMIS deals ok with it and hope we got some clarifying results.
__

Even if we seem to be in a vicious replying circle, insight appreciated @Axel, I'm sure your highlights will help someone else.
pd: Feel free to change any misleading word/concept that may disrupt other community members. If not you, some moderator should be able to do so.

afaust
Legendary Innovator
Legendary Innovator

I am actually a moderator, so there should be no issues. Though I really don't want to go about editing any replies because it could be even more confusing.

Ok, the InvalidNodeRef error inside a nested transaction is of course an issue with the isolation / non-committed state.

Multiple threads in multiple transactions running through the makeFolders code should not run into FileExistsException (again, issue with isolation may be cause here) - there should be something along the line of DeadlockLoserException or ConcurrencyFailureException, which are retryable.

I wouldn't call the reply-cycle "vicious" - just long-running since we don't seem to come to a simple, clear-cut answer to what is going wrong.