cancel
Showing results for 
Search instead for 
Did you mean: 

Nuxeo cluster and SQL Server shared instance problem

arubiom00_
Star Contributor
Star Contributor

We have a Nuxeo-5.9.3 cluster consisting in two Tomcat server and a SQL Server geo-cluster. The SQL instance is shared among many applications.

We have observed that the cleanup of the table cluster_invals and cluster_nodes was not working properly, and Nuxeo performance plummeted (due to 30 rows in cluster_nodes and millions in cluster_invals). After doing some research, we found that the way Nuxeo handles the cleanup of those tables only works if you have a dedicated SQL Server instance.

We were able to work-around the problem by:

  • Adding a column to cluster_nodes (hostname nvarchar(128)).

  • Modify the file sqlserver.sql.txt included in nuxeo-core-storage-sql-5.9.3.jar. Now nodes are deleted and inserted this way:

      -- DB Collation is CI but Nuxeo DB is CS
      DELETE FROM [cluster_nodes] WHERE hostname =
        (SELECT host_name COLLATE DATABASE_DEFAULT FROM sys.dm_exec_sessions S
         WHERE S.session_id = @@SPID)
    
      INSERT INTO [cluster_nodes] (nodeid, created, hostname)
        SELECT @@SPID, CURRENT_TIMESTAMP, host_name FROM sys.dm_exec_sessions S
        WHERE S.session_id = @@SPID
    

The work-around can be easily improved (it doesn't currently work for many nuxeo instances on the same node).

Do you think there is a better approach for this problem? Should it be reported as a bug?

Thanks in advance.

1 ACCEPTED ANSWER

Florent_Guillau
World-Class Innovator
World-Class Innovator

I can confirm that 5.9.4 indeed has this problem, but it's fixed for the next 5.9.5 (release in around 2 weeks). I don't have an exact ticket to pinpoint when the problem was introduced or when the fix occurred, but we've done a number of changes in the repository shutdown mechanism and one of them has fixed it.

You can try a 5.9.5-SNAPSHOT from http://qa.nuxeo.org/jenkins/job/IT-nuxeo-master-build/ if you want to confirm.

View answer in original post

14 REPLIES 14

Florent_Guillau
World-Class Innovator
World-Class Innovator

Thanks for the report. I opened NXP-14795 to track this issue and its resolution.

We haven't tested or validated Nuxeo with a clustered SQL Server. Can you explain why the host_name is needed in addition to the session_id? The session_id is not unique globally on the server cluster? Also host_name is provided by the client application and marked not reliable in the documentation.

arubiom00_
Star Contributor
Star Contributor

Hi Florent,

The problem is not the SQL cluster but the use of shared instance. The original cleanup is:

DELETE FROM N FROM [cluster_nodes] N WHERE
HAS\_PERMS\_BY\_NAME(null, null, 'VIEW SERVER STATE') = 1
AND NOT EXISTS(
SELECT 1 FROM sys.dm_exec_sessions S WHERE
S.is\_user\_process = 1 AND N.nodeid = S.session\_id);

When there are connections that are not originated from nuxeo this doesn't work as expected. Session_id is unique, but it may not be assigned to a nuxeo session under these circumstances. Let's consider the following:

  1. One nuxeo server is started and connects to the database with session_id=50 (one row with that session_id is added to cluster_nodes)
  2. That nuxeo server goes offline.
  3. Other session is started from a different application (e.g. Team Foundation Server), connects to database and gets session_id=50
  4. The nuxeo instance is started again. The row in cluster_node is not removed because there is an active session with session_id=50.
  5. A new row is added. As a result, the same node has added two rows to cluster_nodes. In a quite busy database, after many restarts you can get too many rows in cluster_nodes.

Adding host_name allows us to clean all previous rows inserted from one node. It's not a full solution and there are still some situation where it doesn't work:

  • Multiple Nuxeo Instances in the same host.
  • Another application in the same host that connects to the same database.

If there were some field or parameter that uniquely identifies the instance, using it would resolve the problem. Is there such a thing?

But in step 2, if the Nuxeo server is correctly shut down then it will clean up its entry in cluster_nodes.

In that case we are having some problem in our instances, because cluster_nodes is only cleaned up when starting an instance. It's been the behaviour since we install Nuxeo.

Hi Florent,

If you look at sqlserver.sql.txt you'll see a section

As the triggered test showed, apparently there is no DELETE is executed during shutdown.

No sql traces during shutdown even if I enable more logging.

It's working correctly in the current 5.9.5-SNAPSHOT version. We have refactored a few things related to repository shutdown. Maybe try 5.9.4 first? Your 5.9.3 is a Fast Track version, so you should always upgrade asap.