Hyland Connect

konsultex · ‎10-20-2022

Hi,

I'm running a custom scheduled job that's processing documents in the repository in batches of 9,000 documents. The repository has many million documents but it's not a problem for me because Alfresco could be used in the meantime. This question is related to my other post https://hub.alfresco.com/t5/alfresco-content-services-forum/java-heap-space-with-java-class-custom-a... . I can process about 36,000 docments per hour because the job is started every 15 minutes. Processing the batch takes between 6 and 11 minutes so Alfresco has some time to "settle down". The job does an AFTS query on Solr like this:

+TYPE:"cm:content" AND -cm:creator:"System" AND -TYPE:"cmerson" AND -TYPE:"lnk:link" AND -TYPE:"fmost" AND -TYPE:"dl.issue" AND -TYPE:"dl.event" AND -TYPE:"dl.todolist" AND -TYPE:"cm.post" AND +cm:created:["1900-01-01" TO "2022-10-21T01:53:00.356281"]

During development and testing with the SDK I noticed that sometimes there were Solr time out problems and so I handled them in the code. The code tries to execute the query 10 times, waiting 45 seconds between tries. If it can't get a reply after that the entire batch is retried. Obviously this is a major drag on overall performance. When a job starts it's very unlikey, though it happens sometimes, for a time out to appear but as more batches are processed the time out problem gets worse and worse. Sometimes the time out happens when the search parameters are set (searchService.query(sp):smileywink: and sometimes when a new set of 900 documents has to come (rs.getNodeRefs():smileywink:. Although Alfresco could be used during processing right now it's not used so the only load is the job.

After about 12 hours of processing it's so bad that entire batches are redone a few times before they can finish. We then restart Alfresco and the job picks up where it stopped. In the testing envivronmet this is ok but I could not do this in production.

The processing logic reads a query offset (skipCount for the query) and does 10 queries of 900 documents. After the batch is processed, the offset is updated (in a Postgres table) and the job waits to be run again.

This is the part in docker-compose,yml relevant to solr.

solr6:
        restart: always
        build:
          context: ./search
          args:
            SEARCH_TAG: ${SEARCH_CE_TAG}
            SOLR_HOSTNAME: solr6
            ALFRESCO_HOSTNAME: alfresco
            ALFRESCO_COMMS: secret 
            CROSS_LOCALE: "true"
        mem_limit: 24576m
        environment:
            #Solr needs to know how to register itself with Alfresco
            SOLR_ALFRESCO_HOST: "alfresco"
            SOLR_ALFRESCO_PORT:  "8080" 
            #Alfresco needs to know how to call solr
            SOLR_SOLR_HOST: "solr6"
            SOLR_SOLR_PORT: "8983"
            #Create the default alfresco and archive cores
            SOLR_CREATE_ALFRESCO_DEFAULTS: "alfresco,archive"
            SOLR_JAVA_MEM: "-Xms10240m -Xmx16384m" 
            SOLR_OPTS: "
                -XX:NewSize=5120m
                -XX:MaxNewSize=5120m
                -Dalfresco.secureComms.secret=xamw94vet9o 
            "
        volumes: 
            - ./data/solr-data:/opt/alfresco-search-services/data

(See SOLR_JAVA_MEM: "-Xms10240m -Xmx16384m" )

I increased the Solr container size but it did not help.

I also noticed this drop in performance when trying an external shell script (/bin/bash) using the REST API with curl a few weeks ago but back then I wasn't sure what caused this problem and switchd to Java running in Alfresco. This makes me think it's not a Java problem. Solr appears to be too busy or stuck on something.

My question is what causes Solr to time out more and more often and how can thihs be tuned?

Thanks.

angelborroy · ‎10-21-2022

I guess that the problem may come from this offset / skipCount approach. SOLR performance is affected when skipping a large amount of results.

I'd try to add a setting in processed nodes to exclude them from future processing in schedule job, so you can get a set of nodes without using the "skipCount" setting.

Hyland Developer Evangelist

View answer in original post

angelborroy · ‎10-21-2022

I guess that the problem may come from this offset / skipCount approach. SOLR performance is affected when skipping a large amount of results.

I'd try to add a setting in processed nodes to exclude them from future processing in schedule job, so you can get a set of nodes without using the "skipCount" setting.

Hyland Developer Evangelist

konsultex · ‎10-23-2022

Hi Angel,

Thanks for the explanartion of the limitation of skipCount. Before the skipCount approach I used to set a tag on the documents but the Java Heap problem didn't help me when using an action 😉 I'll now go back to that and mark the nodes with a tag.

Hyland Connect

Alfresco 7.2 container solr time-outs