10-14-2022 12:41 PM
Hi,
I have a version 7.2 Alfresco ECM (community) in Docker with docker-compose in a VM (Ubuntu 22.04.1) with very good specs: 12 cpu cores and 128 GB RAM total. This is a default 7.2 installation done with Angel Borroy's tool and OpenJDK is the JVM. The repository has about 15 million documents, some of which have up to 20 versions. I need to clean up this system. I have an Oracle table that has a column with the nodeIds that need to be kept (almost 4.5 million) and from the nodes to keep I need to delete all version except the first 3. The nodeIds should be the same after the process.
My algorithm is:
1) Do an AFTS query with batches of 900 nodes and keep doing this until there are no more results. The AFTS query filters out system documents, posts, links, persons, etc. The idea is to only bring documents that are relevant to the business.
2) For each node check to see if it's in the Oracle table.
3.1) If it is, check how many versions it has and delete all but the most recent 3. I also tag this document if verions are deleted.
3.2) If it is not, delete it.
This processes about 60.000 documents per hour (even though the server is good, the attached disk in testing is slow)
I have it set up to do a trial run in production just to generate a log of what would be done in a log file. Doing a trial run or touching documents is controlled by a parameter in alfresco-global.properties.
The class is programmed to make it easy for the garbage collector to keep the heap small. My maximum heap is 32 gb as reported bu OpenJDK. I release result sets and connections as soon as I don't need them. I have very few class variables and those are private static. The class extends ActionExecuterAbstractBase.
I have a developement environment (SDK 4.x - for Alfresco 7.2) and a test environment wich is a copy of production. Everything works great in development doing a trial run and deleteing.
The problems are:
1) Even when doing a trial run without touching the documents I notice that heap space grows up to the limit after about 20% is processed.
2) I notice that during operation, changes are not seen until the action stops (for example the tags). This makes me think that uncommited changes are taking up the heap. Could this be possible? In Javascript there's a document.save() method but I haven't found this in Java.
3) I noticed rather frequent solr time-outs and handled them by retrying but sometimes even after 5 minutes the rpo and solr don't talk to each other.
docker-compose.yml:
version: "2" services: alfresco: restart: always build: context: ./alfresco args: ALFRESCO_TAG: ${ALFRESCO_CE_TAG} DB: postgres SOLR_COMMS: secret mem_limit: 65536m depends_on: - postgres environment: JAVA_TOOL_OPTIONS: " -Dencryption.keystore.type=JCEKS -Dencryption.cipherAlgorithm=DESede/CBC/PKCS5Padding -Dencryption.keyAlgorithm=DESede -Dencryption.keystore.location=/usr/local/tomcat/shared/classes/alfresco/extension/keystore/keystore -Dmetadata-keystore.password=mp6yc0UD9e -Dmetadata-keystore.aliases=metadata -Dmetadata-keystore.metadata.password=oKIWzVdEdA -Dmetadata-keystore.metadata.algorithm=DESede " JAVA_OPTS : ' -Ddb.username=alfresco -Ddb.password=alfresco -Ddb.driver=org.postgresql.Driver -Ddb.url=jdbc:postgresql://postgres:5432/new_alfresco -Dalfresco_user_store.adminpassword=209c6174da490caeb422f3fa5a7ae644 -Dsystem.preferred.password.encoding=bcrypt10 -Dsolr.host=solr6 -Dsolr.port=8983 -Dsolr.port.ssl=8983 -Dsolr.secureComms=secret -Dsolr.baseUrl=/solr -Dindex.subsystem.name=solr6 -Dsolr.sharedSecret=xamw94vet9o -Dalfresco.host=${SERVER_NAME} -Dalfresco.port=8433 -Dapi-explorer.url=https://${SERVER_NAME}:8433/api-explorer -Dalfresco.protocol=https -Dshare.host=${SERVER_NAME} -Dshare.port=8433 -Dshare.protocol=https -Daos.baseUrlOverwrite=https://${SERVER_NAME}/alfresco/aos -Dmessaging.broker.url="failover:(nio://activemq:61616)?timeout=15000&jms.useCompression=true" -Ddeployment.method=DOCKER_COMPOSE -Dcsrf.filter.enabled=false -Dftp.enabled=true -Dftp.port=2121 -Dftp.dataPortFrom=2433 -Dftp.dataPortTo=2434 -Dopencmis.server.override=true -Dopencmis.server.value=https://${SERVER_NAME}:8433 -DlocalTransform.core-aio.url=http://transform-core-aio:8090/ -Dcsrf.filter.enabled=false -Dalfresco.restApi.basicAuthScheme=true -Dauthentication.protection.enabled=false -XX:+UseG1GC -XX:+UseStringDeduplication -Dgoogledocs.enabled=true -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=90 ' volumes: - /opt/alfresco/alf_data:/usr/local/tomcat/alf_data - ./logs/alfresco:/usr/local/tomcat/logs ports: - 2121:2121 - 2433:2433 - 2434:2434 transform-core-aio: restart: always image: alfresco/alfresco-transform-core-aio:${TRANSFORM_ENGINE_TAG} mem_limit: 6144m environment: JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 -Dserver.tomcat.threads.max=12 -Dserver.tomcat.threads.min=4 -Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR " share: restart: always build: context: ./share args: SHARE_TAG: ${SHARE_TAG} SERVER_NAME: ${SERVER_NAME} mem_limit: 2024m environment: REPO_HOST: "alfresco" REPO_PORT: "8080" CSRF_FILTER_REFERER: "https://localhost:8433/.*" CSRF_FILTER_ORIGIN: "https://localhost:8433" JAVA_OPTS: " -Dalfresco.context=alfresco -Dalfresco.protocol=https -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 " volumes: - ./logs/share:/usr/local/tomcat/logs postgres: restart: always image: postgres:${POSTGRES_TAG} shm_size: 2Gb mem_limit: 1872m environment: - POSTGRES_PASSWORD=alfresco - POSTGRES_USER=alfresco - POSTGRES_DB=new_alfresco command: " postgres -c max_connections=200 -c logging_collector=on -c log_min_messages=LOG -c log_directory=/var/log/postgresql" ports: - 5432:5432 volumes: - ./data/postgres-data:/var/lib/postgresql/data - ./logs/postgres:/var/log/postgresql # - /opt/alfresco/banco_qa_atual:/tmp solr6: restart: always build: context: ./search args: SEARCH_TAG: ${SEARCH_CE_TAG} SOLR_HOSTNAME: solr6 ALFRESCO_HOSTNAME: alfresco ALFRESCO_COMMS: secret CROSS_LOCALE: "true" mem_limit: 12288m environment: #Solr needs to know how to register itself with Alfresco SOLR_ALFRESCO_HOST: "alfresco" SOLR_ALFRESCO_PORT: "8080" #Alfresco needs to know how to call solr SOLR_SOLR_HOST: "solr6" SOLR_SOLR_PORT: "8983" #Create the default alfresco and archive cores SOLR_CREATE_ALFRESCO_DEFAULTS: "alfresco,archive" SOLR_JAVA_MEM: "-Xms10240m -Xmx10240m" SOLR_OPTS: " -XX:NewSize=5120m -XX:MaxNewSize=5120m -Dalfresco.secureComms.secret=xamw94vet9o " volumes: - ./data/solr-data:/opt/alfresco-search-services/data activemq: restart: always image: alfresco/alfresco-activemq:${ACTIVEMQ_TAG} mem_limit: 8g environment: JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 " ports: - 8161:8161 volumes: - ./data/activemq-data:/opt/activemq/data content-app: restart: always image: alfresco/alfresco-content-app:${ACA_TAG} mem_limit: 512m depends_on: - alfresco - share # HTTP proxy to provide HTTP Default port access to services # SOLR API and SOLR Web Console are protected to avoid unauthenticated access proxy: restart: always image: nginx:stable-alpine mem_limit: 128m depends_on: - alfresco - solr6 - share - content-app volumes: - ./config/nginx.conf:/etc/nginx/nginx.conf - ./config/nginx.htpasswd:/etc/nginx/conf.d/nginx.htpasswd - ./config/cert/localhost.cer:/etc/nginx/localhost.cer - ./config/cert/localhost.key:/etc/nginx/localhost.key ports: - 8433:8433
Is this a good approach to the problem and how to make it process to the end?
Thanks.
10-17-2022 08:59 AM
Hi Angel,
Thanks for the idea! I tried some things already already but not a scheduled job. I had an external shell script using the REST API but it was too slow (with my logic) probably due to the delay with the connections and it looked like it was getting slower and slower after some hours so I didn't go that way. I'm now running a shell script that controls the action every 900 * 10 batches with good performance. If I run into problems I'll redo this as a scheduled action.
Thanks.
10-17-2022 03:37 AM
Using an Alfresco Action didn't seem to be the best approach. All the action steps are executed inside the same DB Transaction, that may explain the problems you're noticing.
I'd suggest to move your logic to a Scheduled Job (https://docs.alfresco.com/content-services/latest/develop/repo-ext-points/scheduled-jobs/), that can be started periodically for every 900 nodes batch. In this way, the changes for a DB Transaction may be reduced a lot.
Another approach would be to use an external tool using the Alfresco REST API or the CMIS protocol. You can start some tasks in parallel when using this method and the overall performance could be more or less equivalent to the one you're getting with the action.
10-17-2022 08:59 AM
Hi Angel,
Thanks for the idea! I tried some things already already but not a scheduled job. I had an external shell script using the REST API but it was too slow (with my logic) probably due to the delay with the connections and it looked like it was getting slower and slower after some hours so I didn't go that way. I'm now running a shell script that controls the action every 900 * 10 batches with good performance. If I run into problems I'll redo this as a scheduled action.
Thanks.
Explore our Alfresco products with the links below. Use labels to filter content by product module.