10-14-2022 12:41 PM
Hi,
I have a version 7.2 Alfresco ECM (community) in Docker with docker-compose in a VM (Ubuntu 22.04.1) with very good specs: 12 cpu cores and 128 GB RAM total. This is a default 7.2 installation done with Angel Borroy's tool and OpenJDK is the JVM. The repository has about 15 million documents, some of which have up to 20 versions. I need to clean up this system. I have an Oracle table that has a column with the nodeIds that need to be kept (almost 4.5 million) and from the nodes to keep I need to delete all version except the first 3. The nodeIds should be the same after the process.
My algorithm is:
1) Do an AFTS query with batches of 900 nodes and keep doing this until there are no more results. The AFTS query filters out system documents, posts, links, persons, etc. The idea is to only bring documents that are relevant to the business.
2) For each node check to see if it's in the Oracle table.
3.1) If it is, check how many versions it has and delete all but the most recent 3. I also tag this document if verions are deleted.
3.2) If it is not, delete it.
This processes about 60.000 documents per hour (even though the server is good, the attached disk in testing is slow)
I have it set up to do a trial run in production just to generate a log of what would be done in a log file. Doing a trial run or touching documents is controlled by a parameter in alfresco-global.properties.
The class is programmed to make it easy for the garbage collector to keep the heap small. My maximum heap is 32 gb as reported bu OpenJDK. I release result sets and connections as soon as I don't need them. I have very few class variables and those are private static. The class extends ActionExecuterAbstractBase.
I have a developement environment (SDK 4.x - for Alfresco 7.2) and a test environment wich is a copy of production. Everything works great in development doing a trial run and deleteing.
The problems are:
1) Even when doing a trial run without touching the documents I notice that heap space grows up to the limit after about 20% is processed.
2) I notice that during operation, changes are not seen until the action stops (for example the tags). This makes me think that uncommited changes are taking up the heap. Could this be possible? In Javascript there's a document.save() method but I haven't found this in Java.
3) I noticed rather frequent solr time-outs and handled them by retrying but sometimes even after 5 minutes the rpo and solr don't talk to each other.
docker-compose.yml:
version: "2"
services:
alfresco:
restart: always
build:
context: ./alfresco
args:
ALFRESCO_TAG: ${ALFRESCO_CE_TAG}
DB: postgres
SOLR_COMMS: secret
mem_limit: 65536m
depends_on:
- postgres
environment:
JAVA_TOOL_OPTIONS: "
-Dencryption.keystore.type=JCEKS
-Dencryption.cipherAlgorithm=DESede/CBC/PKCS5Padding
-Dencryption.keyAlgorithm=DESede
-Dencryption.keystore.location=/usr/local/tomcat/shared/classes/alfresco/extension/keystore/keystore
-Dmetadata-keystore.password=mp6yc0UD9e
-Dmetadata-keystore.aliases=metadata
-Dmetadata-keystore.metadata.password=oKIWzVdEdA
-Dmetadata-keystore.metadata.algorithm=DESede
"
JAVA_OPTS : '
-Ddb.username=alfresco
-Ddb.password=alfresco
-Ddb.driver=org.postgresql.Driver
-Ddb.url=jdbc:postgresql://postgres:5432/new_alfresco
-Dalfresco_user_store.adminpassword=209c6174da490caeb422f3fa5a7ae644
-Dsystem.preferred.password.encoding=bcrypt10
-Dsolr.host=solr6
-Dsolr.port=8983
-Dsolr.port.ssl=8983
-Dsolr.secureComms=secret
-Dsolr.baseUrl=/solr
-Dindex.subsystem.name=solr6
-Dsolr.sharedSecret=xamw94vet9o
-Dalfresco.host=${SERVER_NAME}
-Dalfresco.port=8433
-Dapi-explorer.url=https://${SERVER_NAME}:8433/api-explorer
-Dalfresco.protocol=https
-Dshare.host=${SERVER_NAME}
-Dshare.port=8433
-Dshare.protocol=https
-Daos.baseUrlOverwrite=https://${SERVER_NAME}/alfresco/aos
-Dmessaging.broker.url="failover:(nio://activemq:61616)?timeout=15000&jms.useCompression=true"
-Ddeployment.method=DOCKER_COMPOSE
-Dcsrf.filter.enabled=false
-Dftp.enabled=true
-Dftp.port=2121
-Dftp.dataPortFrom=2433
-Dftp.dataPortTo=2434
-Dopencmis.server.override=true
-Dopencmis.server.value=https://${SERVER_NAME}:8433
-DlocalTransform.core-aio.url=http://transform-core-aio:8090/
-Dcsrf.filter.enabled=false
-Dalfresco.restApi.basicAuthScheme=true
-Dauthentication.protection.enabled=false
-XX:+UseG1GC -XX:+UseStringDeduplication
-Dgoogledocs.enabled=true
-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=90
'
volumes:
- /opt/alfresco/alf_data:/usr/local/tomcat/alf_data
- ./logs/alfresco:/usr/local/tomcat/logs
ports:
- 2121:2121
- 2433:2433
- 2434:2434
transform-core-aio:
restart: always
image: alfresco/alfresco-transform-core-aio:${TRANSFORM_ENGINE_TAG}
mem_limit: 6144m
environment:
JAVA_OPTS: "
-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
-Dserver.tomcat.threads.max=12
-Dserver.tomcat.threads.min=4
-Dlogging.level.org.alfresco.transform.router.TransformerDebug=ERROR
"
share:
restart: always
build:
context: ./share
args:
SHARE_TAG: ${SHARE_TAG}
SERVER_NAME: ${SERVER_NAME}
mem_limit: 2024m
environment:
REPO_HOST: "alfresco"
REPO_PORT: "8080"
CSRF_FILTER_REFERER: "https://localhost:8433/.*"
CSRF_FILTER_ORIGIN: "https://localhost:8433"
JAVA_OPTS: "
-Dalfresco.context=alfresco
-Dalfresco.protocol=https
-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
"
volumes:
- ./logs/share:/usr/local/tomcat/logs
postgres:
restart: always
image: postgres:${POSTGRES_TAG}
shm_size: 2Gb
mem_limit: 1872m
environment:
- POSTGRES_PASSWORD=alfresco
- POSTGRES_USER=alfresco
- POSTGRES_DB=new_alfresco
command: "
postgres
-c max_connections=200
-c logging_collector=on
-c log_min_messages=LOG
-c log_directory=/var/log/postgresql"
ports:
- 5432:5432
volumes:
- ./data/postgres-data:/var/lib/postgresql/data
- ./logs/postgres:/var/log/postgresql
# - /opt/alfresco/banco_qa_atual:/tmp
solr6:
restart: always
build:
context: ./search
args:
SEARCH_TAG: ${SEARCH_CE_TAG}
SOLR_HOSTNAME: solr6
ALFRESCO_HOSTNAME: alfresco
ALFRESCO_COMMS: secret
CROSS_LOCALE: "true"
mem_limit: 12288m
environment:
#Solr needs to know how to register itself with Alfresco
SOLR_ALFRESCO_HOST: "alfresco"
SOLR_ALFRESCO_PORT: "8080"
#Alfresco needs to know how to call solr
SOLR_SOLR_HOST: "solr6"
SOLR_SOLR_PORT: "8983"
#Create the default alfresco and archive cores
SOLR_CREATE_ALFRESCO_DEFAULTS: "alfresco,archive"
SOLR_JAVA_MEM: "-Xms10240m -Xmx10240m"
SOLR_OPTS: "
-XX:NewSize=5120m
-XX:MaxNewSize=5120m
-Dalfresco.secureComms.secret=xamw94vet9o
"
volumes:
- ./data/solr-data:/opt/alfresco-search-services/data
activemq:
restart: always
image: alfresco/alfresco-activemq:${ACTIVEMQ_TAG}
mem_limit: 8g
environment:
JAVA_OPTS: "
-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80
"
ports:
- 8161:8161
volumes:
- ./data/activemq-data:/opt/activemq/data
content-app:
restart: always
image: alfresco/alfresco-content-app:${ACA_TAG}
mem_limit: 512m
depends_on:
- alfresco
- share
# HTTP proxy to provide HTTP Default port access to services
# SOLR API and SOLR Web Console are protected to avoid unauthenticated access
proxy:
restart: always
image: nginx:stable-alpine
mem_limit: 128m
depends_on:
- alfresco
- solr6
- share
- content-app
volumes:
- ./config/nginx.conf:/etc/nginx/nginx.conf
- ./config/nginx.htpasswd:/etc/nginx/conf.d/nginx.htpasswd
- ./config/cert/localhost.cer:/etc/nginx/localhost.cer
- ./config/cert/localhost.key:/etc/nginx/localhost.key
ports:
- 8433:8433Is this a good approach to the problem and how to make it process to the end?
Thanks.
10-17-2022 08:59 AM
Hi Angel,
Thanks for the idea! I tried some things already already but not a scheduled job. I had an external shell script using the REST API but it was too slow (with my logic) probably due to the delay with the connections and it looked like it was getting slower and slower after some hours so I didn't go that way. I'm now running a shell script that controls the action every 900 * 10 batches with good performance. If I run into problems I'll redo this as a scheduled action.
Thanks.
10-17-2022 03:37 AM
Using an Alfresco Action didn't seem to be the best approach. All the action steps are executed inside the same DB Transaction, that may explain the problems you're noticing.
I'd suggest to move your logic to a Scheduled Job (https://docs.alfresco.com/content-services/latest/develop/repo-ext-points/scheduled-jobs/), that can be started periodically for every 900 nodes batch. In this way, the changes for a DB Transaction may be reduced a lot.
Another approach would be to use an external tool using the Alfresco REST API or the CMIS protocol. You can start some tasks in parallel when using this method and the overall performance could be more or less equivalent to the one you're getting with the action.
10-17-2022 08:59 AM
Hi Angel,
Thanks for the idea! I tried some things already already but not a scheduled job. I had an external shell script using the REST API but it was too slow (with my logic) probably due to the delay with the connections and it looked like it was getting slower and slower after some hours so I didn't go that way. I'm now running a shell script that controls the action every 900 * 10 batches with good performance. If I run into problems I'll redo this as a scheduled action.
Thanks.
Explore our Alfresco products with the links below. Use labels to filter content by product module.