Hyland Connect

karan97425 · ‎10-24-2023

Working with Elasticsearch gives us the flexibility to customize it. One can choose to get the index ready with just the metadata of uploaded files or with content of the files as well. Having content indexed will have time and cost implications and thus is chosen if only necessary. If we consider the data volume of up to 1 billion files, even getting the Metadata indexed can be a lengthy job. Here we will be looking at the approach that was used for indexing and the speed that was achieved with our deployment setup hosted over Amazon Web Service (AWS)
Alfresco Elasticsearch Connector can be used with multiple threads to perform indexing based on Node ID of the files present in the database, with right set of configurations of Elasticsearch i.e. number of primary and replica shards, refresh time we can achieved better indexing speed. Below is the complete set of infrastructure and configuration that was used.

Elasticsearch Data Node:

AWS Elasticsearch v7.10 with Availability Zone: 1-AZ
Number of Data Nodes: 3
Instance Type: r6g.2xlarge.search
Storage type: EBS
EBS volume type: Provisioned IOPS (SSD)
EBS volume size: 1000 GiB per node
Fielddata cache allocation: 20
Max clause count: 1024

Indexing Instance

EC2 Indexing Instance to run alfresco-elasticsearch-connector-distribution-3.1.1
Indexing Instance type: t2.2xlarge (8vCPUs, 32GiB RAM)
Number of EC2 Instances for Indexing: 3
Number of threads running on Instance 1 and 2 is 7 each with 6 threads on instance 3. Total threads running in parallel is 20
Maximum Heap allocated to each thread is 4GB (-Xmx4G)

Elasticsearch Master Node:

Number Of Master Nodes: 3
Master Node Instance type: m5.large.search
Master node is added for resilience, and may be avoided without having significant impact.

Elasticsearch Settings

Number of Primary shards: 32
Number of Replica Shards: 0
Refresh Time: Disabled
Translog flush threshold: 2GB

Other Components

Active MQ: mq.m4.large
RDS is used as Database with db.r5.2xlarge running PostgreSQL
EC2 Instance running ACS & Transform Service: m5a.xlarge
ACS Version used is 7.2.0

The deployment architecture of the setup looks like this:

Setting Up Elasticsearch

At first, shard creation is performed which depends on the data volume and size of each shard that has to be created. Abiding by AWS recommendations, each shard should be not more than 50GB and after estimating the total indexed data size of to be indexed files, the number of shards can be calculated. Like for 1 Billion, estimated size of metadata indexed data is approximately 1.3TB and keeping each shard at 40GB, we get number of shards as 32. Below command be be run from the same VPC to create such a setup

curl -XPUT 'https://<Elasticsearch DNS>:443/alfresco?pretty' -H 'Content-Type: application/json' -d'
{
  "settings" :{
        "number_of_shards":32,
        "number_of_replicas":0
  }
}'

Other critical parameters like refresh interval and translog flush threshold can be set. Refresh time is the time in which indexed data gets searchable and should be disabled (by passing -1) or put to a higher value during indexing to avoid unnecessary usage of resources. Translog flush threshold is set to a higher size e.g. 2GB to avoid frequent flushed during indexing. Both can be set using below commands

// To set the refresh time to -1 for disabling it
curl -XPUT "https://<Elasticsearch DNS>:443/alfresco/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "refresh_interval" : "-1"  }}'
// To set flush threshold at 2GB
curl -XPUT "https://<Elasticsearch DNS>:443/alfresco/_settings?pretty" -H 'Content-Type: application/json' -d '{"index":{"translog.flush_threshold_size" : "2GB"}}'
//To verify all the setting, run below command
curl -XGET "https://<Elasticsearch DNS>:443/alfresco/_settings?pretty" -H 'Content-Type: application/json' -d '{ "index" : { "refresh_interval" }}'

Setting Up Re-Indexing Instance

Deploy 3 EC2 instances using configuration from table above in same VPC as all other services
Attach these EC2 instances to security group such that all incoming traffic is allowed from other services
Install Java on all the 3 instances
Copy alfresco-elasticsearch-connector-distribution-3.1.1 to 3 EC2 Instance
We were running 7 threads each on two instances and 6 on third to achieve total 20 thread count
Browse to folder where alfresco-elasticsearch-reindexing-3.1.1-app.jar is located
Run below code with following necessary modifications.
- server.port: provide unique port numbers to run required number of threads needed from an instance. For example, to run 7 threads from Instance1; copy below code 7 times providing unique port in each of 7 sets of commands. Similarly, update other parameters as below.
- Provide unique nodeID for each thread using parameter alfresco.reindex.fromId and alfresco.reindex.toId. Idea is to equally distribute total file count among the threads. In this case, 1B among 20 threads, each thread getting 50 million each. For example,
  - For Thread 1: alfresco.reindex.fromId=0 alfresco.reindex.toId=50000000
  - For Thread 2: alfresco.reindex.fromId=50000001 alfresco.reindex.toId=100000000
  - For Thread 3: alfresco.reindex.fromId=100000001 alfresco.reindex.toId=150000000
  - ......
  - For Thread 20: alfresco.reindex.fromId=950000001 alfresco.reindex.toId=1000000000
  - Running more than 20 threads at this infrastructure has not shown improvement in indexing speed.

Indexing Command

nohup java -Xmx4G -jar alfresco-elasticsearch-reindexing-3.1.1-app.jar \
--server.port=<unique port> \
--alfresco.reindex.jobName=reindexByIds \
--spring.elasticsearch.rest.uris=https://<Elasticsearch DNS>:443 \
--spring.datasource.url=jdbc:postgresql://<DB Writer URL>:5432/alfresco \
--spring.datasource.username=**** \
--spring.datasource.password=**** \
--alfresco.accepted-content-media-types-cache.enabled=false \
--spring.activemq.broker-url=failover:\(ssl://<Broker 1>:61617,ssl://<Broker 2>:61617\) \
--spring.activemq.user=alfresco \
--spring.activemq.password=***** \
--alfresco.reindex.fromId=0 \
--alfresco.reindex.toId=50000000 \
--alfresco.reindex.multithreadedStepEnabled=true \
--alfresco.reindex.concurrentProcessors=30 \
--alfresco.reindex.metadataIndexingEnabled=true \
--alfresco.reindex.contentIndexingEnabled=false \
--alfresco.reindex.pathIndexingEnabled=true \
--alfresco.reindex.pagesize=10000 \
--alfresco.reindex.batchSize=1000  &

Indexing Speed

Below table summarizes different indexing obtained with different data volumes with identical infrastructure and configurations.

Metadata Indexing	20 Millions	200 Millions	300 Millions	1 Billion
Total Time Consumed	40 minutes	8 hours 9 minutes	13 hours 43 minutes	47 hours 30 minutes
Average Indexing Speed	30 million files/hour	24 million files/hour	21 million files/hour	21.2 million files/hour
Elasticsearch Operations	300k-400k per minute	200k-350k per minute	200k-350k per minute	200k-350k per minute
Data \| Master Node CPU%	85% \| 10%	95% \| 15%	95% \| 15%	96% \| 15%
Data Node Memory %	80%	97%	98%	98%
RDS Thread Count	411	411	411	411
RDS CPU%	5%	5%	5%	5%
Re-Indexer CPU%	5%	5%	5%	5%

Conclusion

With certain tricks we can customize Elasticsearch to cater to our needs. With right set of infrastructure and configurations, one can achieve greater indexing performance. Here, we have an optimum setup which can be scaled up further to achieve higher indexing speed.

Hyland Connect

Achieving Higher Metadata Indexing Speed with Elasticsearch