Working with Elasticsearch gives us the flexibility to customize it. One can choose to get the index ready with just the metadata of uploaded files or with content of the files as well. Having content indexed will have time and cost implications and thus is chosen if only necessary. If we consider the data volume of up to 1 billion files, even getting the Metadata indexed can be a lengthy job. Here we will be looking at the approach that was used for indexing and the speed that was achieved with our deployment setup hosted over Amazon Web Service (AWS)
Alfresco Elasticsearch Connector can be used with multiple threads to perform indexing based on Node ID of the files present in the database, with right set of configurations of Elasticsearch i.e. number of primary and replica shards, refresh time we can achieved better indexing speed. Below is the complete set of infrastructure and configuration that was used.
Elasticsearch Data Node:
Indexing Instance
Elasticsearch Master Node:
Elasticsearch Settings
Other Components
The deployment architecture of the setup looks like this:
Setting Up Elasticsearch
At first, shard creation is performed which depends on the data volume and size of each shard that has to be created. Abiding by AWS recommendations, each shard should be not more than 50GB and after estimating the total indexed data size of to be indexed files, the number of shards can be calculated. Like for 1 Billion, estimated size of metadata indexed data is approximately 1.3TB and keeping each shard at 40GB, we get number of shards as 32. Below command be be run from the same VPC to create such a setup
curl -XPUT 'https://<Elasticsearch DNS>:443/alfresco?pretty' -H 'Content-Type: application/json' -d' { "settings" :{ "number_of_shards":32, "number_of_replicas":0 } }'
Other critical parameters like refresh interval and translog flush threshold can be set. Refresh time is the time in which indexed data gets searchable and should be disabled (by passing -1) or put to a higher value during indexing to avoid unnecessary usage of resources. Translog flush threshold is set to a higher size e.g. 2GB to avoid frequent flushed during indexing. Both can be set using below commands
// To set the refresh time to -1 for disabling it
curl -XPUT "https://<Elasticsearch DNS>:443/alfresco/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "refresh_interval" : "-1" }}' // To set flush threshold at 2GB curl -XPUT "https://<Elasticsearch DNS>:443/alfresco/_settings?pretty" -H 'Content-Type: application/json' -d '{"index":{"translog.flush_threshold_size" : "2GB"}}' //To verify all the setting, run below command curl -XGET "https://<Elasticsearch DNS>:443/alfresco/_settings?pretty" -H 'Content-Type: application/json' -d '{ "index" : { "refresh_interval" }}'
Setting Up Re-Indexing Instance
Indexing Command
nohup java -Xmx4G -jar alfresco-elasticsearch-reindexing-3.1.1-app.jar \ --server.port=<unique port> \ --alfresco.reindex.jobName=reindexByIds \ --spring.elasticsearch.rest.uris=https://<Elasticsearch DNS>:443 \ --spring.datasource.url=jdbc:postgresql://<DB Writer URL>:5432/alfresco \ --spring.datasource.username=**** \ --spring.datasource.password=**** \ --alfresco.accepted-content-media-types-cache.enabled=false \ --spring.activemq.broker-url=failover:\(ssl://<Broker 1>:61617,ssl://<Broker 2>:61617\) \ --spring.activemq.user=alfresco \ --spring.activemq.password=***** \ --alfresco.reindex.fromId=0 \ --alfresco.reindex.toId=50000000 \ --alfresco.reindex.multithreadedStepEnabled=true \ --alfresco.reindex.concurrentProcessors=30 \ --alfresco.reindex.metadataIndexingEnabled=true \ --alfresco.reindex.contentIndexingEnabled=false \ --alfresco.reindex.pathIndexingEnabled=true \ --alfresco.reindex.pagesize=10000 \ --alfresco.reindex.batchSize=1000 &
Indexing Speed
Below table summarizes different indexing obtained with different data volumes with identical infrastructure and configurations.
Metadata Indexing |
20 Millions |
200 Millions |
300 Millions |
1 Billion |
---|---|---|---|---|
Total Time Consumed |
40 minutes |
8 hours 9 minutes |
13 hours 43 minutes |
47 hours 30 minutes |
Average Indexing Speed |
30 million files/hour |
24 million files/hour |
21 million files/hour |
21.2 million files/hour |
Elasticsearch Operations |
300k-400k per minute |
200k-350k per minute |
200k-350k per minute |
200k-350k per minute |
Data | Master Node CPU% |
85% | 10% |
95% | 15% |
95% | 15% |
96% | 15% |
Data Node Memory % |
80% |
97% |
98% |
98% |
RDS Thread Count |
411 |
411 |
411 |
411 |
RDS CPU% |
5% |
5% |
5% |
5% |
Re-Indexer CPU% |
5% |
5% |
5% |
5% |
Conclusion
With certain tricks we can customize Elasticsearch to cater to our needs. With right set of infrastructure and configurations, one can achieve greater indexing performance. Here, we have an optimum setup which can be scaled up further to achieve higher indexing speed.