Elasticsearch is an open-source project originally developed by Elastic and has gained much popularity as a search engine due to its free-form search capabilities and is based out on Apache Lucene. It provides the flexibility to work with huge data volume which can be structured/unstructured, text, numeric etc. Along with Logstash (used for data ingestion) and Kibana (for data visualization) it is also referred to as Elastic Stack (aka ELK stack). Some prominent features are:
Full text Search: Including Metadata, Content, Path of files that can be of different formats like .docx, .pdf, .pptx, .jpg etc
Log Analysis
Monitoring and Management of infrastructure
As an open source* tool, it can be setup and configured on any machine but obviously with multiple steps that is needed to get it to work at a lower cost. Another easy and expensive way is to use any cloud based solution like from Amazon (to be referred as AWS Managed Service) which will give us a stable and resilient setup. AWS Managed Service for Elasticsearch provide the complete solution by hosting Data & Master Nodes along with multiple configuration, security and backup options. Alfresco Performance team have primarily used AWS for performance testing and benchmarking. However we also conducted some comparative studies of AWS vs on-premise installation. The purpose here was to understand if there were any significant performance differences between the two deployment models, and understand the cost/benefit ratio of use of a managed service as applied to Alfresco search. The difference both the approaches lies in the deployment mechanism, in one we use AWS Managed Service for hosting Elasticsearch and in latter we use the EC2 instances for running Elasticsearch. Let us look at both the approaches in detail.
* From v7.11 onward Elasticsearch cannot be considered fully open source, as the license now includes restrictions for use as part of a managed service
Benefits of AWS Managed Service for Elasticsearch
Setup: It is easy to setup using AWS Elasticsearch service and supports Elasticsearch version 6.8 and 7.10. The major flexibility is that the service can be used without any code changes
Resilience: We can deploy Elasticsearch in multiple availability zones (up to 3) to improve high availability of nodes
Integration: It can be used with other in-house AWS services
Automated Snapshots: AWS Elasticsearch provides automated snapshots at hourly rate
Auto-tune: Auto-Tune analyzes cluster performance over time and suggests optimizations based on your workload and can be scheduled in Off-peak Window
Limitations of AWS Managed Service for Elasticsearch
High Cost: Charges are based on the infrastructure chosen, region and usage. For 80 million files, the cost can be around $4000/month due to Data & Master Node Instance type, attached IOPS, EBS type, throughput considering 24 hour usage of services in London region.
Infrastructure Limitations: Limited Elasticsearch (at present 6.8 and 7.10) version are available with certain instance types. At present AWS Opensearch versions 1.3 is also supported
Configuration Limitations: While some configuration settings like (Elasticsearch version, data/master node instance type, EBS, IOPS, Throughput in the cluster) are available, without full control over Elasticsearch clusters settings, there can also be some limitations in terms of scalability like hosting Elasticsearch in Multiple geographical regions—for particular use cases
Limited Plugin: AWS supports limited plugins, like open-source X-Pack feature is not supported which is for security, alerting, monitoring, reporting, graph analytics
Oversight need: Though AWS Elasticsearch is easy to setup and lowers the operational aspects of managing Elasticsearch cluster, we will still need expert knowledge to manage day-to-day operations
Benefits of AWS Self-Managed Service for Elasticsearch (using EC2 instances)
Low Cost: The AWS managed service charges are not levied and just instance cost has to be paid. For Example, self-managed can cost up to 30% of what would be needed for managed service
Elasticsearch Version availability: We can choose to install any version of available Elasticsearch version from Elastic.co or Open Distro
Infrastructure Flexibility: We can choose from variety of instance types making it more usage and cost centric
Other Flexibility: We can setup complex cluster to be available in multiple AWS regions and leverage non-AWS environment resources
Greater Control: We can control all underlying setting of Elasticsearch to cater to specific use cases
Plugin Compatibility: All available open-source plugins can be used
Limitation of AWS Self-Managed Service for Elasticsearch (using EC2 instances)
Oversight need: To operate it, we may need an expert
Setup Overhead: We have to perform multiple manual steps to setup, configure and run the service
Conclusion: Both the approaches have their own pros and cons and one can choose based on their requirement and budget. However, to have hassle free, stable, resilient but expensive solution one can choose to use AWS Managed Services for Elasticsearch and for cheaper but to have much better control one can go for On-premise with manual setup of Elasticsearch. The cost of implementation using Self Managed approach on the EC2 instances is ~30% of what would be there for AWS Managed service and performance of both will vary on the infrastructure used with Shards, Data Nodes, IOPS, Throughput, EBS Volume type as the deciding factor.
Detailed steps to setup Elasticsearch on EC2 instances can be found here
Detailed steps to setup Elasticsearch with AWS Elasticsearch service can be found here