Introduction
As an experienced DevOps consulting services company, ITGix is dedicated to providing innovative solutions and streamlining processes for our clients. Today, we are excited to share a use case where we partnered with a client to deploy, optimize, and scale Elasticsearch. Leveraging our expertise in Elasticsearch and our commitment to delivering reliable and scalable solutions, we aimed to address their specific needs and empower their search capabilities.
Identifying the Need for Elasticsearch:
Our client recognized the importance of incorporating a robust search functionality into their application to enhance user experience and improve data retrieval. They understood that traditional database systems were not well-suited for complex full-text searches, which required features like keyword matching, relevance scoring, and faceted search.
Elasticsearch emerged as the ideal solution for their search requirements due to its powerful and flexible nature. As a distributed, open-source search and analytics engine, Elasticsearch offers a range of features specifically designed for efficient full-text searches. It leverages the Apache Lucene library, which provides advanced indexing and querying capabilities, making it highly performant even with large volumes of data.
Moreover, Elasticsearch’s distributed architecture makes it highly scalable and resilient. By distributing data across multiple nodes and shards, it can handle high query volumes and provide high availability for search operations. This scalability ensures that as the client’s application grows and data volumes increase, Elasticsearch can easily scale to meet the demands of the search functionality without compromising performance.
Deployment and Initial Challenges
In collaboration with our client, we embarked on the journey of deploying Elasticsearch for their application. However, as with any new implementation, challenges emerged. Initially, the Elasticsearch deployment was met with performance issues due to resource limitations and suboptimal configurations. The application’s demands were straining the Elasticsearch instance, impacting search functionality and responsiveness.
Optimizing Elasticsearch for Performance
In our analysis, we identified that the performance issues were mainly related to the different characteristics and sizes of the two indices in the client’s application. One index had a size of approximately 12 GB, while the other index was much larger, with a size of around 500 GB. The auto-scaling process, which involved redistributing shards across new data nodes, proved to be a time-consuming operation, leading to issues during search operations while the auto-scaling was in progress.
To address this challenge, we designed a solution that involved creating two separate Elasticsearch clusters, each dedicated to handling one of the indices. By separating the indices into different clusters, we could allocate the necessary resources and optimize the configurations based on the specific requirements of each index.
For the 12GB index, we created a dedicated Elasticsearch cluster. This cluster was designed to handle the smaller index’s workload and enabled us to auto-scale quickly and efficiently. The allocation of resources and shard management for this cluster was optimized to provide fast and responsive search operations.
On the other hand, for the 500GB index, we determined that auto-scaling was not necessary due to the already sufficient resources available in the cluster. With careful planning and resource allocation, we ensured that this cluster had the necessary computing power and storage capacity to handle the larger index without the need for dynamic scaling.
By separating the indices into distinct Elasticsearch clusters, we were able to overcome the performance challenges associated with auto-scaling and shard re-allocation. The dedicated cluster for the 12GB index allowed for fast and responsive scaling, ensuring that search operations were not impacted by delays. Simultaneously, the cluster housing the 500GB index benefited from optimized resource allocation, avoiding unnecessary scaling operations while maintaining adequate performance and stability.
Optimizing Resource Allocation and Node Group Separation
By separating the Elasticsearch Statefulets into different worker node groups, we optimized resource allocation and enhanced the overall performance of the system. This approach allowed us to distribute the workload more effectively and allocate resources based on specific operational requirements.
The master node group, consisting of dedicated master nodes, handled global configurations and coordinated the cluster operations. These nodes were responsible for tasks such as managing metadata, handling cluster state changes, and coordinating the addition or removal of nodes. By isolating these tasks to the master node group, we ensured the stability and efficient management of the Elasticsearch cluster.
The client node group acted as a load balancer, distributing incoming requests across the Elasticsearch cluster. These nodes were designed to handle client interactions, forwarding search and indexing requests to the appropriate data nodes and providing a seamless interface for application integration. By offloading the load balancing responsibilities to the client node group, we optimized the distribution of incoming traffic and improved the responsiveness of the Elasticsearch cluster.
The data worker node group played a vital role in handling CPU-intensive operations, such as search and indexing. By designating these nodes specifically for data-related tasks, we could allocate resources tailored to the demands of search and indexing operations. This separation ensured that the data worker nodes had ample CPU resources to efficiently execute these compute-intensive tasks, resulting in improved search performance and faster indexing operations.
Understanding esJavaOpts and Its Impact on Elasticsearch
In our journey to optimize Elasticsearch, we also recognized the importance of adjusting the esJavaOpts settings. esJavaOpts represents the Elasticsearch Java Virtual Machine (JVM) options, which allow administrators to fine-tune the JVM heap size. By allocating 50% of the available RAM to the JVM heap, we ensured efficient memory management and improved Elasticsearch performance.
Scaling Elasticsearch for Growth
As the client’s application continued to experience growth, scalability became a critical factor in ensuring optimal performance. Since Elasticsearch does not provide built-in scaling mechanisms, we took the initiative to design a custom scaling solution to accommodate increasing workloads. To achieve this, we created custom scripts that dynamically scaled the data worker node group based on CPU usage. These scripts were specifically designed to address Elasticsearch’s lack of built-in scaling capabilities, allowing us to adapt the cluster’s capacity in response to demand and maintain efficient operations. By monitoring CPU usage patterns, our custom scripts automatically deployed additional data nodes to scale Elasticsearch effectively. This ensured that the system could handle growing workloads without compromising performance or responsiveness.
Implementing Monitoring and Alerting
To ensure the stability and reliability of the Elasticsearch infrastructure, we implemented comprehensive monitoring and alerting systems. As part of our solution, we integrated Prometheus monitoring with Alertmanager, providing real-time insights into the cluster’s health and performance. Additionally, we leveraged Slack to receive notifications about critical events and alerts related to the Elasticsearch cluster.
By configuring Prometheus to collect and analyze key metrics such as resource utilization, indexing rates, and query latencies, we gained valuable visibility into the Elasticsearch cluster’s performance. This allowed us to proactively identify potential issues or bottlenecks before they impacted the client’s application.
With Alertmanager integrated into the monitoring setup, we established a flexible and customizable alerting mechanism. Based on predefined thresholds and rules, Alertmanager generated notifications and forwarded them to our dedicated Slack channel. This ensured that our team was promptly notified about any critical events, enabling us to take immediate action and optimize performance.
Conclusion
ITGix’s comprehensive approach included adjustments to critical settings such as esJavaOpts, ensuring efficient memory management, and using custom scaling scripts to tackle Elasticsearch’s built-in scaling limitations. This proactive strategy toward scalability allowed the client’s application to effortlessly handle growing workloads while sustaining optimal performance.