Cloudibn News

Be updated with Technology

  • BIG Data & Analytics
  • CLOUD
  • Data Center
  • IOT
  • Machine Learning & AI
  • SECURITY
  • Blockchain
  • Virtualization
You are here: Home / CLOUD / Monitoring a critical part of your infrastructure: Amazon Elasticsearch domain

Monitoring a critical part of your infrastructure: Amazon Elasticsearch domain

February 8, 2018 by cbn Leave a Comment

I used Elasticsearch in various projects: to add rich search functionality to applications as well as to collect and analyze logs with the help of Kibana. In both cases, either your users or your operators rely on the Elasticsearch infrastructure. In one of my past projects, the team used Elasticsearch to store logs of EC2 instances. Over time, more and more applications were moved to AWS. Therefore, the volume of logs shipped to Elasticsearch also increased. One day, it was a Sunday, the Elasticsearch cluster became suddenly unavailable, and the log shippers were throwing errors. Luckily, the log shippers were monitored, and someone was paged to look at the issue. It took some time to find out that the Elasticsearch cluster had no disk space available. Situations like this are avoidable. Monitor available disk space and you can react before the disks are full.

Amazon Elasticsearch provides Elasticsearch as a Service. The fully managed service covers a lot of the challenges of operating a search engine (e.g., cluster management, patching the operating system and the search engine, …). But you are still responsible for some operational aspects: sizing and performance optimizations. Therefore, you need to monitor every Elasticsearch domain that serves production workloads.

Monitoring your whole cloud infrastructure is a complex task, as Andreas pointed out in his AWS Monitoring Primer. In this blog post, I will focus on the relevant parts for monitoring your Elasticsearch domain:

  1. I guide you to the relevant AWS monitoring services and features offered by AWS.
  2. I present best practices based on real-world client projects.
  3. I provide a CloudFormation template that implements all ideas in the post.
  4. You can use the template to monitor any Elasticsearch domain in a minute.

Let’s get started!

Identifying important CloudWatch metrics

Each Elasticsearch domain sends metrics to CloudWatch.

CloudWatch metrics expose internals of the Elasticsearch domain

The most important metrics are:

area metric description relevance
Storage FreeStorageSpace The free space, in megabytes, for all data nodes in the cluster. ES throws a ClusterBlockException when this metric reaches 0.
CPU CPUUtilization The maximum percentage of CPU resources used for data nodes in the cluster. 100% CPU utilization isn't uncommon, but sustained high averages are problematic.
CPU CPUCreditBalance The remaining CPU credits available for data nodes in the cluster (only applies to the t2 family) If you run out of burst credits, performance will drop significantly.
CPU MasterCPUUtilization The maximum percentage of CPU resources used by the dedicated master nodes. Because of their role in cluster stability, dedicated master nodes should have lower average usage than data nodes.
CPU MasterCPUCreditBalance The remaining CPU credits available for dedicated master nodes in the cluster (only applies to the t2 family). If you run out of burst credits, performance will drop significantly.
Memory JVMMemoryPressure The maximum percentage of the Java heap used for all data nodes in the cluster. The cluster could encounter out of memory errors if usage increases.
Memory MasterJVMMemoryPressure The maximum percentage of the Java heap used for all dedicated master nodes in the cluster. Because of their role in cluster stability, dedicated master nodes should have lower average usage than data nodes.
Cluster ClusterStatus.yellow At least one replica shard is not allocated to a node Your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation.
Cluster ClusterStatus.red At least one primary shard is not allocated to a node. You are missing data: searches will return partial results, and indexing into that shard will return an exception.
Cluster ClusterIndexWritesBlocked Indicates whether your cluster is accepting or blocking incoming write requests. A value of 1 means that the cluster is blocking write requests.
Cluster AutomatedSnapshotFailure The number of failed automated snapshots for the cluster. A value of 1 indicates that no automated snapshot was taken for the domain in the previous 36 hours.
Cluster KibanaHealthyNodes A health check for Kibana. A value of 0 indicates that Kibana is inaccessible.
Cluster KMSKeyError Indicates whether your cluster can use the configured KMS key. A value of 1 indicates that the KMS customer master key used to encrypt data at rest has been disabled.
Cluster KMSKeyInaccessible Indicates whether your cluster can use the configured KMS key. A value of 1 indicates that the KMS customer master key used to encrypt data at rest has been deleted or revoked its grants to Amazon ES.

Once important metrics are identified, you can use them to understand how a healthy system differs from an impacted system.

Defining thresholds

One of the hardest parts of monitoring is to define what healthy means. For each metric, you have to define a threshold between healthy and impacted. E.g., you regard CPU utilization under 80% as healthy because the application was never impacted when the CPU was not utilized. Thresholds are defined based on observations from the past. They might need adjustment in the future.

We don’t know about the whole application here. We can only reason about one component: the search engine. Application monitoring is a different topic. E.g., HTTP 5XX responses, latency, sign-ups.

From our experience and the AWS documentation, we usually start with the following thresholds to identify unhealthy behavior and adjust them over time.

area metric comparison operator threshold rationale
Storage FreeStorageSpace < 2000 MB 2 GB usually provides enough time to a) fix why so much space is consumed or b) add capacity. You can also modify this value to 10% of your database capacity.
CPU CPUUtilization > 80 % Queuing theory tells us the latency increases exponentially with utilization. In practice, we see higher latency when utilization exceeds 80% and unacceptable high latency with utilization above 90%.
CPU CPUCreditBalance < 20 One credit equals 1 minute of 100% usage of a vCPU. 20 credits should give you enough time to a) fix the inefficiency, b) add capacity or c) don't use t2 type.
CPU MasterCPUUtilization > 50 % Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower average CPU usage than data nodes.
CPU MasterCPUCreditBalance < 20 One credit equals 1 minute of 100% usage of a vCPU. 20 credits should give you enough time to a) fix the inefficiency, b) add capacity or c) don't use t2 type.
Memory JVMMemoryPressure > 80 % The cluster could encounter out of memory errors if usage increases.
Memory MasterJVMMemoryPressure > 80 % The cluster could encounter out of memory errors if usage increases.
Cluster ClusterStatus.yellow > 0 Your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation.
Cluster ClusterStatus.red > 0 You are missing data: searches will return partial results, and indexing into that shard will return an exception.
Cluster ClusterIndexWritesBlocked > 0 Cluster is blocking write requests.
Cluster AutomatedSnapshotFailure > 0 No automated snapshot was taken for the domain in the previous 36 hours.
Cluster KibanaHealthyNodes < 1 Kibana is inaccessible.
Cluster KMSKeyError > 0 The KMS customer master key used to encrypt data at rest has been disabled.
Cluster KMSKeyInaccessible > 0 The KMS customer master key used to encrypt data at rest has been deleted or revoked its grants to Amazon ES.

Now you know what healthy/unhealthy means. It’s time to define CloudWatch Alarms to send you an alert if a metric exceeds its threshold.

Observing metrics with CloudWatch Alarms and marbot

A CloudWatch Alarm continuously watches a metric. Once the threshold is reached, an action is performed that sends a message to an SNS topic. From this topic, you can then send yourself an email. We found that emails are not a good way to handle alerts. In a team, multiple people are responsible. If you send an email to a group email address:

  1. Your team has no idea if someone already started to work on solving the issue.
  2. You disturb the whole team for each alert.
  3. It’s easy to ignore an email.
  4. You have no statistics about how many alerts are generated. Too many alerts are an indication that your team is no longer able to handle them.
  5. No help to investigate the issue is available, like links to the AWS Management Console.

To solve the problem, we built marbot: a Slack chatbot that manages and escalates AWS alerts for you.

marbot forwards alerts to Slack

marbot sends alerts to a single user from the Slack channel via a direct message. If the user doesn’t acknowledge the alert within 5 minutes, marbot will escalate to the next level. Escalations minimize distraction while keeping response time low. Try marbot for free now.

CloudFormation template

We developed a CloudFormation template to monitor an Elasticsearch domain in any region. The template integrates with marbot, but you can modify it to send out emails. The template is available on GitHub for free.

If you have already installed marbot, you can also ask marbot to monitor your Elasticsearch domain or read more detailed setup instructions. Otherwise: Try marbot for free now.

Summary

Monitoring an Elasticsearch domain requires looking at 14 different CloudWatch metrics.

CloudWatch Alarms can trigger actions. The obvious choice is to send out an email if a metric exceeds a threshold. But we recommend not to use emails. Instead, use a tool like marbot. marbot comes with alert escalation, deduplication, and context-aware links to the AWS Management Console.

Share on FacebookShare on TwitterShare on LinkedinShare on Pinterest

Filed Under: CLOUD

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • May 2015

Recent Posts

  • Tigera’s Calico Cloud Now Available in AWS Marketplace
  • Key metrics to consider when assessing the performance of your VDI/ DaaS environment
  • insightsoftware Acquires Izenda, Diving Deeper into Embedded Analytics
  • Kaspersky Cited as a “Vendor to Watch” for Software-Defined Vehicles
  • The Secret IR Insider’s Diary – from Sunburst to DarkSide

Recent Comments

  • +905443535397 on Announcing Cognitive Search: Azure Search + cognitive capabilities

Categories

  • Artificial intelligence
  • BIG Data & Analytics
  • BlockChain
  • CLOUD
  • Data Center
  • IOT
  • Machine Learning
  • SECURITY
  • Storage
  • Uncategorized
  • Virtualization

Categories

  • Artificial intelligence (51)
  • BIG Data & Analytics (33)
  • BlockChain (331)
  • CLOUD (1,742)
  • Data Center (10)
  • IOT (2,091)
  • Machine Learning (149)
  • SECURITY (425)
  • Storage (25)
  • Uncategorized (63)
  • Virtualization (923)

Subscribe Our Newsletter

0% Complete

Copyright © 2021 · News Pro Theme on Genesis Framework · WordPress · Log in