EC2 is one of the fundamental services on AWS. If you are not 100% Serverless, your application health depends on the health of your EC2 instances. When I do AWS architecture reviews for our clients, I check that CPU burst capacity is monitored for EC2 instances of type t2. A widespread mistake is that credits of burstable EC2 instances (t2 family) are not monitored. If your burstable EC2 instance runs out of credits, the performance drops by 70-95%.
Let’s look at the
t2.large in more detail. Your baseline performance is 0.6 vCPU (70% performance drop) while you can burst up to 2 vCPUs as long as you have credits. But the baseline can be much worse. Let’s look at the
t2.nano, the cheapest instance type. Your baseline is 0.05 vCPU (95% performance drop) while you can burst up to 1 vCPU. The following table shows the information for all t2 instance types.
|instance type||performance drop||baseline vCPUs||maximum vCPUs|
Your t2 instance only consumes credits if your application requires more than the baseline performance. This means that you can only run out of credits if your instance is bursting. In other words, credits are only consumed if your application needs performance.
Now imagine what happens if the t2 instance is no longer able to burst and drops to the baseline from one second to the other. This is precisely what happens if you run out of burst credits. In other words, performance drops by up to 95% from one second to the other. This will have a significant impact on your application. Most likely, the application will not be responsive anymore.
First of all, your t2 instances are earning credits while not bursting. For Example, a
t2.large earns 36 credits per hour. One CPU credit is equal to one vCPU running at 100% utilization for one minute. As long as you don’t run out of credits, performance will not drop to the baseline. But how do you know? First, each t2 EC2 instance publishes the
CPUCreditBalance metric to CloudWatch. The metric reports the remaining CPU credits available. Second, you can define a CloudWatch Alarm that continuously monitors the
CPUCreditBalance metric. As soon as you run out of credits, an alert is triggered and you can react. E.g., increase capacity or reschedule work.
We found that emails are not a good way to handle alerts. In a team, multiple people are responsible. If you send an email to a group email address:
- Your team has no idea if someone already started to work on solving the issue.
- You disturb the whole team for each alert.
- It’s easy to ignore an email.
- You have no statistics about how many alerts are generated. Too many alerts are an indication that your team is no longer able to handle them.
- No help to investigate the issue is available, like links to the AWS Management Console.
To solve the problem, we built marbot: a Slack bot supporting your DevOps team to detect and solve incidents on AWS.
marbot sends alerts to a single user from the Slack channel via a direct message. If the user doesn’t acknowledge the alert within 5 minutes, marbot will escalate to the next level. Escalations minimize distraction while keeping response time low. Try marbot for free now.
We developed a CloudFormation template to monitor an EC2 instance in any region (includes
CPUCreditBalance metric monitoring). The template integrates with marbot, but you can modify it to send out emails. The template is available on GitHub for free. We also offer a version that works with a fleet of EC2 instances managed by an Auto Scaling Group.
t2 EC2 instances are cheap, but they increase complexity. You have to monitor burst credits to ensure that you will not suffer from a 95% performance drop. We usually advise not to use the t2 family in production systems that serve user traffic. But we like t2 instances for test environments and internal applications like Jenkins. If you want to stay with the low-cost t2 family, you can enable T2 Unlimited which provides a way to continue to burst without credits but with an additional charge.