AWS is a complex system that no one can understand end-to-end? As AWS professionals we still have to deal with the AWS system daily. We design for AWS, and we debug our applications running in AWS. How can we deal with a system that is too complex to understand in detail? A good mental model can help us.
A mental model is a simple representation of the system that is good enough to predict how the system behaves. If AWS behaves differently than I predicted, I know that my mental model is not good enough. I tweak it. Over the years, I end up with a mental model that works in most situations. Be aware that AWS is evolving quickly. Sometimes, I have to adjust my mental model to take this into account. You can build your mental model by using AWS extensively, reading the documentation, blogs, books; listening to podcasts; talking to peers; Today, you also have the chance to look into my head. I present my mental model of AWS to you. I also added a few exercises to challenge your mental model.
All responses from the AWS API (using CLI or SDK) are eventually consistent. A recent change might not appear in the result. Making the same request from two clients at the same time can result in different responses. There is no guarantee to read your writes.
Calls to the AWS API should be retried if they fail and are retriable.
All APIs are rate limited and when retrying an exponential backoff with a random component should be applied.
CloudTrail can be used to debug failed requests.
Exercise: If you write a script to copy snapshots from one AWS account to another, what are your assumptions?
Most AWS resources send metrics to CloudWatch. You have to create CloudWatch Alarms to monitor those metrics. Pro tip: We offer a product to configure your AWS monitoring and manage your incidents using Slack.
An Infrastructure as Code (IaC) tool (e.g., CloudFormation) is used to create/update/delete resources and rolls back on error. A deployment pipeline (e.g., CodePipeline) invokes the IaC tool.
Exercise: Pick three AWS resources with finite resources (CPU, Memory, Disk, …) and check if you monitor them with CloudWatch Alarms.
Labor is expensive. Comparing costs of AWS services should take this into account (e.g., running a database on EC2 seems cheaper compared to RDS, but how many hours of labor does the EC2 solution require?)
Managed services from AWS are a good choice.
There are many ways to solve a problem with AWS. Know what you optimize for and design accordingly.
Exercise: Pick an infrastructure service that your team operates and calculate how many hours/month you work to maintain the solution.
The smallest unit to reason about is the Elastic Network Interface (ENI). Internal traffic in AWS is received on or send from an ENI. An EC2 instance comes with at least one ENI. As well as an RDS instance, ElastiCache node, and so on.
If a security group or NACL blocks a packet, Flow Logs can be used to see this (with a delay). Issues with route tables are not visible in Flow Logs.
Security groups provide enough security to control network traffic. NACLs are not needed most of the time.
Traffic inside VPCs is referenced using Security Groups (not IP addresses).
Network Load Balancers (NLB) are very different. I have still no working mental model. Always a surprise.
Exercise: Use Flow Logs to track a request you made. How can you debug a problem that is caused by a routing table?
An IAM user is either a human or a technical user for workloads outside of AWS.
Before an IAM role can be assumed authentication happens using an IAM user, AWS service, or Identity Federation.
The trust policy of an IAM role can give access to the outside of the AWS account.
IAM policies manage access to resources. If the resources are in the same AWS account, the IAM policy controls access (except for KMS CMKs where you have to allow it IAM access explicitly using a resource policy aka key policy). If the resources are from another AWS account, a resource policy in the other AWS account controls access as well.
Only some actions support resource-level permissions.
Granting least privileges is the goal. Only allow the minimum set of actions on resources.
Any IAM user/group/role with administrator access to IAM can escalate its own privileges. With IAM permission boundaries privilege escalation can be prevented.
I have still no working mental model for IAM permission boundaries.
Service Control Policies (SCPs) can only control access on the action level. Not at the resource level. SCPs take precedence over IAM policies. A
Deny can never be allowed again in the chain of SCPs and IAM policies. An IAM policy can only
Allow what is allowed by the SCP.
Many services offer additional ways to authentication/authorization (S3 bucket policies, S3 signed URL, SQS queue policy, …) and they can give access to the outside of the AWS account.
Exercise: Pick a random IAM role and understand the attached policies in detail.
Many AWS resources do not provide stable performance. Instead, a burst mode is used to provide high performance for short periods of time. In constant and high load scenarios this can be a problem.
Many EC2 instance types come with burstable performance. Some are obvious (CPU of the
t family), and predictable some are not (network bandwidth of most EC2 instances is burstable).
gp2 volumes burst using a credit system.
Load tests have to run long enough (e.g., 1 hour). Otherwise, you measure burst performance which suddenly drops significantly.
Exercise: How can you monitor burstable performance?
|Service||Access||Maximum storage volume||Latency||Storage Cost||Notes|
|S3||AWS API (SDKs, CLI), third party tools||unlimited||high||very low||Replicated in the region|
|Glacier||S3, AWS API (SDKs, CLI), third party tools||unlimited||extreme high||extreme low||Replicated in the region|
|EBS (SSD)||Attached to an EC2 instance via network||17.5 TB||low||low||Replicated in the AZ; magnetic disks are also available|
|EC2 Instance Store (SSD)||Attached to an EC2 instance directly||51.5 TB||very low||very low||Data is lost if instance is stopped/terminated/fails; data is not replicated; magnetic disks are also available|
|EFS||NFSv4.1, e.g., from EC2 instance or on-premises||unlimited||medium||medium||Replicated in the region; no native backup available|
|RDS (MySQL, SSD)||SQL||17.5 TB||medium||low||Use Multi-AZ to replicate to a second AZ; other engines available as well|
For AWS services, in transit encryption can be enabled (if not enabled by default). Certificates are usually issued and managed by the Certificate Manager.
AWS services that persist data offer server-side encryption (SSE) that you usually have to enable. SSE usually used a secret key managed by KMS.
You can delete a KMS customer managed key (CMK) that a resource (e.g., EBS volume) uses for SSE. Usage can also be passive until you want to use the resource again (e.g., you cannot restore an RDS snapshot if the CMK is deleted). Backups (e.g., RDS snapshots) should be copied to another AWS account using a KMS CMK of that other AWS account to protect against data loss caused by the deletion of the key.
Exercise: If you use KMS: what is your strategy to prevent data loss caused by (accidentally) KMS CMK deletion?
AWS services are either global (e.g., Route 53, CloudFront), regional (e.g., Lambda), or zonal (e.g., EBS). EC2 (and services based on EC2) are different and rely on a single hypervisor.
CPU, RAM, local disks, network bandwidth, EBS bandwidth (only dedicated for EBS optimized instances, otherwise shared with network bandwidth) are finite resources and can be saturated. CloudWatch Alarms are needed to know when saturation happens.
Resource utilization of more than 80% can affect latency significantly.
A single EC2 instance is always at risk (e.g., hardware failure). Even the EC2 SLA does not apply to single instances.
A backup strategy is needed if data is stored on EC2 instances. Backups have to be consistent.
- On Linux,
fsfreeze -fto flush open writes in combination with an EBS snapshot can be used to backup non-root volumes. Linux root volumes cannot be consistently backed up while the instance is running (using
fsfreeze -fon the root volume can crash the instance). Two options: Stopping the instance and taking a snapshot or moving the valuable data out of the root volume.
- On Windows, snapshots of all volumes can be created while the instance is running.
A group of EC2 instances managed by an Auto Scaling Group (ASG) is tolerant to EC2 instance failure and AZ failure. The EC2 SLA covers a group of instances.
Operating EC2 instances is much effort (e.g., patching, logging, deployment). Containers are much easier to operate using Fargate. Lambda might also be a good choice.
Logs should flow to a central service (e.g., CloudWatch Logs, Kibana provided by the Elasticearch service).
Exercise: Which of your EC2 instances need a backup? How do you achieve a consistent backup?
RDS instances are based on EC2 instances and share the same resource characteristics (finite CPU, RAM, network, …).
Different engines work very differently as well as Aurora MySQL/PostgreSQL/Serverless.
Exercise: What happens if y minor/major update to your Multi-AZ engine of choice happens? There might be a downtime!
A Lambda function implementation has to be idempotent.
The order of Lambda function invocations is not the order of Lambda function executions.
A Lambda function execution can be aborted at any time in the code.
If multiple functions work together, Step Functions should be used.
Local testing of a Lambda function is possible as with any other code. You have to mock calls to the AWS API as to any other external service you integrate with. An integration test against real AWS APIs is the second step after unit tests pass.
Exercise: Take one of your Lambda functions. For each line, figure out what happens if you call this line two times instead of once. Does it still work?
|SNS||An SNS subscription is triggered at-least-once. Processing has to be idempotent.
HTTP(S) subscriptions are only retried in case of 5XX or timeouts. Not 429. The
Processing has to be idempotent.
|Kinesis||Order within partition key.|
|DynamoDB||Operations are atomic.
No concurrent writes to the same primary key (partition + sort key).
Multi-region tables are a different game. Your application should ensure that you do not write to the same primary key in two regions. Otherwise, the outcome is not predictable ("last" write wins).
A mental model can help you to navigate in a complex system such as AWS. I hope you improved your mental model of AWS by reading this article. I’m interested in learning about your mental models. Meet me at re:Invent or send me an email with your feedback.