I’m frustrated. A major service of AWS is broken for 24 days. The Simple Notification Service (SNS) delivers messages to HTTPS subscriptions with a delay of more than 30 minutes. That issue impacts our SaaS business. But AWS did not fix the problem yet and did not even reveal an ETA for resolving the issue.
Our SaaS business runs on a Serverless architecture. Our customers send all kinds of alarms from their AWS infrastructure to our chatbot. To do so, our solution configures CloudWatch alarms and an SNS topic within our customer’s AWS accounts. The SNS topic forwards all incoming alarms to our API Gateway by using an HTTP subscription.
On September 1st, a customer wrote in: they observed that CloudWatch alarms showed up in Slack with a delay of more than 30 minutes. A monitoring and incident management solution is kind of worthless with delayed alarms. Therefore, I started investigating immediately.
First of all, I had a look at the AWS Service Health Dashboard. All systems operating normally.
Next, I analyzed our log messages to track down the issue. But I could not find any delayed messages—incoming alarms were delivered to Slack within milliseconds after arriving at our API Gateway.
I was wondering how to find out whether SNS caused the delay. But how to investigate an issue like that? Luckily, I stumbled upon delivery status logging. A SNS topic is capable of writing delivery logs to CloudWatch Logs. The perfect way to debug a problem like that.
I found log messages similar to this one. SNS sent a message to
api.marbot.io, and our API Gateway answered with status code
204. SNS tried to deliver the message once. The important information is
dwellTimeMs = 2748244. It took SNS about 45 minutes to send the alarm to our backend.
And this is not an outlier. The same is true for all other messages as well. Immediately, I’ve contacted AWS Support. Unfortunately, the story does not end here.
After the typical back and forth between the support engineer and the service team, I was told to remove the rate limit from our HTTP subscription. AWS announced that feature in December 2011, so I would expect that to be pretty stable, but hey. I’ve removed the throttling policy – the parameter is named
maxReceivesPerSecond – and indeed, doing so fixed the problem.
Let me clarify that. We are using a rate limit of 1 message per second to avoid flooding our API Gateway in case of misconfigured alarms. Typically only a few messages pass the SNS topic per hour! We are far away from reaching the rate limit. However, AWS Support recommended to remove the throttle policy as a workaround for the issue with delayed messages.
Fine, there is a workaround for the issue. However, implementing the workaround is a challenge for us. We need to update about 1,000 SNS topics. And unfortunately, those SNS topics are not part of our AWS accounts but are managed by our customers. Therefore it is quite expensive to roll out the recommended workaround.
We love our Serverless architecture. However, our business depends on AWS to operate all the involved services (SNS, API Gateway, Lambda, DynamoDB, Kinesis, Step Functions, etc.) professionally and fix any issues within hours.
Unfortunately, more than 24 days passed, and AWS did not fix the problem. SNS messages are still delayed. Neither AWS Support nor any AWS employee that I have contacted could help with the issue. Until today, we do not even know when AWS is planning to fix the problem – “Unfortunately, I cannot provide any ETA.”. The AWS Service Health Dashboard still states All systems operating normally. It is not!
I’m building on AWS for more than eight years. I’ve never experienced anything like that before. My trust in AWS has strongly decreased over the past month.