Cloudibn News

Be updated with Technology

  • BIG Data & Analytics
  • CLOUD
  • Data Center
  • IOT
  • Machine Learning & AI
  • SECURITY
  • Blockchain
  • Virtualization
You are here: Home / CLOUD / Lessons learned: Serverless Chatbot architecture for marbot

Lessons learned: Serverless Chatbot architecture for marbot

June 30, 2017 by cbn Leave a Comment

marbot forwards alerts from AWS to your DevOps team via Slack. marbot was one of the winners of the AWS Serverless Chatbot Competition in 2016. Today I want to show you how marbot works and what we learned so far.

Let’s start with the architecture diagram.

marbot Architecture
The diagram was created with Cloudcraft – Visualize your cloud architecture like a pro.

Architecture

The marbot API is provided by an API Gateway. We get most of your requests from:

  • AWS SNS: Our users use SNS topics with an HTTPS subscription to transport alerts, notifications, and incidents from within their AWS accounts to marbot. At this point of writing, marbot can integrate with CloudWatch Alarms, CloudWatch Events, Budget Notifications, Auto Scaling Notifications, and Elastic Beanstalk notifications. Everything else is interpreted by the generic integration.
  • Non-AWS sources that make HTTPS calls like New Relic, Uptime Robot, or just curl clients.
  • Slack: marbot uses the Slack Events API to drive conversations with his users and Slack Buttons to allow users to acknowledge, close, or pass alerts.

The API Gateway forwards HTTP requests to one of our Lambda functions. All of them are implemented in Node.js and store their state in DynamoDB tables.

One special case is the Slack Button API. When you press a button in a Slack message, marbot has 3 seconds to respond to this message. To respond to a button press, marbot may need to make a bunch of calls to the Slack API.

Learnings

Decoupling the process

We learned that we miss the 2-second timeout very often by looking at our CloudWatch data. To not miss the 2-second timeout, we now only put a record into a Kinesis stream that contains all relevant data before we respond to the API request. Writing to Kinesis is a quick operation, and we haven’t seen 2-second timeouts since we switched to Kinesis streams.

As soon as possible we read the Kinesis stream and process the records within a Lambda function. Kinesis comes with its challenges. If you fail to process a record, the Lambda Kinesis integration will retry this record as long as the record is deleted from the stream. All the newer records will not be processed until the failed record is deleted or you fix the bug!

We also thought about using SQS, but:

  • there is no native SQS Lambda integration
  • we can not build one on our own that is serverless and responds within a second

So we decided to use Kinesis knowing that an error can stop our whole processing pipeline.

Resilient remote calls

HTTP requests are hard. A lot of things can go wrong. Two things that we learned early when talking to the Slack API:

  1. Set timeouts: We use 3 seconds at the moment and think about reducing this to 2 seconds
  2. Retry on failures like timeouts or 5XX responses.

Our Node.js implementation of Slack API calls relies on the requestretry package:

const requestretry = require('requestretry');
const AWSXRay = require('aws-xray-sdk');

function invokeSlack(method, qs, cb) {
requestretry({
method: 'GET',
url: `https://slack.com/api/${method}`,
qs: qs,
json: true,
maxAttempts: 3, // retry only 3 times
retryDelay: 100, // wait 0.1 seconds between two retries
timeout: 3000, // timeout after 3 seconds
httpModules: {
'http:': AWSXRay.captureHTTPs(require('http')), // enable X-Ray tracing for http calls
'https:': AWSXRay.captureHTTPs(require('https')) // enable X-Ray tracing for https calls
}
}, function(err, res, body) { /* ... */ });
}

The following screenshot shows a X-Ray trace where the code retried Slack API calls because of the 3 seconds timeout.

X-Ray trace

Implementing timers on AWS

For every alert that arrives in marbot, we keep a timer. 5 minutes after the alert is received we check if someone acknowledged the alert. If not, we escalate the alert to another engineer or the whole team. We have decided to use SQS queues for that. If you send a message to an SQS queue, you can set a delay. Only after the delay, the message becomes visible in the queue. Exactly what we need! The only downside to this solution is that there is no native way to connect Lambda and SQS. But with a few lines of code, you can implement this on your own.

Keeping secrets secure

We use git to version our source code. To communicate with the Slack API, we need to store a secret that we use to authenticate with Slack. We keep those secrets in a JSON file that is added to git as well. But we encrypt the whole file with KMS before we put it into git with the AWS CLI:

aws kms encrypt --key-id XXX --plaintext fileb://config_plain.json --output text --query CiphertextBlob | base64 --decode > config.json

Make sure to put config_plain.json into your .gitignore file!

Outside of the Lambda handler code, we use this code snippet to decrypt the configuration:

const fs = require('fs');
const AWSXRay = require('aws-xray-sdk');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const kms = new AWS.KMS({apiVersion: '2014-11-01'});

const config = new Promise(function(resolve, reject) {
fs.readFile(`config.json`, function(err, data) {
if (err) {
reject(err);
} else {
kms.decrypt({CiphertextBlob: data}, function(err, data) {
if (err) {
reject(err);
} else {
try {
resolve(JSON.parse(new Buffer(data.Plaintext, 'base64')));
} catch (err) {
reject(err);
}
}
});
}
});
});

Inside the Lambda handler code, you can access the config like this:

config
.then(function(c) {
// do something
})
.catch(function(err) {
// handle error
});

Using this approach, you will only make one API call to KMS (for every Lambda runtime).

Getting insights

We use custom CloudWatch metrics to get insights into:

  • How many Slack teams installed marbot
  • Number of alerts and escalations created

We use a CloudWatch Dashboard to display those business metrics together with some technical metrics.

marbot dashboard

Deploying the infrastructure

Our pipeline for deploying marbot works like this:

  1. Download dependencies (npm install)
  2. Lint code
  3. Run unit tests (we mock all external HTTP calls with nock
  4. cloudformation package
  5. cloudformation deploy to an integration stack
  6. Run integration tests with newman
  7. cloudformation deploy to a prod stack

Jenkins runs the pipeline. Since our code is hosted on BitBucket, we can not easily use CodePipeline at the moment.

Share on FacebookShare on TwitterShare on LinkedinShare on Pinterest

Filed Under: CLOUD

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • May 2015

Recent Posts

  • Tigera’s Calico Cloud Now Available in AWS Marketplace
  • Key metrics to consider when assessing the performance of your VDI/ DaaS environment
  • insightsoftware Acquires Izenda, Diving Deeper into Embedded Analytics
  • Kaspersky Cited as a “Vendor to Watch” for Software-Defined Vehicles
  • The Secret IR Insider’s Diary – from Sunburst to DarkSide

Recent Comments

  • +905443535397 on Announcing Cognitive Search: Azure Search + cognitive capabilities

Categories

  • Artificial intelligence
  • BIG Data & Analytics
  • BlockChain
  • CLOUD
  • Data Center
  • IOT
  • Machine Learning
  • SECURITY
  • Storage
  • Uncategorized
  • Virtualization

Categories

  • Artificial intelligence (51)
  • BIG Data & Analytics (33)
  • BlockChain (331)
  • CLOUD (1,742)
  • Data Center (10)
  • IOT (2,091)
  • Machine Learning (149)
  • SECURITY (425)
  • Storage (25)
  • Uncategorized (63)
  • Virtualization (923)

Subscribe Our Newsletter

0% Complete

Copyright © 2021 · News Pro Theme on Genesis Framework · WordPress · Log in