Cloudibn News

Be updated with Technology

  • BIG Data & Analytics
  • CLOUD
  • Data Center
  • IOT
  • Machine Learning & AI
  • SECURITY
  • Blockchain
  • Virtualization
You are here: Home / CLOUD / Scaling Microsoft Kaizala on Azure

Scaling Microsoft Kaizala on Azure

September 30, 2020 by cbn Leave a Comment

This post was co-authored by Anubhav Mehendru, Group Engineering Manager, Kaizala.

Mobile-only workers depend on Microsoft Kaizala—a simple and secure work management and mobile messaging app—to get the work done. Since COVID-19 has forced many of us to work from home across the world, Kaizala usage has surged close to 3x from pre-COVID-19. While this is a good opportunity for the product to grow, it has increased pressure on the engineering team to ensure that the service scales along with the increased usage while maintaining the customer promised SLA of 99.99 percent.

Today, we’re sharing share some of the learnings about managing and scaling an enterprise grade secure productivity app and the backend service behind it.

Foundation of Kaizala

Kaizala is a productivity tool primarily targeted for mobile-only users and is based on Microservice architecture with Microsoft Azure as the core cloud platform. Our workload runs on Azure Cloud Services, with Azure SQL DB and Azure Blob Storage used for primary storage. We use Azure Cache for Redis to handle caching, and Azure Service Bus and Azure Notification Hub manages async processing of events. Azure Active Directory (Azure AD) is used for our user authentication. We use Azure Data Explorer and Azure Monitoring for data analytics. Azure Pipelines is used for automated safe deployments where we can deploy updates rapidly multiple times in a week with high confidence.

We follow a safe deployment process, ensuring minimal customer impact, and stage wise release of new feature and optimizations with full control on exposure and rollback ability.

In addition, we use a centralized configuration management system where all our config can be controlled, such as exposure of a new feature to a set of users/groups or tenants. We fine grained control on msg processing rate, receipt processing, user classification, priority processing, slow down non-core functionalities etc. This allows us to rapidly prototype new feature and optimization over a user set.

Key resiliency strategies

We employ the following key resilience strategies:

API rate limit

To protect our existing service from misuse, we need to control the incoming calls coming from multiple clients within a safe limit. We incorporated a rate limiter entirely based on in-memory caching that does the work with negligible latency impact on customer operations.

Optimized caching

To provide optimal user experience, we created a generic in-memory caching infra where multiple compute nodes are able the quickly sync back the state changes using Azure Redis PubSub. Using this a significant number of external API calls were avoided which effectively reduced our SQL load.

Prioritize critical operations

In case of overload of service due to heavy customer traffic, we prioritize the critical customer operations such as messaging over non-core operations such as receipts.

Isolation of core components

Core system components that support messaging are now totally isolated from other non-core parts so that any overload does not impact the core messaging operations. The isolation is done at every resource level such as separate compute nodes, separate service bus for event processing and totally separate storage for non-core operations.

Reduction in intra node communication

We made multiple enhancements in our message processing system where we significantly reduced scenarios of intra node communication that caused a heavy intra node dependency and slows down the entire message processing.

Controlled service rollout

We made several changes in our rollout process to ensure controlled rollout of new features and optimizations to minimize and negative customer impact. The deployments moved to early morning slots where the customer load is minimal to prevent any downtime.

Monitoring and telemetry

We setup specific monitoring dashboards to give a quick overview of service health that monitor important parameters, such as CPU consumption, thread count, garbage collection (GC) rate, rate of incoming messages, unprocessed messages, lock contention rate, and connected clients.

GC rate

We have finetuned the options to control the rate of Gen2 GC happening in a cloud service as per the needs of the web and worker instances to ensure minimal latency impact of GC during customer operations.

Node partitioning

Users need to be partitioned across multiple nodes to distribute the ownership responsibility using a consistent hashing mechanism. This master ownership helps in ensuring that only required user’s information is stored in the in-memory cache on a particular node.

Active passive user

In large group messaging operations, there are always users who are actively using the app while a lot of users are not active. Our idea is to prioritize message delivery for active users so that the last bucket active user received the message fast.

Serialization optimization

Default JSON serialization is costly when the input output operations are very frequent and burn precious CPU cycles. ProtoBuf offers a fast binary serialization protocol that was leveraged to optimize the operations for large data structures.

Scaling compute

We re-evaluated our compute usage in our internal multiple test and scale environments and judiciously reduced the compute node SKU to optimize as per the needs and optimize cost of goods sold (COGS). While most of our traffic in an Azure region is during the day time, there is minimal load at the night where we leverage to do heavy tasks, such as redundant data cleanup, cache cleanups, GC, database re-indexing, and compliance jobs.

Scaling storage

With increasing scale, the load of receipts became huge on the backend service and was consuming a lot of storage. While critical operations required highly consistent data, the requirement is less for non-critical operations. We moved the receipt to highly available No-SQL storage, which costs a tenth of the SQL storage.

Queries for background operations were spread out lazily to reduce the overall peak load on SQL storage. Certain non-critical Operations were moved from being strongly consistent to eventual consistency model to flatten the peak storage load, thus creating more capacity for additional users.

Our future plans

As the COVID-19 situation continues to be grave, we are expecting an accelerated pace of Kaizala adoption from multiple customers. To keep up with the increase in messaging load and high customer usage, we are working on new enhancements and optimizations to ensure that we remain ahead of the curve including:

  • Developing alternative messaging flows where users actively using the app can directly pull group messages even if the backend system is overloaded. Message delivery is prioritized for active users over passive users.
  • Aggressively working on distributed in-memory caching of data entities to enable fast user response and alternative designs to keep cache in sync while minimizing stale data.
  • Moving to container-based deployment model from the current virtual machine (VM)-based model to bring more agility and reduce operational cost.
  • Exploring alternative storage mechanism which scale well with massive write operations for large consumer groups supporting batched data flush in a single connection.
  • Actively exploring ideas around active-active service configuration to minimize downtime due to data center outages and minimize Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
  • Exploring ideas around moving some of the non-core functionalities to passive scale units to utilize the standby compute/storage resources there.
  • Evaluating the dynamic scaling abilities of Azure Cloud services where we can automatically reduce the number of compute nodes during nighttime hours where our user load is less than fifth of the peak load.
Share on FacebookShare on TwitterShare on LinkedinShare on Pinterest

Filed Under: CLOUD

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • May 2015

Recent Posts

  • Postgres Users Can Easily and Reliably Move Data Anywhere with Airbyte
  • CloudFormation cfn-init pitfall: Auto scaling and throttling error rate exceeded
  • Migrate Root CA to a new server
  • Migrate Root CA to a new server
  • Migrate Root CA to a new server

Recent Comments

    Categories

    • Artificial intelligence
    • BIG Data & Analytics
    • BlockChain
    • CLOUD
    • Data Center
    • IOT
    • Machine Learning
    • SECURITY
    • Storage
    • Uncategorized
    • Virtualization

    Categories

    • Artificial intelligence (182)
    • BIG Data & Analytics (192)
    • BlockChain (596)
    • CLOUD (2,374)
    • Data Center (977)
    • IOT (3,127)
    • Machine Learning (281)
    • SECURITY (924)
    • Storage (32)
    • Uncategorized (70)
    • Virtualization (1,655)

    Subscribe Our Newsletter

    0% Complete

    Copyright © 2022 · News Pro Theme on Genesis Framework · WordPress · Log in

    Looking for Cloud Solutions, We can help !

    ×