Scaling Services with Stateful Event Flow

March 30, 2022

Intent

Modern services are designed to be stateless, meaning a server processes requests only based on the information provided without having to rely on information from earlier requests. This is the ideal scenario but real-life processing flows do not always comply with that principle, such as:

There are more than one processing server available to process an event flow that relies on the knowledge of previous executions;
Scaling up server instances amid processing of large volumes of events;
Use of sharding architecture;

All this requires a stateful approach to routing so that the API calls and events contain sufficient data with each request to ensure proper routing and application behaviour.

This is required to maintain the lineage of the flow to prevent:

accidental loss of information; or worse,
race conditions due to the events being processed in the wrong order by different instances of service.

Following are some approaches for mitigation.

Sticky Session

Load balancing is an essential mechanism to distribute incoming traffic across multiple servers and maintain the session information. With load balancing, it can easily achieve traffic distribution if the request is deemed as stateless.

Because of the nature of event flow processing, meaning it needs to mutable upon its previous results, i.e. stateful, traffic from the same origin to be processed consistently with the same server. Therefore, aside from a load balancer session cookie issued to the caller, the processing server also needs to generate a cookie session, which overrides the one from load balancer. This approach is referred to as sticky session, or session stickiness.

Load balancers maintain a timeout for stickiness to try and balance traffic across all available scaled instances. For example, all events with ID 123 are being sent to a specific server. Then there was a pause in the events with that ID for 30 seconds. The next time the load balancer sees events with ID 123, it may route them to a different server considering that there are no events with that ID pending processing.

Figure 1. Event Flow Processing using a Specific Server

There are a number of benefits with sticky session:

Load balancer can use a combination of customer identifier (typically a cookie or IP) and session information to direct the traffic to a specific server which has seen traffic from the same flow earlier;
No change is needed on processing servers to implement sticky sessions.
This approach works well with scaling up processing units.

On the flip side:

Uneven distribution of traffic from load balancer, as opposed to round robin load balancing, makes it hard to achieve the theoretical full utilisation of processing servers;
The tight couple between requests to a specific server make it susceptible to DDoS attacks;
This approach can not solve the problem with incidental loss of processing servers, or the loss of the session information.

Out-of-Process Session

To circumvent some of the issues arising from use of sticky sessions, such as loss of processing server, it is better to store and handle the state information outside of the processing servers, using a shareable service. This way, regardless of which server the load balancer forwards the requests to after a server instance failure, it is capable of serving successive events by picking up where the last processing unit left off.

There are several key ways to optimise the speed of query, such as:

Minimised payload;
Use Key-Value based storage;
Storage as close to in-memory as possible, or leverage internal cache layer;

Depending on the performance requirements and platform limitations, the candidates can range from Redis, Couchbase, DynamoDB, etc.

Even though there will be cost and performance decrease associated with this approach, it is recommended for mission-critical applications and processes for better consistency and scalability.

Distributed Lock

By introducing shareable storage, it offloads the teething issue of race condition to the server side. It is not hard to imagine that two processing servers may be accessing, writing in particular, the storage at the same time.

To solve this tricky problem, a mutual exclusion capability needs to be implemented. What it does is to acquire an exclusive lock to access the shared resource and release the lock after operations have completed, and is capable of working with distributed systems, hence the name: distributed lock.

The most important features of a distributed lock are:

A lock prevents multiple processes from changing the same piece of data to ensure the correctness and consistency of data;
The lock prevents concurrent processes from executing at once.

The key challenge with a distributed lock when processing event flows using multiple scaled services is the granularity of the lock. A distributed lock essentially serialises processing. More coarse grained a lock is, less are the benefits of distribution and scaling. More fine-grained a lock is over scaled service, the more are the chances of race conditions.

All that being said, distributed lock is a complex concept with its unique challenges; the implementation could be really difficult even though there are out-of-box solutions, such as Redlock from Redis, available; and there is another system to maintain and the risk is the system will grow to become another monolith.

Meng Lin's Byte-Wise Words

Scaling Services with Stateful Event Flow

Sticky Session

Out-of-Process Session

Distributed Lock

Comments

Post a Comment

Popular posts from this blog

How to: Add Watermark to PDFs Programmatically using iTextSharp

A practical guide to Scala Traits

A Short Guide to AWK