Data Consistency across Microservices with Outbox Pattern

April 13, 2022

Intent

When moving to microservices architecture, one of the biggest challenges is dealing with data consistency among a swarm of microservices. Microservices are loosely coupled and separated by bounded contexts, such as a transaction through the trade lifecycle, which needs to be broken into local atomic transactions that can be executed by individual microservices within their own domain (not the other way around, i.e. a microservice knowing the internal constructs of other microservices’ databases).

In order to achieve eventual consistency within a business transaction, the microservices need to be able to:

persist data in relation to their own domain, and
publish events to the event broker upon changes to asynchronously trigger the execution of other processing units.

These are two distinct operations, often referred to as dual write. Because they are two separate operations of an atomic transaction, it became a source of consistency issues, performance dips and scalability traps. The cost of dealing with it, even with a veteran enterprise software systems solution: two-phase commit (2PC), has become increasingly unbearable, as all processes need to report back as successful (phase 1) for commit (phase 2) to happen.

Figure 1. A successful execution of two-phase commit (2PC)

The issues can be simply put: a record is created in the database, but for a number of reasons that were described elaborately in Designing Data-Intensive Applications (Martin Kleppmann, 2017, p273), the event failed to be sent to the event broker, and vice versa. Neither scenario is desirable. In fact, a simple loss of data or event publication could result in cascading failures that lead to great business losses. And it could be worse: the operation has not been reversed to reinstate the correct status or triggered reprocessing (this will be discussed in Saga Pattern).

Figure 2. Service failed to publish event as part of the transaction

As the famous CAP theorem (Gilbert and Lynch, 2002) pointed out, you can only guarantee two out of the three: consistency, availability and partition tolerance in distributed data storage.

So how can we provide a reliable mechanism to guarantee the operations happen in an atomic transaction with acceptable compromises?

Outbox Architecture

Outbox is introduced to physically facilitate the split of two operations, by writing a change event to an “Outbox” table as part of the same transaction that persisted the data, and used to guarantee the “at-least-once” delivery afterwards.

Figure 3. Persist business data and Outbox data as an atomic transaction

Every time an entity is created or modified in the database, the updated entity will also be written to the outbox table. Instead of being a typical table or database that can only be seen and modified inside of the service boundary with a predefined schema, the outbox table has the following traits:

It is a generic table that records any changes to the database, which are meant to be sent out as events to the event broker;
It is outward facing so that an independent service can monitor it for publishing;
It needs to ensure the idempotency of the entries, and also indicate the type of event with appropriate payload to construct a meaningful event.
It supports local ACID transactions by cleanly rolling back local transactions if either of the writing fails.
It can be implemented agnostic of technology with any types of databases.

To spice things up, there are two different strategies on the publishing end.

Polling Publisher

Polling Publisher is an asynchronous process that monitors the outbox table, polls events from the Outbox table, and publishes them to the event broker at a set interval.

Figure 4. Use Polling Publisher to publish Outbox events

Transaction Log Trailing

Transaction Log Trailing is the approach that tails the database transaction logs, looks for new entries in the Outbox table, transforms the entries into events and publishes them to the event broker. This approach is also known as Change Data Capture (CDC).

Unlike its poll-based counterpart, CDC can be near-real-time, and is comparatively more efficient with lower overhead.

Figure 5. Use CDC to publish Outbox events

Advantages

The advantages of the solution are:

Achieve atomicity and avoid dual-write problem
ACID transaction in local data storage
Guarantee “at-least-once” delivery of the events
The order of events are preserved

Trade Offs

Duplicate Events

“At-least-once” delivery means there could be multiple same events being sent to the downstream event broker, while this is an inconvenience from time to time, it can be solved by enforcing idempotency (Pattern: Idempotency) on both ends.

Outbox Table

The use of an outbox table in the local database can be a significant bottleneck of the system.

On top of the performance toll to be paid with the pattern, the growing number of microservices means the demand for local databases goes up. It is true in general for the microservices that do require a database on its own; it is an additional accessory for those that do not. The reliability of the system depends on the reliability of the local databases.

For better performance, outbox data should be stored in the same database as business data.

To mitigate the risk with scaling, it is important to consider session stickiness (Pattern: Scaling Services with Stateful Event Flow). It would also help to have a plan on managing and maintaining the availability of databases, and ensuring a retention policy is set on outbox tables to remove the stale data as frequently as possible.

Meng Lin's Byte-Wise Words

Data Consistency across Microservices with Outbox Pattern

Outbox Architecture

Polling Publisher

Transaction Log Trailing

Advantages

Trade Offs

Duplicate Events

Outbox Table

Comments

Post a Comment

Popular posts from this blog

How to: Add Watermark to PDFs Programmatically using iTextSharp

A practical guide to Scala Traits

A Short Guide to AWK