The Architect’s Notebook

The Architect’s Notebook

Ep #68: Eventual Consistency Implementation Guide: Quorum Reads, CRDTs, and Production Patterns | Part 2

Learn how to implement eventual consistency patterns in production: quorum reads, conflict-free replicated data types, monitoring strategies, and database-specific configurations.

The Architect’s Notebook's avatar
The Architect’s Notebook
Dec 18, 2025
∙ Paid

Breaking the complex System Design Components

By Amit Raghuvanshi | The Architect’s Notebook
🗓️ Dec 18, 2025 · Deep Dive ·


Recap: The Consistency Spectrum

In Part 1, we explored why eventual consistency exists: the speed of light, CAP theorem, and real-world disasters from misunderstanding these tradeoffs.

Now, in Part 2, we’re diving into solutions. You’ll learn:

  • How to “dial” your consistency level with quorum reads

  • Conflict-free replicated data types (CRDTs) for collaborative systems

  • Production monitoring and chaos engineering techniques

  • Database-specific configuration examples

Let’s start with the most powerful tool in your arsenal: Quorum Reads.


The Philosophy of Conflict

In Part 1, we dealt with “Stale Reads.” Now we are dealing with “Concurrent Writes.”

When two people change the same data at the same time in a distributed system, who wins?

  • Does the last person wins? (LWW)

  • Do we merge the data?

  • Do we reject the second write?

This brings us to the most dangerous misconception in distributed databases: “Last Write Wins” is not always safe.

If User A writes “Hello” and User B writes “World” at the same time, “Last Write Wins” means one of them effectively deletes the other’s work. In a Google Doc, that is unacceptable. In a banking ledger, that is illegal.


The Danger Zones: When to NEVER Use Eventual Consistency

While “Availability” is great, there are specific domains where Eventual Consistency is dangerous.

1. Inventory Management (The “Overselling” Problem)

You have 1 iPad left in stock.

  • User A buys it. The Replica says “Stock: 0”.

  • User B (hitting a different Replica) sees “Stock: 1”. They buy it too.

Now you have sold 2 iPads, but you only have 1. You have to email User B and apologize. In high-volume retail (ticket sales, flash sales), this destroys trust.

Solution: You need a distributed lock or a single source of truth (Primary) for the “Checkout” button.

class InventoryManager:
    def __init__(self):
        self.lock = DistributedLock()
    
    def purchase(self, product_id, quantity):
        # CRITICAL: Must use strong consistency
        with self.lock.acquire(f”inventory:{product_id}”):
            # Read from PRIMARY, not replica
            current_stock = primary_db.get_stock(product_id)
            
            if current_stock >= quantity:
                # Atomic decrement
                primary_db.decrement_stock(product_id, quantity)
                return {”success”: True}
            else:
                return {”success”: False, “error”: “Out of stock”}

Real-World Disaster: In 2011, Groupon’s “Banana Bunker” deal went viral. Their eventually consistent inventory system allowed thousands of people to buy an item that was out of stock. They had to honor all purchases at a massive loss, and their stock price tanked.

2. Financial Transactions (The “Double Spend” Problem)

You have $100 in your account.

  • You transfer $100 to Mom.

  • Before the balance updates on all nodes, you withdraw $100 from an ATM.

If the ATM checks a stale Replica, it gives you the money. You just stole $100.

Solution: Money is always Strong Consistency. Banks use strict transaction isolation levels (Serializable) and ACID compliance. They would rather the ATM says “Service Unavailable” than give away free money.

-- Banks use SERIALIZABLE isolation
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- Check balance (locks the row)
SELECT balance FROM accounts WHERE user_id = 123 FOR UPDATE;

-- If sufficient, deduct
UPDATE accounts SET balance = balance - 100 WHERE user_id = 123;

COMMIT;

Real-World Disaster: In 2016, the Bangladesh Bank heist, hackers exploited timing windows in the SWIFT system to attempt to steal nearly $1 billion. While not purely an eventual consistency issue, it highlighted how critical strong consistency is for financial systems.

3. Authentication & Authorization (The “Revoked Access” Problem)

  • Admin revokes User X’s access to the system at 10:00 AM.

  • User X’s session token is checked against a stale replica at 10:01 AM.

  • Replica says “Token valid” (it hasn’t synced yet).

  • User X accesses confidential data after being fired.

Solution: For security-critical decisions, always check the Primary or use short-lived tokens with distributed caches.

def check_access(user_id, resource_id):
    # NEVER check stale replicas for permissions
    permissions = primary_db.get_permissions(user_id)
    
    # OR use a fast, strongly consistent cache like Redis
    cached_permissions = redis.get(f”permissions:{user_id}”)
    
    if cached_permissions is None:
        # Cache miss, read from primary and cache for 60 seconds
        cached_permissions = primary_db.get_permissions(user_id)
        redis.setex(f”permissions:{user_id}”, 60, cached_permissions)
    
    return resource_id in cached_permissions

4. Auctions & Bidding (The “Phantom Bid” Problem)

  • Current highest bid: $100

  • User A bids $150 (writes to Primary)

  • User B sees a stale replica showing highest bid: $100

  • User B bids $120, thinking they’ll win

  • User B loses, complains the system “changed the rules”

Solution: Auction systems must show real-time data. Use WebSockets connected to the Primary, or very short cache TTLs.


📚 Level Up Your Architecture Skills

Why I wrote the “System Design Masterclass” series (and why they aren’t like other books)

I’ve been sitting on the other side of the interview table for a decade. I’ve interviewed hundreds of engineers—from bright Juniors to seasoned Principals. And I noticed a pattern that bothered me.

Most candidates treat System Design like a vocabulary test.

They memorize the definitions. They know what a Load Balancer is. They know what Redis is. They can draw boxes and arrows because “that’s how Netflix does it.”

But the moment I introduce a real-world constraint—“What happens if the network drops the packet after the payment is authorized?” or “How do you prevent double-booking when 10,000 users hit the API at once?”—the memorized architecture collapses.

We have plenty of books that teach the definitions. We have very few that teach the intuition.

That is why I started the System Design Masterclass series.

I didn’t want to write another textbook full of dry theory. I wanted to write a Flight Simulator for backend engineers.

Here is how these books are different:

  1. They are Simulations, not Chapters: Each section drops you into a specific, high-stakes scenario (e.g., “Design a Payment Gateway” or “Design a Ticketmaster Clone”).

  2. The Junior Trap: I explicitly show you the intuitive answer—the one 90% of engineers give—and exactly why it fails in production (race conditions, data loss, downtime).

  3. The Staff Solution: We then rebuild the system using the patterns that actually survive the real world.


What’s covered in the first two volumes?

📘 Volume 1: Building Financial Systems That Never Fail This book is about Correctness. Financial data is the hardest data to handle because “oops” isn’t an acceptable error message.

  • Core Themes: Idempotency, Double-Entry Ledgers, Sagas vs. 2PC, and Reconciliation.

  • The Goal: Moving from “My code works” to “My architecture guarantees money is never lost.”

📕 Volume 2: Building Inventory Systems That Survive the Crush This book is about Extreme Concurrency. Volume 1 was about rigor; Volume 2 is about chaos. We look at the physics of Flash Sales and Ticket Launches.

  • Core Themes: Pessimistic Locking (SELECT FOR UPDATE), Redis Atomic Counters, Virtual Waiting Rooms, and Bot Protection.

  • The Goal: Moving from “Horizontal Scaling” to “Contention Management.”


Why am I continuing this series? Because the gap between “Senior” and “Staff” isn’t code quality—it’s Paranoia. It’s the ability to foresee failure modes before they happen. My goal with these books (and the upcoming Volume 3 on Banking Ledgers) is to train that paranoia so you can walk into any interview—or any war room—with confidence.

🚀 Launch Offer For the readers of this newsletter, I’ve set up a bundle link with a flat 15% discount on both volumes.

Grab your copies here:

Volume 1 | Volume 2

Let me know in the comments: Which system design topic scares you the most? (That might just be the topic for Volume 4).

— Amit


5. Compliance & Audit Logs (The “Legal Evidence” Problem)

  • User deletes incriminating data at 3:00 PM

  • Investigators query a replica at 3:05 PM

  • Replica still shows the deleted data (hasn’t synced)

  • Legal team builds a case on data that “doesn’t exist” on the primary

Solution: Audit logs and compliance data must be written to strongly consistent, append-only storage with write-ahead logging.

class AuditLogger:
    def log_action(self, user_id, action, data):
        # Write to strongly consistent, append-only log
        # NEVER allow deletes or updates
        log_entry = {
            ‘timestamp’: time.time(),
            ‘user_id’: user_id,
            ‘action’: action,
            ‘data’: data,
            ‘checksum’: hashlib.sha256(data.encode()).hexdigest()
        }
        
        # Write to primary with SYNC replication
        primary_db.append_log(log_entry, sync=True)

So, how do we solve this? How do we build systems that are fast most of the time, but strict some of the time?

The Decision Matrix


Conclusion: Embrace the Chaos

Eventual Consistency is not a bug. It is a fundamental law of the universe that applies as soon as you have two computers talking to each other.

As an architect, your job is not to fight it, but to manage it.

  1. Identify the Danger Zones: Billing and Inventory must use Quorums or Transactions.

  2. Relax the Rest: Likes, Comments, and Feeds should be Eventually Consistent to keep the app fast.

  3. Test for Failure: Use chaos engineering to ensure your UI doesn’t crash when data is stale.

Don’t let the marketing term fool you. “Eventually” is a long time. Make sure your application can handle the wait.


🔒 Subscribe to Stop Data Corruption

We have identified the risks. Now let’s engineer the solutions.

This isn’t just theory. We are going to look at the mathematical tuning knobs of NoSQL databases (Cassandra/DynamoDB) and the advanced data structures that power collaborative tools like Google Docs.

In this implementation guide, we cover:

  • The Quorum Formula: How to tune N, R, and W to guarantee mathematical consistency.

  • CRDTs (Conflict-free Replicated Data Types): The data structures that merge conflicts automatically without data loss.

  • Chaos Engineering: A Python script to simulate “Network Partitions” and prove if your app survives.

  • The “Last Write Wins” Trap: Why it deletes data and how to use Vector Clocks instead.

Subscribe to get the Deep Dive

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Amit Raghuvanshi · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture