IntermediateMulti-region

Multi-region Architecture

Design geo-distributed systems: active-active vs active-passive, failover strategies, data consistency, and latency optimization.

🧠 ELI5 Explanation

Imagine a pizza chain with one store. If the building burns down, no pizza for anyone. Now imagine stores in 3 cities (East, West, Central). Customers go to nearest store. If East store burns, customers still get pizza from West & Central. That's multi-region: replicate your system across geographic areas so failure in one region doesn't break everything.

Why Multi-region?

Benefit Explanation Disaster Recovery If primary region fails (power outage, natural disaster), secondary region takes over Latency Reduction Users routed to nearest region (~10ms vs 100ms+) Compliance Some data must stay in specific countries (GDPR = Europe) High Availability 99.99%+ SLA achievable with multi-region redundancy

Multi-region Patterns

Pattern 1: Active-Passive (Failover)

Setup: Primary region handles all traffic. Secondary region is warm standby (data replicated, but not handling traffic).

Pros: Simpler to manage, lower cost (secondary doesn't fully run)
Cons: Failover takes time (RTO not near-zero), secondary resources unused
When to use: Apps that can tolerate 5-15 minute failover, lower budget

Active-Passive: Primary handles traffic, Secondary on standby


        Users

          ↓

        Front Door (primary.region1.com)

          ↓

        PRIMARY REGION (East US)

        ├── App VMs (running)

        ├── Database (primary, active)

        └── Storage (active)

          ↓ data replication

        SECONDARY REGION (West US)

        ├── App VMs (powered off or minimal)

        ├── Database (replica, read-only)

        └── Storage (geo-redundant copy)


        Normal: All traffic East US

        Failure: Manual failover or auto after timeout

Pattern 2: Active-Active (Always On)

Setup: Both regions handle traffic simultaneously. Users routed based on latency/geography.

Pros: No failover delay (both always on), utilizes resources efficiently, better user experience
Cons: Complex (data consistency, session management), higher cost (run 2x resources)
When to use: Mission-critical, must survive region failure seamlessly, budget allows

Active-Active: Both regions handle traffic


        Users

          ↓

        Front Door (geolocation routing)

        ├── East US users → Region 1 app/db

        └── West US users → Region 2 app/db


        Region 1: App + Primary DB (10M customers)
        Region 2: App + Primary DB (10M customers)


        Cross-region replication (eventual consistency)


        Failure: If Region 1 fails, Front Door auto-routes East users to Region 2

Data Consistency Challenges

Challenge: Cross-region Data Replication

Problem: If you replicate data across regions, which is the source of truth? What if writes happen to both at once?

Solutions:

Asynchronous (Eventual Consistency): Write to primary, async replicate to secondary. Secondary may be slightly behind. Best for high-performance systems (e.g., social media likes)
Synchronous (Strong Consistency): Write only to primary, sync replicate to secondary. Slower writes, but data always consistent. Best for financial/healthcare
Multi-master: Both regions accept writes. Conflict resolution needed (usually last-write-wins or custom logic)

Failover Scenarios

Scenario Active-Passive Active-Active Primary fails Manual/auto failover (5-15 min), secondary activated Automatic routing (seconds), no manual intervention Network partition Primary isolated, failover triggered Both regions isolated, risk of split-brain (2 primaries) Partial failure Keep primary online if partially working Remove failed region from routing, other region continues

Real-world Example: Global SaaS Platform

Setup (Active-Active):
• Region 1 (East US): App tier + read-write database
• Region 2 (West EU): App tier + read-write database
• Front Door: Routes based on user geography (East users → Region 1, EU users → Region 2)
• Data: Cross-region replication with eventual consistency (low latency priority)

Normal Operation:
User in New York hits Region 1 (fast), user in London hits Region 2 (fast)

Scenario 1: East US datacenter fire
• Region 1 DB goes down
• Front Door detects region unhealthy
• East US users auto-routed to Region 2
• Slight latency increase (300ms vs 10ms), but service continues

Scenario 2: West EU network issue
• Region 2 temporarily unreachable
• EU users fallback to Region 1
• Worse latency, but no downtime

Result: 99.99%+ availability, survives single region failure, optimized latency per geography

Multi-region Design Decisions

Decision 1: Active-Active or Active-Passive?

Active-Active: Higher SLA target (99.99%+), tolerate complexity/cost
Active-Passive: 99.9% SLA acceptable, lower cost/complexity

Decision 2: How Many Regions?

2 regions: Survives 1 region failure, standard choice
3+ regions: Survives 2 region failures (very rare), expensive, use if critical

Decision 3: Data Consistency Model?

Eventual consistency: Fast writes, acceptable delay, non-transactional data
Strong consistency: Slow writes, always up-to-date, financial/healthcare data

Summary

Multi-region: Replicate system across geographic regions for disaster recovery & latency
Active-Passive: Lower cost, manual failover, slower RTO
Active-Active: Higher SLA, instant failover, complex data consistency
Data consistency: Choose eventual (fast) vs strong (safe) based on use case
Front Door: Geo-aware routing, automatic failover detection

Interview Questions

Q: Explain active-active vs active-passive and when to use each.
A: Active-Passive: Primary handles all traffic, secondary stands by. Cheaper, simpler, slower failover (~5-15 min). Active-Active: Both regions handle traffic, instant failover. More complex/costly, used for mission-critical. Choose based on RTO needs.

Q: Design a multi-region architecture for a game that needs 99.99% SLA and low latency (<50ms) globally.
A: Active-Active: Region 1 (US) + Region 2 (EU) + Region 3 (Asia). Each region has app + game state database. Front Door routes by geography. Use eventual consistency for game state (fast updates). Anti-cheat service centralized or distributed. Costs high, but achieves SLA & latency.

Q: What is the difference between RTO and RPO in multi-region setups?
A: RTO = Recovery Time Objective = how fast can you failover (active-active = near-zero, active-passive = 5-15 min). RPO = Recovery Point Objective = how much data can you lose (async replication = lose recent writes, sync = lose nothing).