Multi-region Architecture
Design geo-distributed systems: active-active vs active-passive, failover strategies, data consistency, and latency optimization.
🧠 ELI5 Explanation
Imagine a pizza chain with one store. If the building burns down, no pizza for anyone. Now imagine stores in 3 cities (East, West, Central). Customers go to nearest store. If East store burns, customers still get pizza from West & Central. That's multi-region: replicate your system across geographic areas so failure in one region doesn't break everything.
Why Multi-region?
Multi-region Patterns
Pattern 1: Active-Passive (Failover)
Setup: Primary region handles all traffic. Secondary region is warm standby (data replicated, but not handling traffic).
- Pros: Simpler to manage, lower cost (secondary doesn't fully run)
- Cons: Failover takes time (RTO not near-zero), secondary resources unused
- When to use: Apps that can tolerate 5-15 minute failover, lower budget
Users
↓
Front Door (primary.region1.com)
↓
PRIMARY REGION (East US)
├── App VMs (running)
├── Database (primary, active)
└── Storage (active)
↓ data replication
SECONDARY REGION (West US)
├── App VMs (powered off or minimal)
├── Database (replica, read-only)
└── Storage (geo-redundant copy)
Normal: All traffic East US
Failure: Manual failover or auto after timeout
Pattern 2: Active-Active (Always On)
Setup: Both regions handle traffic simultaneously. Users routed based on latency/geography.
- Pros: No failover delay (both always on), utilizes resources efficiently, better user experience
- Cons: Complex (data consistency, session management), higher cost (run 2x resources)
- When to use: Mission-critical, must survive region failure seamlessly, budget allows
Users
↓
Front Door (geolocation routing)
├── East US users → Region 1 app/db
└── West US users → Region 2 app/db
Region 1: App + Primary DB (10M customers)
Region 2: App + Primary DB (10M customers)
Cross-region replication (eventual consistency)
Failure: If Region 1 fails, Front Door auto-routes East users to Region 2
Data Consistency Challenges
Challenge: Cross-region Data Replication
Problem: If you replicate data across regions, which is the source of truth? What if writes happen to both at once?
Solutions:
- Asynchronous (Eventual Consistency): Write to primary, async replicate to secondary. Secondary may be slightly behind. Best for high-performance systems (e.g., social media likes)
- Synchronous (Strong Consistency): Write only to primary, sync replicate to secondary. Slower writes, but data always consistent. Best for financial/healthcare
- Multi-master: Both regions accept writes. Conflict resolution needed (usually last-write-wins or custom logic)
Failover Scenarios
Real-world Example: Global SaaS Platform
Setup (Active-Active):
• Region 1 (East US): App tier + read-write database
• Region 2 (West EU): App tier + read-write database
• Front Door: Routes based on user geography (East users → Region 1, EU users → Region 2)
• Data: Cross-region replication with eventual consistency (low latency priority)
Normal Operation:
User in New York hits Region 1 (fast), user in London hits Region 2 (fast)
Scenario 1: East US datacenter fire
• Region 1 DB goes down
• Front Door detects region unhealthy
• East US users auto-routed to Region 2
• Slight latency increase (300ms vs 10ms), but service continues
Scenario 2: West EU network issue
• Region 2 temporarily unreachable
• EU users fallback to Region 1
• Worse latency, but no downtime
Result: 99.99%+ availability, survives single region failure, optimized latency per geography
Multi-region Design Decisions
Decision 1: Active-Active or Active-Passive?
- Active-Active: Higher SLA target (99.99%+), tolerate complexity/cost
- Active-Passive: 99.9% SLA acceptable, lower cost/complexity
Decision 2: How Many Regions?
- 2 regions: Survives 1 region failure, standard choice
- 3+ regions: Survives 2 region failures (very rare), expensive, use if critical
Decision 3: Data Consistency Model?
- Eventual consistency: Fast writes, acceptable delay, non-transactional data
- Strong consistency: Slow writes, always up-to-date, financial/healthcare data
Summary
- Multi-region: Replicate system across geographic regions for disaster recovery & latency
- Active-Passive: Lower cost, manual failover, slower RTO
- Active-Active: Higher SLA, instant failover, complex data consistency
- Data consistency: Choose eventual (fast) vs strong (safe) based on use case
- Front Door: Geo-aware routing, automatic failover detection
Interview Questions
A: Active-Passive: Primary handles all traffic, secondary stands by. Cheaper, simpler, slower failover (~5-15 min). Active-Active: Both regions handle traffic, instant failover. More complex/costly, used for mission-critical. Choose based on RTO needs.
A: Active-Active: Region 1 (US) + Region 2 (EU) + Region 3 (Asia). Each region has app + game state database. Front Door routes by geography. Use eventual consistency for game state (fast updates). Anti-cheat service centralized or distributed. Costs high, but achieves SLA & latency.
A: RTO = Recovery Time Objective = how fast can you failover (active-active = near-zero, active-passive = 5-15 min). RPO = Recovery Point Objective = how much data can you lose (async replication = lose recent writes, sync = lose nothing).