High Availability Design
Design systems that survive failures: availability sets, availability zones, load balancing, and redundancy patterns.
🧠 ELI5 Explanation
Imagine a restaurant. If the chef is sick, the restaurant closes (1 point of failure). If you have 2 chefs, one sick chef = restaurant still open. High Availability = multiple servers/instances so if one fails, others keep the system running. Load Balancer = host who directs customers to available chefs.
Core HA Concepts
Availability: The Metric
- 99%: Down 3.6 days/year (acceptable for internal apps)
- 99.9%: Down 8.7 hours/year
- 99.95%: Down 4.4 hours/year (common SLA target)
- 99.99%: Down 52 minutes/year (very high, pricey)
Calculation: Uptime = (Uptime Hours / Total Hours) × 100%
Failure Domains
Failure domain: A set of hardware that shares a single point of failure.
- Rack fault domain: All servers on one rack (single power supply might fail)
- Availability set: Spreads VMs across multiple fault domains (racks)
- Availability zone: Separate physical datacenters (power, networking, cooling independent)
HA Options: Availability Sets vs Zones
Pattern 1: Availability Sets
Datacenter
├── Fault Domain 1 (Rack A)
│ ├── VM1 (Web Server)
│ ├── Power Supply A
│ └── Network Switch A
├── Fault Domain 2 (Rack B)
│ ├── VM2 (Web Server)
│ ├── Power Supply B
│ └── Network Switch B
└── Load Balancer → distributes requests to VM1 & VM2
If Fault Domain 1 fails: VM1 down, but VM2 (on Fault Domain 2) still serves traffic. SLA: 99.95%
Pattern 2: Availability Zones
Region: East US
├── Zone 1 (Datacenter 1)
│ └── VM1 + LB (Power Grid A, Fiber A)
├── Zone 2 (Datacenter 2)
│ └── VM2 + LB (Power Grid B, Fiber B)
└── Zone 3 (Datacenter 3)
└── VM3 + LB (Power Grid C, Fiber C)
Public IP + Zone-Redundant LB → distributes to all zones
If entire Zone 1 fails: VM1 & Zone 1 LB down, but VM2 & VM3 still serve traffic. SLA: 99.99%
Load Balancing
Azure Load Balancer (Layer 4)
What: Routes TCP/UDP traffic across backends.
- Ultra-low latency, high performance
- Inbound (public IP to backend VMs), outbound (backend to internet)
- Health probes (only send traffic to healthy VMs)
Application Gateway (Layer 7)
What: HTTP/HTTPS application-level router.
- Route based on hostname, URL path, HTTP headers
- SSL/TLS termination
- Web Application Firewall (WAF) protection
Front Door (Global Layer 7)
What: Global load balancer (multi-region).
- Route across regions based on latency, geography
- DDoS protection
- Used for multi-region failover
Real-world Example: E-commerce HA Architecture
Setup:
• 3 Web VMs across 3 availability zones (99.99% SLA)
• Azure Load Balancer (Layer 4) distributes traffic
• Backend pool: all 3 VMs
• Health probe: check /health every 5 seconds
• Database: SQL Database with geo-replication (separate region)
• Storage: geo-redundant (automatic failover)
Failure Scenarios:
• 1 VM fails → LB detects unhealthy, routes to other 2 VMs (no user impact)
• Zone 1 fails → All 3 VMs go down if all in Zone 1 (bad!)
• Region fails → Database geo-replication + blob storage failover kick in
Result: 99.99% availability for compute, able to survive zone and region failures
Design Decisions
Decision 1: Availability Set or Zone-Redundant?
- Availability Set: Lower SLA (99.95%), but lower cost, lower latency
- Zones: Higher SLA (99.99%), survives larger failures, slight latency/cost increase
Decision 2: Load Balancer vs Application Gateway vs Front Door?
- LB: High performance, Layer 4, best for TCP/UDP
- AppGW: Application-aware routing, WAF, HTTPS termination
- Front Door: Multi-region, geolocation-based routing
Summary
- HA: System survives failures without downtime
- Availability Set: Multiple fault domains (same datacenter), 99.95% SLA
- Availability Zone: Multiple datacenters, 99.99% SLA, higher resilience
- Load Balancing: Route traffic away from failed instances
- Health probes: Detect failures and remove bad instances
Interview Questions
A: Sets protect from hardware failure in same datacenter (99.95% SLA). Zones protect from entire datacenter failure (99.99% SLA, multi-datacenter).
A: Deploy 3+ VMs across 3 zones, Azure LB distributes traffic. If one zone fails, other 2 zones handle load. Zone-to-zone latency ~5ms, so total latency impact minimal. Use active-active (all zones actively serve traffic).
A: Cost-sensitive or latency-sensitive apps that don't need zone-failure protection. Example: internal app that 99.95% SLA is acceptable for. Zones have inter-zone data transfer costs.