IntermediateReliability

High Availability Design

Design systems that survive failures: availability sets, availability zones, load balancing, and redundancy patterns.

🧠 ELI5 Explanation

Imagine a restaurant. If the chef is sick, the restaurant closes (1 point of failure). If you have 2 chefs, one sick chef = restaurant still open. High Availability = multiple servers/instances so if one fails, others keep the system running. Load Balancer = host who directs customers to available chefs.

Core HA Concepts

Availability: The Metric

  • 99%: Down 3.6 days/year (acceptable for internal apps)
  • 99.9%: Down 8.7 hours/year
  • 99.95%: Down 4.4 hours/year (common SLA target)
  • 99.99%: Down 52 minutes/year (very high, pricey)

Calculation: Uptime = (Uptime Hours / Total Hours) × 100%

Failure Domains

Failure domain: A set of hardware that shares a single point of failure.

  • Rack fault domain: All servers on one rack (single power supply might fail)
  • Availability set: Spreads VMs across multiple fault domains (racks)
  • Availability zone: Separate physical datacenters (power, networking, cooling independent)

HA Options: Availability Sets vs Zones

Feature Availability Set Availability Zone Failure Scope Protects from rack/hardware failure Protects from entire datacenter failure Latency <2ms (same datacenter) ~5ms (different datacenters) Cost Free (VMs same price) Data transfer between zones = $0.02/GB Availability SLA 99.95% (2+ VMs) 99.99% (3+ VMs across zones) When to Use Lower availability needs, latency-sensitive apps Mission-critical, must survive zone failure

Pattern 1: Availability Sets

Availability Set: 2 VMs across 2 Fault Domains

Datacenter
├── Fault Domain 1 (Rack A)
│ ├── VM1 (Web Server)
│ ├── Power Supply A
│ └── Network Switch A
├── Fault Domain 2 (Rack B)
│ ├── VM2 (Web Server)
│ ├── Power Supply B
│ └── Network Switch B
└── Load Balancer → distributes requests to VM1 & VM2

If Fault Domain 1 fails: VM1 down, but VM2 (on Fault Domain 2) still serves traffic. SLA: 99.95%

Pattern 2: Availability Zones

Availability Zones: 3 VMs across 3 Datacenters

Region: East US
├── Zone 1 (Datacenter 1)
│ └── VM1 + LB (Power Grid A, Fiber A)
├── Zone 2 (Datacenter 2)
│ └── VM2 + LB (Power Grid B, Fiber B)
└── Zone 3 (Datacenter 3)
└── VM3 + LB (Power Grid C, Fiber C)

Public IP + Zone-Redundant LB → distributes to all zones

If entire Zone 1 fails: VM1 & Zone 1 LB down, but VM2 & VM3 still serve traffic. SLA: 99.99%

Load Balancing

Azure Load Balancer (Layer 4)

What: Routes TCP/UDP traffic across backends.

  • Ultra-low latency, high performance
  • Inbound (public IP to backend VMs), outbound (backend to internet)
  • Health probes (only send traffic to healthy VMs)

Application Gateway (Layer 7)

What: HTTP/HTTPS application-level router.

  • Route based on hostname, URL path, HTTP headers
  • SSL/TLS termination
  • Web Application Firewall (WAF) protection

Front Door (Global Layer 7)

What: Global load balancer (multi-region).

  • Route across regions based on latency, geography
  • DDoS protection
  • Used for multi-region failover

Real-world Example: E-commerce HA Architecture

Setup:
• 3 Web VMs across 3 availability zones (99.99% SLA)
• Azure Load Balancer (Layer 4) distributes traffic
• Backend pool: all 3 VMs
• Health probe: check /health every 5 seconds
• Database: SQL Database with geo-replication (separate region)
• Storage: geo-redundant (automatic failover)

Failure Scenarios:
• 1 VM fails → LB detects unhealthy, routes to other 2 VMs (no user impact)
• Zone 1 fails → All 3 VMs go down if all in Zone 1 (bad!)
• Region fails → Database geo-replication + blob storage failover kick in

Result: 99.99% availability for compute, able to survive zone and region failures

Design Decisions

Decision 1: Availability Set or Zone-Redundant?

  • Availability Set: Lower SLA (99.95%), but lower cost, lower latency
  • Zones: Higher SLA (99.99%), survives larger failures, slight latency/cost increase

Decision 2: Load Balancer vs Application Gateway vs Front Door?

  • LB: High performance, Layer 4, best for TCP/UDP
  • AppGW: Application-aware routing, WAF, HTTPS termination
  • Front Door: Multi-region, geolocation-based routing

Summary

Interview Questions

Q: What's the difference between availability sets and availability zones?
A: Sets protect from hardware failure in same datacenter (99.95% SLA). Zones protect from entire datacenter failure (99.99% SLA, multi-datacenter).
Q: Design a highly available web app: must survive 1 zone failure with <100ms latency increase.
A: Deploy 3+ VMs across 3 zones, Azure LB distributes traffic. If one zone fails, other 2 zones handle load. Zone-to-zone latency ~5ms, so total latency impact minimal. Use active-active (all zones actively serve traffic).
Q: When would you use an Availability Set instead of Zones?
A: Cost-sensitive or latency-sensitive apps that don't need zone-failure protection. Example: internal app that 99.95% SLA is acceptable for. Zones have inter-zone data transfer costs.