IntermediateReliability

High Availability Design

Design systems that survive failures: availability sets, availability zones, load balancing, and redundancy patterns.

🧠 ELI5 Explanation

Imagine a restaurant. If the chef is sick, the restaurant closes (1 point of failure). If you have 2 chefs, one sick chef = restaurant still open. High Availability = multiple servers/instances so if one fails, others keep the system running. Load Balancer = host who directs customers to available chefs.

Core HA Concepts

Availability: The Metric

99%: Down 3.6 days/year (acceptable for internal apps)
99.9%: Down 8.7 hours/year
99.95%: Down 4.4 hours/year (common SLA target)
99.99%: Down 52 minutes/year (very high, pricey)

Calculation: Uptime = (Uptime Hours / Total Hours) × 100%

Failure Domains

Failure domain: A set of hardware that shares a single point of failure.

Rack fault domain: All servers on one rack (single power supply might fail)
Availability set: Spreads VMs across multiple fault domains (racks)
Availability zone: Separate physical datacenters (power, networking, cooling independent)

HA Options: Availability Sets vs Zones

Feature Availability Set Availability Zone Failure Scope Protects from rack/hardware failure Protects from entire datacenter failure Latency <2ms (same datacenter) ~5ms (different datacenters) Cost Free (VMs same price) Data transfer between zones = $0.02/GB Availability SLA 99.95% (2+ VMs) 99.99% (3+ VMs across zones) When to Use Lower availability needs, latency-sensitive apps Mission-critical, must survive zone failure

Pattern 1: Availability Sets

Availability Set: 2 VMs across 2 Fault Domains


      Datacenter

      ├── Fault Domain 1 (Rack A)

      │   ├── VM1 (Web Server)

      │   ├── Power Supply A

      │   └── Network Switch A

      ├── Fault Domain 2 (Rack B)

      │   ├── VM2 (Web Server)

      │   ├── Power Supply B

      │   └── Network Switch B

      └── Load Balancer → distributes requests to VM1 & VM2

If Fault Domain 1 fails: VM1 down, but VM2 (on Fault Domain 2) still serves traffic. SLA: 99.95%

Pattern 2: Availability Zones

Availability Zones: 3 VMs across 3 Datacenters


      Region: East US

      ├── Zone 1 (Datacenter 1)

      │   └── VM1 + LB (Power Grid A, Fiber A)

      ├── Zone 2 (Datacenter 2)

      │   └── VM2 + LB (Power Grid B, Fiber B)

      └── Zone 3 (Datacenter 3)

          └── VM3 + LB (Power Grid C, Fiber C)


      Public IP + Zone-Redundant LB → distributes to all zones

If entire Zone 1 fails: VM1 & Zone 1 LB down, but VM2 & VM3 still serve traffic. SLA: 99.99%

Load Balancing

Azure Load Balancer (Layer 4)

What: Routes TCP/UDP traffic across backends.

Ultra-low latency, high performance
Inbound (public IP to backend VMs), outbound (backend to internet)
Health probes (only send traffic to healthy VMs)

Application Gateway (Layer 7)

What: HTTP/HTTPS application-level router.

Route based on hostname, URL path, HTTP headers
SSL/TLS termination
Web Application Firewall (WAF) protection

Front Door (Global Layer 7)

What: Global load balancer (multi-region).

Route across regions based on latency, geography
DDoS protection
Used for multi-region failover

Real-world Example: E-commerce HA Architecture

Setup:
• 3 Web VMs across 3 availability zones (99.99% SLA)
• Azure Load Balancer (Layer 4) distributes traffic
• Backend pool: all 3 VMs
• Health probe: check /health every 5 seconds
• Database: SQL Database with geo-replication (separate region)
• Storage: geo-redundant (automatic failover)

Failure Scenarios:
• 1 VM fails → LB detects unhealthy, routes to other 2 VMs (no user impact)
• Zone 1 fails → All 3 VMs go down if all in Zone 1 (bad!)
• Region fails → Database geo-replication + blob storage failover kick in

Result: 99.99% availability for compute, able to survive zone and region failures

Design Decisions

Decision 1: Availability Set or Zone-Redundant?

Availability Set: Lower SLA (99.95%), but lower cost, lower latency
Zones: Higher SLA (99.99%), survives larger failures, slight latency/cost increase

Decision 2: Load Balancer vs Application Gateway vs Front Door?

LB: High performance, Layer 4, best for TCP/UDP
AppGW: Application-aware routing, WAF, HTTPS termination
Front Door: Multi-region, geolocation-based routing

Summary

HA: System survives failures without downtime
Availability Set: Multiple fault domains (same datacenter), 99.95% SLA
Availability Zone: Multiple datacenters, 99.99% SLA, higher resilience
Load Balancing: Route traffic away from failed instances
Health probes: Detect failures and remove bad instances

Interview Questions

Q: What's the difference between availability sets and availability zones?
A: Sets protect from hardware failure in same datacenter (99.95% SLA). Zones protect from entire datacenter failure (99.99% SLA, multi-datacenter).

Q: Design a highly available web app: must survive 1 zone failure with <100ms latency increase.
A: Deploy 3+ VMs across 3 zones, Azure LB distributes traffic. If one zone fails, other 2 zones handle load. Zone-to-zone latency ~5ms, so total latency impact minimal. Use active-active (all zones actively serve traffic).

Q: When would you use an Availability Set instead of Zones?
A: Cost-sensitive or latency-sensitive apps that don't need zone-failure protection. Example: internal app that 99.95% SLA is acceptable for. Zones have inter-zone data transfer costs.