AdvancedResilience

High Availability and Scaling

Build resilient systems using Multi-AZ deployment, Auto Scaling, and load balancing for fault tolerance under pressure.

What Is It? (ELI5)

High availability means your app stays alive even when one part fails. Scaling means your app can handle more users when traffic grows.

Why Do We Need It?

Downtime impacts revenue and trust.
Traffic patterns are rarely constant.
Production systems need both resilience and elasticity.

How It Works (Technical)

Deploy app instances across at least two AZs.
Place ALB/NLB in front of instances.
Use Auto Scaling policies on CPU, request count, or schedules.
Use health checks and replace unhealthy nodes automatically.

Users -> ALB -> Auto Scaling Group (AZ-a + AZ-b)
If AZ-a fails, traffic continues to AZ-b.

Hands-on

# View auto scaling groups
aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[].AutoScalingGroupName" --output table

# Set desired capacity
aws autoscaling set-desired-capacity --auto-scaling-group-name web-asg --desired-capacity 4

Debugging Scenario

Problem

Auto Scaling group is not scaling out during peak traffic.

Check CloudWatch metric and threshold bindings.
Verify cooldown periods are not too high.
Check ASG max size limit and launch template errors.

Interview Questions

Beginner: What is Multi-AZ deployment?
Deploying resources across multiple availability zones in one region.

Intermediate: ALB vs NLB?
ALB is Layer 7 HTTP routing; NLB is Layer 4 high-performance TCP/UDP.

Scenario: One AZ goes down during Black Friday. How should system behave?
ALB routes traffic to healthy AZ, ASG rebalances capacity, and user impact stays minimal.

Real-world Usage

Retail platforms combine ALB, ASG, and Multi-AZ managed databases to keep checkout available during sale spikes.

Summary

HA prevents outages from single-point failure.
Scaling protects user experience during demand surges.
Health checks and automation make resilience practical.