High Availability and Scaling
Build resilient systems using Multi-AZ deployment, Auto Scaling, and load balancing for fault tolerance under pressure.
What Is It? (ELI5)
High availability means your app stays alive even when one part fails. Scaling means your app can handle more users when traffic grows.
Why Do We Need It?
- Downtime impacts revenue and trust.
- Traffic patterns are rarely constant.
- Production systems need both resilience and elasticity.
How It Works (Technical)
- Deploy app instances across at least two AZs.
- Place ALB/NLB in front of instances.
- Use Auto Scaling policies on CPU, request count, or schedules.
- Use health checks and replace unhealthy nodes automatically.
Users -> ALB -> Auto Scaling Group (AZ-a + AZ-b)
If AZ-a fails, traffic continues to AZ-b.
If AZ-a fails, traffic continues to AZ-b.
Hands-on
# View auto scaling groups aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[].AutoScalingGroupName" --output table # Set desired capacity aws autoscaling set-desired-capacity --auto-scaling-group-name web-asg --desired-capacity 4
Debugging Scenario
Problem
Auto Scaling group is not scaling out during peak traffic.
- Check CloudWatch metric and threshold bindings.
- Verify cooldown periods are not too high.
- Check ASG max size limit and launch template errors.
Interview Questions
Beginner: What is Multi-AZ deployment?
Deploying resources across multiple availability zones in one region.
Deploying resources across multiple availability zones in one region.
Intermediate: ALB vs NLB?
ALB is Layer 7 HTTP routing; NLB is Layer 4 high-performance TCP/UDP.
ALB is Layer 7 HTTP routing; NLB is Layer 4 high-performance TCP/UDP.
Scenario: One AZ goes down during Black Friday. How should system behave?
ALB routes traffic to healthy AZ, ASG rebalances capacity, and user impact stays minimal.
ALB routes traffic to healthy AZ, ASG rebalances capacity, and user impact stays minimal.
Real-world Usage
Retail platforms combine ALB, ASG, and Multi-AZ managed databases to keep checkout available during sale spikes.
Summary
- HA prevents outages from single-point failure.
- Scaling protects user experience during demand surges.
- Health checks and automation make resilience practical.