AdvancedLesson 7

Chaos Engineering and Resilience Testing

ELI5 Explanation

Chaos engineering is fire-drill practice for software. You break small things on purpose to prove big things keep working.

Technical Explanation

Chaos experiments start from a steady-state hypothesis tied to SLO. You inject controlled faults such as pod termination, network latency, dependency timeout, or zone failure. Measure impact and stop safely if blast radius grows beyond guardrails.

Tip: Start with low-risk experiments in staging, then limited production with strict abort conditions.

Visual

Hypothesis
Inject Fault
Observe SLO
Improve Controls

Hands-on Commands

Controlled failure test examples
# Restart one pod intentionally
kubectl delete pod -n prod -l app=checkout --grace-period=0 --force

# Observe rollout and readiness
kubectl get pods -n prod -w
kubectl describe deploy checkout -n prod

# Validate service health during chaos
kubectl get svc -n prod

Debugging Scenario

Chaos test injects 400 ms latency on payment dependency. Availability stays inside SLO but latency SLO burns quickly. Team adds circuit breaker, timeout tuning, and fallback workflow before expanding experiment scope.

Beginner

  • What is chaos engineering?
  • Why is steady state important?
  • How is chaos different from random testing?
  • What is blast radius?
  • When should an experiment stop?

Intermediate

  • How do you choose safe chaos experiments?
  • What prechecks are needed before production chaos?
  • How do you tie chaos to error budgets?
  • How do you validate observability readiness for chaos?
  • How do you convince stakeholders to adopt chaos testing?

Scenario-based

  • Chaos test causes unexpected customer impact. What immediate actions?
  • Experiments pass in staging but fail in prod. Why?
  • Teams fear chaos in peak periods. How do you schedule safely?
  • Single service is resilient but system still fails. What architecture issue likely exists?
  • Experiment uncovers vendor dependency bottleneck. What resilience options exist?

Real-world Use Case

A travel booking company ran weekly chaos tests on payment and inventory dependencies. Incident frequency from dependency outages decreased after introducing fallback paths and timeout budgets.

Summary

Chaos engineering turns unknown failure modes into known, tested behavior. Next, you will apply all SRE concepts in realistic outage scenarios.