Hands-onLesson 8

Real-world SRE Scenarios

ELI5 Explanation

This lesson is your reliability simulator. You respond to realistic outages like a production SRE team.

Technical Explanation

These scenarios train triage speed, prioritization, communication, mitigation choice, and follow-up actions. Use SLO burn, blast radius, and customer impact to guide decisions.

Visual

Detect
Triage
Mitigate
Recover + Learn

Hands-on Scenarios

  1. Checkout latency jumps from 250 ms to 2.5 s after deployment.
  2. Database replica lag causes stale reads in cart service.
  3. Message queue backlog grows until workers crash-loop.
  4. Third-party payment API returns intermittent 502.
  5. Kubernetes node pressure evicts critical pods.
  6. Certificate expiration causes TLS handshake failures.
  7. Bad feature flag activates expensive query path.
  8. Region-wide network disruption triggers failover.
Incident triage command pack
kubectl get pods -A
kubectl top pods -A
kubectl get events -A --sort-by=.metadata.creationTimestamp
kubectl logs -n prod deploy/checkout --since=20m
kubectl rollout history deploy/checkout -n prod

Debugging Pattern

Use an incident worksheet: impact, affected components, change timeline, hypothesis, mitigation options, risk of rollback, and communication cadence.

Beginner

  • What is first step when an alert fires?
  • How do you classify severity?
  • When do you rollback?
  • What is customer-impact communication?
  • Why do timelines matter?

Intermediate

  • How do you decide between failover and degrade mode?
  • How do you detect noisy but harmless alerts?
  • How do you split incident response roles?
  • How do you reduce MTTR in repeated incidents?
  • How do you turn scenario drills into engineering backlog?

Scenario-based

  • Error rate spikes only in one geography. What do you inspect first?
  • All dashboards are green but customers complain. How do you investigate?
  • Rollback fails due to schema drift. What is Plan B?
  • Incident ends but root cause is unclear. How do you proceed?
  • You have two simultaneous major incidents. How do you prioritize?

Real-world Use Case

A retail platform runs monthly outage game days using scenario scorecards. Teams improved incident coordination across SRE, app, and database engineers, reducing customer impact during peak seasons.

Summary

Scenario practice builds speed, confidence, and consistency under stress. Next, consolidate your learning with interview preparation focused on real SRE decision-making.