Hands-onLesson 15 of 16

Debugging Windows + IIS Production Incidents

Real incident drills using L1/L2/L3 approach with evidence capture and communication hygiene.

🧒 Simple Explanation (ELI5)

Fix fast, then find why, then make sure it never happens again.

🔧 Why Do We Need It?

Protect uptime and revenue.
Reduce panic-driven fixes.
Improve handoffs between teams.
Create reusable runbooks.

🌍 Real-world Analogy

Like firefighting: contain fire, investigate cause, fireproof the building.

⚙️ Technical Explanation

Incident handling must include timeline, blast radius, mitigation, root cause, and preventive actions. Evidence sources: IIS logs, event logs, dump files, deployment records, and dependency metrics.

📊 Visual Representation

Incident Lifecycle

Detect

Alert / User report

Triage

Scope + Severity

Mitigate

Pool/node actions

Root Cause

Logs + Dumps

Prevent

Fix + guardrails

⌨️ Commands / Syntax

cmd/PowerShell

# Scope checks
appcmd list site
appcmd list apppool
appcmd list wp

# Fast mitigation
appcmd recycle apppool /apppool.name:"critical-api"

# Evidence
Get-WinEvent -FilterHashtable @{LogName='Application'; StartTime=(Get-Date).AddHours(-2)} -MaxEvents 200
netsh http show servicestate view=requestq

# Node operations (if behind LB)
# Drain node from LB before heavy actions

💼 Example (Real-world Use Case)

During a payment outage, team drained one node, captured dumps, restarted only failing pool, and restored service while preserving forensic evidence.

🧪 Hands-on

Run a mock incident with 503 errors.
Assign roles: commander, comms, fixer, scribe.
Apply mitigation without full reset.
Capture artifacts and timeline.
Write 5-whys and prevention actions.

💡

Communication Rule

Status updates every 15 minutes with impact, actions, and ETA reduce chaos.

🐛 Debugging Scenario

Failure: Random 500 spikes after deployment.

Compare old/new configs and binaries.
Use canary rollback on affected nodes.
Correlate with DB and cache latency.
Capture dump under load.
Patch and redeploy with guard checks.

🎯 Interview Questions

Beginner

What is first objective in incident?

Restore customer service safely.

Why keep timeline notes?

Needed for accurate RCA and auditability.

What is blast radius?

How many users/services are impacted.

Why avoid simultaneous unknown changes?

It destroys causality and increases risk.

What is rollback trigger?

Sustained customer impact beyond agreed threshold.

Intermediate

How do you preserve evidence while mitigating?

Capture logs/dumps first where feasible, then targeted recycle.

When to escalate L3?

Recurring issues, crash loops, or no clear infra cause.

What does good postmortem include?

Timeline, root cause, contributing factors, fixes, owners, due dates.

How to reduce MTTR structurally?

Runbooks, dashboards, synthetic checks, rehearsal drills.

How to handle noisy alerts?

Tune thresholds, deduplicate, and define symptom-based alerts.

Scenario-based

Bridge call disagreement on cause.

Use evidence checkpoints and assign parallel owners by layer.

Outage during change freeze.

Apply emergency change process with clear approvals/logging.

Recurring weekly incident.

Look for scheduled job/recycle/dependency maintenance overlap.

Canary healthy, one region failing.

Investigate region-specific DNS/LB/firewall or dependency endpoint.

No repro in staging.

Capture production traffic profile and environment parity gaps.

🌐 Real-world Usage

This operating model is used by SRE/NOC teams handling high-severity incidents.

📝 Summary

Great incident response is structured, evidence-based, and prevention-focused.

← SSL Lab NextInterview Preparation: Windows + IIS