L1: restore access quickly, identify impacted scope, capture evidence. L2: isolate failing pool/module/config and apply targeted mitigation. L3: dump analysis, code-level root cause, preventive engineering changes.
IIS Troubleshooting: 500 Errors, App Pool Crashes, and High CPU
Structured L1/L2/L3 troubleshooting workflow for IIS production incidents: fast triage, deep diagnostics, and durable remediation for recurring outages.
🧒 Simple Explanation (ELI5)
When IIS is sick, don't guess. Check signs in this order: Is server reachable? Is site running? Is app pool running? What error code appears? What do logs say? Then fix exactly that layer.
🔧 Why Do We Need It?
- Reduce MTTR during outages.
- Avoid harmful "restart everything" patterns.
- Differentiate infra issues from code bugs quickly.
- Build repeatable runbooks for on-call teams.
🌍 Real-world Analogy
Like emergency medicine triage: stabilize first (restore service), diagnose second (find exact cause), prevent relapse third (permanent fix).
⚙️ Technical Explanation
Common outage patterns:
- 500.19: invalid/locked/unreadable config.
- 500.21: module/handler not installed.
- 502.5: ASP.NET Core startup failure via ANCM.
- 503: app pool stopped/disabled or queue overload.
- High CPU: runaway request loops, expensive queries, thread starvation, or external dependency delays causing retries.
App pool rapid-fail protection disables pools after repeated crashes. High CPU diagnosis needs mapping PID to pool and collecting process dump before recycle if possible.
📊 Visual Representation
⌨️ Commands / Syntax
# L1 quick checks
Test-NetConnection -ComputerName localhost -Port 80
Test-NetConnection -ComputerName localhost -Port 443
sc query w3svc
sc query was
%windir%\system32\inetsrv\appcmd list apppool
%windir%\system32\inetsrv\appcmd list wp
# Start stopped pool
appcmd start apppool /apppool.name:"MyPool"
# Recycle single pool (avoid full iisreset)
appcmd recycle apppool /apppool.name:"MyPool"
# Collect evidence
Get-WinEvent -FilterHashtable @{LogName='Application'; Id=5005,5010,1000} -MaxEvents 100
netsh http show servicestate view=requestq
# High CPU mapping
tasklist /FI "IMAGENAME eq w3wp.exe" /V
%windir%\system32\inetsrv\appcmd list wp
# Dump process for analysis (Sysinternals Procdump)
procdump -ma C:\dumps\w3wp_.dmp
💼 Example (Real-world Use Case)
At peak traffic, API returned 503 intermittently. L1 confirmed only one app pool impacted. L2 found rapid-fail with Event 5010 after repeated crash in one deployment. L3 dump showed null-reference loop in startup path causing immediate process termination. Team hotfixed startup validation, added canary checks, and prevented recurrence.
🧪 Hands-on
- Simulate a bad web.config to trigger 500.19 in a lab and recover from backup.
- Stop a pool manually and observe 503, then start it and verify recovery.
- Enable FREB for 500 and compare with Event Viewer entries.
- Use appcmd list wp to map PID to app pool.
- Create a short incident report: symptom, impact, timeline, fix, prevention.
Use pool-level restart/recycle whenever possible. Full IISRESET impacts all sites and often hides root cause.
🐛 Debugging Scenario
Failure: CPU at 95% with one IIS node in farm degraded.
- Identify hottest w3wp PID in Task Manager.
- Map PID to pool via appcmd list wp.
- Capture dump before recycle if SLA allows.
- Temporarily drain node from load balancer, recycle impacted pool.
- Analyze dump and application traces for hot path and blocking calls.
🎯 Interview Questions
Beginner
Confirm network reachability and whether W3SVC/WAS and app pool are running.
Service unavailable, often due to stopped/disabled pool or queue saturation.
It limits blast radius to one application and preserves availability for other sites.
IIS feature that disables a crashing pool after repeated failures within a time window.
Timestamped IIS log lines with matching Event Viewer IDs and deployment timeline.
Intermediate
Inspect config parse errors, locked sections, file permissions, and module availability; restore from known-good backup if needed.
Find high-CPU w3wp PID, map to pool with appcmd list wp, then inspect that app's dependencies and code path.
503 bursts with request queue growth in HTTP.sys and rising request latency.
Before recycle/restart when recurring crash/high CPU persists and evidence is needed for L3 root cause analysis.
Add health checks, canary deploys, synthetic monitoring, and enforce config/deployment validation gates.
Scenario-based
Rollback latest infra/config change, restore IIS backup, verify service state, then reintroduce changes incrementally.
Validate runtime/module compatibility, app startup logs, and pool CLR settings. Roll back update if required.
Likely resource contention or dependency timeout; profile thread pool, DB latency, and connection limits.
Set safe recycle thresholds as temporary mitigation; capture periodic dumps and fix leak in code path.
Check time-taken distributions, backend response times, and queue depth. Latency incidents can exist without 5xx.
🌐 Real-world Usage
Reliable IIS operations depend on disciplined incident flow, not hero debugging. Strong runbooks transform outages into controlled, measurable responses.
📝 Summary
Use a layered L1/L2/L3 process: restore service fast, gather evidence, isolate failing component, then implement durable fixes. This is the core of professional IIS incident handling.