AdvancedLesson 12 of 16

IIS Troubleshooting: 500 Errors, App Pool Crashes, and High CPU

Structured L1/L2/L3 troubleshooting workflow for IIS production incidents: fast triage, deep diagnostics, and durable remediation for recurring outages.

🧒 Simple Explanation (ELI5)

When IIS is sick, don't guess. Check signs in this order: Is server reachable? Is site running? Is app pool running? What error code appears? What do logs say? Then fix exactly that layer.

🔧 Why Do We Need It?

🌍 Real-world Analogy

Like emergency medicine triage: stabilize first (restore service), diagnose second (find exact cause), prevent relapse third (permanent fix).

⚙️ Technical Explanation

Common outage patterns:

App pool rapid-fail protection disables pools after repeated crashes. High CPU diagnosis needs mapping PID to pool and collecting process dump before recycle if possible.

⚠️
L1/L2/L3 Escalation Model

L1: restore access quickly, identify impacted scope, capture evidence. L2: isolate failing pool/module/config and apply targeted mitigation. L3: dump analysis, code-level root cause, preventive engineering changes.

📊 Visual Representation

Incident Decision Flow
Reachability
DNS/Port 80/443
Service State
W3SVC/WAS/AppPool
Error Signature
500.x / 502.5 / 503
Evidence
IIS logs + Event IDs + FREB
Fix
Targeted remediation + prevention

⌨️ Commands / Syntax

cmd / PowerShell
# L1 quick checks
Test-NetConnection -ComputerName localhost -Port 80
Test-NetConnection -ComputerName localhost -Port 443
sc query w3svc
sc query was
%windir%\system32\inetsrv\appcmd list apppool
%windir%\system32\inetsrv\appcmd list wp

# Start stopped pool
appcmd start apppool /apppool.name:"MyPool"

# Recycle single pool (avoid full iisreset)
appcmd recycle apppool /apppool.name:"MyPool"

# Collect evidence
Get-WinEvent -FilterHashtable @{LogName='Application'; Id=5005,5010,1000} -MaxEvents 100
netsh http show servicestate view=requestq

# High CPU mapping
tasklist /FI "IMAGENAME eq w3wp.exe" /V
%windir%\system32\inetsrv\appcmd list wp

# Dump process for analysis (Sysinternals Procdump)
procdump -ma  C:\dumps\w3wp_.dmp

💼 Example (Real-world Use Case)

At peak traffic, API returned 503 intermittently. L1 confirmed only one app pool impacted. L2 found rapid-fail with Event 5010 after repeated crash in one deployment. L3 dump showed null-reference loop in startup path causing immediate process termination. Team hotfixed startup validation, added canary checks, and prevented recurrence.

🧪 Hands-on

  1. Simulate a bad web.config to trigger 500.19 in a lab and recover from backup.
  2. Stop a pool manually and observe 503, then start it and verify recovery.
  3. Enable FREB for 500 and compare with Event Viewer entries.
  4. Use appcmd list wp to map PID to app pool.
  5. Create a short incident report: symptom, impact, timeline, fix, prevention.
💡
Avoid Blanket IISRESET

Use pool-level restart/recycle whenever possible. Full IISRESET impacts all sites and often hides root cause.

🐛 Debugging Scenario

Failure: CPU at 95% with one IIS node in farm degraded.

🎯 Interview Questions

Beginner

What is the first check for an IIS outage?

Confirm network reachability and whether W3SVC/WAS and app pool are running.

What does 503 mean in IIS?

Service unavailable, often due to stopped/disabled pool or queue saturation.

Why prefer app pool recycle over iisreset?

It limits blast radius to one application and preserves availability for other sites.

What is rapid-fail protection?

IIS feature that disables a crashing pool after repeated failures within a time window.

What is a useful artifact during incident review?

Timestamped IIS log lines with matching Event Viewer IDs and deployment timeline.

Intermediate

How do you troubleshoot 500.19 quickly?

Inspect config parse errors, locked sections, file permissions, and module availability; restore from known-good backup if needed.

How do you identify CPU culprit in multi-site IIS?

Find high-CPU w3wp PID, map to pool with appcmd list wp, then inspect that app's dependencies and code path.

What indicates queue overload?

503 bursts with request queue growth in HTTP.sys and rising request latency.

When do you capture process dumps?

Before recycle/restart when recurring crash/high CPU persists and evidence is needed for L3 root cause analysis.

How do you prevent repeat outages?

Add health checks, canary deploys, synthetic monitoring, and enforce config/deployment validation gates.

Scenario-based

All sites down after change window. What now?

Rollback latest infra/config change, restore IIS backup, verify service state, then reintroduce changes incrementally.

Only one app fails after .NET update.

Validate runtime/module compatibility, app startup logs, and pool CLR settings. Roll back update if required.

Intermittent 500 only under load tests.

Likely resource contention or dependency timeout; profile thread pool, DB latency, and connection limits.

High CPU and memory leak over days.

Set safe recycle thresholds as temporary mitigation; capture periodic dumps and fix leak in code path.

Users report slowness but no errors.

Check time-taken distributions, backend response times, and queue depth. Latency incidents can exist without 5xx.

🌐 Real-world Usage

Reliable IIS operations depend on disciplined incident flow, not hero debugging. Strong runbooks transform outages into controlled, measurable responses.

📝 Summary

Use a layered L1/L2/L3 process: restore service fast, gather evidence, isolate failing component, then implement durable fixes. This is the core of professional IIS incident handling.