AdvancedLesson 12 of 16

IIS Troubleshooting: 500 Errors, App Pool Crashes, and High CPU

Structured L1/L2/L3 troubleshooting workflow for IIS production incidents: fast triage, deep diagnostics, and durable remediation for recurring outages.

🧒 Simple Explanation (ELI5)

When IIS is sick, don't guess. Check signs in this order: Is server reachable? Is site running? Is app pool running? What error code appears? What do logs say? Then fix exactly that layer.

🔧 Why Do We Need It?

Reduce MTTR during outages.
Avoid harmful "restart everything" patterns.
Differentiate infra issues from code bugs quickly.
Build repeatable runbooks for on-call teams.

🌍 Real-world Analogy

Like emergency medicine triage: stabilize first (restore service), diagnose second (find exact cause), prevent relapse third (permanent fix).

⚙️ Technical Explanation

Common outage patterns:

500.19: invalid/locked/unreadable config.
500.21: module/handler not installed.
502.5: ASP.NET Core startup failure via ANCM.
503: app pool stopped/disabled or queue overload.
High CPU: runaway request loops, expensive queries, thread starvation, or external dependency delays causing retries.

App pool rapid-fail protection disables pools after repeated crashes. High CPU diagnosis needs mapping PID to pool and collecting process dump before recycle if possible.

⚠️

L1/L2/L3 Escalation Model

L1: restore access quickly, identify impacted scope, capture evidence. L2: isolate failing pool/module/config and apply targeted mitigation. L3: dump analysis, code-level root cause, preventive engineering changes.

📊 Visual Representation

Incident Decision Flow

Reachability

DNS/Port 80/443

Service State

W3SVC/WAS/AppPool

Error Signature

500.x / 502.5 / 503

Evidence

IIS logs + Event IDs + FREB

Fix

Targeted remediation + prevention

⌨️ Commands / Syntax

cmd / PowerShell

# L1 quick checks
Test-NetConnection -ComputerName localhost -Port 80
Test-NetConnection -ComputerName localhost -Port 443
sc query w3svc
sc query was
%windir%\system32\inetsrv\appcmd list apppool
%windir%\system32\inetsrv\appcmd list wp

# Start stopped pool
appcmd start apppool /apppool.name:"MyPool"

# Recycle single pool (avoid full iisreset)
appcmd recycle apppool /apppool.name:"MyPool"

# Collect evidence
Get-WinEvent -FilterHashtable @{LogName='Application'; Id=5005,5010,1000} -MaxEvents 100
netsh http show servicestate view=requestq

# High CPU mapping
tasklist /FI "IMAGENAME eq w3wp.exe" /V
%windir%\system32\inetsrv\appcmd list wp

# Dump process for analysis (Sysinternals Procdump)
procdump -ma  C:\dumps\w3wp_.dmp

💼 Example (Real-world Use Case)

At peak traffic, API returned 503 intermittently. L1 confirmed only one app pool impacted. L2 found rapid-fail with Event 5010 after repeated crash in one deployment. L3 dump showed null-reference loop in startup path causing immediate process termination. Team hotfixed startup validation, added canary checks, and prevented recurrence.

🧪 Hands-on

Simulate a bad web.config to trigger 500.19 in a lab and recover from backup.
Stop a pool manually and observe 503, then start it and verify recovery.
Enable FREB for 500 and compare with Event Viewer entries.
Use appcmd list wp to map PID to app pool.
Create a short incident report: symptom, impact, timeline, fix, prevention.

💡

Avoid Blanket IISRESET

Use pool-level restart/recycle whenever possible. Full IISRESET impacts all sites and often hides root cause.

🐛 Debugging Scenario

Failure: CPU at 95% with one IIS node in farm degraded.

Identify hottest w3wp PID in Task Manager.
Map PID to pool via appcmd list wp.
Capture dump before recycle if SLA allows.
Temporarily drain node from load balancer, recycle impacted pool.
Analyze dump and application traces for hot path and blocking calls.

🎯 Interview Questions

Beginner

What is the first check for an IIS outage?

Confirm network reachability and whether W3SVC/WAS and app pool are running.

What does 503 mean in IIS?

Service unavailable, often due to stopped/disabled pool or queue saturation.

Why prefer app pool recycle over iisreset?

It limits blast radius to one application and preserves availability for other sites.

What is rapid-fail protection?

IIS feature that disables a crashing pool after repeated failures within a time window.

What is a useful artifact during incident review?

Timestamped IIS log lines with matching Event Viewer IDs and deployment timeline.

Intermediate

How do you troubleshoot 500.19 quickly?

Inspect config parse errors, locked sections, file permissions, and module availability; restore from known-good backup if needed.

How do you identify CPU culprit in multi-site IIS?

Find high-CPU w3wp PID, map to pool with appcmd list wp, then inspect that app's dependencies and code path.

What indicates queue overload?

503 bursts with request queue growth in HTTP.sys and rising request latency.

When do you capture process dumps?

Before recycle/restart when recurring crash/high CPU persists and evidence is needed for L3 root cause analysis.

How do you prevent repeat outages?

Add health checks, canary deploys, synthetic monitoring, and enforce config/deployment validation gates.

Scenario-based

All sites down after change window. What now?

Rollback latest infra/config change, restore IIS backup, verify service state, then reintroduce changes incrementally.

Only one app fails after .NET update.

Validate runtime/module compatibility, app startup logs, and pool CLR settings. Roll back update if required.

Intermittent 500 only under load tests.

Likely resource contention or dependency timeout; profile thread pool, DB latency, and connection limits.

High CPU and memory leak over days.

Set safe recycle thresholds as temporary mitigation; capture periodic dumps and fix leak in code path.

Users report slowness but no errors.

Check time-taken distributions, backend response times, and queue depth. Latency incidents can exist without 5xx.

🌐 Real-world Usage

Reliable IIS operations depend on disciplined incident flow, not hero debugging. Strong runbooks transform outages into controlled, measurable responses.

📝 Summary

Use a layered L1/L2/L3 process: restore service fast, gather evidence, isolate failing component, then implement durable fixes. This is the core of professional IIS incident handling.

← IIS Logging, Monitoring, and Error Codes NextLab: Deploy a .NET Application on IIS