Hands-onLesson 15 of 16

Troubleshooting Azure Core Services

A systematic guide to diagnosing and resolving the most common failures across Compute, Storage, and Database services — the scenarios you'll face in production and in interviews.

General Troubleshooting Mindset

Before diving into specifics, establish a consistent process:

  1. Check recent changes — what changed just before the issue started? (Activity Log)
  2. Check the service health — is Azure itself having an incident? (Azure Service Health dashboard)
  3. Check logs — resource logs, diagnostic settings, Application Insights.
  4. Check metrics — CPU, memory, connection counts, request rates, error rates.
  5. Narrow scope — reproduce the issue in isolation (staging environment, minimal repro).

Compute Troubleshooting

VM Won't Start

SymptomCauseFix
Allocation failedNo capacity in the region/zone for the VM sizeResize to available SKU, try a different region
Boot stuck / black screenOS corruption, bad kernel update, disk issueBoot Diagnostics → Screenshot/Serial Console; attach disk to recovery VM
Disk full / boot failureOS disk ran out of space during updateAttach disk to recovery VM, extend partition or remove disk-bloating files
VM shows Stopped (Deallocated) unexpectedlyAuto-shutdown policy, Azure maintenance, subscription issueCheck Activity Log for stop event and actor; check auto-shutdown settings
Azure CLI — Boot Diagnostics
# Enable boot diagnostics
az vm boot-diagnostics enable \
  --resource-group rg-compute \
  --name vm-web-01 \
  --storage https://mystorageacct.blob.core.windows.net

# Get boot diagnostics screenshot URL
az vm boot-diagnostics get-boot-log \
  --resource-group rg-compute \
  --name vm-web-01

# Connect via Serial Console (Portal only — interactive emergency access)
# Azure Portal → VM → Serial Console

App Service Returns 503 / App Not Loading

SymptomCauseFix
503 Service UnavailableApp crashed; startup failed; plan out of resourcesCheck log stream; Kudu Process Explorer; restart app
Application error (500)Runtime exception in codeApplication Insights exceptions; log stream
App "cold" — slow first requestApp Service plan is scaled out to 0 or app recycledEnable "Always On" (Standard+) to prevent recycling
Deployment rolled out with bugBad code released to production slotSwap back to previous slot immediately
Azure CLI — App Service Logs
# Stream live logs
az webapp log tail \
  --resource-group rg-web \
  --name myapp-prod

# Enable detailed error logging
az webapp log config \
  --resource-group rg-web \
  --name myapp-prod \
  --web-server-logging filesystem \
  --docker-container-logging filesystem \
  --application-logging filesystem \
  --level verbose

# Check which slot is serving traffic
az webapp show \
  --resource-group rg-web \
  --name myapp-prod \
  --query defaultHostName

Azure Function Failing

Storage Troubleshooting

Blob Access Denied (403)

CauseCheckFix
SAS token expiredInspect the se (expiry) param in the URLRegenerate SAS with adequate expiry
Storage firewall enabledPortal → Storage Account → NetworkingAdd client IP or VNet to allowed list
Missing RBAC roleCheck IAM on storage account / containerAssign Storage Blob Data Reader/Contributor
Account key rotatedApp using old connection stringUpdate connection string in App Settings / Key Vault
Private endpoint, no DNSCan resolve .blob.core.windows.net?Configure private DNS zone or custom DNS

File Share Mount Failure (Linux)

Database Troubleshooting

Azure SQL Connection Failure

ErrorCauseFix
Cannot open server 'x' requestedFirewall not allowing client IPAdd IP to SQL Server firewall rules
Login failed for userWrong credentials / AAD auth misconfiguredVerify admin username; reset password if needed
SSL/TLS errorsClient driver too old; Encrypt=FalseUse Encrypt=True;TrustServerCertificate=False; update driver
Connection timeoutDTU limit exceeded; max connections reachedScale up DTU/vCore; check Elastic Pool limits
Azure CLI — SQL Diagnostics
# List firewall rules
az sql server firewall-rule list \
  --resource-group rg-database \
  --server myapp-sql-server \
  --output table

# Check DTU consumption (Metrics in Portal, or via CLI)
az monitor metrics list \
  --resource /subscriptions/{sub}/resourceGroups/rg-database/providers/Microsoft.Sql/servers/myapp-sql-server/databases/app-db \
  --metric "dtu_consumption_percent" \
  --interval PT1M \
  --output table

# Reset admin password
az sql server update \
  --resource-group rg-database \
  --name myapp-sql-server \
  --admin-password "NewP@ssw0rd2024!"

Cosmos DB Query Performance Problems

Networking / Connectivity Troubleshooting

NSG Blocking Traffic

Azure CLI
# Check effective NSG rules for a NIC
az network nic show-effective-nsg \
  --resource-group rg-compute \
  --name vm-web-01VMNic \
  --output table

# Use Network Watcher — IP Flow Verify (test if NSG allows traffic)
az network watcher test-ip-flow \
  --direction Inbound \
  --protocol TCP \
  --local 10.0.0.4:80 \
  --remote 203.0.113.10:* \
  --vm vm-web-01 \
  --resource-group rg-compute

DNS Not Resolving Private Endpoints

Useful Diagnostic Commands Reference

Quick Reference
# Activity Log — recent operations on resource group
az monitor activity-log list \
  --resource-group rg-web \
  --start-time 2024-01-15T00:00:00Z \
  --output table

# List all resources in a resource group
az resource list --resource-group rg-web --output table

# Check resource health
az resource show \
  --ids /subscriptions/{sub}/resourceGroups/rg-web/providers/Microsoft.Web/sites/myapp \
  --include-response-body

# Network connectivity test from CLI
az network watcher check-connectivity \
  --resource-group rg-compute \
  --source-resource vm-web-01 \
  --dest-address myapp-sql-server.database.windows.net \
  --dest-port 1433

Summary

Good troubleshooting is systematic: establish timeline, check logs, check metrics, narrow to root cause. For VMs — Boot Diagnostics and Serial Console are your emergency tools. For App Service — log stream and Kudu. For storage — firewall and SAS/RBAC. For databases — firewall rules and connection string format. Azure Network Watcher covers connectivity gaps between all services.