Troubleshooting Azure Core Services
A systematic guide to diagnosing and resolving the most common failures across Compute, Storage, and Database services — the scenarios you'll face in production and in interviews.
General Troubleshooting Mindset
Before diving into specifics, establish a consistent process:
- Check recent changes — what changed just before the issue started? (Activity Log)
- Check the service health — is Azure itself having an incident? (Azure Service Health dashboard)
- Check logs — resource logs, diagnostic settings, Application Insights.
- Check metrics — CPU, memory, connection counts, request rates, error rates.
- Narrow scope — reproduce the issue in isolation (staging environment, minimal repro).
Compute Troubleshooting
VM Won't Start
| Symptom | Cause | Fix |
|---|---|---|
| Allocation failed | No capacity in the region/zone for the VM size | Resize to available SKU, try a different region |
| Boot stuck / black screen | OS corruption, bad kernel update, disk issue | Boot Diagnostics → Screenshot/Serial Console; attach disk to recovery VM |
| Disk full / boot failure | OS disk ran out of space during update | Attach disk to recovery VM, extend partition or remove disk-bloating files |
| VM shows Stopped (Deallocated) unexpectedly | Auto-shutdown policy, Azure maintenance, subscription issue | Check Activity Log for stop event and actor; check auto-shutdown settings |
Azure CLI — Boot Diagnostics
# Enable boot diagnostics az vm boot-diagnostics enable \ --resource-group rg-compute \ --name vm-web-01 \ --storage https://mystorageacct.blob.core.windows.net # Get boot diagnostics screenshot URL az vm boot-diagnostics get-boot-log \ --resource-group rg-compute \ --name vm-web-01 # Connect via Serial Console (Portal only — interactive emergency access) # Azure Portal → VM → Serial Console
App Service Returns 503 / App Not Loading
| Symptom | Cause | Fix |
|---|---|---|
| 503 Service Unavailable | App crashed; startup failed; plan out of resources | Check log stream; Kudu Process Explorer; restart app |
| Application error (500) | Runtime exception in code | Application Insights exceptions; log stream |
| App "cold" — slow first request | App Service plan is scaled out to 0 or app recycled | Enable "Always On" (Standard+) to prevent recycling |
| Deployment rolled out with bug | Bad code released to production slot | Swap back to previous slot immediately |
Azure CLI — App Service Logs
# Stream live logs az webapp log tail \ --resource-group rg-web \ --name myapp-prod # Enable detailed error logging az webapp log config \ --resource-group rg-web \ --name myapp-prod \ --web-server-logging filesystem \ --docker-container-logging filesystem \ --application-logging filesystem \ --level verbose # Check which slot is serving traffic az webapp show \ --resource-group rg-web \ --name myapp-prod \ --query defaultHostName
Azure Function Failing
- 429 Too Many Requests: Consumption plan scaling limit hit; queue up and retry logic needed.
- Function times out (10 min limit): Move to Premium plan for unlimited timeout, or split into smaller steps.
- Cold start causing errors: Downstream service timeout before function warms up. Use Premium plan or keep function warm with a Timer trigger.
- Function not triggering: Check storage account for trigger queue/blob. Queue Storage trigger requires a valid storage connection string.
Storage Troubleshooting
Blob Access Denied (403)
| Cause | Check | Fix |
|---|---|---|
| SAS token expired | Inspect the se (expiry) param in the URL | Regenerate SAS with adequate expiry |
| Storage firewall enabled | Portal → Storage Account → Networking | Add client IP or VNet to allowed list |
| Missing RBAC role | Check IAM on storage account / container | Assign Storage Blob Data Reader/Contributor |
| Account key rotated | App using old connection string | Update connection string in App Settings / Key Vault |
| Private endpoint, no DNS | Can resolve .blob.core.windows.net? | Configure private DNS zone or custom DNS |
File Share Mount Failure (Linux)
- Port 445 blocked by ISP or corporate firewall. Solution: private endpoint + VPN/ExpressRoute, or use NFS instead of SMB.
- Wrong storage account key. Re-fetch key:
az storage account keys list. - Wrong SMB version: use
vers=3.0in mount options.
Database Troubleshooting
Azure SQL Connection Failure
| Error | Cause | Fix |
|---|---|---|
| Cannot open server 'x' requested | Firewall not allowing client IP | Add IP to SQL Server firewall rules |
| Login failed for user | Wrong credentials / AAD auth misconfigured | Verify admin username; reset password if needed |
| SSL/TLS errors | Client driver too old; Encrypt=False | Use Encrypt=True;TrustServerCertificate=False; update driver |
| Connection timeout | DTU limit exceeded; max connections reached | Scale up DTU/vCore; check Elastic Pool limits |
Azure CLI — SQL Diagnostics
# List firewall rules
az sql server firewall-rule list \
--resource-group rg-database \
--server myapp-sql-server \
--output table
# Check DTU consumption (Metrics in Portal, or via CLI)
az monitor metrics list \
--resource /subscriptions/{sub}/resourceGroups/rg-database/providers/Microsoft.Sql/servers/myapp-sql-server/databases/app-db \
--metric "dtu_consumption_percent" \
--interval PT1M \
--output table
# Reset admin password
az sql server update \
--resource-group rg-database \
--name myapp-sql-server \
--admin-password "NewP@ssw0rd2024!"Cosmos DB Query Performance Problems
- 429 Request Rate Too High: Provisioned RU/s has been exceeded. Increase provisioned throughput or enable autoscale.
- Cross-partition queries are slow: Query doesn't include the partition key in the filter → full scan. Add partition key to WHERE clause.
- Hot partition: All traffic routed to one partition key value. Redesign partition key with higher cardinality.
Networking / Connectivity Troubleshooting
NSG Blocking Traffic
Azure CLI
# Check effective NSG rules for a NIC az network nic show-effective-nsg \ --resource-group rg-compute \ --name vm-web-01VMNic \ --output table # Use Network Watcher — IP Flow Verify (test if NSG allows traffic) az network watcher test-ip-flow \ --direction Inbound \ --protocol TCP \ --local 10.0.0.4:80 \ --remote 203.0.113.10:* \ --vm vm-web-01 \ --resource-group rg-compute
DNS Not Resolving Private Endpoints
- Private DNS zone (
privatelink.blob.core.windows.net) must be linked to the VNet. - Custom DNS servers must forward Azure DNS queries to
168.63.129.16. - Test from inside the VNet:
nslookup mystorageacct.blob.core.windows.netshould return a private IP (10.x.x.x).
Useful Diagnostic Commands Reference
Quick Reference
# Activity Log — recent operations on resource group
az monitor activity-log list \
--resource-group rg-web \
--start-time 2024-01-15T00:00:00Z \
--output table
# List all resources in a resource group
az resource list --resource-group rg-web --output table
# Check resource health
az resource show \
--ids /subscriptions/{sub}/resourceGroups/rg-web/providers/Microsoft.Web/sites/myapp \
--include-response-body
# Network connectivity test from CLI
az network watcher check-connectivity \
--resource-group rg-compute \
--source-resource vm-web-01 \
--dest-address myapp-sql-server.database.windows.net \
--dest-port 1433Summary
Good troubleshooting is systematic: establish timeline, check logs, check metrics, narrow to root cause. For VMs — Boot Diagnostics and Serial Console are your emergency tools. For App Service — log stream and Kudu. For storage — firewall and SAS/RBAC. For databases — firewall rules and connection string format. Azure Network Watcher covers connectivity gaps between all services.