Monitoring & Logging
Gain full observability over your AKS clusters — from Container Insights and Log Analytics to Prometheus, Grafana dashboards, metric alerts, control plane diagnostics, and cost monitoring.
🧒 Simple Explanation (ELI5)
Monitoring your AKS cluster is like having a dashboard in your car. You need to see:
- Speed (CPU usage) — are your applications running too fast or too slow?
- Fuel gauge (memory) — are pods running out of resources and crashing?
- Engine temperature (error rates) — is something overheating and about to fail?
- Warning lights (alerts) — automatic notifications when something goes wrong so you don't have to stare at the dashboard 24/7.
- Trip computer (logs) — a detailed record of everything that happened so you can figure out why something broke.
Without monitoring, you're driving blind — you only discover problems when the engine stalls (users complain). With proper observability, you catch issues before they affect anyone.
🔧 Technical Explanation
1. Azure Monitor for Containers (Container Insights)
Container Insights is Azure's native monitoring solution for AKS. It deploys an OMS agent (now called Azure Monitor Agent) as a DaemonSet on every node to collect metrics, logs, and inventory data.
# Enable Container Insights on an existing cluster az aks enable-addons -a monitoring -g myRG -n myAKS # Or specify a particular Log Analytics workspace WORKSPACE_ID=$(az monitor log-analytics workspace show -g myRG -n myWorkspace --query id -o tsv) az aks enable-addons -a monitoring -g myRG -n myAKS --workspace-resource-id "$WORKSPACE_ID" # Enable during cluster creation az aks create -g myRG -n myAKS \ --enable-addons monitoring \ --workspace-resource-id "$WORKSPACE_ID" \ --node-count 3 --generate-ssh-keys
What Container Insights collects:
| Data Type | Source | Table in Log Analytics |
|---|---|---|
| Container stdout/stderr logs | Node-level /var/log/containers | ContainerLogV2 |
| Node performance (CPU, memory, disk) | cAdvisor + kubelet | InsightsMetrics |
| Pod/container inventory | Kubernetes API | KubePodInventory |
| Node inventory | Kubernetes API | KubeNodeInventory |
| Kubernetes events | Kubernetes API | KubeEvents |
Container logs can be expensive at scale. Use the ConfigMap container-azm-ms-agentconfig to exclude namespaces (like kube-system), filter log levels, or reduce collection frequency. This can cut Log Analytics costs by 40-60%.
2. Log Analytics & KQL Queries
Log Analytics is the underlying data platform. You query it using Kusto Query Language (KQL).
// Find all OOMKilled containers in the last 24 hours KubeEvents | where TimeGenerated > ago(24h) | where Reason == "OOMKilling" | project TimeGenerated, Namespace, Name, Message | order by TimeGenerated desc
// Average CPU usage per node over the last 1 hour InsightsMetrics | where TimeGenerated > ago(1h) | where Name == "cpuUsageNanoCores" | summarize AvgCPU = avg(Val) by Computer, bin(TimeGenerated, 5m) | render timechart
// Top 10 pods by memory consumption InsightsMetrics | where TimeGenerated > ago(30m) | where Name == "memoryWorkingSetBytes" | extend PodName = tostring(parse_json(Tags).pod) | summarize AvgMemMB = avg(Val / 1024 / 1024) by PodName | top 10 by AvgMemMB desc
// Container restart count by namespace KubePodInventory | where TimeGenerated > ago(24h) | summarize Restarts = max(ContainerRestartCount) by Namespace, Name | where Restarts > 0 | order by Restarts desc
3. Key Metrics to Monitor
| Metric | Healthy Threshold | Alert Threshold |
|---|---|---|
| Node CPU % | < 60% | > 80% for 5 min |
| Node Memory % | < 70% | > 85% for 5 min |
| Pod restart count | 0 | > 3 in 10 min |
| Disk IOPS / throughput | Within SKU limits | > 90% of limit |
| Network in/out per node | Within expected | Sudden spikes or drops |
| Failed pod count | 0 | > 0 for 5 min |
| Pending pods | 0 (except during scale-out) | > 0 for 10 min |
4. Pre-built Azure Portal Dashboards
Once Container Insights is enabled, navigate to Azure Portal → AKS Cluster → Monitoring → Insights to find:
- Cluster tab — overall CPU/memory utilization, node count, pod count
- Nodes tab — per-node metrics, drill down to individual containers
- Controllers tab — group metrics by Deployment, DaemonSet, StatefulSet
- Containers tab — search/filter containers, see live logs
- Reports (Workbooks) — pre-built Azure Monitor Workbooks for disk, network, GPU, and billing
5. Prometheus & Grafana
Azure offers a managed Prometheus service (Azure Monitor Workspace for Prometheus) that eliminates running your own Prometheus server.
# Create an Azure Monitor workspace (Prometheus-compatible)
az monitor account create -g myRG -n myPrometheusWorkspace -l eastus
# Link it to AKS
PROM_ID=$(az monitor account show -g myRG -n myPrometheusWorkspace --query id -o tsv)
az aks update -g myRG -n myAKS \
--enable-azure-monitor-metrics \
--azure-monitor-workspace-resource-id "$PROM_ID"
# Create a Grafana instance and link to Prometheus
az grafana create -g myRG -n myGrafana -l eastus
az grafana update -g myRG -n myGrafana \
--azure-monitor-workspace-integrations "[{azureMonitorWorkspaceResourceId:$PROM_ID}]"For custom application metrics, annotate pods so the managed Prometheus scrapes them:
# Pod annotations for Prometheus scraping
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"6. Alerts & Action Groups
Azure Monitor alerts trigger when metrics or log conditions are met. Configure an action group to define who gets notified and how.
# Create an action group (email + Teams webhook) az monitor action-group create \ -g myRG -n aks-ops-team \ --short-name AKSOps \ --email-receiver name=OnCall email=oncall@company.com \ --webhook-receiver name=Teams uri="https://company.webhook.office.com/webhookb2/..." # Create a metric alert: node CPU > 80% for 5 minutes AKS_RESOURCE_ID=$(az aks show -g myRG -n myAKS --query id -o tsv) az monitor metrics alert create \ -g myRG -n "aks-node-cpu-high" \ --scopes "$AKS_RESOURCE_ID" \ --condition "avg node_cpu_usage_percentage > 80" \ --window-size 5m \ --evaluation-frequency 1m \ --action aks-ops-team \ --severity 2 \ --description "AKS node CPU exceeds 80% for 5 minutes"
# Create a log alert: OOMKilled events az monitor scheduled-query create \ -g myRG -n "aks-oomkilled-alert" \ --scopes "$WORKSPACE_ID" \ --condition "count 'KubeEvents | where Reason == \"OOMKilling\"' > 0" \ --condition-query "KubeEvents | where TimeGenerated > ago(10m) | where Reason == 'OOMKilling'" \ --window-size 10 \ --evaluation-frequency 5 \ --action-groups aks-ops-team \ --severity 1
Common alert types for AKS:
| Alert | Type | Condition | Severity |
|---|---|---|---|
| Node CPU high | Metric | CPU > 80% for 5 min | Warning (2) |
| Node memory high | Metric | Memory > 85% for 5 min | Warning (2) |
| OOMKilled pods | Log | OOMKilling event count > 0 | Critical (1) |
| Node not ready | Metric | Ready node count < expected | Critical (1) |
| Persistent volume at capacity | Metric | PV usage > 90% | Warning (2) |
| Pod in failed state | Log | Failed pod count > 0 for 5 min | Error (1) |
7. Diagnostic Settings — Control Plane Logs
By default, AKS does not export control plane logs. You must enable diagnostic settings to send them to Log Analytics, a storage account, or Event Hub.
# Enable all control plane log categories
az monitor diagnostic-settings create \
--name aks-diagnostics \
--resource "$AKS_RESOURCE_ID" \
--workspace "$WORKSPACE_ID" \
--logs '[
{"category":"kube-apiserver","enabled":true},
{"category":"kube-controller-manager","enabled":true},
{"category":"kube-scheduler","enabled":true},
{"category":"kube-audit","enabled":true},
{"category":"kube-audit-admin","enabled":true},
{"category":"cluster-autoscaler","enabled":true},
{"category":"guard","enabled":true}
]'| Log Category | What It Contains | When You Need It |
|---|---|---|
kube-apiserver | API server request/response logs | Debugging API errors, RBAC denials |
kube-audit | Full audit log (every API request) | Compliance, security investigations |
kube-audit-admin | Write operations only (subset of kube-audit) | Lighter audit trail |
kube-controller-manager | Controller reconciliation loops | Debugging stuck deployments/replicasets |
kube-scheduler | Scheduling decisions | Debugging pending pods, node affinity |
cluster-autoscaler | Scale-up/scale-down decisions | Understanding why nodes aren't scaling |
guard | Azure AD auth webhook logs | Debugging Azure AD/RBAC login failures |
The kube-audit category generates massive data volumes on busy clusters. For cost control, consider using kube-audit-admin (write operations only) or set a retention policy on the Log Analytics table. A 100-node cluster can generate 10+ GB/day of audit logs.
8. Application-Level Observability
- Application Insights — instrument your code with the Application Insights SDK (or auto-instrumentation for .NET, Java, Node.js, Python). Tracks request rates, response times, dependency calls, exceptions, and custom metrics.
- Distributed Tracing — correlates requests across microservices using W3C TraceContext headers. See end-to-end transaction flows in the Application Map.
- Live Metrics Stream — real-time view of incoming requests, outgoing dependencies, and exceptions with sub-second latency.
# Auto-instrument a Java app on AKS using Application Insights agent
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
template:
spec:
containers:
- name: api
image: myacr.azurecr.io/api:v2
env:
- name: APPLICATIONINSIGHTS_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: appinsights-secret
key: connectionString
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/opt/applicationinsights-agent.jar"
volumeMounts:
- name: ai-agent
mountPath: /opt/applicationinsights-agent.jar
subPath: applicationinsights-agent.jar
initContainers:
- name: ai-jar
image: mcr.microsoft.com/applicationinsights/agent:3.4.0
command: ['cp', '/agent/applicationinsights-agent.jar', '/opt/']
volumeMounts:
- name: ai-agent
mountPath: /opt
volumes:
- name: ai-agent
emptyDir: {}9. Cost Monitoring
- AKS Cost Analysis (Preview) — in Azure Portal under the cluster's Cost Analysis blade, view costs broken down by namespace, label, or node pool.
- Right-sizing recommendations — Container Insights shows CPU/memory request vs actual usage, highlighting over-provisioned pods.
- Azure Advisor — provides recommendations like "reduce node pool size" or "switch to spot instances for non-critical workloads."
// Find pods requesting far more CPU than they use (right-sizing candidates)
InsightsMetrics
| where TimeGenerated > ago(7d)
| where Name == "cpuUsageNanoCores"
| extend PodName = tostring(parse_json(Tags).pod)
| extend Namespace = tostring(parse_json(Tags).namespace)
| summarize P95CPU_mCores = percentile(Val / 1000000, 95) by PodName, Namespace
| join kind=inner (
KubePodInventory
| where TimeGenerated > ago(1h)
| extend CpuRequest_mCores = toint(parse_json(ContainerResourceRequestCPU))
| project PodName = Name, CpuRequest_mCores
| distinct PodName, CpuRequest_mCores
) on PodName
| extend WasteRatio = round((CpuRequest_mCores - P95CPU_mCores) / CpuRequest_mCores * 100, 1)
| where WasteRatio > 50
| order by WasteRatio desc10. kubectl Debugging Commands
While Azure Monitor provides long-term data, kubectl is your first tool for real-time debugging:
# Real-time resource usage per node kubectl top nodes # Real-time CPU/memory per pod in a namespace kubectl top pods -n production --sort-by=memory # Check recent events for a specific pod kubectl describe pod myapp -n production | tail -30 # Stream live logs from a container kubectl logs -f deployment/api-service -n production -c api # View events cluster-wide, sorted by time kubectl get events --sort-by='.lastTimestamp' -A | tail -20
AKS deploys Metrics Server by default. If kubectl top returns "Metrics API not available," check that the metrics-server deployment is running in kube-system: kubectl get deploy metrics-server -n kube-system.
⌨️ Hands-on
Lab 1: Enable Container Insights & Run KQL Queries
# 1. Create a Log Analytics workspace (if you don't have one)
az monitor log-analytics workspace create -g myRG -n aks-logs -l eastus
WORKSPACE_ID=$(az monitor log-analytics workspace show -g myRG -n aks-logs --query id -o tsv)
# 2. Enable Container Insights
az aks enable-addons -a monitoring -g myRG -n myAKS --workspace-resource-id "$WORKSPACE_ID"
# 3. Verify OMS agent pods are running
kubectl get daemonset ama-logs -n kube-system
# 4. Generate some data — deploy a memory-hungry pod that will get OOMKilled
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: oom-test
spec:
containers:
- name: stress
image: polinux/stress
resources:
limits:
memory: "50Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "100M", "--vm-hang", "1"]
EOF
# 5. Wait ~2 minutes, then check events
kubectl get events --field-selector involvedObject.name=oom-test
# 6. Go to Azure Portal → Log Analytics → Logs, run this KQL:
# KubeEvents | where Reason == "OOMKilling" | project TimeGenerated, Name, MessageLab 2: Create a CPU Alert
# 1. Create an action group
az monitor action-group create \
-g myRG -n "aks-alerts" \
--short-name AKSAlert \
--email-receiver name=Admin email=admin@company.com
# 2. Create a metric alert for node CPU > 80%
AKS_ID=$(az aks show -g myRG -n myAKS --query id -o tsv)
az monitor metrics alert create \
-g myRG -n "node-cpu-high" \
--scopes "$AKS_ID" \
--condition "avg node_cpu_usage_percentage > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action aks-alerts \
--severity 2
# 3. Generate CPU load to test the alert
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: cpu-stress
spec:
containers:
- name: stress
image: polinux/stress
resources:
requests:
cpu: "1"
limits:
cpu: "2"
command: ["stress"]
args: ["--cpu", "4", "--timeout", "600"]
EOF
# 4. Watch node CPU rise
kubectl top nodes
# Wait 5+ minutes for alert to fire, check Azure Portal → Monitor → AlertsLab 3: Check kubectl top & Describe Events
# View node-level resource usage kubectl top nodes # View pod-level resource usage sorted by CPU kubectl top pods -A --sort-by=cpu | head -20 # Describe a specific pod to see events, conditions, and resource usage kubectl describe pod cpu-stress # View all warning events in the cluster kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp' # Compare kubectl data with Azure Monitor # The "kubectl top" shows real-time, Azure Monitor shows historical trends # Use kubectl for immediate debugging, Azure Monitor for pattern analysis
🐛 Debugging Scenarios
Scenario 1: Container Insights Not Showing Data
Symptom: You navigate to Azure Portal → AKS → Insights, but the dashboards are empty — no metrics, no logs, no inventory.
# Step 1: Is the monitoring add-on enabled? az aks show -g myRG -n myAKS --query "addonProfiles.omsagent.enabled" # If null or false → az aks enable-addons -a monitoring ... # Step 2: Check OMS agent (ama-logs) DaemonSet status kubectl get daemonset ama-logs -n kube-system # Desired=3, Current=3, Ready=3 → agents are running # Ready=0 → agents are crashing # Step 3: Check agent pod logs for errors kubectl logs daemonset/ama-logs -n kube-system --tail=50 # Look for "Unauthorized", "workspace not found", or connectivity errors # Step 4: Verify the Log Analytics workspace exists and is linked az aks show -g myRG -n myAKS --query "addonProfiles.omsagent.config.logAnalyticsWorkspaceResourceID" az monitor log-analytics workspace show --ids "<WORKSPACE_ID>" --query "provisioningState" # Step 5: Check network connectivity (private cluster scenario) # OMS agent needs outbound connectivity to: # *.ods.opinsights.azure.com # *.oms.opinsights.azure.com # *.monitoring.azure.com # If using private link, ensure the Private Link Scope includes these endpoints # Step 6: Data ingestion delay — wait 10-15 minutes after enabling # Run a simple KQL to check: Heartbeat | take 5 # Fix: Re-enable the add-on if agent pods aren't created, fix network # rules if agents can't reach Azure, verify workspace hasn't been deleted.
Scenario 2: Alerts Not Firing
Symptom: Node CPU has been at 90% for 30 minutes, but no alert email arrives.
# Step 1: Check if the alert rule exists and is enabled
az monitor metrics alert list -g myRG -o table
# Verify: state = "Enabled", not "Disabled"
# Step 2: Check alert scope — is it targeting the right AKS resource?
az monitor metrics alert show -g myRG -n "node-cpu-high" \
--query "{scopes:scopes, condition:criteria.allOf[0]}"
# Step 3: Verify the metric name is correct
# Common mistake: using "cpuUsagePercentage" instead of "node_cpu_usage_percentage"
az monitor metrics list --resource "$AKS_ID" --metric-namespace "Insights.Container/nodes"
# Step 4: Check the action group is configured correctly
az monitor action-group show -g myRG -n aks-alerts \
--query "{email:emailReceivers, webhook:webhookReceivers}"
# Step 5: Check action group test — send a test notification
az monitor action-group test-notifications create \
-g myRG --action-group-name aks-alerts \
--alert-type "budget" \
--email-receiver name=Admin email=admin@company.com
# Step 6: Check Azure Monitor alert history
# Azure Portal → Monitor → Alerts → check "Alert history"
# If alert fired but email not received → check spam folder, action group config
# Fix: Correct the metric name, enable the rule, verify action group
# receivers, check email/webhook endpoints are valid and reachable.Scenario 3: Can't See Control Plane Logs
Symptom: You need to investigate API server errors or audit who deleted a namespace, but there are no control plane logs in Log Analytics.
# Step 1: Check if diagnostic settings exist for the AKS resource
az monitor diagnostic-settings list --resource "$AKS_RESOURCE_ID" -o table
# If empty → no diagnostic settings configured
# Step 2: Create diagnostic settings (see section 7 above)
az monitor diagnostic-settings create \
--name aks-diagnostics \
--resource "$AKS_RESOURCE_ID" \
--workspace "$WORKSPACE_ID" \
--logs '[{"category":"kube-apiserver","enabled":true},{"category":"kube-audit-admin","enabled":true},{"category":"cluster-autoscaler","enabled":true}]'
# Step 3: Wait 5-10 minutes for logs to start flowing
# Step 4: Verify data is arriving
# KQL: AzureDiagnostics | where Category == "kube-apiserver" | take 5
# Step 5: If diagnostic settings exist but no data appears:
# - Check the workspace is correct and active
# - Verify the categories are set to enabled: true
# - Check for Azure Policy restricting diagnostic settings
# Common misconception: Container Insights (monitoring add-on) does NOT
# collect control plane logs. You must configure diagnostic settings separately.
# Step 6: Query audit logs to find who deleted the namespace
# KQL:
# AzureDiagnostics
# | where Category == "kube-audit-admin"
# | where log_s contains "namespaces" and log_s contains "delete"
# | project TimeGenerated, log_s
# | order by TimeGenerated desc🎯 Interview Questions
Beginner
Container Insights is Azure Monitor's native monitoring solution for AKS. It deploys an agent (Azure Monitor Agent) as a DaemonSet on each node that collects container logs, node/pod performance metrics, and Kubernetes inventory data. Enable it with az aks enable-addons -a monitoring -g myRG -n myAKS. Data is sent to a Log Analytics workspace where you can query it with KQL, visualize it in built-in dashboards, and create alerts.
Container Insights is Azure-native — it collects logs AND metrics, stores them in Log Analytics (KQL), and provides pre-built Azure Portal dashboards. Prometheus is a CNCF metrics-only system — it scrapes metrics endpoints using PromQL. Azure now offers managed Prometheus (Azure Monitor Workspace) that removes the need to self-host. Use Container Insights for logs + Azure-native experience, Prometheus for custom application metrics + Grafana dashboards. Many teams use both together.
KQL (Kusto Query Language) is the query language for Azure Log Analytics and Azure Data Explorer. It's used to query container logs, Kubernetes events, performance metrics, and audit logs stored in Log Analytics. KQL is pipe-based (like PowerShell) — you start with a table, filter with where, transform with extend/project, aggregate with summarize, and visualize with render. Example: KubeEvents | where Reason == "OOMKilling" | project TimeGenerated, Name.
Metric alerts evaluate numeric platform metrics at regular intervals (e.g., "node CPU > 80% for 5 min"). They're fast (near-real-time, 1-minute evaluation). Log alerts run a KQL query against Log Analytics data (e.g., "count of OOMKilled events > 0 in the last 10 min"). They're more flexible but have higher latency (5-15 min) due to log ingestion delay. Use metric alerts for infrastructure thresholds, log alerts for event-based conditions.
kubectl top nodes shows real-time CPU and memory usage per node. kubectl top pods shows per-pod resource usage. It requires the Metrics Server to be running in the cluster, which AKS deploys by default. Metrics Server provides the Metrics API that kubectl top queries. It shows current usage only (no history) — for historical data, use Container Insights or Prometheus.
Intermediate
Diagnostic settings export AKS control plane logs to Log Analytics, Storage, or Event Hub. Categories include: kube-apiserver (API request logs), kube-audit (full audit — every API call), kube-audit-admin (write operations only), kube-controller-manager, kube-scheduler, cluster-autoscaler, and guard (Azure AD auth). These are separate from Container Insights — you must enable them via az monitor diagnostic-settings create. The kube-audit category is essential for compliance but can be very high-volume.
Create multiple action groups with escalating severity: (1) Warning alerts (CPU > 70%) → email the team Slack channel. (2) Critical alerts (CPU > 90% or OOMKilled) → page the on-call engineer via PagerDuty webhook. (3) Severity 0 alerts (node not ready, multiple pod failures) → SMS + phone call via PagerDuty, auto-create incident ticket via Logic App webhook. Use Azure Monitor's severity levels (0-4) to map alerts to action groups. Add suppression rules for maintenance windows.
Several strategies: (1) Configure the agent ConfigMap to exclude verbose namespaces (kube-system, monitoring). (2) Reduce log collection frequency from the default. (3) Use kube-audit-admin instead of kube-audit for diagnostic logs. (4) Set data retention policies (30 days instead of default 31-90). (5) Use Basic tier for Log Analytics tables that don't need full analytics. (6) Archive cold data to Storage accounts. (7) Use Data Collection Rules to filter at ingestion time. A well-tuned config can reduce costs 40-60%.
Azure Monitor managed service for Prometheus creates an Azure Monitor Workspace that acts as a Prometheus-compatible remote-write endpoint. Enable it on AKS with --enable-azure-monitor-metrics, which deploys a collector that scrapes Prometheus endpoints on pods (using annotations) and node-level metrics. Create an Azure Managed Grafana instance and link it to the Prometheus workspace. Grafana can then use PromQL to query these metrics. This eliminates managing Prometheus servers, storage, and HA — Azure handles it all.
Right-sizing means setting CPU/memory requests and limits to match actual usage, avoiding waste. To identify over-provisioned pods: (1) Use Container Insights' "Controllers" view to compare requests vs actual usage. (2) Run a KQL query that joins InsightsMetrics (actual usage) with KubePodInventory (requested resources) and calculates the waste ratio. (3) Use Azure Advisor's AKS recommendations. A pod requesting 1 CPU but consistently using 100m is wasting 90% of its allocation, preventing other pods from scheduling on that node.
Scenario-Based
1. Check kubectl top pods to see if the new pods are consuming more CPU/memory. 2. In Container Insights, compare CPU/memory graphs before and after deployment. 3. If using Application Insights, check the Performance blade for slow dependencies or increased exception rates. 4. Use distributed tracing to find which downstream service call is slow. 5. Query container logs: ContainerLogV2 | where PodName contains "api-service" | where LogMessage contains "error" or "timeout". 6. Check if the pod is hitting resource limits (throttled CPU). 7. Review Kubernetes events for OOMKilled or probe failures. Most likely causes: missing resource limits causing CPU throttling, a bad database query, or a dependency service issue.
1. Check data ingestion volume: Azure Portal → Log Analytics → Usage and estimated costs → Data volume by solution. 2. Identify which tables are largest — likely ContainerLogV2 (application logs) or AzureDiagnostics (audit logs). 3. If ContainerLogV2: check if a verbose application was deployed (debug logging in production), or if a crash loop is generating excessive logs. Apply ConfigMap filtering to exclude noisy namespaces. 4. If AzureDiagnostics: check if full kube-audit was enabled (switch to kube-audit-admin). 5. For immediate relief: set daily cap on the workspace, enable Data Collection Rules to filter at source. 6. Long-term: implement Basic tier for cold tables, set retention policies, archive to storage.
1. Check current node pool limits: az aks nodepool show -g myRG --cluster-name myAKS -n nodepool1 --query "{min:minCount,max:maxCount,current:count}". 2. If max is reached, decide: increase the max count (az aks nodepool update --max-count 20) or optimize existing workloads. 3. Check if pending pods have resource requests that no available node can satisfy (GPU, specific VM size). 4. Check cluster-autoscaler logs in diagnostic settings: AzureDiagnostics | where Category == "cluster-autoscaler". 5. Verify subscription VM quotas aren't exhausted: az vm list-usage -l eastus. 6. Consider adding a second node pool with a different VM size for the pending workload's needs.
1. Ensure kube-audit or kube-audit-admin diagnostic settings have been enabled (this should have been done preemptively). 2. Query: AzureDiagnostics | where Category == "kube-audit-admin" | where TimeGenerated > ago(90d) | extend User = tostring(parse_json(log_s).user.username) | extend Verb = tostring(parse_json(log_s).verb) | extend Resource = tostring(parse_json(log_s).objectRef.resource) | summarize Actions=count() by User, Verb, Resource | order by Actions desc. 3. For Azure-level operations (who scaled the cluster, changed config): use Azure Activity Log. 4. Cross-reference Azure AD sign-in logs for login times. 5. Export results to CSV for the auditor. If audit logs weren't enabled, you can only provide Azure Activity Log data — a lesson to enable diagnostic settings from day one.
1. Check if the Prometheus collector pods are running: kubectl get pods -n kube-system -l app.kubernetes.io/name=ama-metrics. 2. If pods restarted, check logs for OOM or connectivity issues. 3. Verify the Azure Monitor Workspace is healthy: Azure Portal → Monitor → Prometheus. 4. Check if the application pods were restarted or rescheduled (gaps during restarts are normal). 5. Verify pod annotations (prometheus.io/scrape: "true") were not removed during a deployment. 6. Check network policies — a new deny-all rule might be blocking the collector from scraping pod metrics endpoints. 7. Verify the scrape interval hasn't been changed to a very long value.
🌍 Real-World Use Case
Proactive Monitoring at a SaaS Platform Provider
A B2B SaaS company runs 300+ microservices across 5 AKS clusters serving 10 million daily API requests. Before implementing comprehensive monitoring, they experienced 4 unplanned outages per quarter — each costing ~$50K in SLA credits.
- Container Insights on all clusters with tuned ConfigMaps — excluded noisy debug logs from development namespaces, saving 45% on Log Analytics costs.
- Managed Prometheus + Grafana — custom dashboards per team showing request latency (p50/p95/p99), error rates, and queue depths. Teams own their dashboards.
- Three-tier alert escalation:
- Tier 1 (Warning): CPU > 70%, memory > 75% → Slack notification to owning team.
- Tier 2 (Error): Pod crash loops, OOMKilled, probe failures → PagerDuty page to on-call engineer.
- Tier 3 (Critical): Node not ready, multiple service failures → PagerDuty escalation + auto-incident in ServiceNow + SMS to engineering leadership.
- Diagnostic settings with
kube-audit-admin— all write operations logged for compliance, with 90-day retention and archive to cold storage. - Weekly right-sizing reviews — KQL query identifies pods wasting > 50% of requested resources. Savings: ~$8K/month by reducing node pool sizes.
- Application Insights distributed tracing — end-to-end visibility across microservices. When a payment API slowed down, tracing pinpointed a database query taking 3 seconds in a downstream service within 5 minutes.
Result: outages dropped from 4/quarter to 0 in the first 6 months. MTTR (mean time to resolve) dropped from 45 minutes to 8 minutes. The monitoring investment paid for itself in 2 months through reduced SLA credits and faster incident response.
📝 Summary
- Container Insights is the foundation — deploy it first for logs, metrics, inventory, and pre-built dashboards.
- Log Analytics + KQL lets you query container logs, Kubernetes events, and performance data with powerful aggregations and visualizations.
- Monitor key metrics: node CPU/memory, pod restarts, pending pods, disk IOPS — set thresholds before problems hit users.
- Prometheus + Grafana (Azure-managed) adds custom application metrics and team-owned dashboards.
- Alerts with action groups automate incident response — metric alerts for thresholds, log alerts for events.
- Diagnostic settings export control plane logs (API server, audit, scheduler, autoscaler) — essential for compliance and debugging.
- Application Insights provides distributed tracing across microservices for end-to-end visibility.
- Cost monitoring and right-sizing prevent budget surprises — compare requested vs actual resource usage.
- kubectl top is your real-time first responder; Azure Monitor is your historical analyst.