Advanced Lesson 9 of 14

Monitoring & Logging

Gain full observability over your AKS clusters — from Container Insights and Log Analytics to Prometheus, Grafana dashboards, metric alerts, control plane diagnostics, and cost monitoring.

🧒 Simple Explanation (ELI5)

Monitoring your AKS cluster is like having a dashboard in your car. You need to see:

Without monitoring, you're driving blind — you only discover problems when the engine stalls (users complain). With proper observability, you catch issues before they affect anyone.

🔧 Technical Explanation

1. Azure Monitor for Containers (Container Insights)

Container Insights is Azure's native monitoring solution for AKS. It deploys an OMS agent (now called Azure Monitor Agent) as a DaemonSet on every node to collect metrics, logs, and inventory data.

bash
# Enable Container Insights on an existing cluster
az aks enable-addons -a monitoring -g myRG -n myAKS

# Or specify a particular Log Analytics workspace
WORKSPACE_ID=$(az monitor log-analytics workspace show -g myRG -n myWorkspace --query id -o tsv)
az aks enable-addons -a monitoring -g myRG -n myAKS --workspace-resource-id "$WORKSPACE_ID"

# Enable during cluster creation
az aks create -g myRG -n myAKS \
  --enable-addons monitoring \
  --workspace-resource-id "$WORKSPACE_ID" \
  --node-count 3 --generate-ssh-keys

What Container Insights collects:

Data TypeSourceTable in Log Analytics
Container stdout/stderr logsNode-level /var/log/containersContainerLogV2
Node performance (CPU, memory, disk)cAdvisor + kubeletInsightsMetrics
Pod/container inventoryKubernetes APIKubePodInventory
Node inventoryKubernetes APIKubeNodeInventory
Kubernetes eventsKubernetes APIKubeEvents
💡
Cost Control

Container logs can be expensive at scale. Use the ConfigMap container-azm-ms-agentconfig to exclude namespaces (like kube-system), filter log levels, or reduce collection frequency. This can cut Log Analytics costs by 40-60%.

2. Log Analytics & KQL Queries

Log Analytics is the underlying data platform. You query it using Kusto Query Language (KQL).

kusto
// Find all OOMKilled containers in the last 24 hours
KubeEvents
| where TimeGenerated > ago(24h)
| where Reason == "OOMKilling"
| project TimeGenerated, Namespace, Name, Message
| order by TimeGenerated desc
kusto
// Average CPU usage per node over the last 1 hour
InsightsMetrics
| where TimeGenerated > ago(1h)
| where Name == "cpuUsageNanoCores"
| summarize AvgCPU = avg(Val) by Computer, bin(TimeGenerated, 5m)
| render timechart
kusto
// Top 10 pods by memory consumption
InsightsMetrics
| where TimeGenerated > ago(30m)
| where Name == "memoryWorkingSetBytes"
| extend PodName = tostring(parse_json(Tags).pod)
| summarize AvgMemMB = avg(Val / 1024 / 1024) by PodName
| top 10 by AvgMemMB desc
kusto
// Container restart count by namespace
KubePodInventory
| where TimeGenerated > ago(24h)
| summarize Restarts = max(ContainerRestartCount) by Namespace, Name
| where Restarts > 0
| order by Restarts desc

3. Key Metrics to Monitor

MetricHealthy ThresholdAlert Threshold
Node CPU %< 60%> 80% for 5 min
Node Memory %< 70%> 85% for 5 min
Pod restart count0> 3 in 10 min
Disk IOPS / throughputWithin SKU limits> 90% of limit
Network in/out per nodeWithin expectedSudden spikes or drops
Failed pod count0> 0 for 5 min
Pending pods0 (except during scale-out)> 0 for 10 min

4. Pre-built Azure Portal Dashboards

Once Container Insights is enabled, navigate to Azure Portal → AKS Cluster → Monitoring → Insights to find:

5. Prometheus & Grafana

Azure offers a managed Prometheus service (Azure Monitor Workspace for Prometheus) that eliminates running your own Prometheus server.

bash
# Create an Azure Monitor workspace (Prometheus-compatible)
az monitor account create -g myRG -n myPrometheusWorkspace -l eastus

# Link it to AKS
PROM_ID=$(az monitor account show -g myRG -n myPrometheusWorkspace --query id -o tsv)
az aks update -g myRG -n myAKS \
  --enable-azure-monitor-metrics \
  --azure-monitor-workspace-resource-id "$PROM_ID"

# Create a Grafana instance and link to Prometheus
az grafana create -g myRG -n myGrafana -l eastus
az grafana update -g myRG -n myGrafana \
  --azure-monitor-workspace-integrations "[{azureMonitorWorkspaceResourceId:$PROM_ID}]"

For custom application metrics, annotate pods so the managed Prometheus scrapes them:

yaml
# Pod annotations for Prometheus scraping
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
Monitoring Data Flow
AKS Nodes (kubelet, cAdvisor)
OMS Agent / Prometheus Collector
Log Analytics / Azure Monitor Workspace
Dashboards & Alerts

6. Alerts & Action Groups

Azure Monitor alerts trigger when metrics or log conditions are met. Configure an action group to define who gets notified and how.

bash
# Create an action group (email + Teams webhook)
az monitor action-group create \
  -g myRG -n aks-ops-team \
  --short-name AKSOps \
  --email-receiver name=OnCall email=oncall@company.com \
  --webhook-receiver name=Teams uri="https://company.webhook.office.com/webhookb2/..."

# Create a metric alert: node CPU > 80% for 5 minutes
AKS_RESOURCE_ID=$(az aks show -g myRG -n myAKS --query id -o tsv)
az monitor metrics alert create \
  -g myRG -n "aks-node-cpu-high" \
  --scopes "$AKS_RESOURCE_ID" \
  --condition "avg node_cpu_usage_percentage > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action aks-ops-team \
  --severity 2 \
  --description "AKS node CPU exceeds 80% for 5 minutes"
bash
# Create a log alert: OOMKilled events
az monitor scheduled-query create \
  -g myRG -n "aks-oomkilled-alert" \
  --scopes "$WORKSPACE_ID" \
  --condition "count 'KubeEvents | where Reason == \"OOMKilling\"' > 0" \
  --condition-query "KubeEvents | where TimeGenerated > ago(10m) | where Reason == 'OOMKilling'" \
  --window-size 10 \
  --evaluation-frequency 5 \
  --action-groups aks-ops-team \
  --severity 1

Common alert types for AKS:

AlertTypeConditionSeverity
Node CPU highMetricCPU > 80% for 5 minWarning (2)
Node memory highMetricMemory > 85% for 5 minWarning (2)
OOMKilled podsLogOOMKilling event count > 0Critical (1)
Node not readyMetricReady node count < expectedCritical (1)
Persistent volume at capacityMetricPV usage > 90%Warning (2)
Pod in failed stateLogFailed pod count > 0 for 5 minError (1)

7. Diagnostic Settings — Control Plane Logs

By default, AKS does not export control plane logs. You must enable diagnostic settings to send them to Log Analytics, a storage account, or Event Hub.

bash
# Enable all control plane log categories
az monitor diagnostic-settings create \
  --name aks-diagnostics \
  --resource "$AKS_RESOURCE_ID" \
  --workspace "$WORKSPACE_ID" \
  --logs '[
    {"category":"kube-apiserver","enabled":true},
    {"category":"kube-controller-manager","enabled":true},
    {"category":"kube-scheduler","enabled":true},
    {"category":"kube-audit","enabled":true},
    {"category":"kube-audit-admin","enabled":true},
    {"category":"cluster-autoscaler","enabled":true},
    {"category":"guard","enabled":true}
  ]'
Log CategoryWhat It ContainsWhen You Need It
kube-apiserverAPI server request/response logsDebugging API errors, RBAC denials
kube-auditFull audit log (every API request)Compliance, security investigations
kube-audit-adminWrite operations only (subset of kube-audit)Lighter audit trail
kube-controller-managerController reconciliation loopsDebugging stuck deployments/replicasets
kube-schedulerScheduling decisionsDebugging pending pods, node affinity
cluster-autoscalerScale-up/scale-down decisionsUnderstanding why nodes aren't scaling
guardAzure AD auth webhook logsDebugging Azure AD/RBAC login failures
⚠️
Audit Log Volume

The kube-audit category generates massive data volumes on busy clusters. For cost control, consider using kube-audit-admin (write operations only) or set a retention policy on the Log Analytics table. A 100-node cluster can generate 10+ GB/day of audit logs.

8. Application-Level Observability

yaml
# Auto-instrument a Java app on AKS using Application Insights agent
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  template:
    spec:
      containers:
      - name: api
        image: myacr.azurecr.io/api:v2
        env:
        - name: APPLICATIONINSIGHTS_CONNECTION_STRING
          valueFrom:
            secretKeyRef:
              name: appinsights-secret
              key: connectionString
        - name: JAVA_TOOL_OPTIONS
          value: "-javaagent:/opt/applicationinsights-agent.jar"
        volumeMounts:
        - name: ai-agent
          mountPath: /opt/applicationinsights-agent.jar
          subPath: applicationinsights-agent.jar
      initContainers:
      - name: ai-jar
        image: mcr.microsoft.com/applicationinsights/agent:3.4.0
        command: ['cp', '/agent/applicationinsights-agent.jar', '/opt/']
        volumeMounts:
        - name: ai-agent
          mountPath: /opt
      volumes:
      - name: ai-agent
        emptyDir: {}

9. Cost Monitoring

kusto
// Find pods requesting far more CPU than they use (right-sizing candidates)
InsightsMetrics
| where TimeGenerated > ago(7d)
| where Name == "cpuUsageNanoCores"
| extend PodName = tostring(parse_json(Tags).pod)
| extend Namespace = tostring(parse_json(Tags).namespace)
| summarize P95CPU_mCores = percentile(Val / 1000000, 95) by PodName, Namespace
| join kind=inner (
    KubePodInventory
    | where TimeGenerated > ago(1h)
    | extend CpuRequest_mCores = toint(parse_json(ContainerResourceRequestCPU))
    | project PodName = Name, CpuRequest_mCores
    | distinct PodName, CpuRequest_mCores
) on PodName
| extend WasteRatio = round((CpuRequest_mCores - P95CPU_mCores) / CpuRequest_mCores * 100, 1)
| where WasteRatio > 50
| order by WasteRatio desc

10. kubectl Debugging Commands

While Azure Monitor provides long-term data, kubectl is your first tool for real-time debugging:

bash
# Real-time resource usage per node
kubectl top nodes

# Real-time CPU/memory per pod in a namespace
kubectl top pods -n production --sort-by=memory

# Check recent events for a specific pod
kubectl describe pod myapp -n production | tail -30

# Stream live logs from a container
kubectl logs -f deployment/api-service -n production -c api

# View events cluster-wide, sorted by time
kubectl get events --sort-by='.lastTimestamp' -A | tail -20
kubectl top requires Metrics Server

AKS deploys Metrics Server by default. If kubectl top returns "Metrics API not available," check that the metrics-server deployment is running in kube-system: kubectl get deploy metrics-server -n kube-system.

⌨️ Hands-on

Lab 1: Enable Container Insights & Run KQL Queries

bash
# 1. Create a Log Analytics workspace (if you don't have one)
az monitor log-analytics workspace create -g myRG -n aks-logs -l eastus
WORKSPACE_ID=$(az monitor log-analytics workspace show -g myRG -n aks-logs --query id -o tsv)

# 2. Enable Container Insights
az aks enable-addons -a monitoring -g myRG -n myAKS --workspace-resource-id "$WORKSPACE_ID"

# 3. Verify OMS agent pods are running
kubectl get daemonset ama-logs -n kube-system

# 4. Generate some data — deploy a memory-hungry pod that will get OOMKilled
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: oom-test
spec:
  containers:
  - name: stress
    image: polinux/stress
    resources:
      limits:
        memory: "50Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "100M", "--vm-hang", "1"]
EOF

# 5. Wait ~2 minutes, then check events
kubectl get events --field-selector involvedObject.name=oom-test

# 6. Go to Azure Portal → Log Analytics → Logs, run this KQL:
#    KubeEvents | where Reason == "OOMKilling" | project TimeGenerated, Name, Message

Lab 2: Create a CPU Alert

bash
# 1. Create an action group
az monitor action-group create \
  -g myRG -n "aks-alerts" \
  --short-name AKSAlert \
  --email-receiver name=Admin email=admin@company.com

# 2. Create a metric alert for node CPU > 80%
AKS_ID=$(az aks show -g myRG -n myAKS --query id -o tsv)
az monitor metrics alert create \
  -g myRG -n "node-cpu-high" \
  --scopes "$AKS_ID" \
  --condition "avg node_cpu_usage_percentage > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action aks-alerts \
  --severity 2

# 3. Generate CPU load to test the alert
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: cpu-stress
spec:
  containers:
  - name: stress
    image: polinux/stress
    resources:
      requests:
        cpu: "1"
      limits:
        cpu: "2"
    command: ["stress"]
    args: ["--cpu", "4", "--timeout", "600"]
EOF

# 4. Watch node CPU rise
kubectl top nodes
# Wait 5+ minutes for alert to fire, check Azure Portal → Monitor → Alerts

Lab 3: Check kubectl top & Describe Events

bash
# View node-level resource usage
kubectl top nodes

# View pod-level resource usage sorted by CPU
kubectl top pods -A --sort-by=cpu | head -20

# Describe a specific pod to see events, conditions, and resource usage
kubectl describe pod cpu-stress

# View all warning events in the cluster
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'

# Compare kubectl data with Azure Monitor
# The "kubectl top" shows real-time, Azure Monitor shows historical trends
# Use kubectl for immediate debugging, Azure Monitor for pattern analysis

🐛 Debugging Scenarios

Scenario 1: Container Insights Not Showing Data

Symptom: You navigate to Azure Portal → AKS → Insights, but the dashboards are empty — no metrics, no logs, no inventory.

bash
# Step 1: Is the monitoring add-on enabled?
az aks show -g myRG -n myAKS --query "addonProfiles.omsagent.enabled"
# If null or false → az aks enable-addons -a monitoring ...

# Step 2: Check OMS agent (ama-logs) DaemonSet status
kubectl get daemonset ama-logs -n kube-system
# Desired=3, Current=3, Ready=3 → agents are running
# Ready=0 → agents are crashing

# Step 3: Check agent pod logs for errors
kubectl logs daemonset/ama-logs -n kube-system --tail=50
# Look for "Unauthorized", "workspace not found", or connectivity errors

# Step 4: Verify the Log Analytics workspace exists and is linked
az aks show -g myRG -n myAKS --query "addonProfiles.omsagent.config.logAnalyticsWorkspaceResourceID"
az monitor log-analytics workspace show --ids "<WORKSPACE_ID>" --query "provisioningState"

# Step 5: Check network connectivity (private cluster scenario)
# OMS agent needs outbound connectivity to:
#   *.ods.opinsights.azure.com
#   *.oms.opinsights.azure.com
#   *.monitoring.azure.com
# If using private link, ensure the Private Link Scope includes these endpoints

# Step 6: Data ingestion delay — wait 10-15 minutes after enabling
# Run a simple KQL to check: Heartbeat | take 5

# Fix: Re-enable the add-on if agent pods aren't created, fix network
# rules if agents can't reach Azure, verify workspace hasn't been deleted.

Scenario 2: Alerts Not Firing

Symptom: Node CPU has been at 90% for 30 minutes, but no alert email arrives.

bash
# Step 1: Check if the alert rule exists and is enabled
az monitor metrics alert list -g myRG -o table
# Verify: state = "Enabled", not "Disabled"

# Step 2: Check alert scope — is it targeting the right AKS resource?
az monitor metrics alert show -g myRG -n "node-cpu-high" \
  --query "{scopes:scopes, condition:criteria.allOf[0]}"

# Step 3: Verify the metric name is correct
# Common mistake: using "cpuUsagePercentage" instead of "node_cpu_usage_percentage"
az monitor metrics list --resource "$AKS_ID" --metric-namespace "Insights.Container/nodes"

# Step 4: Check the action group is configured correctly
az monitor action-group show -g myRG -n aks-alerts \
  --query "{email:emailReceivers, webhook:webhookReceivers}"

# Step 5: Check action group test — send a test notification
az monitor action-group test-notifications create \
  -g myRG --action-group-name aks-alerts \
  --alert-type "budget" \
  --email-receiver name=Admin email=admin@company.com

# Step 6: Check Azure Monitor alert history
# Azure Portal → Monitor → Alerts → check "Alert history"
# If alert fired but email not received → check spam folder, action group config

# Fix: Correct the metric name, enable the rule, verify action group
# receivers, check email/webhook endpoints are valid and reachable.

Scenario 3: Can't See Control Plane Logs

Symptom: You need to investigate API server errors or audit who deleted a namespace, but there are no control plane logs in Log Analytics.

bash
# Step 1: Check if diagnostic settings exist for the AKS resource
az monitor diagnostic-settings list --resource "$AKS_RESOURCE_ID" -o table
# If empty → no diagnostic settings configured

# Step 2: Create diagnostic settings (see section 7 above)
az monitor diagnostic-settings create \
  --name aks-diagnostics \
  --resource "$AKS_RESOURCE_ID" \
  --workspace "$WORKSPACE_ID" \
  --logs '[{"category":"kube-apiserver","enabled":true},{"category":"kube-audit-admin","enabled":true},{"category":"cluster-autoscaler","enabled":true}]'

# Step 3: Wait 5-10 minutes for logs to start flowing

# Step 4: Verify data is arriving
# KQL: AzureDiagnostics | where Category == "kube-apiserver" | take 5

# Step 5: If diagnostic settings exist but no data appears:
# - Check the workspace is correct and active
# - Verify the categories are set to enabled: true
# - Check for Azure Policy restricting diagnostic settings

# Common misconception: Container Insights (monitoring add-on) does NOT
# collect control plane logs. You must configure diagnostic settings separately.

# Step 6: Query audit logs to find who deleted the namespace
# KQL:
# AzureDiagnostics
# | where Category == "kube-audit-admin"
# | where log_s contains "namespaces" and log_s contains "delete"
# | project TimeGenerated, log_s
# | order by TimeGenerated desc

🎯 Interview Questions

Beginner

Q: What is Container Insights and how do you enable it on AKS?

Container Insights is Azure Monitor's native monitoring solution for AKS. It deploys an agent (Azure Monitor Agent) as a DaemonSet on each node that collects container logs, node/pod performance metrics, and Kubernetes inventory data. Enable it with az aks enable-addons -a monitoring -g myRG -n myAKS. Data is sent to a Log Analytics workspace where you can query it with KQL, visualize it in built-in dashboards, and create alerts.

Q: What is the difference between Container Insights and Prometheus on AKS?

Container Insights is Azure-native — it collects logs AND metrics, stores them in Log Analytics (KQL), and provides pre-built Azure Portal dashboards. Prometheus is a CNCF metrics-only system — it scrapes metrics endpoints using PromQL. Azure now offers managed Prometheus (Azure Monitor Workspace) that removes the need to self-host. Use Container Insights for logs + Azure-native experience, Prometheus for custom application metrics + Grafana dashboards. Many teams use both together.

Q: What is KQL and why is it used with AKS monitoring?

KQL (Kusto Query Language) is the query language for Azure Log Analytics and Azure Data Explorer. It's used to query container logs, Kubernetes events, performance metrics, and audit logs stored in Log Analytics. KQL is pipe-based (like PowerShell) — you start with a table, filter with where, transform with extend/project, aggregate with summarize, and visualize with render. Example: KubeEvents | where Reason == "OOMKilling" | project TimeGenerated, Name.

Q: How do metric alerts differ from log alerts in Azure Monitor?

Metric alerts evaluate numeric platform metrics at regular intervals (e.g., "node CPU > 80% for 5 min"). They're fast (near-real-time, 1-minute evaluation). Log alerts run a KQL query against Log Analytics data (e.g., "count of OOMKilled events > 0 in the last 10 min"). They're more flexible but have higher latency (5-15 min) due to log ingestion delay. Use metric alerts for infrastructure thresholds, log alerts for event-based conditions.

Q: What does "kubectl top" show and what is required for it to work?

kubectl top nodes shows real-time CPU and memory usage per node. kubectl top pods shows per-pod resource usage. It requires the Metrics Server to be running in the cluster, which AKS deploys by default. Metrics Server provides the Metrics API that kubectl top queries. It shows current usage only (no history) — for historical data, use Container Insights or Prometheus.

Intermediate

Q: What are AKS diagnostic settings and what log categories are available?

Diagnostic settings export AKS control plane logs to Log Analytics, Storage, or Event Hub. Categories include: kube-apiserver (API request logs), kube-audit (full audit — every API call), kube-audit-admin (write operations only), kube-controller-manager, kube-scheduler, cluster-autoscaler, and guard (Azure AD auth). These are separate from Container Insights — you must enable them via az monitor diagnostic-settings create. The kube-audit category is essential for compliance but can be very high-volume.

Q: How would you set up an alert escalation chain for an AKS cluster?

Create multiple action groups with escalating severity: (1) Warning alerts (CPU > 70%) → email the team Slack channel. (2) Critical alerts (CPU > 90% or OOMKilled) → page the on-call engineer via PagerDuty webhook. (3) Severity 0 alerts (node not ready, multiple pod failures) → SMS + phone call via PagerDuty, auto-create incident ticket via Logic App webhook. Use Azure Monitor's severity levels (0-4) to map alerts to action groups. Add suppression rules for maintenance windows.

Q: How do you reduce Container Insights costs on a large AKS cluster?

Several strategies: (1) Configure the agent ConfigMap to exclude verbose namespaces (kube-system, monitoring). (2) Reduce log collection frequency from the default. (3) Use kube-audit-admin instead of kube-audit for diagnostic logs. (4) Set data retention policies (30 days instead of default 31-90). (5) Use Basic tier for Log Analytics tables that don't need full analytics. (6) Archive cold data to Storage accounts. (7) Use Data Collection Rules to filter at ingestion time. A well-tuned config can reduce costs 40-60%.

Q: How does Azure managed Prometheus integrate with AKS and Grafana?

Azure Monitor managed service for Prometheus creates an Azure Monitor Workspace that acts as a Prometheus-compatible remote-write endpoint. Enable it on AKS with --enable-azure-monitor-metrics, which deploys a collector that scrapes Prometheus endpoints on pods (using annotations) and node-level metrics. Create an Azure Managed Grafana instance and link it to the Prometheus workspace. Grafana can then use PromQL to query these metrics. This eliminates managing Prometheus servers, storage, and HA — Azure handles it all.

Q: What is right-sizing and how do you identify over-provisioned pods?

Right-sizing means setting CPU/memory requests and limits to match actual usage, avoiding waste. To identify over-provisioned pods: (1) Use Container Insights' "Controllers" view to compare requests vs actual usage. (2) Run a KQL query that joins InsightsMetrics (actual usage) with KubePodInventory (requested resources) and calculates the waste ratio. (3) Use Azure Advisor's AKS recommendations. A pod requesting 1 CPU but consistently using 100m is wasting 90% of its allocation, preventing other pods from scheduling on that node.

Scenario-Based

Q: Your team deployed a new microservice version and response times increased 5x. How do you investigate using AKS monitoring tools?

1. Check kubectl top pods to see if the new pods are consuming more CPU/memory. 2. In Container Insights, compare CPU/memory graphs before and after deployment. 3. If using Application Insights, check the Performance blade for slow dependencies or increased exception rates. 4. Use distributed tracing to find which downstream service call is slow. 5. Query container logs: ContainerLogV2 | where PodName contains "api-service" | where LogMessage contains "error" or "timeout". 6. Check if the pod is hitting resource limits (throttled CPU). 7. Review Kubernetes events for OOMKilled or probe failures. Most likely causes: missing resource limits causing CPU throttling, a bad database query, or a dependency service issue.

Q: It's month-end and your Azure bill shows a 300% increase in Log Analytics costs attributed to the AKS cluster. What do you do?

1. Check data ingestion volume: Azure Portal → Log Analytics → Usage and estimated costs → Data volume by solution. 2. Identify which tables are largest — likely ContainerLogV2 (application logs) or AzureDiagnostics (audit logs). 3. If ContainerLogV2: check if a verbose application was deployed (debug logging in production), or if a crash loop is generating excessive logs. Apply ConfigMap filtering to exclude noisy namespaces. 4. If AzureDiagnostics: check if full kube-audit was enabled (switch to kube-audit-admin). 5. For immediate relief: set daily cap on the workspace, enable Data Collection Rules to filter at source. 6. Long-term: implement Basic tier for cold tables, set retention policies, archive to storage.

Q: The cluster-autoscaler log shows "ScaleUp: node group has reached its maximum size" but you still have pending pods. How do you approach this?

1. Check current node pool limits: az aks nodepool show -g myRG --cluster-name myAKS -n nodepool1 --query "{min:minCount,max:maxCount,current:count}". 2. If max is reached, decide: increase the max count (az aks nodepool update --max-count 20) or optimize existing workloads. 3. Check if pending pods have resource requests that no available node can satisfy (GPU, specific VM size). 4. Check cluster-autoscaler logs in diagnostic settings: AzureDiagnostics | where Category == "cluster-autoscaler". 5. Verify subscription VM quotas aren't exhausted: az vm list-usage -l eastus. 6. Consider adding a second node pool with a different VM size for the pending workload's needs.

Q: A compliance audit requires you to prove who accessed the cluster and what changes they made in the last 90 days. How do you provide this?

1. Ensure kube-audit or kube-audit-admin diagnostic settings have been enabled (this should have been done preemptively). 2. Query: AzureDiagnostics | where Category == "kube-audit-admin" | where TimeGenerated > ago(90d) | extend User = tostring(parse_json(log_s).user.username) | extend Verb = tostring(parse_json(log_s).verb) | extend Resource = tostring(parse_json(log_s).objectRef.resource) | summarize Actions=count() by User, Verb, Resource | order by Actions desc. 3. For Azure-level operations (who scaled the cluster, changed config): use Azure Activity Log. 4. Cross-reference Azure AD sign-in logs for login times. 5. Export results to CSV for the auditor. If audit logs weren't enabled, you can only provide Azure Activity Log data — a lesson to enable diagnostic settings from day one.

Q: Your Grafana dashboard suddenly shows gaps in Prometheus metrics. What do you investigate?

1. Check if the Prometheus collector pods are running: kubectl get pods -n kube-system -l app.kubernetes.io/name=ama-metrics. 2. If pods restarted, check logs for OOM or connectivity issues. 3. Verify the Azure Monitor Workspace is healthy: Azure Portal → Monitor → Prometheus. 4. Check if the application pods were restarted or rescheduled (gaps during restarts are normal). 5. Verify pod annotations (prometheus.io/scrape: "true") were not removed during a deployment. 6. Check network policies — a new deny-all rule might be blocking the collector from scraping pod metrics endpoints. 7. Verify the scrape interval hasn't been changed to a very long value.

🌍 Real-World Use Case

Proactive Monitoring at a SaaS Platform Provider

A B2B SaaS company runs 300+ microservices across 5 AKS clusters serving 10 million daily API requests. Before implementing comprehensive monitoring, they experienced 4 unplanned outages per quarter — each costing ~$50K in SLA credits.

Result: outages dropped from 4/quarter to 0 in the first 6 months. MTTR (mean time to resolve) dropped from 45 minutes to 8 minutes. The monitoring investment paid for itself in 2 months through reduced SLA credits and faster incident response.

📝 Summary

← Back to AKS Course