IntermediateLesson 7 of 16

Incident Summarization and Intelligent Alerts

Auto-generate structured incident summaries and context-rich alerts using LLMs — so on-call engineers get a clear situation brief in seconds instead of spending 20 minutes reading logs.

🧒 Simple Explanation (ELI5)

Imagine you're called at 2am about a problem at a factory. A traditional alert says: "Alarm 409 triggered in Zone 7." You have no idea what that means. An intelligent alert says: "The packaging machine in Zone 7 stopped working at 1:58am. This has happened twice before — both times a conveyor belt sensor caused it. Maintenance runbook: check sensor cable in panel B-3." You know exactly what happened, why, and what to do — all before you've got out of bed.

🔧 Why Incident Summarization Reduces MTTR

🌍 Real-world Analogy

A hospital Emergency Department uses a triage nurse who, within 3 minutes of a patient arriving, produces a structured brief for the attending physician: chief complaint, vital signs, relevant medical history, suspected condition, and immediate actions needed. The physician can make an informed decision in 60 seconds instead of starting from scratch. AI incident summarization is exactly this triage brief — for production systems.

⚙️ How AI Incident Summarization Works

Step 1: Context Aggregation

Before calling the LLM, collect all available context:

Step 2: Structured Summarization Prompt

Engineering the prompt to get consistent, structured output is critical:

Step 3: Alert Enrichment

Attach the AI summary to the incident ticket with:

📊 Visual: Intelligent Alert Architecture

From Raw Alert to Context-Rich Incident Brief
🚨 Alert Fires
Anomaly detected
📊 Context Aggregator
logs + metrics + deploys
🤖 LLM Summary
Azure OpenAI
📝 Incident Brief
severity + cause + runbook
📱 On-Call Engineer
Slack / PagerDuty

⚡ Kubernetes Integration Flow: Input → AI → Action

How AI incident summarization runs during a multi-service alert storm in AKS:

K8s Alert Storm → AI → Single Incident Brief
🚨 AlertManager
15 simultaneous alerts
📊 Context Aggregator
kubectl events + logs + Helm history
🤖 LLM (GPT-4)
groups → 1 root cause
📝 Incident Brief
severity + cause + runbook
📱 PagerDuty P1
+ Slack thread
bash
# Collect context from AKS before calling the LLM

# 1. Get recent Kubernetes events (deployment failures, pod restarts, OOMKills)
kubectl get events -n prod --sort-by='.lastTimestamp' --field-selector type=Warning \
  | tail -20 > /tmp/k8s_events.txt

# 2. Get Helm release history (was there a recent upgrade?)
helm history payment-api -n prod --max 3 > /tmp/helm_history.txt

# 3. Get current pod states
kubectl get pods -n prod -l app=payment-api -o wide > /tmp/pod_status.txt

# 4. Feed all context into the Python summarisation script
python3 generate_incident_brief.py \
  --service payment-api \
  --k8s-events /tmp/k8s_events.txt \
  --helm-history /tmp/helm_history.txt \
  --pod-status /tmp/pod_status.txt \
  | curl -X POST "$PAGERDUTY_EVENTS_URL" \
    -H 'Content-Type: application/json' \
    --data-binary @-

⌨️ AI Incident Summarization with Azure OpenAI

python
"""
AI incident summarization pipeline.
Aggregates multi-source context and generates a structured incident brief.
"""
import os
import json
import datetime
import requests

AZURE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
API_KEY        = os.getenv("AZURE_OPENAI_KEY")
DEPLOYMENT     = "gpt-4"

def create_incident_brief(
    service: str,
    log_clusters: list[dict],
    anomalous_metrics: list[dict],
    recent_deployments: list[dict],
    similar_incidents: list[dict]
) -> dict:
    """
    Generate a structured incident brief from multi-source context.
    Returns validated JSON matching the IncidentBrief schema.
    """
    # Build context string (keep concise to save tokens)
    log_summary = "\n".join([
        f"- [{c['count']}x] {c['template'][:120]}" for c in log_clusters[:5]
    ])
    metric_summary = "\n".join([
        f"- {m['name']}: {m['current_value']} (baseline: {m['baseline_value']}, {m['deviation_pct']:+.0f}%)"
        for m in anomalous_metrics[:3]
    ])
    deploy_summary = "\n".join([
        f"- {d['service']} v{d['version']} deployed {d['minutes_ago']}min ago by {d['author']}"
        for d in recent_deployments[:2]
    ]) or "No deployments in the last 2 hours"
    past_incidents = "\n".join([
        f"- INC-{i['id']}: {i['title']} (resolved in {i['resolution_minutes']}min)"
        for i in similar_incidents[:2]
    ]) or "No similar past incidents found"

    prompt = f"""You are a senior SRE performing incident triage. Analyse the following data for service '{service}'.
Return ONLY valid JSON matching this exact schema — no markdown, no explanation:
{{
  "severity": "critical|high|medium|low",
  "summary": "2-3 sentence plain-English description of what is happening",
  "root_cause_hypothesis": "most likely root cause in one sentence",
  "affected_services": ["list of affected service names"],
  "user_impact": "description of what users are experiencing",
  "immediate_action": "the single most important immediate action",
  "runbook_hint": "which runbook or procedure is most relevant",
  "confidence": 0.0
}}

TOP ERROR LOG CLUSTERS:
{log_summary}

ANOMALOUS METRICS:
{metric_summary}

RECENT DEPLOYMENTS:
{deploy_summary}

SIMILAR PAST INCIDENTS:
{past_incidents}"""

    response = requests.post(
        f"{AZURE_ENDPOINT}/openai/deployments/{DEPLOYMENT}/chat/completions?api-version=2024-06-01",
        headers={"api-key": API_KEY, "Content-Type": "application/json"},
        json={
            "messages": [
                {"role": "system", "content": "You are a production incident triage bot. Output structured JSON only."},
                {"role": "user",   "content": prompt}
            ],
            "max_tokens": 400,
            "temperature": 0.1
        },
        timeout=30
    )
    response.raise_for_status()
    raw_output = response.json()["choices"][0]["message"]["content"]

    # Validate and parse JSON output
    try:
        brief = json.loads(raw_output)
    except json.JSONDecodeError:
        # Retry with explicit JSON extraction if LLM wrapped in markdown
        import re
        json_match = re.search(r'\{.*\}', raw_output, re.DOTALL)
        brief = json.loads(json_match.group()) if json_match else {"error": "JSON parse failed", "raw": raw_output}

    brief["generated_at"] = datetime.datetime.utcnow().isoformat()
    brief["service"] = service
    return brief

# ── Example usage ─────────────────────────────────────────────────────────────
sample_brief = create_incident_brief(
    service="payment-api",
    log_clusters=[
        {"count": 847, "template": "DB connection refused <IP>:5432 after timeout"},
        {"count": 423, "template": "Upstream payment-service returned <NUM> status code"},
        {"count": 112, "template": "Connection pool exhausted - max <NUM> connections"},
    ],
    anomalous_metrics=[
        {"name": "db_connection_time_ms", "current_value": 4200, "baseline_value": 45, "deviation_pct": 9233},
        {"name": "error_rate_pct",        "current_value": 23.4, "baseline_value": 0.3, "deviation_pct": 7700},
        {"name": "cpu_pct",               "current_value": 18,   "baseline_value": 44,  "deviation_pct": -59},
    ],
    recent_deployments=[
        {"service": "postgres-proxy", "version": "2.4.1", "minutes_ago": 18, "author": "terraform-ci"}
    ],
    similar_incidents=[
        {"id": "4521", "title": "DB pool exhaustion after proxy upgrade", "resolution_minutes": 23}
    ]
)
print(json.dumps(sample_brief, indent=2))

🧪 Hands-on

  1. Set up Azure OpenAI credentials and run the create_incident_brief() function with the sample data above.
  2. Verify the output severity is "critical" and the root cause hypothesis mentions the database proxy deployment.
  3. Test the JSON parsing resilience: modify the prompt to occasionally return markdown-wrapped JSON (```json{...}```) and verify the regex fallback catches it.
  4. Integrate with a PagerDuty or Opsgenie API: when the brief is generated, automatically create a P1 incident if severity is "critical" with the brief as the incident description.
  5. Build a feedback mechanism: add a "Was this summary correct? 👍 👎" button to your Slack alert. Track the ratio over time to measure LLM quality.
💡
The Confidence Threshold Pattern

Always include a confidence field in AI-generated summaries (0.0–1.0). When confidence < 0.6, automatically add a note: "⚠️ Low confidence analysis — human review strongly recommended." When confidence ≥ 0.85, the summary can be used as-is to start triage. Never auto-escalate or auto-remediate based on low-confidence AI outputs.

🎮 Try It Yourself

🎮
Challenge: Build and Test a K8s Incident Brief Pipeline
  1. Assemble a context bag manually: Imagine the following K8s incident at 14:23 UTC: payment-api pods show OOMKilled, Helm history shows payment-api v2.4.1 installed 22 minutes ago, error rate jumped from 0.2% to 19%, 3 dependent services returning 503. Write the 4 context fields (log_clusters, anomalous_metrics, recent_deployments, similar_incidents) that you would feed into the create_incident_brief() function.
  2. Run the summarisation function with Azure OpenAI credentials. Verify: severity should be "critical", root cause should mention the Helm upgrade, affected_services should include payment-api and the 3 dependent services.
  3. Test confidence thresholding: Remove the recent_deployments data from the input (pass an empty list). Does the AI-generated confidence drop? Does the root cause hypothesis change? This simulates an incident with no deployment signal.
  4. Test the JSON fallback: Temporarily modify the prompt to ask the LLM to wrap its output in markdown code fences (```json...```). Verify the regex fallback in the except json.JSONDecodeError block correctly strips the fences and parses the JSON.
  5. Wire to Slack: Using the Slack Incoming Webhooks API, send the generated brief as a formatted message. Format: bold severity, plain-text summary, inline code for root_cause_hypothesis, bulleted list of affected_services. Post to a test channel and verify it renders correctly.

🧠 Debugging Scenario

Problem: AI incident summary says "Root cause: DNS resolution failure" but the real incident was a database connection pool exhaustion. The summary is confidently wrong and engineers wasted 10 minutes investigating DNS.

🎯 Interview Questions

Beginner

What information should an AI-generated incident summary include?

At minimum: severity level, plain-English description of what is happening, root cause hypothesis, affected services and users, immediate action required, and a link to the relevant runbook. Optionally: similar past incidents with resolution times (helps estimate expected MTTR), confidence score, and list of evidence used (which logs/metrics drove the summary).

Why is temperature 0.1 used for incident summarization (not 0.7 or 1.0)?

Low temperature (0.1) produces deterministic, consistent, conservative outputs. For incident triage you want the model to stick to what the evidence shows, not be creative. High temperature introduces randomness — the model might produce a different root cause hypothesis each time you run it for the same incident. Determinism is critical for reliability: the same context should always produce the same summary so engineers can trust and benchmark it.

What is "context aggregation" in AI incident summarization?

Context aggregation is collecting all relevant data about an incident into one structured input before calling the LLM. This includes: top error log clusters (from log analysis), anomalous metric values, recent deployments and config changes, service dependency topology, and similar past incidents. The quality of the LLM output is directly proportional to the completeness and relevance of the aggregated context.

How does AI incident summarization help on-call handovers?

When an incident spans multiple shifts, the outgoing on-call engineer must brief the incoming one. AI summarization automatically generates a handover document from all incident activity: what happened, what was tried, current hypothesis, outstanding actions. This prevents the classic "starting from zero" problem when a new engineer takes over, and ensures nothing is lost in verbal handover at 3am.

What is the risk of engineers over-trusting AI incident summaries?

Automation bias: engineers follow AI-suggested actions without independently verifying the root cause. If the AI is wrong (hallucinated root cause), engineers waste time investigating the wrong area or take the wrong action (e.g., rolling back a deployment that isn't the problem). Mitigation: always display confidence scores, require a second signal to support the AI hypothesis, and treat AI summaries as starting hypotheses to validate, not ground truth.

Intermediate

How do you prevent LLM hallucinations in incident summaries?

1) Restrict the model to only reference evidence provided in the context: "Base your analysis ONLY on the data provided. Do not infer additional context." 2) Require citations: "For each claim, reference which log or metric supports it." 3) Add a confidence calibration: if the LLM can't find strong evidence, it should return low confidence rather than speculating. 4) Run golden tests: replay known past incidents through the pipeline and verify the AI matches the human-documented root cause. 5) Use structured output with JSON schema validation — this prevents factual claims using invented service names.

How would you implement runbook linking in an AI alert system?

Build a runbook index: convert all runbook documents into vector embeddings using text-embedding-ada (or similar). When an incident summary is generated, perform semantic search against the runbook index using the root cause hypothesis as query. Return top 2-3 runbook matches with similarity scores. If similarity > 0.8, include the runbook link directly in the alert. Below 0.8, list as "potentially relevant" and don't auto-link. Update the runbook index whenever runbooks are added or modified.

How do you scope the LLM context window to prevent irrelevant data from misleading summaries?

1) Scope logs to the specific alerting service and its immediate dependencies (1 hop in dependency graph). 2) Cap log clusters to top 5 by error rate (not total count). 3) Cap metrics to only those that are currently anomalous. 4) Include deployments only from the last 2 hours — not last month. 5) Use time-bounded context: similar past incidents from the last 6 months, not 3 years ago. Tight scoping keeps context tokens low (<2000) and prevents the LLM from synthesising misleading correlations across unrelated services.

Scenario-based

Engineers report that the AI incident summary is helpful 80% of the time but causes confusion the other 20%. How do you improve it?

1) Instrument the 20%: track which incident types produce bad summaries. Add a "Was this helpful?" widget in Slack with a "explain why not" dropdown. 2) Classify failure modes: wrong root cause, wrong severity, missing key info, too verbose. 3) For each failure mode, create golden test cases and run regression tests after every prompt change. 4) Address root causes by type: hallucinations →add citation requirement; wrong severity → add severity anchors; missing info → check context aggregation coverage. Target: monthly cadence of prompt improvements with regression test validation.

During a major outage, your AI summarisation service itself becomes unavailable. What happens?

This is a critical operational risk — the AI layer must never block incident response. Design for graceful degradation: 1) AI summarisation runs asynchronously and is never on the critical alert delivery path. 2) Alerts fire with basic context first (metric values, log cluster counts) before the AI summary is ready. 3) If summarisation fails within 60s, fall back to a template-based summary with raw data. 4) The AI service must have its own SLO: <30s summary generation, 99.5% availability. Treat it like any other production service with dedicated SLO alerts.

How would you use AI to improve post-incident review reports?

After incident resolution: 1) Provide the AI with the full incident timeline (all AI-generated summaries, engineer actions, metric graphs), 2) Ask it to draft a structured post-incident report: impact, timeline, root cause, contributing factors, action items. 3) Review as a team and annotate — engineers provide the "what we learned" section. 4) Store the final human-reviewed report as a labeled example in the training corpus. 5) Mine historical PIRs for known failure patterns to improve future triage: "we've seen DB pool exhaustion 8 times — let's make it auto-remediate."

🌐 Real-world Usage

PagerDuty's AIOps feature uses ML to group related alerts into a single incident and generates an automated triage note. Microsoft's Azure Monitor uses AI-powered "insight explanations" that describe in plain English why a smart detection alert fired. Atlassian's Jira Service Management uses AI to generate incident summaries from linked monitoring data. LinkedIn reduced average time-to-acknowledge by 40% after implementing automated incident context enrichment.

📝 Summary

AI incident summarization aggregates multi-source context (logs, metrics, deployments, past incidents), sends it to an LLM with a structured prompt, and produces a brief that on-call engineers can act on immediately. Key engineering requirements: structured JSON output, low temperature for determinism, confidence scoring, graceful degradation, and a feedback loop to improve quality over time.