Building an AIOps Chatbot and Ops Assistant
Create an operations assistant that answers incident questions, summarizes live telemetry, and points responders to the right runbooks.
🧒 Simple Explanation (ELI5)
An AIOps assistant is like a smart teammate in chat who can answer: What is failing? Which service changed recently? What runbook should I use? It saves responders from hunting through five dashboards and ten docs tabs.
🤔 Why Do We Need It?
- Incident responders lose time gathering context from many tools.
- Runbook knowledge is often trapped in senior engineers' heads.
- Chat is already where teams coordinate during incidents.
- Natural language is faster than manually writing complex queries during pressure.
🌍 Real-world Analogy
Imagine a hospital triage assistant who listens to the doctor, reads the patient chart, pulls the right protocol, and answers simple follow-up questions quickly. It does not replace the doctor, but it removes a lot of lookup work.
⚙️ Technical Explanation
An ops assistant usually uses retrieval-augmented generation. It pulls current incident state, dashboards, alerts, deployment history, topology, and runbooks from trusted sources and gives grounded responses. Safety requirements are strict: narrow tool access, approval for actions, redaction of secrets, and clear separation between read-only questions and write actions.
📊 Visual: AIOps Ops Assistant Architecture
retrieval-augmented
⌨️ Commands / Syntax
{
"question": "Why is checkout latency high?",
"context": {
"active_alerts": ["HighLatency", "DbConnectionsSaturated"],
"recent_deployments": ["checkout-api 2026-04-20T13:58Z"],
"runbooks": ["db-pool-exhaustion", "rollback-checkout"]
}
}User: Summarize the current incident in one paragraph. Assistant: Checkout latency rose from 140ms to 980ms after the 13:58 UTC deployment. Error rate remains below 1%, but DB connections are saturated and queue depth is increasing. Most likely causes are migration-related connection leakage or insufficient pool size. Recommended next steps: compare pool metrics before and after deploy, consider rollback if saturation persists for 5 minutes.
🧪 Hands-on
- Collect trusted documents: runbooks, service ownership, dashboard URLs, and deployment metadata.
- Define 10 read-only questions the assistant must answer well.
- Add grounding so responses cite current alerts and recent changes.
- Block write actions by default; allow only suggestions first.
- Test during a controlled incident simulation and record wrong or hallucinated answers.
🧭 Example (Real-world Use Case)
During a production incident, responders ask the assistant: Which services depend on Redis? What changed in the last 30 minutes? Do we have a runbook for queue backlog? The assistant answers with grounded links and a concise incident summary, reducing context-building time during the first 10 minutes of response.
🛠️ Try It Yourself
- Define 10 read-only questions your ops assistant must answer correctly during an incident. Examples: "What changed in the last 30 minutes in the prod namespace?" / "Which runbook covers DB pool exhaustion?" / "Which pods are currently in CrashLoopBackOff?"
- Ground with kubectl output: Write a Python function that runs
kubectl get events -n prod --sort-by=lastTimestamp | tail -20, captures the output, and prepends it to an Azure OpenAI system message as context before answering a question. - Test grounding quality: Ask the assistant "What is the most recent warning event in prod?" (a) with the kubectl output grounded in the prompt, and (b) without grounding. Compare responses. Ungrounded version will hallucinate — document what it invents.
- Implement the "I don’t know" pattern: Add to your system prompt: "If you cannot find evidence in the provided context, reply: ‘I do not have enough current data to answer this. Run: <suggested kubectl command>’". Test with a question whose answer is NOT in the grounded data.
- Wire to Slack: Use the Slack Incoming Webhooks API to post the assistant’s answer to a
#ops-assistant-testchannel. Format the response with bold headers for: Summary, Evidence, Suggested Action, and Confidence (1-5 stars).
🐛 Debugging Scenarios
Hallucination: Assistant Suggests a Non-Existent Runbook
Signal: During a live incident, the assistant responds: "Follow the redis-memory-recovery.md runbook." The engineer searches for it — it does not exist in the runbook repository. Time wasted: 8 minutes.
- Root cause: The vector database embedding index was last updated 6 weeks ago. A new runbook was added but never indexed. The LLM, receiving no matching document from the retrieval step, generated a plausible-sounding runbook filename from its training data rather than returning "not found."
- Fix: 1) Add a CI/CD pipeline step that re-indexes the runbook vector DB on every merge to main. 2) Add retrieval confidence scoring: if cosine similarity < 0.75, do NOT pass the retrieved document to the LLM. Instead, respond: "No matching runbook found. Try searching: <suggested search terms>." 3) Add a post-response validator that checks any filename/URL mentioned in the response exists in the actual runbook index before posting to Slack.
- Verification: Create a test suite: 20 known questions with known correct answers. Run after every index update. Alert if any hallucinated resource names appear in responses.
Information Leak: A Secret Credential Appears in the Bot Response
Signal: An engineer asks "What is the connection string for the payment DB?" The bot responds with a real connection string including username and password pulled from a Kubernetes Secret that was accidentally included in the indexed runbooks.
- Root cause: A runbook was committed with
kubectl get secret -o yamloutput pasted inline. The indexing pipeline ingested it without redaction. The retrieval step matched on "connection string" and passed the full secret content to the LLM as grounding context. - Fix: Add a pre-indexing scrubber that scans all documents for patterns matching secrets (connection strings, API keys, base64-encoded tokens). Block any document containing these patterns from being indexed. Implement RBAC on the runbook store — the assistant’s service account should have read access only to docs explicitly marked as ops-viewable.
- Verification: Run a red-team test: deliberately ask for sensitive information in 5 different phrasings. Verify all return "I do not have access to credentials" rather than leaking any secret data.
🎯 Interview Questions
Beginner
It is a chat-based operations assistant that uses AI plus grounded operational data to answer incident and system questions.
Because responders already collaborate in chat during incidents, so the assistant fits naturally into existing workflows.
Start with read-only questions like current alert summary, recent deployments, known runbooks, and service ownership.
Grounding means the assistant bases its answer on trusted source data rather than making unsupported claims.
Because wrong actions in production are expensive. Teams usually start with retrieval and recommendations before automation.
Intermediate
Runbooks, alert feeds, topology maps, deployment history, dashboards, and ownership metadata are high-value sources.
Use retrieval grounding, source citations, confidence checks, narrow prompts, and explicit fallback behavior when evidence is weak.
Because the security and operational risk profiles are very different. Read-only access is much safer during early adoption.
I would evaluate answer accuracy, citation quality, time saved during incidents, and whether responders trust the assistant enough to use it.
Building a generic bot without clear operational scope, trusted sources, or safety controls.
Scenario-based
It should explain the suggested action, show evidence, and require a secure approval path rather than executing immediately.
The assistant should surface the disagreement explicitly and avoid pretending there is certainty where the evidence conflicts.
I would start with high-friction read-only lookups that save time but do not change systems, then prove usefulness with incident drills.
Redaction and retrieval boundaries failed. I would audit the source documents, indexing pipeline, and response filters immediately.
Responders use the assistant voluntarily, first-response context gathering is faster, and incorrect grounded answers are rare and well understood.
📝 Summary
An AIOps assistant is valuable when it is grounded, narrow in scope, and operationally safe. The point is not to sound smart in chat; it is to reduce cognitive load during real incidents.