BeginnerLesson 3 of 16

Machine Learning Fundamentals for DevOps Engineers

Learn the essential ML concepts you need to evaluate, deploy, and debug AI systems in production — without becoming a data scientist.

🧒 Simple Explanation (ELI5)

Imagine you want to teach a friend to recognise spam emails. Instead of writing a list of rules ("if email contains 'win money' it's spam"), you show them 10,000 examples of spam and 10,000 normal emails. After seeing enough examples, they can spot new spam even if they've never seen those exact words before. That's machine learning — learning from examples instead of following explicit rules.

For DevOps, this means: instead of writing if cpu > 80% then alert, you show the ML model thousands of normal CPU patterns and it learns to recognise when something is genuinely abnormal — including slow, gradual drifts that static rules miss.

🔧 Why DevOps Engineers Need ML Fundamentals

Evaluate AI tools intelligently: Dynatrace, Datadog, and Azure Monitor all use ML under the hood. Understanding basics lets you tune thresholds and interpret confidence scores instead of blindly trusting outputs.
Debug wrong predictions: When the anomaly detector fires on every Monday morning deployment, you need to understand why (lack of training data for that pattern) to fix it.
Choose the right model type: Not every problem needs GPT-4. Sometimes Isolation Forest for anomaly detection is cheaper, faster, and more accurate.
Feature engineering: The difference between a useful model and a useless one is often the input features you provide, not the algorithm.
Compliance: In regulated industries, you need to explain why the AI made a decision. Understanding model types tells you which are explainable.

🌍 Real-world Analogy

An experienced nurse learns what "normal" patient vitals look like after seeing thousands of patients. She doesn't follow a rigid rule book — she has an intuition built from experience. If a patient's heart rate is 95 bpm, she knows whether that's concerning based on their age, medication, and whether they just walked upstairs. ML models work the same way: they build "intuition" from training data that captures context no static rule could encode.

⚙️ Core ML Concepts for AIOps

1. Supervised vs Unsupervised Learning

Type	Training Data	AIOps Use Case	Example
Supervised	Labeled examples (input + correct output)	Incident classification, severity scoring	10,000 logs labeled "normal" or "error"
Unsupervised	Raw data (no labels)	Anomaly detection, log clustering	Find unusual patterns in metric streams
Semi-supervised	Small labeled set + large unlabeled	Alert classification with few labeled examples	100 labeled incidents + 50,000 unlabeled logs
Reinforcement	Reward/penalty signals	Auto-remediation policy optimization	Reward system when remediation resolves incident

2. Key Model Types in AIOps

Classification: Assigns input to a category. Used for: "Is this log line an error? (yes/no)", "What severity is this incident? (P1/P2/P3)"
Regression: Predicts a continuous number. Used for: "How long will this incident take to resolve?", "What will CPU usage be in 30 minutes?"
Clustering: Groups similar things together without labels. Used for: "Group these 10,000 log patterns into 20 common types."
Anomaly Detection: Identifies unusual data points. Used for: "Is this metric reading abnormal given historical patterns?"
LLMs (Large Language Models): Understand and generate natural language. Used for: "Summarize this incident timeline", "Generate a runbook for this error pattern."

3. The ML Pipeline

Every ML system in production follows this flow:

Data collection: Gather raw logs, metrics, traces
Feature engineering: Extract meaningful signals (error rate, request volume, latency percentiles)
Model training: Fit mathematical patterns to historical data
Evaluation: Measure accuracy, precision, recall on held-out data
Deployment: Serve predictions in real-time pipeline
Monitoring: Track model drift — when live data diverges from training data

4. Evaluation Metrics You Must Understand

Metric	Meaning	AIOps Importance
Precision	Of all alerts fired, how many were real?	Low precision = alert fatigue
Recall	Of all real incidents, how many were caught?	Low recall = missed incidents
F1 Score	Harmonic mean of precision and recall	Balance between false positives and negatives
AUC-ROC	Overall classifier performance across thresholds	Tune sensitivity without retraining
Latency p99	99th percentile prediction time	Model must respond in <200ms for real-time triage

📊 Visual: ML Model Selection for AIOps

Choosing the Right Model Type

Problem Type?

Classify logs → Classification

Find outliers → Anomaly Detection

Group similar → Clustering

Predict number → Regression

→

Do you have labels?

Yes → Supervised (sklearn)

No → Unsupervised (Isolation Forest)

Natural language → LLM (Azure OpenAI)

→

Production Considerations

Latency: <200ms real-time

Cost: batch vs streaming

Explainability: regulated?

⌨️ Your First Anomaly Detector: Isolation Forest

python

"""
Isolation Forest anomaly detector for CPU/memory metrics.
Isolation Forest works by: if a data point is isolated quickly
(few random splits needed), it's likely an outlier.
"""
import numpy as np
from sklearn.ensemble import IsolationForest
import json

# Simulate 30 days of normal CPU metrics (5-min intervals)
np.random.seed(42)
normal_cpu = np.random.normal(loc=45, scale=10, size=8640)        # 45% avg CPU
normal_cpu = np.clip(normal_cpu, 5, 70)                            # cap at realistic range

# Add intentional anomalies at the end (simulate incident)
anomaly_cpu = np.array([92, 95, 91, 88, 94, 96])

all_cpu = np.concatenate([normal_cpu, anomaly_cpu]).reshape(-1, 1)

# Train on the normal data only
model = IsolationForest(
    contamination=0.01,    # expect 1% anomalies
    random_state=42,
    n_estimators=100
)
model.fit(normal_cpu.reshape(-1, 1))

# Score all data points
scores = model.decision_function(all_cpu)  # more negative = more anomalous
predictions = model.predict(all_cpu)        # -1 = anomaly, 1 = normal

# Report anomalies found
anomaly_indices = np.where(predictions == -1)[0]
print(f"Total data points: {len(all_cpu)}")
print(f"Anomalies detected: {len(anomaly_indices)}")
print(f"Anomaly CPU values: {all_cpu[anomaly_indices].flatten().tolist()}")

# Expected output:
# Total data points: 8646
# Anomalies detected: 6
# Anomaly CPU values: [92.0, 95.0, 91.0, 88.0, 94.0, 96.0]

🧪 Hands-on

Install sklearn: pip install scikit-learn numpy
Run the Isolation Forest example above and reproduce the output.
Change contamination=0.01 to 0.05 — observe how the number of anomalies detected changes. Understand the trade-off: more sensitivity = more false positives.
Replace the simulated CPU data with real metric exports from your Prometheus or Azure Monitor. Use a CSV with one column of metric values.
Calculate precision and recall: manually label which data points are real anomalies, then compare with model predictions.

💡

Key Gotcha: Class Imbalance

In production AIOps, anomalies are rare — maybe 0.1% of data points. If you train on imbalanced data without accounting for this, a model that always predicts normal achieves 99.9% accuracy. Always check precision and recall separately, never rely on accuracy alone for anomaly detection.

🎮 Try It Yourself

🎮

Challenge: Tune Isolation Forest for Real AIOps Data

Run the Isolation Forest example from the Hands-on section above. Confirm it detects anomalies at rows 480–485 where CPU was injected at 92%.
Now tune the contamination parameter. Change it to 0.001 (very strict) — how many anomalies does it miss? Change to 0.1 (lenient) — how many false positives appear? Record the trade-off.
Supervised scenario: Pretend you have labeled data. Create 10 rows labeled is_anomaly=True (CPU > 85) and 490 rows labeled False. Train a RandomForestClassifier on 80% and evaluate on 20% — calculate precision and recall separately.
Kubernetes context: Consider a pod with the following metrics in a 24-hour window: CPU% averages [45, 47, 92, 46, 44]. Would you use supervised (you have past labeled crashes) or unsupervised (first time deploying this service)? Justify your choice.

Key insight to internalize: In a K8s environment with 50+ services, you will never have enough labeled crash data for all of them. Unsupervised anomaly detection (Isolation Forest, z-score) is the pragmatic default; supervised models are added later for services with enough incident history.

🧠 Debugging Scenario

Problem: Your Isolation Forest model correctly detects anomalies in test data but fires constantly in production — almost every hour is flagged as anomalous.

Root cause 1: Concept drift. Training data was from Q1, but production patterns changed in Q2 (new service launched, traffic patterns shifted). The model's idea of "normal" is outdated.
Root cause 2: Feature distribution mismatch. Training used CPU% 0-100, but production metrics come in as fractional (0.0-1.0). The model sees everything as an outlier.
Root cause 3: Contamination too low. contamination=0.001 means only 0.1% of training data was treated as anomaly. If true anomaly rate is 2%, the threshold is set wrong.
Fix: Retrain on recent data (rolling 30-day window), validate feature scales between train and production, and set contamination to match your actual expected anomaly rate.

🎯 Interview Questions

Beginner

What is the difference between supervised and unsupervised machine learning?▾

Supervised learning uses labeled training data — you provide input-output pairs and the model learns the mapping. Unsupervised learning finds patterns in data without labels. In AIOps, supervised models classify known incident types; unsupervised models detect novel anomalies you haven't labeled before.

What is overfitting and why does it matter in production AI systems?▾

Overfitting is when a model memorises training data too precisely — it performs well on training data but poorly on new data. In AIOps, an overfitted anomaly detector would learn the exact noise patterns in historical data and either miss new anomaly types or fire on normal production variations it hasn't seen.

What does precision vs recall mean for an alert system?▾

Precision: of all alerts fired, what fraction were real incidents? Low precision = alert fatigue. Recall: of all real incidents, what fraction did we catch? Low recall = missed incidents. In critical systems, recall is prioritised (never miss a P1 incident), even at the cost of some false positives.

Name three ML model types used in AIOps and their use cases.▾

1) Isolation Forest — anomaly detection on metric time series. 2) Random Forest classifier — incident severity scoring with labeled historical data. 3) LLMs (GPT-4) — natural language incident summarization and runbook generation.

What is model drift and how does it affect production AI systems?▾

Model drift occurs when the statistical distribution of input data changes after training (data drift) or when the relationship between inputs and outputs changes (concept drift). In AIOps, a model trained on Q1 traffic may perform poorly in Q4 when traffic patterns change seasonally. Detect drift by monitoring prediction distribution and model accuracy metrics over time.

Intermediate

How do you select features for an ML-based alert system?▾

Start with domain expertise: what signals correlate with incidents? (error rate, p99 latency, CPU delta, deployment recency). Use feature importance scores from tree models to prune irrelevant ones. Avoid high-cardinality categoricals (pod names) without encoding. Test that features are available in real-time, not just historical batch.

What is the trade-off when adjusting the contamination parameter in Isolation Forest?▾

The contamination parameter sets the expected fraction of anomalies in the training data, which sets the decision threshold. Too low: model misses real anomalies (low recall). Too high: model fires constantly (low precision). You should set it to match your empirically measured anomaly rate in production data, then validate with labeled golden datasets.

How would you monitor an ML model in production for health degradation?▾

Track: 1) Prediction distribution drift — is the model outputting unusual scores compared to baseline? 2) Input feature distribution shift — are the input values drifting from training distribution? 3) Ground truth accuracy — when incidents are resolved, did the model's classification match? 4) Latency p99 — is inference time creeping up? Set alerts on all four signals.

When should you use an LLM vs a traditional ML model for an AIOps task?▾

Use LLMs for: natural language output (summaries, runbooks), tasks with no labeled training data, novel incident types requiring contextual reasoning. Use traditional ML (sklearn, statsmodels) for: high-frequency prediction at low latency (<10ms), structured metric data, cost-sensitive pipelines. LLMs cost ~100x more per prediction than a local sklearn model.

What is cross-validation and why is it important before deploying an ML model?▾

Cross-validation splits training data into k folds, training on k-1 and validating on 1 repeatedly. It gives a reliable estimate of production performance without requiring a separate test set. For AIOps time-series data, use time-based splits (train on past, validate on newer data) to prevent data leakage — never randomly shuffle temporal data.

Scenario-based

Your anomaly detection model fires every Monday morning at 9am but there's no real incident. How do you diagnose and fix this?▾

Monday morning traffic spikes are a predictable pattern the model hasn't learned. Fix: 1) Add day-of-week and hour as features so the model understands cyclical patterns. 2) Use a model like Prophet or seasonal ARIMA that natively handles weekly seasonality. 3) Add a "business hours" suppression window with domain-specific context for Monday 9am deployments.

A newly trained model has 99.5% accuracy but your team still experiences missed incidents. What's the problem?▾

High accuracy with class imbalance is misleading. If 99.5% of data points are normal, a model that always predicts "normal" achieves 99.5% accuracy while catching zero anomalies. Evaluate recall separately — a recall of 0% on the anomaly class means every incident is missed. Use F1 score or AUC-ROC for imbalanced classification. Always check confusion matrix, not just accuracy.

You're asked to explain why an ML model flagged a specific alert to a compliance team. What model types would you choose and why?▾

Use explainable models: Decision Trees, LIME, or SHAP with Random Forest/XGBoost. These provide feature importance and per-prediction explanations. SHAP values show exactly which input features contributed how much to the decision. Avoid black-box deep learning for compliance-sensitive decisions. Document model lineage: training data, features, version, and approval chain.

🌐 Real-world Usage

Dynatrace Davis uses a combination of clustering (for topology-aware grouping), regression (for baseline forecasting), and classification (for root cause attribution). Azure Monitor's Smart Detection uses historical metric patterns to learn per-resource baselines — different CPU baselines for a web server vs a batch processor. Google SRE teams use ML to predict SLO burn rate before breaches occur, reducing manual intervention.

📝 Summary

DevOps engineers don't need to build ML models from scratch, but they must understand the fundamentals: model types (classification, anomaly detection, regression), evaluation metrics (precision, recall over accuracy), and failure modes (class imbalance, concept drift, feature mismatch). These skills let you deploy, tune, debug, and confidently explain AI systems in production.

PreviousAI in DevOps - Concepts, Tools, and Workflows ← Back to Course NextData Pipelines and Feature Engineering for AIOps