BeginnerLesson 3 of 16

Machine Learning Fundamentals for DevOps Engineers

Learn the essential ML concepts you need to evaluate, deploy, and debug AI systems in production — without becoming a data scientist.

🧒 Simple Explanation (ELI5)

Imagine you want to teach a friend to recognise spam emails. Instead of writing a list of rules ("if email contains 'win money' it's spam"), you show them 10,000 examples of spam and 10,000 normal emails. After seeing enough examples, they can spot new spam even if they've never seen those exact words before. That's machine learning — learning from examples instead of following explicit rules.

For DevOps, this means: instead of writing if cpu > 80% then alert, you show the ML model thousands of normal CPU patterns and it learns to recognise when something is genuinely abnormal — including slow, gradual drifts that static rules miss.

🔧 Why DevOps Engineers Need ML Fundamentals

🌍 Real-world Analogy

An experienced nurse learns what "normal" patient vitals look like after seeing thousands of patients. She doesn't follow a rigid rule book — she has an intuition built from experience. If a patient's heart rate is 95 bpm, she knows whether that's concerning based on their age, medication, and whether they just walked upstairs. ML models work the same way: they build "intuition" from training data that captures context no static rule could encode.

⚙️ Core ML Concepts for AIOps

1. Supervised vs Unsupervised Learning

TypeTraining DataAIOps Use CaseExample
SupervisedLabeled examples (input + correct output)Incident classification, severity scoring10,000 logs labeled "normal" or "error"
UnsupervisedRaw data (no labels)Anomaly detection, log clusteringFind unusual patterns in metric streams
Semi-supervisedSmall labeled set + large unlabeledAlert classification with few labeled examples100 labeled incidents + 50,000 unlabeled logs
ReinforcementReward/penalty signalsAuto-remediation policy optimizationReward system when remediation resolves incident

2. Key Model Types in AIOps

3. The ML Pipeline

Every ML system in production follows this flow:

  1. Data collection: Gather raw logs, metrics, traces
  2. Feature engineering: Extract meaningful signals (error rate, request volume, latency percentiles)
  3. Model training: Fit mathematical patterns to historical data
  4. Evaluation: Measure accuracy, precision, recall on held-out data
  5. Deployment: Serve predictions in real-time pipeline
  6. Monitoring: Track model drift — when live data diverges from training data

4. Evaluation Metrics You Must Understand

MetricMeaningAIOps Importance
PrecisionOf all alerts fired, how many were real?Low precision = alert fatigue
RecallOf all real incidents, how many were caught?Low recall = missed incidents
F1 ScoreHarmonic mean of precision and recallBalance between false positives and negatives
AUC-ROCOverall classifier performance across thresholdsTune sensitivity without retraining
Latency p9999th percentile prediction timeModel must respond in <200ms for real-time triage

📊 Visual: ML Model Selection for AIOps

Choosing the Right Model Type
Problem Type?
Classify logs → Classification
Find outliers → Anomaly Detection
Group similar → Clustering
Predict number → Regression
Do you have labels?
Yes → Supervised (sklearn)
No → Unsupervised (Isolation Forest)
Natural language → LLM (Azure OpenAI)
Production Considerations
Latency: <200ms real-time
Cost: batch vs streaming
Explainability: regulated?

⌨️ Your First Anomaly Detector: Isolation Forest

python
"""
Isolation Forest anomaly detector for CPU/memory metrics.
Isolation Forest works by: if a data point is isolated quickly
(few random splits needed), it's likely an outlier.
"""
import numpy as np
from sklearn.ensemble import IsolationForest
import json

# Simulate 30 days of normal CPU metrics (5-min intervals)
np.random.seed(42)
normal_cpu = np.random.normal(loc=45, scale=10, size=8640)        # 45% avg CPU
normal_cpu = np.clip(normal_cpu, 5, 70)                            # cap at realistic range

# Add intentional anomalies at the end (simulate incident)
anomaly_cpu = np.array([92, 95, 91, 88, 94, 96])

all_cpu = np.concatenate([normal_cpu, anomaly_cpu]).reshape(-1, 1)

# Train on the normal data only
model = IsolationForest(
    contamination=0.01,    # expect 1% anomalies
    random_state=42,
    n_estimators=100
)
model.fit(normal_cpu.reshape(-1, 1))

# Score all data points
scores = model.decision_function(all_cpu)  # more negative = more anomalous
predictions = model.predict(all_cpu)        # -1 = anomaly, 1 = normal

# Report anomalies found
anomaly_indices = np.where(predictions == -1)[0]
print(f"Total data points: {len(all_cpu)}")
print(f"Anomalies detected: {len(anomaly_indices)}")
print(f"Anomaly CPU values: {all_cpu[anomaly_indices].flatten().tolist()}")

# Expected output:
# Total data points: 8646
# Anomalies detected: 6
# Anomaly CPU values: [92.0, 95.0, 91.0, 88.0, 94.0, 96.0]

🧪 Hands-on

  1. Install sklearn: pip install scikit-learn numpy
  2. Run the Isolation Forest example above and reproduce the output.
  3. Change contamination=0.01 to 0.05 — observe how the number of anomalies detected changes. Understand the trade-off: more sensitivity = more false positives.
  4. Replace the simulated CPU data with real metric exports from your Prometheus or Azure Monitor. Use a CSV with one column of metric values.
  5. Calculate precision and recall: manually label which data points are real anomalies, then compare with model predictions.
💡
Key Gotcha: Class Imbalance

In production AIOps, anomalies are rare — maybe 0.1% of data points. If you train on imbalanced data without accounting for this, a model that always predicts normal achieves 99.9% accuracy. Always check precision and recall separately, never rely on accuracy alone for anomaly detection.

🎮 Try It Yourself

🎮
Challenge: Tune Isolation Forest for Real AIOps Data
  1. Run the Isolation Forest example from the Hands-on section above. Confirm it detects anomalies at rows 480–485 where CPU was injected at 92%.
  2. Now tune the contamination parameter. Change it to 0.001 (very strict) — how many anomalies does it miss? Change to 0.1 (lenient) — how many false positives appear? Record the trade-off.
  3. Supervised scenario: Pretend you have labeled data. Create 10 rows labeled is_anomaly=True (CPU > 85) and 490 rows labeled False. Train a RandomForestClassifier on 80% and evaluate on 20% — calculate precision and recall separately.
  4. Kubernetes context: Consider a pod with the following metrics in a 24-hour window: CPU% averages [45, 47, 92, 46, 44]. Would you use supervised (you have past labeled crashes) or unsupervised (first time deploying this service)? Justify your choice.

Key insight to internalize: In a K8s environment with 50+ services, you will never have enough labeled crash data for all of them. Unsupervised anomaly detection (Isolation Forest, z-score) is the pragmatic default; supervised models are added later for services with enough incident history.

🧠 Debugging Scenario

Problem: Your Isolation Forest model correctly detects anomalies in test data but fires constantly in production — almost every hour is flagged as anomalous.

🎯 Interview Questions

Beginner

What is the difference between supervised and unsupervised machine learning?

Supervised learning uses labeled training data — you provide input-output pairs and the model learns the mapping. Unsupervised learning finds patterns in data without labels. In AIOps, supervised models classify known incident types; unsupervised models detect novel anomalies you haven't labeled before.

What is overfitting and why does it matter in production AI systems?

Overfitting is when a model memorises training data too precisely — it performs well on training data but poorly on new data. In AIOps, an overfitted anomaly detector would learn the exact noise patterns in historical data and either miss new anomaly types or fire on normal production variations it hasn't seen.

What does precision vs recall mean for an alert system?

Precision: of all alerts fired, what fraction were real incidents? Low precision = alert fatigue. Recall: of all real incidents, what fraction did we catch? Low recall = missed incidents. In critical systems, recall is prioritised (never miss a P1 incident), even at the cost of some false positives.

Name three ML model types used in AIOps and their use cases.

1) Isolation Forest — anomaly detection on metric time series. 2) Random Forest classifier — incident severity scoring with labeled historical data. 3) LLMs (GPT-4) — natural language incident summarization and runbook generation.

What is model drift and how does it affect production AI systems?

Model drift occurs when the statistical distribution of input data changes after training (data drift) or when the relationship between inputs and outputs changes (concept drift). In AIOps, a model trained on Q1 traffic may perform poorly in Q4 when traffic patterns change seasonally. Detect drift by monitoring prediction distribution and model accuracy metrics over time.

Intermediate

How do you select features for an ML-based alert system?

Start with domain expertise: what signals correlate with incidents? (error rate, p99 latency, CPU delta, deployment recency). Use feature importance scores from tree models to prune irrelevant ones. Avoid high-cardinality categoricals (pod names) without encoding. Test that features are available in real-time, not just historical batch.

What is the trade-off when adjusting the contamination parameter in Isolation Forest?

The contamination parameter sets the expected fraction of anomalies in the training data, which sets the decision threshold. Too low: model misses real anomalies (low recall). Too high: model fires constantly (low precision). You should set it to match your empirically measured anomaly rate in production data, then validate with labeled golden datasets.

How would you monitor an ML model in production for health degradation?

Track: 1) Prediction distribution drift — is the model outputting unusual scores compared to baseline? 2) Input feature distribution shift — are the input values drifting from training distribution? 3) Ground truth accuracy — when incidents are resolved, did the model's classification match? 4) Latency p99 — is inference time creeping up? Set alerts on all four signals.

When should you use an LLM vs a traditional ML model for an AIOps task?

Use LLMs for: natural language output (summaries, runbooks), tasks with no labeled training data, novel incident types requiring contextual reasoning. Use traditional ML (sklearn, statsmodels) for: high-frequency prediction at low latency (<10ms), structured metric data, cost-sensitive pipelines. LLMs cost ~100x more per prediction than a local sklearn model.

What is cross-validation and why is it important before deploying an ML model?

Cross-validation splits training data into k folds, training on k-1 and validating on 1 repeatedly. It gives a reliable estimate of production performance without requiring a separate test set. For AIOps time-series data, use time-based splits (train on past, validate on newer data) to prevent data leakage — never randomly shuffle temporal data.

Scenario-based

Your anomaly detection model fires every Monday morning at 9am but there's no real incident. How do you diagnose and fix this?

Monday morning traffic spikes are a predictable pattern the model hasn't learned. Fix: 1) Add day-of-week and hour as features so the model understands cyclical patterns. 2) Use a model like Prophet or seasonal ARIMA that natively handles weekly seasonality. 3) Add a "business hours" suppression window with domain-specific context for Monday 9am deployments.

A newly trained model has 99.5% accuracy but your team still experiences missed incidents. What's the problem?

High accuracy with class imbalance is misleading. If 99.5% of data points are normal, a model that always predicts "normal" achieves 99.5% accuracy while catching zero anomalies. Evaluate recall separately — a recall of 0% on the anomaly class means every incident is missed. Use F1 score or AUC-ROC for imbalanced classification. Always check confusion matrix, not just accuracy.

You're asked to explain why an ML model flagged a specific alert to a compliance team. What model types would you choose and why?

Use explainable models: Decision Trees, LIME, or SHAP with Random Forest/XGBoost. These provide feature importance and per-prediction explanations. SHAP values show exactly which input features contributed how much to the decision. Avoid black-box deep learning for compliance-sensitive decisions. Document model lineage: training data, features, version, and approval chain.

🌐 Real-world Usage

Dynatrace Davis uses a combination of clustering (for topology-aware grouping), regression (for baseline forecasting), and classification (for root cause attribution). Azure Monitor's Smart Detection uses historical metric patterns to learn per-resource baselines — different CPU baselines for a web server vs a batch processor. Google SRE teams use ML to predict SLO burn rate before breaches occur, reducing manual intervention.

📝 Summary

DevOps engineers don't need to build ML models from scratch, but they must understand the fundamentals: model types (classification, anomaly detection, regression), evaluation metrics (precision, recall over accuracy), and failure modes (class imbalance, concept drift, feature mismatch). These skills let you deploy, tune, debug, and confidently explain AI systems in production.