BeginnerLesson 1 of 11

Monitoring Fundamentals

Understand monitoring, observability, signals, and why operational visibility matters before you touch Prometheus.

Simple Explanation (ELI5)

Monitoring is like looking at your car dashboard while driving. It tells you speed, fuel, and engine temperature. Observability is broader: if the engine makes a strange noise, observability gives you enough clues to figure out why. Monitoring answers known questions. Observability helps investigate unknown problems.

Real-world Analogy

A hospital ICU has screens for heart rate, blood pressure, and oxygen levels. That is monitoring. When a patient suddenly gets worse, the doctor also checks tests, scans, and history to understand why. That extra investigation layer is observability.

Technical Explanation

Monitoring collects signals from systems and checks them against expected behavior. In modern platforms, the core telemetry signals are metrics, logs, and traces. Metrics are numeric and cheap to store, so they are perfect for dashboards and alerts. Observability combines telemetry, context, and tooling so engineers can debug complex distributed systems.

Metrics: Time-series numeric values like CPU, memory, request count.
Logs: Discrete events with rich text details.
Traces: Request flows across multiple services.
SLI: Service Level Indicator, such as request success rate.
SLO: Service Level Objective, such as 99.9% uptime.

Visual Representation

Application

→

Telemetry
Metrics, Logs, Traces

→

Dashboards
Alerts
Debugging

Commands / Syntax

bash

# Example Linux checks you might monitor
uptime
free -m
df -h
curl -I http://localhost:8080/health

# Example Kubernetes checks you eventually convert into metrics
kubectl top nodes
kubectl top pods -A
kubectl get events -A --sort-by=.lastTimestamp

Example (Real-world Use Case)

An e-commerce team monitors checkout latency, payment failures, pod restarts, node CPU saturation, and database connection pool usage. When cart abandonment rises, the team correlates error rates and latency metrics to confirm the issue is platform-related rather than a business fluctuation.

Hands-on Section

List three things you would measure for a web app: traffic, latency, and errors.
Map each to a metric name like http_requests_total, http_request_duration_seconds, and http_requests_errors_total.
Define a simple SLO: 95% of requests should complete within 300 ms.
Decide what alert should fire if the SLO is violated for 10 minutes.

Try It Yourself

Pick one app you know and list five metrics that would actually help an on-call engineer.
Classify each issue below as monitoring or observability: rising CPU, unknown timeout, sudden request spike.
Write one SLI and one SLO for a login endpoint.

Debugging Scenarios

Common Failure

Teams often collect lots of data but no useful signal. If dashboards do not map to user impact or SLOs, they become decoration, not operational tooling.

If alerts are noisy, reduce thresholds based on user impact rather than raw system movement.
If CPU is high but latency is fine, do not page immediately. The metric lacks context.
If dashboards show green while customers complain, you are likely missing business-critical metrics.

Interview Questions

Beginner

What is monitoring?▾

Monitoring is the continuous collection and evaluation of system signals so teams can see health, performance, and failures quickly.

What is observability?▾

Observability is the ability to understand internal system behavior using telemetry such as metrics, logs, and traces, especially for unknown problems.

Name the three core telemetry signals.▾

Metrics, logs, and traces.

What is an SLI?▾

An SLI is a measurable indicator of service behavior, for example request success rate or p95 latency.

What is an SLO?▾

An SLO is a target value for an SLI, such as 99.9% availability over a month.

Intermediate

Why are metrics usually preferred for alerting over logs?▾

Metrics are numeric, compact, and easy to aggregate over time, which makes them efficient and reliable for threshold-based alerting.

How do monitoring and observability complement each other?▾

Monitoring detects known failure patterns quickly, while observability helps investigate root cause when the failure pattern is not obvious.

What makes an alert actionable?▾

An alert is actionable when it represents meaningful user or business impact and provides enough context for the responder to start remediation.

Why should dashboards map to customer experience?▾

Because raw infrastructure data alone may not reflect whether users are affected. Good dashboards connect platform health to real service outcomes.

What is alert fatigue?▾

Alert fatigue happens when engineers receive too many low-value alerts, causing slower response or ignored incidents.

Scenario-based

CPU is spiking but users report no problem. Do you page the team?▾

Not automatically. First check correlated signals like latency, errors, queue depth, and saturation. High CPU without user impact may be acceptable for a short period.

Users say the app is slow but dashboards are green. What does that suggest?▾

The monitoring set is incomplete. You may be missing the right latency metric, business transaction metric, or client-side visibility.

How would you define first metrics for a login service?▾

I would start with request rate, success rate, latency, dependency errors, and authentication backend response time.

A new team wants 50 infrastructure alerts on day one. What do you recommend?▾

Start with a small set tied to service health and customer impact, then expand carefully after observing real incidents.

How do you explain observability to a non-technical manager?▾

Monitoring tells us something is wrong. Observability helps us figure out why quickly so downtime is shorter and less expensive.

Summary

Monitoring gives teams visibility into known behaviors, while observability strengthens investigation of unknown failures. Before learning Prometheus, you need to understand metrics, SLOs, and what good operational signals look like.

PreviousCourse Home ← Back to Course NextIntroduction to Prometheus