Monitoring Fundamentals
Understand monitoring, observability, signals, and why operational visibility matters before you touch Prometheus.
Simple Explanation (ELI5)
Monitoring is like looking at your car dashboard while driving. It tells you speed, fuel, and engine temperature. Observability is broader: if the engine makes a strange noise, observability gives you enough clues to figure out why. Monitoring answers known questions. Observability helps investigate unknown problems.
Real-world Analogy
A hospital ICU has screens for heart rate, blood pressure, and oxygen levels. That is monitoring. When a patient suddenly gets worse, the doctor also checks tests, scans, and history to understand why. That extra investigation layer is observability.
Technical Explanation
Monitoring collects signals from systems and checks them against expected behavior. In modern platforms, the core telemetry signals are metrics, logs, and traces. Metrics are numeric and cheap to store, so they are perfect for dashboards and alerts. Observability combines telemetry, context, and tooling so engineers can debug complex distributed systems.
- Metrics: Time-series numeric values like CPU, memory, request count.
- Logs: Discrete events with rich text details.
- Traces: Request flows across multiple services.
- SLI: Service Level Indicator, such as request success rate.
- SLO: Service Level Objective, such as 99.9% uptime.
Visual Representation
Metrics, Logs, Traces
Alerts
Debugging
Commands / Syntax
# Example Linux checks you might monitor uptime free -m df -h curl -I http://localhost:8080/health # Example Kubernetes checks you eventually convert into metrics kubectl top nodes kubectl top pods -A kubectl get events -A --sort-by=.lastTimestamp
Example (Real-world Use Case)
An e-commerce team monitors checkout latency, payment failures, pod restarts, node CPU saturation, and database connection pool usage. When cart abandonment rises, the team correlates error rates and latency metrics to confirm the issue is platform-related rather than a business fluctuation.
Hands-on Section
- List three things you would measure for a web app: traffic, latency, and errors.
- Map each to a metric name like
http_requests_total,http_request_duration_seconds, andhttp_requests_errors_total. - Define a simple SLO: 95% of requests should complete within 300 ms.
- Decide what alert should fire if the SLO is violated for 10 minutes.
Try It Yourself
- Pick one app you know and list five metrics that would actually help an on-call engineer.
- Classify each issue below as monitoring or observability: rising CPU, unknown timeout, sudden request spike.
- Write one SLI and one SLO for a login endpoint.
Debugging Scenarios
Teams often collect lots of data but no useful signal. If dashboards do not map to user impact or SLOs, they become decoration, not operational tooling.
- If alerts are noisy, reduce thresholds based on user impact rather than raw system movement.
- If CPU is high but latency is fine, do not page immediately. The metric lacks context.
- If dashboards show green while customers complain, you are likely missing business-critical metrics.
Interview Questions
Beginner
Monitoring is the continuous collection and evaluation of system signals so teams can see health, performance, and failures quickly.
Observability is the ability to understand internal system behavior using telemetry such as metrics, logs, and traces, especially for unknown problems.
Metrics, logs, and traces.
An SLI is a measurable indicator of service behavior, for example request success rate or p95 latency.
An SLO is a target value for an SLI, such as 99.9% availability over a month.
Intermediate
Metrics are numeric, compact, and easy to aggregate over time, which makes them efficient and reliable for threshold-based alerting.
Monitoring detects known failure patterns quickly, while observability helps investigate root cause when the failure pattern is not obvious.
An alert is actionable when it represents meaningful user or business impact and provides enough context for the responder to start remediation.
Because raw infrastructure data alone may not reflect whether users are affected. Good dashboards connect platform health to real service outcomes.
Alert fatigue happens when engineers receive too many low-value alerts, causing slower response or ignored incidents.
Scenario-based
Not automatically. First check correlated signals like latency, errors, queue depth, and saturation. High CPU without user impact may be acceptable for a short period.
The monitoring set is incomplete. You may be missing the right latency metric, business transaction metric, or client-side visibility.
I would start with request rate, success rate, latency, dependency errors, and authentication backend response time.
Start with a small set tied to service health and customer impact, then expand carefully after observing real incidents.
Monitoring tells us something is wrong. Observability helps us figure out why quickly so downtime is shorter and less expensive.
Summary
Monitoring gives teams visibility into known behaviors, while observability strengthens investigation of unknown failures. Before learning Prometheus, you need to understand metrics, SLOs, and what good operational signals look like.