Metrics and Data Model
Learn Prometheus metric types, labels, time series, and the cardinality mistakes that hurt production monitoring.
Simple Explanation (ELI5)
A metric is a number that changes over time. Prometheus stores lots of these numbers. Each one has a name and labels that describe where it came from. Together they form a time series, like temperature readings tagged with city and room.
Real-world Analogy
Think of a spreadsheet where every row is a timestamp and every column is a measurement. Now add tags like region, service, and instance so you can filter the spreadsheet later. That is basically the Prometheus data model.
Technical Explanation
Prometheus stores each unique metric name plus label combination as a separate time series. Labels make querying powerful, but too many unique label combinations create high cardinality and memory pressure.
| Metric Type | Meaning | Example | Common Use |
|---|---|---|---|
| Counter | Monotonic increase | http_requests_total | Requests, errors, jobs completed |
| Gauge | Value goes up and down | memory_usage_bytes | Memory, queue depth, active sessions |
| Histogram | Bucketed observations | http_request_duration_seconds_bucket | Latency distributions, sizes |
| Summary | Precomputed quantiles | rpc_latency_seconds | Client-side latency, less common in Prometheus-heavy setups |
Visual Representation
http_requests_total
method="GET"status="200"
Unique combination per label set
Example time series
http_requests_total{job="api",instance="10.0.0.12:8080",method="GET",status="200"}
Commands / Syntax
# Typical metrics output
http_requests_total{method="GET",status="200"} 1254
http_requests_total{method="POST",status="500"} 13
memory_usage_bytes{container="checkout"} 8.2470912e+08
http_request_duration_seconds_bucket{le="0.1"} 1023
http_request_duration_seconds_bucket{le="0.3"} 1950requestCounter := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "status"},
)
memoryGauge := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "worker_queue_depth",
Help: "Current queue depth",
})Example (Real-world Use Case)
A Kubernetes checkout service exports request totals by status code, latency histogram buckets, pod memory usage, and queue depth. The SRE team uses counters for traffic, gauges for current resource state, and histograms for latency SLO calculations.
Hands-on Section
- Take a sample metric endpoint and identify counters, gauges, and histogram buckets.
- Write down which labels are stable, like
service,namespace, andstatus. - Mark labels that are dangerous, such as
user_id,request_id, or full URL path with unbounded values. - Explain why those dangerous labels increase cardinality.
Try It Yourself
- Convert these examples into the right type: CPU temperature, total processed jobs, API latency.
- Create a metric name for pod restart count.
- List three labels that are safe for an HTTP request metric.
Debugging Scenarios
If a developer adds user_id or session_id as a label, Prometheus can explode in memory use because every unique value becomes a new time series.
- If queries are slow, inspect whether the metric has too many label combinations.
- If memory usage grows rapidly after a release, compare newly introduced labels and exporters.
- If latency graphs are misleading, check whether you are using histogram buckets correctly rather than averaging raw request durations badly.
Interview Questions
Beginner
A time series is a metric identified by its name and full label set, with values collected over time.
A counter only increases or resets to zero on restart. It is used for totals such as requests or errors.
A gauge goes up and down. It is used for current values like memory usage or queue size.
A histogram records observations into buckets and helps calculate latency or size distributions.
Labels are key-value pairs attached to a metric that allow filtering and grouping, such as method, status, or namespace.
Intermediate
Cardinality is the number of unique time series created by label combinations. High cardinality increases memory and query cost.
Because values like request IDs or user IDs create a new time series for almost every event, which overwhelms storage and query performance.
Use a histogram when you need distribution data such as latency buckets or payload sizes, not just a current value.
Because the raw total is less useful than the change per second or over time, especially across restarts.
Use clear, unit-suffixed snake_case names like http_request_duration_seconds or memory_usage_bytes.
Scenario-based
I would reject it and replace it with normalized route labels like /orders/:id or handler names to avoid cardinality explosions.
A new metric or label with high cardinality is a common cause. I would inspect new instrumentation first.
I would use histograms for request duration, then query bucket data to calculate p95 latency and latency thresholds.
No. Queue depth is a current value that can rise and fall, so it should be a gauge.
Labels like cluster, namespace, pod, container, and workload let operators slice metrics by team, environment, and workload identity.
Summary
Prometheus metrics are built from names, labels, and timestamps. Choosing the right metric type and keeping label cardinality under control is one of the biggest differences between healthy and painful monitoring systems.