BeginnerLesson 3 of 11

Metrics and Data Model

Learn Prometheus metric types, labels, time series, and the cardinality mistakes that hurt production monitoring.

Simple Explanation (ELI5)

A metric is a number that changes over time. Prometheus stores lots of these numbers. Each one has a name and labels that describe where it came from. Together they form a time series, like temperature readings tagged with city and room.

Real-world Analogy

Think of a spreadsheet where every row is a timestamp and every column is a measurement. Now add tags like region, service, and instance so you can filter the spreadsheet later. That is basically the Prometheus data model.

Technical Explanation

Prometheus stores each unique metric name plus label combination as a separate time series. Labels make querying powerful, but too many unique label combinations create high cardinality and memory pressure.

Metric TypeMeaningExampleCommon Use
CounterMonotonic increasehttp_requests_totalRequests, errors, jobs completed
GaugeValue goes up and downmemory_usage_bytesMemory, queue depth, active sessions
HistogramBucketed observationshttp_request_duration_seconds_bucketLatency distributions, sizes
SummaryPrecomputed quantilesrpc_latency_secondsClient-side latency, less common in Prometheus-heavy setups

Visual Representation

Metric Name

http_requests_total

Labels

method="GET"
status="200"

Series

Unique combination per label set

Example time series

http_requests_total{job="api",instance="10.0.0.12:8080",method="GET",status="200"}

Commands / Syntax

text
# Typical metrics output
http_requests_total{method="GET",status="200"} 1254
http_requests_total{method="POST",status="500"} 13
memory_usage_bytes{container="checkout"} 8.2470912e+08
http_request_duration_seconds_bucket{le="0.1"} 1023
http_request_duration_seconds_bucket{le="0.3"} 1950
go
requestCounter := prometheus.NewCounterVec(
  prometheus.CounterOpts{
    Name: "http_requests_total",
    Help: "Total number of HTTP requests",
  },
  []string{"method", "status"},
)

memoryGauge := prometheus.NewGauge(prometheus.GaugeOpts{
  Name: "worker_queue_depth",
  Help: "Current queue depth",
})

Example (Real-world Use Case)

A Kubernetes checkout service exports request totals by status code, latency histogram buckets, pod memory usage, and queue depth. The SRE team uses counters for traffic, gauges for current resource state, and histograms for latency SLO calculations.

Hands-on Section

  1. Take a sample metric endpoint and identify counters, gauges, and histogram buckets.
  2. Write down which labels are stable, like service, namespace, and status.
  3. Mark labels that are dangerous, such as user_id, request_id, or full URL path with unbounded values.
  4. Explain why those dangerous labels increase cardinality.

Try It Yourself

Debugging Scenarios

Cardinality Trap

If a developer adds user_id or session_id as a label, Prometheus can explode in memory use because every unique value becomes a new time series.

Interview Questions

Beginner

What is a time series in Prometheus?

A time series is a metric identified by its name and full label set, with values collected over time.

What is a counter?

A counter only increases or resets to zero on restart. It is used for totals such as requests or errors.

What is a gauge?

A gauge goes up and down. It is used for current values like memory usage or queue size.

What is a histogram?

A histogram records observations into buckets and helps calculate latency or size distributions.

What are labels?

Labels are key-value pairs attached to a metric that allow filtering and grouping, such as method, status, or namespace.

Intermediate

What is cardinality?

Cardinality is the number of unique time series created by label combinations. High cardinality increases memory and query cost.

Why are unbounded labels dangerous?

Because values like request IDs or user IDs create a new time series for almost every event, which overwhelms storage and query performance.

When would you use a histogram over a gauge?

Use a histogram when you need distribution data such as latency buckets or payload sizes, not just a current value.

Why are counters usually queried with rate-like functions?

Because the raw total is less useful than the change per second or over time, especially across restarts.

What is a good naming convention for Prometheus metrics?

Use clear, unit-suffixed snake_case names like http_request_duration_seconds or memory_usage_bytes.

Scenario-based

A developer wants to label metrics with full URL including IDs. What do you say?

I would reject it and replace it with normalized route labels like /orders/:id or handler names to avoid cardinality explosions.

Your Prometheus memory doubled after a deployment. What is a likely cause?

A new metric or label with high cardinality is a common cause. I would inspect new instrumentation first.

How would you monitor API latency for SLOs?

I would use histograms for request duration, then query bucket data to calculate p95 latency and latency thresholds.

A team stores queue depth as a counter. Is that correct?

No. Queue depth is a current value that can rise and fall, so it should be a gauge.

How do labels help in Kubernetes monitoring?

Labels like cluster, namespace, pod, container, and workload let operators slice metrics by team, environment, and workload identity.

Summary

Prometheus metrics are built from names, labels, and timestamps. Choosing the right metric type and keeping label cardinality under control is one of the biggest differences between healthy and painful monitoring systems.