IntermediateLesson 6 of 11

Querying (PromQL Basics)

Learn how to ask Prometheus useful questions with selectors, rates, aggregations, filtering, and basic vector matching.

Simple Explanation (ELI5)

PromQL is the language you use to ask Prometheus for answers. You can ask simple questions like “is this target up?” or more useful ones like “what is the 5-minute error rate for my checkout service in production?”

Real-world Analogy

PromQL is like a search plus calculator for operational data. You are not just looking something up; you are filtering, grouping, comparing, and doing math on live system signals.

Technical Explanation

PromQL works with instant vectors, range vectors, scalars, and strings. Most operational queries use instant vectors and functions such as rate(), sum(), avg(), and label matchers. Counters are generally wrapped in rate() or increase(). Gauges can often be read directly or aggregated.

Query PatternPurposeExample
SelectorPick seriesup{job="api"}
Range functionLook over timerate(http_requests_total[5m])
AggregationCombine seriessum by (namespace)(rate(container_cpu_usage_seconds_total[5m]))
FilterInclude specific labelsstatus=~"5.."
Binary opDo math between vectorsused / total

Visual Representation

Select

http_requests_total{job="api"}

Transform

rate(...[5m])

Aggregate

sum by (service)

Commands / Syntax

promql
# Is target healthy?
up

# Request rate over 5 minutes
rate(http_requests_total[5m])

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Pod CPU by namespace in Kubernetes
sum by (namespace) (rate(container_cpu_usage_seconds_total{container!="",pod!=""}[5m]))

# Memory usage by pod
sum by (pod) (container_memory_working_set_bytes{container!=""})

Example (Real-world Use Case)

During an incident, an SRE checks whether API traffic dropped, whether 5xx errors increased, and whether Kubernetes memory usage is concentrated in one namespace. PromQL answers each of those quickly with one query per signal.

Hands-on Section

  1. Run up and identify healthy targets.
  2. Run rate(http_requests_total[5m]) on a service metric.
  3. Filter only 500-level responses with status=~"5..".
  4. Aggregate pod CPU by namespace using a sum by query.

Try It Yourself

Debugging Scenarios

Interview Questions

Beginner

What is PromQL?

PromQL is the query language used to select, aggregate, and transform Prometheus metrics.

What does the up metric show?

It shows whether a target was successfully scraped, where 1 means up and 0 means down.

Why use rate() with counters?

Because counters only increase, and rate converts that growth into a per-second trend over time.

What does sum by (namespace) do?

It aggregates matching series and groups the results by namespace label.

How do you filter only 500 errors?

Use a label matcher such as {status=~"5.."}.

Intermediate

What is the difference between instant vectors and range vectors?

An instant vector is a set of series at one point in time. A range vector is a set of series over a time window like 5 minutes.

When would you use increase() instead of rate()?

Use increase() when you want the total increase during a window, not the per-second rate.

Why can label filters break queries unexpectedly?

If label names or values change between versions or exporters, the selector may stop matching data entirely.

How would you calculate CPU usage for Kubernetes pods?

Usually with a rate on container_cpu_usage_seconds_total and an aggregation by pod, namespace, or workload.

Why are recording rules useful for PromQL?

They store frequently used or expensive expressions so dashboards and alerts run faster and more consistently.

Scenario-based

A query returns nothing after a deployment. What do you check first?

I check the raw metric in the expression browser, then verify the metric name, label filters, and scrape health before rewriting the query wildly.

How would you detect a CPU spike in production pods?

I would use a rate query on container CPU by namespace or workload and compare it to normal baseline or resource limits.

How do you compute error percentage for one service?

Divide the 5xx request rate by the total request rate for that service and multiply by 100.

Your memory chart shows one line per pod but management wants one line per namespace. What do you change?

I aggregate with sum by (namespace) and keep the namespace label instead of pod.

Why might a PromQL query be correct but still misleading?

Because the metric or labels may not represent the intended reality, for example averaging latency incorrectly or double-counting replicas.

Summary

PromQL turns Prometheus from a raw storage engine into an operational decision tool. Once you can select, filter, rate, and aggregate correctly, you can build dashboards, alerts, and incident analysis workflows with confidence.