Querying (PromQL Basics)
Learn how to ask Prometheus useful questions with selectors, rates, aggregations, filtering, and basic vector matching.
Simple Explanation (ELI5)
PromQL is the language you use to ask Prometheus for answers. You can ask simple questions like “is this target up?” or more useful ones like “what is the 5-minute error rate for my checkout service in production?”
Real-world Analogy
PromQL is like a search plus calculator for operational data. You are not just looking something up; you are filtering, grouping, comparing, and doing math on live system signals.
Technical Explanation
PromQL works with instant vectors, range vectors, scalars, and strings. Most operational queries use instant vectors and functions such as rate(), sum(), avg(), and label matchers. Counters are generally wrapped in rate() or increase(). Gauges can often be read directly or aggregated.
| Query Pattern | Purpose | Example |
|---|---|---|
| Selector | Pick series | up{job="api"} |
| Range function | Look over time | rate(http_requests_total[5m]) |
| Aggregation | Combine series | sum by (namespace)(rate(container_cpu_usage_seconds_total[5m])) |
| Filter | Include specific labels | status=~"5.." |
| Binary op | Do math between vectors | used / total |
Visual Representation
http_requests_total{job="api"}
rate(...[5m])
sum by (service)
Commands / Syntax
# Is target healthy?
up
# Request rate over 5 minutes
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Pod CPU by namespace in Kubernetes
sum by (namespace) (rate(container_cpu_usage_seconds_total{container!="",pod!=""}[5m]))
# Memory usage by pod
sum by (pod) (container_memory_working_set_bytes{container!=""})Example (Real-world Use Case)
During an incident, an SRE checks whether API traffic dropped, whether 5xx errors increased, and whether Kubernetes memory usage is concentrated in one namespace. PromQL answers each of those quickly with one query per signal.
Hands-on Section
- Run
upand identify healthy targets. - Run
rate(http_requests_total[5m])on a service metric. - Filter only 500-level responses with
status=~"5..". - Aggregate pod CPU by namespace using a
sum byquery.
Try It Yourself
- Write a query for total restarts by pod.
- Write a memory usage query for one namespace.
- Explain why
rate()is usually used with counters.
Debugging Scenarios
- If a query returns no data, verify the metric name and label values from the expression browser.
- If counter graphs look jagged, you may be graphing the raw total instead of a rate.
- If Kubernetes CPU looks duplicated, verify whether multiple scrape paths or container filters are double-counting.
Interview Questions
Beginner
PromQL is the query language used to select, aggregate, and transform Prometheus metrics.
up metric show?It shows whether a target was successfully scraped, where 1 means up and 0 means down.
rate() with counters?Because counters only increase, and rate converts that growth into a per-second trend over time.
sum by (namespace) do?It aggregates matching series and groups the results by namespace label.
Use a label matcher such as {status=~"5.."}.
Intermediate
An instant vector is a set of series at one point in time. A range vector is a set of series over a time window like 5 minutes.
increase() instead of rate()?Use increase() when you want the total increase during a window, not the per-second rate.
If label names or values change between versions or exporters, the selector may stop matching data entirely.
Usually with a rate on container_cpu_usage_seconds_total and an aggregation by pod, namespace, or workload.
They store frequently used or expensive expressions so dashboards and alerts run faster and more consistently.
Scenario-based
I check the raw metric in the expression browser, then verify the metric name, label filters, and scrape health before rewriting the query wildly.
I would use a rate query on container CPU by namespace or workload and compare it to normal baseline or resource limits.
Divide the 5xx request rate by the total request rate for that service and multiply by 100.
I aggregate with sum by (namespace) and keep the namespace label instead of pod.
Because the metric or labels may not represent the intended reality, for example averaging latency incorrectly or double-counting replicas.
Summary
PromQL turns Prometheus from a raw storage engine into an operational decision tool. Once you can select, filter, rate, and aggregate correctly, you can build dashboards, alerts, and incident analysis workflows with confidence.