Prometheus Architecture
Understand how Prometheus server, TSDB, exporters, discovery, rules, and Alertmanager work together.
Simple Explanation (ELI5)
Prometheus is not one big magic box. It is a collector, a database, a query engine, and an alert evaluator working together. It visits targets, stores numbers, answers questions, and raises alerts.
Real-world Analogy
Think of a newsroom. Reporters gather facts, editors organize them, analysts interpret trends, and the chief editor decides when to publish a breaking alert. Prometheus does all of those jobs for metrics.
Technical Explanation
The Prometheus server periodically scrapes targets and writes samples into its local time-series database. Service discovery keeps the target list updated. Recording rules precompute expensive queries. Alerting rules evaluate conditions and push alert events to Alertmanager. Exporters bridge systems that do not expose native Prometheus metrics.
| Component | Role | Operational Note |
|---|---|---|
| Prometheus Server | Scrape, store, query, evaluate rules | Main control plane for metrics |
| TSDB | Local storage engine | Retention and disk planning matter |
| Exporters | Expose metrics for non-native systems | Node exporter and kube-state-metrics are common |
| Service Discovery | Finds changing targets | Critical in Kubernetes and cloud platforms |
| Alertmanager | Routes and deduplicates alerts | Keeps notification logic out of Prometheus |
Visual Representation
Scrape + TSDB + Rules
Commands / Syntax
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
storage:
tsdb:
retention.time: 15d# Check readiness and config curl http://localhost:9090/-/ready curl http://localhost:9090/api/v1/status/config curl http://localhost:9090/api/v1/rules curl http://localhost:9090/api/v1/targets
Example (Real-world Use Case)
A Kubernetes platform runs Prometheus with kube-state-metrics, node exporter, and application-specific metrics. Recording rules precompute CPU and error-rate aggregates for fast dashboards. Alertmanager routes production alerts to PagerDuty and lower-severity alerts to Slack.
Hands-on Section
- Draw the path from an instrumented app to Grafana and Alertmanager.
- List which component discovers targets, which stores data, and which notifies teams.
- Check the Prometheus
/targets,/rules, and/configAPIs. - Identify one place where Kubernetes-specific metadata enters the architecture.
Try It Yourself
- Explain why Prometheus does not send alerts directly to Slack in most setups.
- Estimate what happens when disk fills on the Prometheus server.
- Name two exporters you would deploy for Linux and Kubernetes.
Debugging Scenarios
- If rules do not fire, check whether rule files are loaded and syntactically valid.
- If TSDB grows too quickly, inspect retention settings and high-cardinality metrics.
- If Kubernetes targets are missing, inspect service discovery output and RBAC.
Interview Questions
Beginner
Prometheus server, time-series database, exporters, service discovery, rules, and Alertmanager.
It scrapes targets, stores samples, runs queries, and evaluates recording and alerting rules.
TSDB stands for time-series database, the local storage engine used by Prometheus.
Exporters expose metrics for systems that do not natively speak Prometheus format.
It receives alerts from Prometheus and handles grouping, deduplication, silencing, and routing.
Intermediate
Local storage keeps scrape ingestion and query latency fast and reduces external dependencies for alert evaluation.
Recording rules precompute expensive PromQL expressions into new metrics for faster dashboards and alerts.
It automatically tracks pods, services, endpoints, and other objects so Prometheus can scrape dynamic workloads.
Scrape ingestion and reliable storage are affected. Queries and alert accuracy degrade, so disk capacity is operationally critical.
It keeps Prometheus focused on signal generation while Alertmanager handles notification policy, silencing, and deduplication cleanly.
Scenario-based
I check whether the rule file is mounted, syntactically valid, and loaded by Prometheus, then confirm the expression returns data.
I review retention and cardinality first, then size persistent storage correctly and consider remote write for long-term needs.
Service discovery and scrape configuration. I check ServiceMonitor, PodMonitor, labels, namespace selectors, and RBAC.
Dashboards may use recording rules, while ad hoc queries hit raw high-cardinality data and expensive computations.
Usually no. Keeping them separate improves failure isolation, scaling, and notification resilience.
Summary
Prometheus architecture is straightforward but opinionated: scrape locally, store locally, evaluate rules locally, and route alerts separately. Once this architecture is clear, scrape mechanics make much more sense.