IntermediateLesson 4 of 11

Prometheus Architecture

Understand how Prometheus server, TSDB, exporters, discovery, rules, and Alertmanager work together.

Simple Explanation (ELI5)

Prometheus is not one big magic box. It is a collector, a database, a query engine, and an alert evaluator working together. It visits targets, stores numbers, answers questions, and raises alerts.

Real-world Analogy

Think of a newsroom. Reporters gather facts, editors organize them, analysts interpret trends, and the chief editor decides when to publish a breaking alert. Prometheus does all of those jobs for metrics.

Technical Explanation

The Prometheus server periodically scrapes targets and writes samples into its local time-series database. Service discovery keeps the target list updated. Recording rules precompute expensive queries. Alerting rules evaluate conditions and push alert events to Alertmanager. Exporters bridge systems that do not expose native Prometheus metrics.

ComponentRoleOperational Note
Prometheus ServerScrape, store, query, evaluate rulesMain control plane for metrics
TSDBLocal storage engineRetention and disk planning matter
ExportersExpose metrics for non-native systemsNode exporter and kube-state-metrics are common
Service DiscoveryFinds changing targetsCritical in Kubernetes and cloud platforms
AlertmanagerRoutes and deduplicates alertsKeeps notification logic out of Prometheus

Visual Representation

Targets / Exporters
Prometheus Server
Scrape + TSDB + Rules
Alertmanager / Grafana

Commands / Syntax

yaml
rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

storage:
  tsdb:
    retention.time: 15d
bash
# Check readiness and config
curl http://localhost:9090/-/ready
curl http://localhost:9090/api/v1/status/config
curl http://localhost:9090/api/v1/rules
curl http://localhost:9090/api/v1/targets

Example (Real-world Use Case)

A Kubernetes platform runs Prometheus with kube-state-metrics, node exporter, and application-specific metrics. Recording rules precompute CPU and error-rate aggregates for fast dashboards. Alertmanager routes production alerts to PagerDuty and lower-severity alerts to Slack.

Hands-on Section

  1. Draw the path from an instrumented app to Grafana and Alertmanager.
  2. List which component discovers targets, which stores data, and which notifies teams.
  3. Check the Prometheus /targets, /rules, and /config APIs.
  4. Identify one place where Kubernetes-specific metadata enters the architecture.

Try It Yourself

Debugging Scenarios

Interview Questions

Beginner

What are the main parts of Prometheus architecture?

Prometheus server, time-series database, exporters, service discovery, rules, and Alertmanager.

What does the Prometheus server do?

It scrapes targets, stores samples, runs queries, and evaluates recording and alerting rules.

What is TSDB?

TSDB stands for time-series database, the local storage engine used by Prometheus.

Why are exporters needed?

Exporters expose metrics for systems that do not natively speak Prometheus format.

What is Alertmanager’s purpose?

It receives alerts from Prometheus and handles grouping, deduplication, silencing, and routing.

Intermediate

Why does Prometheus usually use local storage instead of a remote DB for core operation?

Local storage keeps scrape ingestion and query latency fast and reduces external dependencies for alert evaluation.

What are recording rules?

Recording rules precompute expensive PromQL expressions into new metrics for faster dashboards and alerts.

How does service discovery help in Kubernetes?

It automatically tracks pods, services, endpoints, and other objects so Prometheus can scrape dynamic workloads.

What happens if Prometheus cannot write to disk?

Scrape ingestion and reliable storage are affected. Queries and alert accuracy degrade, so disk capacity is operationally critical.

Why decouple alerts from notification routing?

It keeps Prometheus focused on signal generation while Alertmanager handles notification policy, silencing, and deduplication cleanly.

Scenario-based

Rules file changed but new alerts never appear. What do you inspect?

I check whether the rule file is mounted, syntactically valid, and loaded by Prometheus, then confirm the expression returns data.

Your Prometheus pod restarts due to storage pressure. What architectural fix do you consider?

I review retention and cardinality first, then size persistent storage correctly and consider remote write for long-term needs.

A new app appears in Kubernetes but Prometheus does not scrape it. Which architecture area is responsible?

Service discovery and scrape configuration. I check ServiceMonitor, PodMonitor, labels, namespace selectors, and RBAC.

Why might dashboards be fast but ad hoc PromQL queries slow?

Dashboards may use recording rules, while ad hoc queries hit raw high-cardinality data and expensive computations.

Would you colocate Alertmanager in the same pod as Prometheus?

Usually no. Keeping them separate improves failure isolation, scaling, and notification resilience.

Summary

Prometheus architecture is straightforward but opinionated: scrape locally, store locally, evaluate rules locally, and route alerts separately. Once this architecture is clear, scrape mechanics make much more sense.