Interview Preparation
Consolidate everything into interview-ready answers covering monitoring theory, Prometheus internals, PromQL, Alertmanager, and Kubernetes integration.
Simple Explanation (ELI5)
This lesson is your rehearsal room. You already learned the concepts. Now you practice answering clearly, with the right level of detail, and with production credibility.
Real-world Analogy
Learning Prometheus is training. Interview preparation is game day. You are proving you can use the tools under pressure and explain trade-offs to other engineers.
Technical Explanation
Strong Prometheus interview answers usually combine three layers: concept, implementation detail, and production trade-off. For example: “Prometheus uses pull-based scraping, which helps in Kubernetes because targets are dynamic; I would expose app metrics via ServiceMonitor, keep label cardinality under control, and alert on user-facing error rate rather than raw CPU.”
Visual Representation
What the thing is
How you configure or query it
Why you choose one pattern over another
Commands / Syntax
# Interview-friendly examples
rate(http_requests_total[5m])
sum by (namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
increase(kube_pod_container_status_restarts_total[15m])
# Useful operational endpoints
curl http://localhost:9090/api/v1/targets
curl http://localhost:9090/api/v1/rulesExample (Real-world Use Case)
When asked how to monitor a Kubernetes API service, a strong answer would include node-level metrics, pod health, request rate, error rate, latency histograms, alerting through Alertmanager, and ServiceMonitor-based discovery. It would also mention avoiding high-cardinality labels like request IDs.
Hands-on Section
- Practice a 60-second explanation of monitoring vs observability.
- Practice a 90-second explanation of Prometheus architecture.
- Write one PromQL query each for CPU, memory, and error rate.
- Practice explaining a troubleshooting workflow for missing metrics.
Try It Yourself
- Record yourself answering “Why Prometheus in Kubernetes?” in under two minutes.
- Summarize the difference between counter and gauge with one production example each.
- Write one alert rule and explain why it is actionable.
Interview Questions
Beginner
Monitoring tracks known health signals and known-failure conditions. Observability uses telemetry to investigate and explain unknown issues in complex systems.
It is open source, cloud-native, works well with Kubernetes, has strong service discovery, a powerful query language, and a large ecosystem.
Counter, gauge, histogram, and summary, though counters, gauges, and histograms are most commonly discussed and used.
An exporter exposes metrics in Prometheus format for a target system such as Linux, MySQL, Redis, or a black-box endpoint.
Alertmanager manages alert notifications by grouping, routing, deduplicating, and silencing alerts produced by Prometheus.
Intermediate
It works well with dynamic targets, allows Prometheus to know target health directly, and integrates naturally with Kubernetes discovery and operator resources.
High cardinality means too many unique time series due to labels. It increases memory, disk, and query cost and can destabilize Prometheus.
I would divide the 5xx request rate by total request rate over a time window and multiply by 100.
They preserve bucketed latency distributions so you can calculate percentiles and threshold-based latency SLOs more meaningfully.
Common CRDs include ServiceMonitor, PodMonitor, Prometheus, Alertmanager, and PrometheusRule.
Scenario-based
I would collect request rate, errors, latency histograms, pod CPU and memory, restart counts, node saturation, and deployment health using app instrumentation plus kube-state-metrics and node exporter. I would discover the service using ServiceMonitor and alert on user-impact signals.
I check target discovery, target reachability, the raw metrics endpoint, metric presence in Prometheus, then PromQL and dashboards. In Kubernetes I also inspect ServiceMonitor selectors and port naming.
request_id as a label for all request metrics. What do you recommend?I recommend against it because request IDs are unbounded and cause cardinality explosions. Use logs or traces for request-level debugging instead.
I combine memory pressure with restart or OOM signals, add a for window, and ensure the alert maps to real workload risk rather than harmless cache behavior.
It offers flexible open tooling, strong Kubernetes support, customizable dashboards, and a large ecosystem, though it also requires more operational ownership.
Mock Interview Drill
| Prompt | What a Strong Answer Includes |
|---|---|
| Explain Prometheus architecture | Scrape targets, TSDB, rules, Alertmanager, discovery, exporters |
| How do you monitor Kubernetes? | Operator, ServiceMonitor, node exporter, kube-state-metrics, app metrics |
| How do you detect a CPU spike? | PromQL rate, pod/workload aggregation, correlation with latency and errors |
| How do you debug missing metrics? | Discovery, connectivity, endpoint, raw metric, query, dashboard |
| How do you avoid alert fatigue? | Actionable rules, for windows, grouping, severity routing, symptom-based alerting |
Summary
Interview strength comes from clarity and production judgment. If you can explain Prometheus architecture, query core operational signals, troubleshoot missing metrics, and describe Kubernetes integration with trade-offs, you are in good shape for real DevOps and SRE interviews.