Real-world Scenarios
Apply full-stack Dynatrace knowledge to real production situations: deployment regressions, cascading failures, capacity incidents, and SLA compliance monitoring.
Simple Explanation (ELI5)
This lesson is about what it actually looks like to use Dynatrace when things go wrong in production. Each scenario is a realistic incident, worked through step by step — using APM, infrastructure monitoring, Davis AI, and distributed tracing together.
Scenario 1: Deployment-induced Performance Regression
Situation: 10 minutes after deploying version 3.4.2 of the order-service, Davis opens a problem: "Response time degradation — order-service."
STEP 1: Open Davis problem card Root cause identified: order-service response time 440% above baseline Contributing event: Deployment "order-service:v3.4.2" at 14:02 UTC STEP 2: Check deployment comparison (Services → order-service → Events) Before v3.4.2: P95 = 210ms, failure rate = 0.2% After v3.4.2: P95 = 1,840ms, failure rate = 3.1% STEP 3: Open PurePaths for slow requests (filter: duration > 1000ms, after 14:02) Waterfall reveals: new span "InventoryService.getStockLevels()" Duration: avg 1,620ms (was not present in previous version traces) Sub-spans: 47 × SELECT from inventory table (N+1 pattern) STEP 4: Code Hotspot analysis Highest contributor: com.example.InventoryService.getStockLevels() 47 DB calls per request — regression introduced by removing @Cacheable STEP 5: Action Roll back v3.4.2 OR apply hotfix restoring @Cacheable annotation Confirm Davis problem closes as metrics return to baseline
Scenario 2: Cascading Failure from Database Saturation
Situation: At 09:15, 12 different services all show high response times simultaneously. Traditional monitoring fires 85 separate alerts.
DAVIS PROBLEM CARD (single card): Root cause: orders-db — connection pool exhausted (100% utilisation) Impact: 12 downstream services degraded, ~8,400 users affected, 2 SLOs breached STEP 1: Verify root cause Navigate to orders-db process group → metrics DB connections: 200/200 (at max pool size) since 09:12 Query queue depth: 847 pending queries STEP 2: Review Smartscape for blast radius orders-db calls: order-service, cart-service, recommendation-service, fulfillment-service, returns-service, reporting-service... All 12 impacted services confirmed in Smartscape STEP 3: Identify what saturated the pool Davis contributing events: "Automated reporting job started at 09:10" Navigate to reporting-service PurePaths at 09:10 Find 240-second query: SELECT * FROM orders (full table scan, no date filter) STEP 4: Immediate mitigation Kill the reporting job: kubectl delete pod reporting-job-xxxx DB connection pool releases within 30 seconds All 12 services return to baseline in under 2 minutes STEP 5: Post-incident fix Add date range filter to reporting query Separate reporting-service DB user with connection limit of 5 Add DynaKube resource limits for reporting workloads
Scenario 3: Silent Failure — Error Rate Hidden by Retry Logic
Situation: Users are complaining about checkout failures but Dynatrace shows only 0.8% error rate — which is within threshold.
STEP 1: Check PurePaths for /checkout/submit with user_ids from support tickets
Filtering PurePaths by user: user-78234
Trace shows: POST /payment FAILED, retry 1 FAILED, retry 2 SUCCESS
Total user-visible latency: 14.2 seconds (3 attempts)
STEP 2: What does APM show?
Service failure rate: 0.8% (retries counted as separate requests)
Successful retry masks the original failure from aggregate metrics
STEP 3: The real signal — check payment-service error rate WITHOUT retries
Filter PurePaths for payment-service → first attempt only
Actual first-attempt failure rate: 31% ← the real problem
STEP 4: Find the root cause
All failed payment spans show error: "TLS handshake timeout" to gateway
Infrastructure view: payment-service hosts → network metrics
TCP connection failures to external IP: 12.4% packet loss since 09:45
STEP 5: Actions
Alert network team to external gateway connectivity degradation
Add a metric for first-attempt failure rate (new calculated metric in Dynatrace)
SLO updated to track first-attempt success rate, not aggregate
LESSON: Aggregate metrics hide retry-masked failures.
Always verify with PurePath-level analysis.Scenario 4: SLA Compliance Monitoring and Breach Investigation
# Check all SLOs and their current compliance status
curl -s "https://your-env.live.dynatrace.com/api/v2/slo" \
-H "Authorization: Api-Token YOUR_API_TOKEN" \
| jq '.slos[] | {name: .name, target: .target, status: .status, errorBudget: .errorBudget}'
# Response example:
# {
# "name": "Checkout Availability",
# "target": 99.9,
# "status": { "value": 99.72, "color": "RED" }, ← BREACHED
# "errorBudget": { "value": -28, "burnRate": "FAST" }
# }
# Get SLO burn rate over last 7 days
curl -s "https://your-env.live.dynatrace.com/api/v2/slo/SLO-ID/events" \
-H "Authorization: Api-Token YOUR_API_TOKEN" \
-G --data-urlencode "from=now-7d"
# Export for compliance report
curl -s "https://your-env.live.dynatrace.com/api/v2/metrics/query" \
-H "Authorization: Api-Token YOUR_API_TOKEN" \
-G \
--data-urlencode "metricSelector=((builtin:service.errors.total.successCount)/(builtin:service.requestCount.total))*100" \
--data-urlencode "entitySelector=type(SERVICE),entityId(SERVICE-checkout)" \
--data-urlencode "resolution=1h" \
--data-urlencode "from=now-30d" \
| jq '.result[0].data[0].values'Scenario 5: Auto-scaling Failure Detection — Full-Stack Correlation
Situation: A Black Friday traffic spike causes user-facing errors even though Kubernetes HPA is configured to auto-scale.
DAVIS PROBLEM: "High failure rate in checkout-service" + "availability drop" STEP 1: Check infrastructure layer Kubernetes workload view: checkout-service replicas = 3 (desired: 10) Scaling event: HPA attempted scale-up at 14:30, but FAILED STEP 2: Why did scaling fail? Navigate to K8s namespace view → events Event: "0/12 nodes available: 12 Insufficient memory" Node pool is full — all nodes at 98% memory utilisation STEP 3: Connect to Smartscape Node memory: 98% → checkout-service pods pending (can't schedule) → checkout-service has only 3 replicas under 10x normal load → overloaded pods → connection queue saturation → errors STEP 4: Immediate action Add nodes to the Kubernetes cluster (cloud provider node pool scale-up) New nodes available in 4 minutes HPA successfully scales checkout-service to 10 replicas Errors reduce to baseline within 90 seconds of new pods becoming ready STEP 5: Post-incident changes Add cluster auto-provisioner to Kubernetes (provision nodes automatically) Set Dynatrace alert: "Kubernetes node pool > 85% memory" → PagerDuty Pre-warm node pool capacity 30 minutes before planned traffic events
Real-world Insight: Full-Stack Monitoring Value
These scenarios demonstrate the core value proposition of full-stack monitoring: the ability to connect a user-visible symptom (checkout errors) through the application layer (service failures) to the infrastructure root cause (node pool exhaustion) in a single investigation workflow — without switching tools, manually correlating data, or reading through logs. Dynatrace's Smartscape topology is the connective tissue that makes this possible.
Interview Questions
Beginner
Monitoring every layer from the user experience (RUM) through the application (APM/traces) down to infrastructure (host, container, Kubernetes) — connected in a single unified view so you can trace a user complaint all the way to its root cause.
When one service or infrastructure component fails and the failure propagates through dependent services — causing multiple downstream failures. Often looks like all services failing simultaneously when only one root cause exists.
Dynatrace injects deployment events from CI/CD pipelines (or via API). Davis AI correlates the timing of a metric anomaly with recent deployment events and lists the deployment as a contributing factor in the problem card.
The allowed amount of downtime or errors within the SLO target. If your SLO is 99.9% availability, you have 0.1% error budget — about 43 minutes of downtime per month. Burning it fast means reducing risk tolerance until the budget recovers.
How quickly the error budget is being consumed relative to the normal rate. A burn rate of 10 means you're consuming budget 10x faster than the target allows and will exhaust the budget in 1/10th of the SLO period.
Intermediate
Retries cause failed requests to be retried as new requests. If the retry succeeds, the aggregate error rate counts the original failure as a success. PurePath analysis per first-attempt reveals the true failure rate that users experience as latency.
Navigate to the Kubernetes workload view — check desired vs actual replicas. Look at K8s events for scheduling failures ("Insufficient CPU/memory"). Correlate with node-level metrics to confirm whether node pool saturation prevented new pod scheduling.
An alert is a single notification for one metric breach. A problem card is a correlated, AI-analysed incident view containing all related alerts, identified root cause, impact blast radius, contributing events, and remediation guidance.
Export the Davis problem card which includes the contributing events timeline showing the deployment event immediately before the anomaly. Include the metric comparison (before/after Apdex, P95, failure rate) as quantified evidence.
Separate the batch job's database connection pool with a strict max (e.g., 5 connections). Use a separate database user. Schedule batch jobs during off-peak hours. Add Dynatrace alert on DB connection utilisation >80%.
Scenario-based
Open the Davis problem card — it correlates all 14 into one problem. Identify the single root cause entity listed. Use Smartscape to confirm that root cause entity is a dependency of all 14 services. Fix the root cause, not each service individually.
Create an SLO in Dynatrace targeting the relevant service metric (availability or success rate). Use the Metrics API to export hourly SLO values for the month. Calculate overall availability. Export the problem history to document any breaches and their duration.
1. HPA scaling lag — is it scaling fast enough? 2. Node pool capacity — can new pods actually schedule? 3. Pod startup time — are new pods taking too long to become ready? 4. DB connection pool — does the scaled service overwhelm the database? Check each in Dynatrace Kubernetes workload view.
Watch the JVM memory (heap used) metric trend for a process group over days. A persistent upward trend that never returns to baseline after GC cycles indicates a memory leak. Code Hotspot thread analysis can show heap allocations accumulating in specific objects.
1. Davis AI with auto-baselining (no manual thresholds). 2. Smart alerting via PagerDuty integration filtered to High/Critical severity only. 3. Monaco for configuration-as-code to remove manual dashboard/alert setup. 4. SLOs to replace ad-hoc alert threshold discussions with agreed objectives.
Summary
Real-world Dynatrace use requires combining all tools together: Davis AI identifies and correlates the problem, APM provides service-level detail, distributed traces show the exact request path, and infrastructure monitoring confirms the root cause layer. Full-stack monitoring means every incident investigation follows the same structured path — from user impact to code and infrastructure root cause, in minutes.