AdvancedLesson 7 of 9

Real-world Scenarios

Apply full-stack Dynatrace knowledge to real production situations: deployment regressions, cascading failures, capacity incidents, and SLA compliance monitoring.

Simple Explanation (ELI5)

This lesson is about what it actually looks like to use Dynatrace when things go wrong in production. Each scenario is a realistic incident, worked through step by step — using APM, infrastructure monitoring, Davis AI, and distributed tracing together.

Scenario 1: Deployment-induced Performance Regression

Situation: 10 minutes after deploying version 3.4.2 of the order-service, Davis opens a problem: "Response time degradation — order-service."

text — Investigation walkthrough
STEP 1: Open Davis problem card
  Root cause identified: order-service response time 440% above baseline
  Contributing event: Deployment "order-service:v3.4.2" at 14:02 UTC

STEP 2: Check deployment comparison (Services → order-service → Events)
  Before v3.4.2:  P95 = 210ms, failure rate = 0.2%
  After v3.4.2:   P95 = 1,840ms, failure rate = 3.1%

STEP 3: Open PurePaths for slow requests (filter: duration > 1000ms, after 14:02)
  Waterfall reveals: new span "InventoryService.getStockLevels()"
  Duration: avg 1,620ms (was not present in previous version traces)
  Sub-spans: 47 × SELECT from inventory table (N+1 pattern)

STEP 4: Code Hotspot analysis
  Highest contributor: com.example.InventoryService.getStockLevels()
  47 DB calls per request — regression introduced by removing @Cacheable

STEP 5: Action
  Roll back v3.4.2 OR apply hotfix restoring @Cacheable annotation
  Confirm Davis problem closes as metrics return to baseline

Scenario 2: Cascading Failure from Database Saturation

Situation: At 09:15, 12 different services all show high response times simultaneously. Traditional monitoring fires 85 separate alerts.

text — Cascading failure triage
DAVIS PROBLEM CARD (single card): Root cause: orders-db — connection pool exhausted (100% utilisation) Impact: 12 downstream services degraded, ~8,400 users affected, 2 SLOs breached STEP 1: Verify root cause Navigate to orders-db process group → metrics DB connections: 200/200 (at max pool size) since 09:12 Query queue depth: 847 pending queries STEP 2: Review Smartscape for blast radius orders-db calls: order-service, cart-service, recommendation-service, fulfillment-service, returns-service, reporting-service... All 12 impacted services confirmed in Smartscape STEP 3: Identify what saturated the pool Davis contributing events: "Automated reporting job started at 09:10" Navigate to reporting-service PurePaths at 09:10 Find 240-second query: SELECT * FROM orders (full table scan, no date filter) STEP 4: Immediate mitigation Kill the reporting job: kubectl delete pod reporting-job-xxxx DB connection pool releases within 30 seconds All 12 services return to baseline in under 2 minutes STEP 5: Post-incident fix Add date range filter to reporting query Separate reporting-service DB user with connection limit of 5 Add DynaKube resource limits for reporting workloads

Scenario 3: Silent Failure — Error Rate Hidden by Retry Logic

Situation: Users are complaining about checkout failures but Dynatrace shows only 0.8% error rate — which is within threshold.

text — Silent failure investigation
STEP 1: Check PurePaths for /checkout/submit with user_ids from support tickets
  Filtering PurePaths by user: user-78234
  Trace shows: POST /payment FAILED, retry 1 FAILED, retry 2 SUCCESS
  Total user-visible latency: 14.2 seconds (3 attempts)

STEP 2: What does APM show?
  Service failure rate: 0.8% (retries counted as separate requests)
  Successful retry masks the original failure from aggregate metrics

STEP 3: The real signal — check payment-service error rate WITHOUT retries
  Filter PurePaths for payment-service → first attempt only
  Actual first-attempt failure rate: 31% ← the real problem

STEP 4: Find the root cause
  All failed payment spans show error: "TLS handshake timeout" to gateway
  Infrastructure view: payment-service hosts → network metrics
  TCP connection failures to external IP: 12.4% packet loss since 09:45

STEP 5: Actions
  Alert network team to external gateway connectivity degradation
  Add a metric for first-attempt failure rate (new calculated metric in Dynatrace)
  SLO updated to track first-attempt success rate, not aggregate

LESSON: Aggregate metrics hide retry-masked failures.
         Always verify with PurePath-level analysis.

Scenario 4: SLA Compliance Monitoring and Breach Investigation

bash — Query SLO status and burn rate via API
# Check all SLOs and their current compliance status
curl -s "https://your-env.live.dynatrace.com/api/v2/slo" \
  -H "Authorization: Api-Token YOUR_API_TOKEN" \
  | jq '.slos[] | {name: .name, target: .target, status: .status, errorBudget: .errorBudget}'

# Response example:
# {
#   "name": "Checkout Availability",
#   "target": 99.9,
#   "status": { "value": 99.72, "color": "RED" },   ← BREACHED
#   "errorBudget": { "value": -28, "burnRate": "FAST" }
# }

# Get SLO burn rate over last 7 days
curl -s "https://your-env.live.dynatrace.com/api/v2/slo/SLO-ID/events" \
  -H "Authorization: Api-Token YOUR_API_TOKEN" \
  -G --data-urlencode "from=now-7d"

# Export for compliance report
curl -s "https://your-env.live.dynatrace.com/api/v2/metrics/query" \
  -H "Authorization: Api-Token YOUR_API_TOKEN" \
  -G \
  --data-urlencode "metricSelector=((builtin:service.errors.total.successCount)/(builtin:service.requestCount.total))*100" \
  --data-urlencode "entitySelector=type(SERVICE),entityId(SERVICE-checkout)" \
  --data-urlencode "resolution=1h" \
  --data-urlencode "from=now-30d" \
  | jq '.result[0].data[0].values'

Scenario 5: Auto-scaling Failure Detection — Full-Stack Correlation

Situation: A Black Friday traffic spike causes user-facing errors even though Kubernetes HPA is configured to auto-scale.

text — Full-stack auto-scaling investigation
DAVIS PROBLEM: "High failure rate in checkout-service" + "availability drop"

STEP 1: Check infrastructure layer
  Kubernetes workload view: checkout-service replicas = 3 (desired: 10)
  Scaling event: HPA attempted scale-up at 14:30, but FAILED

STEP 2: Why did scaling fail?
  Navigate to K8s namespace view → events
  Event: "0/12 nodes available: 12 Insufficient memory"
  Node pool is full — all nodes at 98% memory utilisation

STEP 3: Connect to Smartscape
  Node memory: 98% → checkout-service pods pending (can't schedule)
  → checkout-service has only 3 replicas under 10x normal load
  → overloaded pods → connection queue saturation → errors

STEP 4: Immediate action
  Add nodes to the Kubernetes cluster (cloud provider node pool scale-up)
  New nodes available in 4 minutes
  HPA successfully scales checkout-service to 10 replicas
  Errors reduce to baseline within 90 seconds of new pods becoming ready

STEP 5: Post-incident changes
  Add cluster auto-provisioner to Kubernetes (provision nodes automatically)
  Set Dynatrace alert: "Kubernetes node pool > 85% memory" → PagerDuty
  Pre-warm node pool capacity 30 minutes before planned traffic events

Real-world Insight: Full-Stack Monitoring Value

These scenarios demonstrate the core value proposition of full-stack monitoring: the ability to connect a user-visible symptom (checkout errors) through the application layer (service failures) to the infrastructure root cause (node pool exhaustion) in a single investigation workflow — without switching tools, manually correlating data, or reading through logs. Dynatrace's Smartscape topology is the connective tissue that makes this possible.

Interview Questions

Beginner

What is full-stack monitoring?

Monitoring every layer from the user experience (RUM) through the application (APM/traces) down to infrastructure (host, container, Kubernetes) — connected in a single unified view so you can trace a user complaint all the way to its root cause.

What is a cascading failure?

When one service or infrastructure component fails and the failure propagates through dependent services — causing multiple downstream failures. Often looks like all services failing simultaneously when only one root cause exists.

How does Dynatrace detect deployment-caused regressions?

Dynatrace injects deployment events from CI/CD pipelines (or via API). Davis AI correlates the timing of a metric anomaly with recent deployment events and lists the deployment as a contributing factor in the problem card.

What is an error budget?

The allowed amount of downtime or errors within the SLO target. If your SLO is 99.9% availability, you have 0.1% error budget — about 43 minutes of downtime per month. Burning it fast means reducing risk tolerance until the budget recovers.

What is SLO burn rate?

How quickly the error budget is being consumed relative to the normal rate. A burn rate of 10 means you're consuming budget 10x faster than the target allows and will exhaust the budget in 1/10th of the SLO period.

Intermediate

How do retry mechanisms hide real failure rates in APM?

Retries cause failed requests to be retried as new requests. If the retry succeeds, the aggregate error rate counts the original failure as a success. PurePath analysis per first-attempt reveals the true failure rate that users experience as latency.

How do you use Dynatrace to investigate a Kubernetes scaling failure?

Navigate to the Kubernetes workload view — check desired vs actual replicas. Look at K8s events for scheduling failures ("Insufficient CPU/memory"). Correlate with node-level metrics to confirm whether node pool saturation prevented new pod scheduling.

What is the difference between a problem card and an alert?

An alert is a single notification for one metric breach. A problem card is a correlated, AI-analysed incident view containing all related alerts, identified root cause, impact blast radius, contributing events, and remediation guidance.

How do you prove an incident was caused by a deployment in a post-mortem?

Export the Davis problem card which includes the contributing events timeline showing the deployment event immediately before the anomaly. Include the metric comparison (before/after Apdex, P95, failure rate) as quantified evidence.

How do you prevent a single batch job from saturating a shared database?

Separate the batch job's database connection pool with a strict max (e.g., 5 connections). Use a separate database user. Schedule batch jobs during off-peak hours. Add Dynatrace alert on DB connection utilisation >80%.

Scenario-based

14 services all go red at once. Where do you start?

Open the Davis problem card — it correlates all 14 into one problem. Identify the single root cause entity listed. Use Smartscape to confirm that root cause entity is a dependency of all 14 services. Fix the root cause, not each service individually.

A client's SLA says 99.9% uptime. How do you report monthly compliance?

Create an SLO in Dynatrace targeting the relevant service metric (availability or success rate). Use the Metrics API to export hourly SLO values for the month. Calculate overall availability. Export the problem history to document any breaches and their duration.

Auto-scaling is configured but the service still goes down during traffic spikes. What do you check?

1. HPA scaling lag — is it scaling fast enough? 2. Node pool capacity — can new pods actually schedule? 3. Pod startup time — are new pods taking too long to become ready? 4. DB connection pool — does the scaled service overwhelm the database? Check each in Dynatrace Kubernetes workload view.

How do you detect a memory leak in production using Dynatrace?

Watch the JVM memory (heap used) metric trend for a process group over days. A persistent upward trend that never returns to baseline after GC cycles indicates a memory leak. Code Hotspot thread analysis can show heap allocations accumulating in specific objects.

You are asked to reduce observability toil for your team. What Dynatrace features do you enable?

1. Davis AI with auto-baselining (no manual thresholds). 2. Smart alerting via PagerDuty integration filtered to High/Critical severity only. 3. Monaco for configuration-as-code to remove manual dashboard/alert setup. 4. SLOs to replace ad-hoc alert threshold discussions with agreed objectives.

Summary

Real-world Dynatrace use requires combining all tools together: Davis AI identifies and correlates the problem, APM provides service-level detail, distributed traces show the exact request path, and infrastructure monitoring confirms the root cause layer. Full-stack monitoring means every incident investigation follows the same structured path — from user impact to code and infrastructure root cause, in minutes.