Distributed Tracing
Master PurePath technology to trace every request end-to-end across microservices, pinpoint latency sources, and analyse service flow maps.
Simple Explanation (ELI5)
Imagine every user request is a parcel moving through a postal network. Distributed tracing puts a GPS tracker on the parcel — recording every sorting facility (service), every vehicle (network hop), and every delay (latency) along the way. When a delivery is late, you can replay the exact route and see exactly where the parcel got stuck.
What is Distributed Tracing?
In a microservices architecture, a single user action triggers calls across dozens of services. Distributed tracing captures this entire request journey as a trace — a collection of linked spans, each representing one unit of work in one service. In Dynatrace, this is called PurePath — an automatically captured, code-level trace with zero configuration required.
PurePath vs Standard Distributed Tracing
| Feature | Standard Tracing (Jaeger/Zipkin/OTel) | Dynatrace PurePath |
|---|---|---|
| Instrumentation | Manual SDK integration required | Automatic (OneAgent bytecode) |
| Granularity | Spans at service boundary | Method-level within each service |
| DB query capture | Manual span creation needed | Automatic — every DB call is a span |
| Sampling | Head/tail based (may miss errors) | Adaptive — all errors captured |
| Correlation with infra | Manual linking | Automatic via Smartscape |
| Context propagation | W3C TraceContext (manual) | Automatic + supports W3C TraceContext |
PurePath Anatomy
POST /checkout/submit [total: 3,420ms] [frontend-service] [0ms - 3420ms] | +-- HTTP POST /orders [45ms - 3180ms] (order-service) | | | +-- SELECT orders [50ms - 890ms] (orders-db: PostgreSQL) | | Query: SELECT * FROM orders WHERE user_id = ? | | Rows: 1,240 ← HIGH (N+1 risk) | | | +-- HTTP GET /inventory [950ms - 2100ms] (inventory-service) | | | | | +-- GET redis:item:* [955ms - 1640ms] (Redis) | | MISS (cache cold) ← ROOT CAUSE of latency | | | +-- HTTP POST /payment [2150ms - 3100ms] (payment-service) | Duration: 950ms (within SLA) | +-- async: kafka:order.created [3180ms] (Kafka) Total database calls: 47 ← N+1 pattern detected Total external calls: 2 Slowest span: Redis cache miss chain (685ms)
Service Flow Maps
Dynatrace automatically builds Service Flow maps — visualising how a specific type of request (e.g., POST /checkout/submit) flows through the system. For each service in the flow it shows throughput, response time, and error rate. This is different from Smartscape (which shows all connections) — Service Flow shows only the path for the specific transaction type you're analysing.
Context Propagation
# Dynatrace automatically injects these headers on outbound HTTP calls: traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # ^^ version ^^ trace-id (128-bit) ^^ span-id (64-bit) ^^ flags tracestate: dt=fw4;8e80;1;0;0;0;2;0;0;fe;2h01;7h4bf92f35 # On the receiving service, OneAgent reads these headers and # automatically continues the trace — linking the new span to the parent # No code changes required
Querying PurePaths via Dynatrace API
# Get the slowest 10 traces for checkout endpoint in last 30 minutes
curl -s -X POST \
"https://your-env.live.dynatrace.com/api/v2/traces/filter" \
-H "Authorization: Api-Token YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"timeFrame": "last30minutes",
"filter": {
"service": "checkout-service",
"endpoint": "POST /checkout/submit",
"minDuration": 2000,
"status": "ANY"
},
"sorting": {"direction": "DESCENDING", "sortBy": "DURATION"},
"pageSize": 10
}' | jq '.traces[] | {traceId: .traceId, duration: .duration, status: .status}'Analysing a Slow PurePath — Step-by-Step
- Open the service in Dynatrace UI and go to PurePaths.
- Filter for slow traces — sort by response time descending, open the slowest one.
- Expand the waterfall — look for the deepest span with the largest latency bar.
- Check DB spans — look for high row counts, missing indexes, or N+1 patterns.
- Check external call spans — identify slow third-party APIs or timeout patterns.
- Review method hotspots inside each service span for CPU-intensive methods.
- Confirm the path — if the slow span is on a specific dependency, validate with that team.
Adaptive Sampling in Dynatrace
Unlike head-based sampling (which randomly drops traces before they complete), Dynatrace uses adaptive sampling — it captures all transactions that are slow, have errors, or are anomalous, and samples normal healthy traffic. This means you never miss a problematic request, even in high-volume production environments.
Debugging Scenarios
- Trace shows a 3-second gap with no spans: The gap is likely a thread wait — a lock, semaphore, or blocking I/O. Look at the thread dump for that service in Dynatrace's Thread Analysis view.
- External API calls show no sub-spans: The external service is not instrumented by Dynatrace (expected). The span boundary stops at the outbound HTTP call — you'll only see the total round-trip time.
- PurePaths missing for some requests: Sampling may be in effect for healthy requests. Also check if the service is missing OneAgent injection (especially common on containerised environments post-deployment).
- Trace IDs not propagating across async messaging (Kafka/SQS): Dynatrace supports automatic context propagation for major messaging systems, but it must be enabled in the messaging integration settings and the library version must be supported.
Real-world Use Case
A fintech company's transaction API intermittently took 15+ seconds — affecting 0.3% of requests, too rare to reproduce in testing. Traditional log search found nothing. PurePath analysis of the 15-second traces immediately revealed the pattern: the slow traces all had a Redis GET call in the fraud-detection-service that occasionally took 14 seconds, while normal Redis calls took 2ms. The specific Redis key pattern involved a hot key that was evicting under memory pressure — a configuration issue invisible in aggregate metrics but immediately visible in individual PurePaths.
Interview Questions
Beginner
An end-to-end record of a single request as it moves through multiple services — composed of linked spans. Enables latency attribution and error localisation across microservice boundaries.
A named, timed unit of work within a trace — representing one operation in one service (e.g., HTTP call, database query, method invocation). Spans have a parent-child relationship forming a trace tree.
Dynatrace's name for a distributed trace — captured automatically by OneAgent without code instrumentation, with method-level granularity including every database call and external request.
Passing the trace ID and span ID in HTTP headers (W3C traceparent) so downstream services can link their spans to the originating trace, enabling end-to-end correlation. Dynatrace does this automatically.
Capturing 100% of anomalous or slow traces while sampling normal healthy traffic — ensuring all problematic requests are always captured for debugging, even at high throughput.
Intermediate
OTel tracing requires manual SDK instrumentation. PurePath is automatic via bytecode instrumentation — capturing method-level granularity inside services (not just at service boundaries) without any code changes.
Open the PurePath waterfall. If you see many repeated identical database spans (e.g., 50 SELECT queries with the same pattern), that's an N+1. Filter traces by high DB call count to surface these systematically.
A Dynatrace view showing the exact call path of a specific request type (e.g., POST /checkout) through all services — with throughput, latency, and error rate per hop. Different from Smartscape which shows all dependencies.
Dynatrace injects trace context into message headers for supported messaging libraries. The consuming service reads the context and continues the trace — enabling end-to-end visibility across async boundaries.
Service-boundary tracing tells you that service A was slow. Method-level tracing tells you which specific method inside service A was slow — e.g., a specific SQL query or third-party library call — reducing the debugging scope dramatically.
Scenario-based
Open the PurePath waterfall. Look for gaps between spans — time when no span is executing. This usually indicates a thread pool wait, a synchronisation lock, a retry delay, or time in a queue/message broker.
Tail latency issue — a small subset of requests are extremely slow while most are fast. Filter PurePaths for duration >5000ms. Look for a pattern in the slow traces (same DB query, same external call, same user segment) to isolate the cause.
Filter PurePaths for the payment endpoint with error status. Open 5-10 errored traces. Compare their span pattern with successful traces — look for a specific span that is absent or errored in failed traces but present in successful ones. That span identifies the failure point.
Open PurePaths for that specific endpoint before and after deployment. Compare the waterfall — look for new spans introduced by the deployment (new DB queries, new service calls) or existing spans that became slower. Code Hotspot will show new methods introduced by the deploy.
Export P99 latency metric from Dynatrace Metrics API for 7 days pre-deployment and 7 days post-deployment. Show the timeline with the deployment marker. Also export DB call count per request — the decrease confirms the cache is being hit.
Summary
Dynatrace PurePath is the most powerful distributed tracing capability in the industry — automatic, method-level, always-on, and integrated with AI-driven root cause analysis. Understanding how to navigate PurePath waterfalls, Service Flow maps, and DB span patterns gives you the ability to diagnose any latency or error issue in production within minutes.