IntermediateLesson 6 of 9

Distributed Tracing

Master PurePath technology to trace every request end-to-end across microservices, pinpoint latency sources, and analyse service flow maps.

Simple Explanation (ELI5)

Imagine every user request is a parcel moving through a postal network. Distributed tracing puts a GPS tracker on the parcel — recording every sorting facility (service), every vehicle (network hop), and every delay (latency) along the way. When a delivery is late, you can replay the exact route and see exactly where the parcel got stuck.

What is Distributed Tracing?

In a microservices architecture, a single user action triggers calls across dozens of services. Distributed tracing captures this entire request journey as a trace — a collection of linked spans, each representing one unit of work in one service. In Dynatrace, this is called PurePath — an automatically captured, code-level trace with zero configuration required.

PurePath vs Standard Distributed Tracing

FeatureStandard Tracing (Jaeger/Zipkin/OTel)Dynatrace PurePath
InstrumentationManual SDK integration requiredAutomatic (OneAgent bytecode)
GranularitySpans at service boundaryMethod-level within each service
DB query captureManual span creation neededAutomatic — every DB call is a span
SamplingHead/tail based (may miss errors)Adaptive — all errors captured
Correlation with infraManual linkingAutomatic via Smartscape
Context propagationW3C TraceContext (manual)Automatic + supports W3C TraceContext

PurePath Anatomy

text — PurePath waterfall structure
POST /checkout/submit     [total: 3,420ms]
  [frontend-service]      [0ms - 3420ms]
  |
  +-- HTTP POST /orders   [45ms - 3180ms]   (order-service)
  |     |
  |     +-- SELECT orders [50ms - 890ms]    (orders-db: PostgreSQL)
  |     |     Query: SELECT * FROM orders WHERE user_id = ?
  |     |     Rows: 1,240   ← HIGH (N+1 risk)
  |     |
  |     +-- HTTP GET /inventory  [950ms - 2100ms]  (inventory-service)
  |     |     |
  |     |     +-- GET redis:item:* [955ms - 1640ms] (Redis)
  |     |           MISS (cache cold) ← ROOT CAUSE of latency
  |     |
  |     +-- HTTP POST /payment  [2150ms - 3100ms] (payment-service)
  |           Duration: 950ms (within SLA)
  |
  +-- async: kafka:order.created  [3180ms]  (Kafka)

Total database calls: 47   ← N+1 pattern detected
Total external calls: 2
Slowest span: Redis cache miss chain (685ms)

Service Flow Maps

Dynatrace automatically builds Service Flow maps — visualising how a specific type of request (e.g., POST /checkout/submit) flows through the system. For each service in the flow it shows throughput, response time, and error rate. This is different from Smartscape (which shows all connections) — Service Flow shows only the path for the specific transaction type you're analysing.

Context Propagation

http — W3C TraceContext headers (automatic in Dynatrace)
# Dynatrace automatically injects these headers on outbound HTTP calls:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#             ^^ version  ^^ trace-id (128-bit)               ^^ span-id (64-bit) ^^ flags

tracestate: dt=fw4;8e80;1;0;0;0;2;0;0;fe;2h01;7h4bf92f35

# On the receiving service, OneAgent reads these headers and
# automatically continues the trace — linking the new span to the parent
# No code changes required

Querying PurePaths via Dynatrace API

bash — Fetch slow PurePaths via Dynatrace Traces API
# Get the slowest 10 traces for checkout endpoint in last 30 minutes
curl -s -X POST \
  "https://your-env.live.dynatrace.com/api/v2/traces/filter" \
  -H "Authorization: Api-Token YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "timeFrame": "last30minutes",
    "filter": {
      "service": "checkout-service",
      "endpoint": "POST /checkout/submit",
      "minDuration": 2000,
      "status": "ANY"
    },
    "sorting": {"direction": "DESCENDING", "sortBy": "DURATION"},
    "pageSize": 10
  }' | jq '.traces[] | {traceId: .traceId, duration: .duration, status: .status}'

Analysing a Slow PurePath — Step-by-Step

  1. Open the service in Dynatrace UI and go to PurePaths.
  2. Filter for slow traces — sort by response time descending, open the slowest one.
  3. Expand the waterfall — look for the deepest span with the largest latency bar.
  4. Check DB spans — look for high row counts, missing indexes, or N+1 patterns.
  5. Check external call spans — identify slow third-party APIs or timeout patterns.
  6. Review method hotspots inside each service span for CPU-intensive methods.
  7. Confirm the path — if the slow span is on a specific dependency, validate with that team.

Adaptive Sampling in Dynatrace

Unlike head-based sampling (which randomly drops traces before they complete), Dynatrace uses adaptive sampling — it captures all transactions that are slow, have errors, or are anomalous, and samples normal healthy traffic. This means you never miss a problematic request, even in high-volume production environments.

Tip: At 10,000 req/sec Dynatrace may not store every healthy trace, but it always stores every errored or slow trace — the ones you actually need for debugging.

Debugging Scenarios

Real-world Use Case

A fintech company's transaction API intermittently took 15+ seconds — affecting 0.3% of requests, too rare to reproduce in testing. Traditional log search found nothing. PurePath analysis of the 15-second traces immediately revealed the pattern: the slow traces all had a Redis GET call in the fraud-detection-service that occasionally took 14 seconds, while normal Redis calls took 2ms. The specific Redis key pattern involved a hot key that was evicting under memory pressure — a configuration issue invisible in aggregate metrics but immediately visible in individual PurePaths.

Interview Questions

Beginner

What is a distributed trace?

An end-to-end record of a single request as it moves through multiple services — composed of linked spans. Enables latency attribution and error localisation across microservice boundaries.

What is a span?

A named, timed unit of work within a trace — representing one operation in one service (e.g., HTTP call, database query, method invocation). Spans have a parent-child relationship forming a trace tree.

What is PurePath in Dynatrace?

Dynatrace's name for a distributed trace — captured automatically by OneAgent without code instrumentation, with method-level granularity including every database call and external request.

What is trace context propagation?

Passing the trace ID and span ID in HTTP headers (W3C traceparent) so downstream services can link their spans to the originating trace, enabling end-to-end correlation. Dynatrace does this automatically.

What is adaptive sampling?

Capturing 100% of anomalous or slow traces while sampling normal healthy traffic — ensuring all problematic requests are always captured for debugging, even at high throughput.

Intermediate

How does PurePath differ from standard OpenTelemetry tracing?

OTel tracing requires manual SDK instrumentation. PurePath is automatic via bytecode instrumentation — capturing method-level granularity inside services (not just at service boundaries) without any code changes.

How do you find an N+1 database query using distributed traces?

Open the PurePath waterfall. If you see many repeated identical database spans (e.g., 50 SELECT queries with the same pattern), that's an N+1. Filter traces by high DB call count to surface these systematically.

What is a Service Flow map?

A Dynatrace view showing the exact call path of a specific request type (e.g., POST /checkout) through all services — with throughput, latency, and error rate per hop. Different from Smartscape which shows all dependencies.

How does Dynatrace handle traces for async messaging (Kafka)?

Dynatrace injects trace context into message headers for supported messaging libraries. The consuming service reads the context and continues the trace — enabling end-to-end visibility across async boundaries.

What is the advantage of method-level tracing over service-boundary tracing?

Service-boundary tracing tells you that service A was slow. Method-level tracing tells you which specific method inside service A was slow — e.g., a specific SQL query or third-party library call — reducing the debugging scope dramatically.

Scenario-based

A request takes 8 seconds total. Metrics show all individual services are fast. Where is the time going?

Open the PurePath waterfall. Look for gaps between spans — time when no span is executing. This usually indicates a thread pool wait, a synchronisation lock, a retry delay, or time in a queue/message broker.

P99 latency is 12 seconds but P50 is 200ms. What does this tell you and how do you investigate?

Tail latency issue — a small subset of requests are extremely slow while most are fast. Filter PurePaths for duration >5000ms. Look for a pattern in the slow traces (same DB query, same external call, same user segment) to isolate the cause.

How would you use distributed tracing to investigate a payment failure that affects 0.5% of transactions?

Filter PurePaths for the payment endpoint with error status. Open 5-10 errored traces. Compare their span pattern with successful traces — look for a specific span that is absent or errored in failed traces but present in successful ones. That span identifies the failure point.

A new feature was deployed and now a specific API endpoint is 3x slower. No other endpoints are affected. What do you investigate?

Open PurePaths for that specific endpoint before and after deployment. Compare the waterfall — look for new spans introduced by the deployment (new DB queries, new service calls) or existing spans that became slower. Code Hotspot will show new methods introduced by the deploy.

You need to prove to management that adding a cache reduced P99 latency. How?

Export P99 latency metric from Dynatrace Metrics API for 7 days pre-deployment and 7 days post-deployment. Show the timeline with the deployment marker. Also export DB call count per request — the decrease confirms the cache is being hit.

Summary

Dynatrace PurePath is the most powerful distributed tracing capability in the industry — automatic, method-level, always-on, and integrated with AI-driven root cause analysis. Understanding how to navigate PurePath waterfalls, Service Flow maps, and DB span patterns gives you the ability to diagnose any latency or error issue in production within minutes.