IntermediateLesson 5 of 9

AI-based Monitoring

Understand how Davis AI eliminates alert noise, automatically baselines behaviour, performs causal root cause analysis, and converts thousands of events into a single actionable problem card.

Simple Explanation (ELI5)

Traditional monitoring fires an alert for every threshold breach — and during a major incident you might get 500 alerts at once. Davis AI watches everything continuously, learns what "normal" looks like for each individual metric, and when something goes wrong it connects all the related alerts together, traces back to the one root cause, and tells you exactly what broke first — one problem card, not 500 pings.

What is Davis AI?

Davis AI is Dynatrace's deterministic causal AI engine — not a black-box machine learning model that "guesses." It uses the Smartscape topology graph to understand dependency relationships, then applies causal analysis: when anomalies appear across multiple entities, Davis traces them back to their origin through the dependency chain. The result is a Problem Card with a single root cause and a full impact blast radius.

Davis AI Architecture

All Entities
(Metrics, Events)
Automatic
Baselining
Anomaly
Detection
Causal Analysis
(Smartscape graph)
Problem Card
(Root cause)

Key Davis AI Capabilities

Automatic Baselining

Davis learns the normal value range for every metric — response time, error rate, CPU, throughput — including time-of-day, day-of-week, and seasonal patterns. No static thresholds to configure.

Anomaly Detection

Fires when a metric deviates beyond its learned baseline. Because baselines are dynamic, Davis avoids false positives during planned load increases or nightly batch jobs.

Root Cause Analysis

Uses Smartscape topology to trace the origin of an anomaly cascade. If 40 services are degraded, Davis identifies the single infrastructure or service event that started the chain.

Problem Cards

A single consolidated view of an incident: root cause entity, impact blast radius (affected services, users, SLOs), timeline of events, and suggested remediation steps.

Smart Alerts

Davis only fires a problem when it has high confidence. A 10-second spike does not trigger a page. Persistent anomalies across multiple signals do — dramatically reducing alert fatigue.

Davis Explanations

Every problem card includes a plain-English explanation: "The high response time of checkout-service is caused by slow database queries on orders-db, which is experiencing high I/O due to missing indexes."

Traditional Alerting vs Davis AI

AspectTraditional (Threshold) AlertingDavis AI
Alert volume per incident100s of individual alerts1 problem card
Threshold configurationManual per metric per serviceFully automatic (learned)
Seasonal patternsStatic thresholds fire during peakBaseline adapts to time patterns
Root causeEngineer must correlate manuallyAutomatically identified
Alert storms after deployCommon (all services alarm simultaneously)Davis correlates to deployment event
MTTDMinutes to hours (manual triage)Seconds to minutes (automatic)

Anatomy of a Davis Problem Card

text — Problem card structure
PROBLEM: Response time degradation in checkout-service
Severity: Performance (Red)
Status: OPEN
Duration: 14 minutes
Affected users: ~1,200 (estimated)

ROOT CAUSE:
  Entity: orders-db (PostgreSQL process group)
  Anomaly: Database query response time 8.2x above baseline
  First detected: 14:03:22 UTC

IMPACT CHAIN:
  orders-db (slow queries)
    └── order-service (high response time, 420% baseline)
          └── checkout-service (response time degradation)
                └── 1,200 users (Apdex 0.44)

CONTRIBUTING EVENTS:
  [14:01] Deployment: order-service v2.3.1 deployed
  [14:03] DB query time anomaly detected on orders-db
  [14:05] order-service response time anomaly detected
  [14:07] checkout-service Apdex degradation detected

DAVIS EXPLANATION:
  "Deployment of order-service v2.3.1 introduced a query that
  performs a full-table scan on the orders table. This caused
  orders-db response time to increase 8x, which propagated to
  order-service and checkout-service response time degradation."

SUGGESTED ACTION:
  Review recent deployment: order-service v2.3.1
  Analyse database queries changed in this release

Configuring Smart Alerts and Anomaly Detection

json — Custom anomaly detection threshold via API
// POST /api/v2/settings/objects
// Schema: builtin:anomaly-detection.services
{
  "schemaId": "builtin:anomaly-detection.services",
  "scope": "SERVICE-XXXXXXXXXXXXXXXX",
  "value": {
    "responseTime": {
      "enabled": true,
      "detectionMode": "AUTO",
      "autoDetection": {
        "responseTimeAll": {
          "degradationMilliseconds": 100,
          "degradationPercent": 50,
          "slowestResponseTimeAll": {
            "degradationMilliseconds": 200,
            "degradationPercent": 100
          }
        }
      }
    },
    "failureRate": {
      "enabled": true,
      "detectionMode": "AUTO",
      "autoDetection": {
        "failingServiceCallPercentageIncreaseAbsoluteThreshold": {
          "enabled": true,
          "threshold": 5
        }
      }
    }
  }
}

Integrating Davis Problems with PagerDuty / Slack

json — Dynatrace notification integration (webhook)
// Configure via Settings -> Integrations -> Problem notifications
// Webhook payload sent to your endpoint on problem OPEN/UPDATE/CLOSED:
{
  "ProblemID": "P-12345",
  "ProblemTitle": "Response time degradation in checkout-service",
  "ProblemURL": "https://your-env.live.dynatrace.com/#problems/problemdetail;pid=P-12345",
  "State": "OPEN",
  "ProblemSeverity": "PERFORMANCE",
  "ImpactedEntities": [
    {"id": "SERVICE-AAAA", "name": "checkout-service"},
    {"id": "SERVICE-BBBB", "name": "order-service"}
  ],
  "RootCauseEntity": {
    "id": "PROCESS_GROUP_INSTANCE-CCCC",
    "name": "orders-db"
  },
  "Tags": ["environment:production", "team:checkout"]
}

SLOs and Error Budgets with Davis AI

Dynatrace SLOs (Service Level Objectives) integrate with Davis AI — a problem that burns through the error budget faster than expected automatically elevates the problem severity. This connects operational events to business-level commitments.

json — Create SLO via Dynatrace API
// POST /api/v2/slo
{
  "name": "Checkout Service Availability",
  "description": "99.9% availability SLO for checkout",
  "metricExpression": "100*(builtin:service.errors.total.successCount:splitBy())/(builtin:service.requestCount.total:splitBy())",
  "evaluationType": "AGGREGATE",
  "target": 99.9,
  "warning": 99.95,
  "timeframe": "-1w",
  "filter": "type(SERVICE),entityId(SERVICE-checkout)"
}

Debugging Scenarios

Real-world Use Case

A major bank's production environment generated over 2,000 alerts per day across traditional monitoring tools. After deploying Dynatrace, Davis AI correlated those 2,000 daily events into an average of 12 meaningful problems per day — a 99.4% reduction in alert noise. More importantly, the mean time to resolution (MTTR) for critical incidents dropped from 47 minutes to 8 minutes because engineers arrived at the problem card with the root cause already identified, not to an inbox of 200 alerts to manually correlate.

Interview Questions

Beginner

What is Davis AI?

Dynatrace's causal AI engine that automatically baselines every metric, detects anomalies when values deviate from their learned baseline, correlates related anomalies using Smartscape topology, and identifies the root cause in a single problem card.

What is automatic baselining?

Davis learns the normal value range for each metric over time — including time-of-day and weekly patterns. Anomalies are detected relative to this learned baseline rather than a static threshold.

What is a Dynatrace problem card?

A single consolidated incident view showing: root cause entity, impact blast radius (affected services and users), contributing events timeline, plain-English Davis explanation, and suggested action.

Why does Davis reduce alert noise?

Davis correlates all anomalies caused by a single root event into one problem card. A root cause affecting 40 services doesn't generate 40 alerts — it generates one problem with 40 impacted entities listed.

What is an SLO in Dynatrace?

A Service Level Objective — a user-defined target for a service metric (availability, response time). Davis AI monitors SLO compliance and burns error budget status is shown in real time.

Intermediate

How does Davis perform root cause analysis?

Davis uses the Smartscape dependency graph to trace anomalies backwards. If service B is degraded and it calls service A which is also degraded, and service A's host has a disk issue, Davis traces: disk problem → service A degradation → service B impact.

What is the difference between Davis AI and traditional threshold alerting?

Threshold alerting fires on every breach of a fixed value — producing alert storms. Davis uses dynamic baselines (no configuration), correlates alerts into single problems, and applies causal analysis to identify root cause.

How does Davis handle seasonal traffic patterns?

The automatic baseline learns weekly and daily patterns. A Monday morning traffic spike that consistently occurs doesn't trigger anomalies — Davis's baseline accounts for it. New unusual spikes are still detected.

How do you suppress Davis problems during maintenance?

Create a Maintenance Window in Settings → Maintenance Windows — specify the time range, affected entities, and whether to suppress problem creation, alerting, or both.

What is a Davis "event" vs a Davis "problem"?

An event is a single anomaly on a single entity (e.g., response time spike on service A). A problem is a correlated set of events across entities that Davis has grouped because they share a common root cause.

Scenario-based

Davis fires a problem at 3am. You're on-call. What information does the problem card give you immediately?

Root cause entity, the anomaly type (e.g., slow DB queries), impact blast radius (which services/users affected), timeline of events (was there a recent deployment?), Davis's plain-English explanation, and a direct link to the root cause entity's metrics.

A deployment happens and Davis opens 15 problems in 2 minutes. What is happening and how do you handle it?

The deployment likely introduced a regression. All 15 problems will list the same deployment event as a contributing factor. Dynatrace often consolidates these into fewer problems if they share a root cause. Roll back the deployment and confirm problems close automatically.

Your SRE team is evaluated on MTTD. How does Davis improve this metric?

Davis detects anomalies the moment they deviate from baseline — often before users report issues. The problem card delivers a pre-analysed root cause, eliminating the correlation triage phase. MTTD drops from the first alert to Davis's first anomaly detection time.

Davis identifies a slow database as root cause but the DBA says the DB is fine. Who is right?

Both could be correct for different definitions of "fine." Davis may detect queries running slower than the learned baseline even if the DB itself isn't in a critical state. Open the specific PurePaths Davis cited — look at the exact queries, their execution plans, and compare with the baseline period.

How do you use Davis AI insights in a post-incident review?

Export the problem card timeline (API: GET /api/v2/problems/P-xxxxx) for exact event sequences. Use it to document: when Davis first detected the issue, what the root cause was, which services were impacted, and how long it took to resolve — structured evidence for the incident timeline.

Summary

Davis AI transforms monitoring from a reactive, high-noise discipline into a proactive, low-noise operation. By learning baselines automatically, detecting anomalies continuously, and performing causal root cause analysis using Smartscape topology, Davis delivers the most operationally valuable output in monitoring: a single problem card telling you exactly what broke, what it impacted, and why.