AI-based Monitoring
Understand how Davis AI eliminates alert noise, automatically baselines behaviour, performs causal root cause analysis, and converts thousands of events into a single actionable problem card.
Simple Explanation (ELI5)
Traditional monitoring fires an alert for every threshold breach — and during a major incident you might get 500 alerts at once. Davis AI watches everything continuously, learns what "normal" looks like for each individual metric, and when something goes wrong it connects all the related alerts together, traces back to the one root cause, and tells you exactly what broke first — one problem card, not 500 pings.
What is Davis AI?
Davis AI is Dynatrace's deterministic causal AI engine — not a black-box machine learning model that "guesses." It uses the Smartscape topology graph to understand dependency relationships, then applies causal analysis: when anomalies appear across multiple entities, Davis traces them back to their origin through the dependency chain. The result is a Problem Card with a single root cause and a full impact blast radius.
Davis AI Architecture
(Metrics, Events)
Baselining
Detection
(Smartscape graph)
(Root cause)
Key Davis AI Capabilities
Davis learns the normal value range for every metric — response time, error rate, CPU, throughput — including time-of-day, day-of-week, and seasonal patterns. No static thresholds to configure.
Fires when a metric deviates beyond its learned baseline. Because baselines are dynamic, Davis avoids false positives during planned load increases or nightly batch jobs.
Uses Smartscape topology to trace the origin of an anomaly cascade. If 40 services are degraded, Davis identifies the single infrastructure or service event that started the chain.
A single consolidated view of an incident: root cause entity, impact blast radius (affected services, users, SLOs), timeline of events, and suggested remediation steps.
Davis only fires a problem when it has high confidence. A 10-second spike does not trigger a page. Persistent anomalies across multiple signals do — dramatically reducing alert fatigue.
Every problem card includes a plain-English explanation: "The high response time of checkout-service is caused by slow database queries on orders-db, which is experiencing high I/O due to missing indexes."
Traditional Alerting vs Davis AI
| Aspect | Traditional (Threshold) Alerting | Davis AI |
|---|---|---|
| Alert volume per incident | 100s of individual alerts | 1 problem card |
| Threshold configuration | Manual per metric per service | Fully automatic (learned) |
| Seasonal patterns | Static thresholds fire during peak | Baseline adapts to time patterns |
| Root cause | Engineer must correlate manually | Automatically identified |
| Alert storms after deploy | Common (all services alarm simultaneously) | Davis correlates to deployment event |
| MTTD | Minutes to hours (manual triage) | Seconds to minutes (automatic) |
Anatomy of a Davis Problem Card
PROBLEM: Response time degradation in checkout-service
Severity: Performance (Red)
Status: OPEN
Duration: 14 minutes
Affected users: ~1,200 (estimated)
ROOT CAUSE:
Entity: orders-db (PostgreSQL process group)
Anomaly: Database query response time 8.2x above baseline
First detected: 14:03:22 UTC
IMPACT CHAIN:
orders-db (slow queries)
└── order-service (high response time, 420% baseline)
└── checkout-service (response time degradation)
└── 1,200 users (Apdex 0.44)
CONTRIBUTING EVENTS:
[14:01] Deployment: order-service v2.3.1 deployed
[14:03] DB query time anomaly detected on orders-db
[14:05] order-service response time anomaly detected
[14:07] checkout-service Apdex degradation detected
DAVIS EXPLANATION:
"Deployment of order-service v2.3.1 introduced a query that
performs a full-table scan on the orders table. This caused
orders-db response time to increase 8x, which propagated to
order-service and checkout-service response time degradation."
SUGGESTED ACTION:
Review recent deployment: order-service v2.3.1
Analyse database queries changed in this releaseConfiguring Smart Alerts and Anomaly Detection
// POST /api/v2/settings/objects
// Schema: builtin:anomaly-detection.services
{
"schemaId": "builtin:anomaly-detection.services",
"scope": "SERVICE-XXXXXXXXXXXXXXXX",
"value": {
"responseTime": {
"enabled": true,
"detectionMode": "AUTO",
"autoDetection": {
"responseTimeAll": {
"degradationMilliseconds": 100,
"degradationPercent": 50,
"slowestResponseTimeAll": {
"degradationMilliseconds": 200,
"degradationPercent": 100
}
}
}
},
"failureRate": {
"enabled": true,
"detectionMode": "AUTO",
"autoDetection": {
"failingServiceCallPercentageIncreaseAbsoluteThreshold": {
"enabled": true,
"threshold": 5
}
}
}
}
}Integrating Davis Problems with PagerDuty / Slack
// Configure via Settings -> Integrations -> Problem notifications
// Webhook payload sent to your endpoint on problem OPEN/UPDATE/CLOSED:
{
"ProblemID": "P-12345",
"ProblemTitle": "Response time degradation in checkout-service",
"ProblemURL": "https://your-env.live.dynatrace.com/#problems/problemdetail;pid=P-12345",
"State": "OPEN",
"ProblemSeverity": "PERFORMANCE",
"ImpactedEntities": [
{"id": "SERVICE-AAAA", "name": "checkout-service"},
{"id": "SERVICE-BBBB", "name": "order-service"}
],
"RootCauseEntity": {
"id": "PROCESS_GROUP_INSTANCE-CCCC",
"name": "orders-db"
},
"Tags": ["environment:production", "team:checkout"]
}SLOs and Error Budgets with Davis AI
Dynatrace SLOs (Service Level Objectives) integrate with Davis AI — a problem that burns through the error budget faster than expected automatically elevates the problem severity. This connects operational events to business-level commitments.
// POST /api/v2/slo
{
"name": "Checkout Service Availability",
"description": "99.9% availability SLO for checkout",
"metricExpression": "100*(builtin:service.errors.total.successCount:splitBy())/(builtin:service.requestCount.total:splitBy())",
"evaluationType": "AGGREGATE",
"target": 99.9,
"warning": 99.95,
"timeframe": "-1w",
"filter": "type(SERVICE),entityId(SERVICE-checkout)"
}Debugging Scenarios
- Davis opens a problem on a known maintenance window: Configure maintenance windows in Dynatrace Settings to suppress problem generation during scheduled maintenance. Davis will re-establish baselines after the window ends.
- Problem card shows a deployment as root cause, but the deploy was unrelated: Review the actual anomaly timing. Dynatrace correlates based on timing proximity. If the cause is external (e.g., upstream API degradation), check the contributing events for external service call anomalies.
- Davis not firing on a known issue: The anomaly may not persist long enough to cross Davis's confidence threshold (short spikes are filtered). Check if a manual alert rule is needed for that specific metric.
- Too many noisy problems: Review anomaly detection sensitivity settings. For high-traffic services, Davis baselines adapt — but very noisy services may need sensitivity tuned down in the service-level settings.
Real-world Use Case
A major bank's production environment generated over 2,000 alerts per day across traditional monitoring tools. After deploying Dynatrace, Davis AI correlated those 2,000 daily events into an average of 12 meaningful problems per day — a 99.4% reduction in alert noise. More importantly, the mean time to resolution (MTTR) for critical incidents dropped from 47 minutes to 8 minutes because engineers arrived at the problem card with the root cause already identified, not to an inbox of 200 alerts to manually correlate.
Interview Questions
Beginner
Dynatrace's causal AI engine that automatically baselines every metric, detects anomalies when values deviate from their learned baseline, correlates related anomalies using Smartscape topology, and identifies the root cause in a single problem card.
Davis learns the normal value range for each metric over time — including time-of-day and weekly patterns. Anomalies are detected relative to this learned baseline rather than a static threshold.
A single consolidated incident view showing: root cause entity, impact blast radius (affected services and users), contributing events timeline, plain-English Davis explanation, and suggested action.
Davis correlates all anomalies caused by a single root event into one problem card. A root cause affecting 40 services doesn't generate 40 alerts — it generates one problem with 40 impacted entities listed.
A Service Level Objective — a user-defined target for a service metric (availability, response time). Davis AI monitors SLO compliance and burns error budget status is shown in real time.
Intermediate
Davis uses the Smartscape dependency graph to trace anomalies backwards. If service B is degraded and it calls service A which is also degraded, and service A's host has a disk issue, Davis traces: disk problem → service A degradation → service B impact.
Threshold alerting fires on every breach of a fixed value — producing alert storms. Davis uses dynamic baselines (no configuration), correlates alerts into single problems, and applies causal analysis to identify root cause.
The automatic baseline learns weekly and daily patterns. A Monday morning traffic spike that consistently occurs doesn't trigger anomalies — Davis's baseline accounts for it. New unusual spikes are still detected.
Create a Maintenance Window in Settings → Maintenance Windows — specify the time range, affected entities, and whether to suppress problem creation, alerting, or both.
An event is a single anomaly on a single entity (e.g., response time spike on service A). A problem is a correlated set of events across entities that Davis has grouped because they share a common root cause.
Scenario-based
Root cause entity, the anomaly type (e.g., slow DB queries), impact blast radius (which services/users affected), timeline of events (was there a recent deployment?), Davis's plain-English explanation, and a direct link to the root cause entity's metrics.
The deployment likely introduced a regression. All 15 problems will list the same deployment event as a contributing factor. Dynatrace often consolidates these into fewer problems if they share a root cause. Roll back the deployment and confirm problems close automatically.
Davis detects anomalies the moment they deviate from baseline — often before users report issues. The problem card delivers a pre-analysed root cause, eliminating the correlation triage phase. MTTD drops from the first alert to Davis's first anomaly detection time.
Both could be correct for different definitions of "fine." Davis may detect queries running slower than the learned baseline even if the DB itself isn't in a critical state. Open the specific PurePaths Davis cited — look at the exact queries, their execution plans, and compare with the baseline period.
Export the problem card timeline (API: GET /api/v2/problems/P-xxxxx) for exact event sequences. Use it to document: when Davis first detected the issue, what the root cause was, which services were impacted, and how long it took to resolve — structured evidence for the incident timeline.
Summary
Davis AI transforms monitoring from a reactive, high-noise discipline into a proactive, low-noise operation. By learning baselines automatically, detecting anomalies continuously, and performing causal root cause analysis using Smartscape topology, Davis delivers the most operationally valuable output in monitoring: a single problem card telling you exactly what broke, what it impacted, and why.