Troubleshooting
Diagnose and resolve the most common Dynatrace operational issues — silent agents, missing metrics, misconfigured K8s injection, false-positive problems, and API errors.
Simple Explanation (ELI5)
Dynatrace is a sophisticated platform and sometimes things don't work as expected. This lesson gives you a systematic approach to every common failure category — with specific commands, log locations, and diagnostic steps used by Dynatrace engineers in production.
Troubleshooting Framework
Every Dynatrace issue follows the same data path: Host/Container → OneAgent → Network → ActiveGate → Dynatrace Cluster → UI. Identify which stage is broken by checking each step in sequence.
Issue 1: OneAgent Installed but Host Not Appearing in UI
# Step 1: Is OneAgent process running? sudo systemctl status oneagent # OR for containerised: ps aux | grep oneagent # Step 2: Check OneAgent logs for connection errors sudo tail -100 /var/log/dynatrace/oneagent/oneagent*.log | grep -i "error\|warning\|connect" # Step 3: Test connectivity to Dynatrace endpoint curl -v https://your-env.live.dynatrace.com/communication \ -H "Authorization: AgentTechnologyType=OneAgent" # Expected: HTTP 200 — any other code = network/firewall issue # Step 4: Verify agent configuration sudo cat /var/lib/dynatrace/oneagent/agent/config/ruxitagentproc.conf | grep server # Step 5: Run OneAgent self-diagnostics sudo /opt/dynatrace/oneagent/agent/lib64/oneagentwatchdog --check # Step 6: Restart agent if configuration was fixed sudo systemctl restart oneagent
Issue 2: Services Not Detected by Dynatrace APM
# Step 1: Verify request traffic is flowing (OneAgent only instruments active services) # Send test requests to the service, then check Dynatrace # Step 2: Check technology type is supported # Java, .NET, Node.js, PHP, Go, Python, Ruby — all supported # Check: https://docs.dynatrace.com/docs/setup-and-configuration/technology-support # Step 3: Verify OneAgent version supports the runtime version # Example: Java 21 requires OneAgent >= 1.285 sudo /opt/dynatrace/oneagent/agent/tools/oneagentctl --get-version # Step 4: Check process visibility # In Dynatrace UI: Infrastructure → Hosts → [your host] → Processes # If process appears but no service: check if traffic type is detected # (HTTP/gRPC traffic must be observed for service to be created) # Step 5: Look for instrumentation errors in OneAgent log sudo grep -i "instrumentation\|agent.*error\|bytecode" \ /var/log/dynatrace/oneagent/oneagent*.log | tail -50 # Step 6: If Java — check JVM startup opts are not blocking instrumentation # JVM should NOT have: -Djavaagent (third-party agents may conflict) # Check: ps aux | grep java | grep javaagent
Issue 3: Kubernetes Pod Injection Failing
# Step 1: Check DynaKube status kubectl get dynakube -n dynatrace kubectl describe dynakube dynakube -n dynatrace # Look for: Status.Conditions — any "False" conditions indicate problems # Step 2: Check Dynatrace Operator is running kubectl get pods -n dynatrace # Expected pods: dynakube-oneagent-*, dynakube-activegate-*, dynatrace-operator-* # Step 3: Check Operator logs for injection errors kubectl logs -n dynatrace deployment/dynatrace-operator --tail=100 | grep -i error # Step 4: Verify namespace is configured for injection kubectl get namespace your-namespace --show-labels # Should include: dynatrace.com/inject=true OR rely on DynaKube namespaceSelector # Step 5: Check webhook configuration kubectl get mutatingwebhookconfiguration | grep dynatrace kubectl describe mutatingwebhookconfiguration dynatrace-webhook # Step 6: Check pod annotations for injection kubectl get pod your-pod-xxxx -o yaml | grep -A5 annotations # If disabled: remove annotation: instrumentation.dynatrace.com/inject=false # Step 7: Force re-injection by restarting the pod kubectl rollout restart deployment/your-service
Issue 4: Davis AI Firing False-Positive Problems
COMMON CAUSES: 1. Baselining period too short (Davis needs 7+ days to learn patterns) 2. Major infrastructure change (new deployment, migration) confuses baseline 3. Planned high-traffic events (sales, launches) exceed baseline 4. Sensitivity set too high for noisy services RESOLUTION FOR PLANNED EVENTS: Settings → Maintenance Windows → New Maintenance Window - Type: Scheduled maintenance - Time range: exact event timeline - Affected entities: specific services or environments - Action: Disable problem detection + alerting RESOLUTION FOR NOISY SERVICES: Services → [Your Service] → Settings → Anomaly Detection - Response time: change AUTO to CUSTOM - Set higher thresholds (e.g., only alert if >3x baseline, not 1.5x) - Or disable specific anomaly type entirely for that service RESOLUTION FOR DEPLOYMENT NOISE: POST /api/v2/events/ingest (push deployment event before deploy) Davis will correlate anomalies with the deployment rather than treating them as independent problems — produces cleaner problem cards BASELINE RESET AFTER INFRASTRUCTURE CHANGE: After major change, Davis will re-learn over 7 days Increase anomaly detection sensitivity thresholds temporarily during the re-learning period
Issue 5: Missing Metrics for Cloud Resources
# AWS Integration — validate via Dynatrace API
curl -s "https://your-env.live.dynatrace.com/api/v1/aws/iamExternalId" \
-H "Authorization: Api-Token YOUR_TOKEN"
# Check AWS integration status
curl -s "https://your-env.live.dynatrace.com/api/v1/aws" \
-H "Authorization: Api-Token YOUR_TOKEN" \
| jq '.[] | {id: .id, status: .status, label: .label}'
# If status = ERROR, check:
# 1. IAM role ARN is correct
# 2. IAM role has the required policies:
# - CloudWatch:GetMetricStatistics
# - EC2:DescribeInstances
# - tag:GetResources
# ActiveGate must be running for cloud integrations
kubectl get pods -n dynatrace | grep activegate
# OR on-prem:
sudo systemctl status dynatracegateway
# Azure — check service principal permissions
# The Dynatrace Azure app registration needs:
# - Monitoring Reader role on subscription
# - Reader role on subscription
# Force metric refresh via API
curl -s -X POST \
"https://your-env.live.dynatrace.com/api/v1/aws/YOUR_CONFIG_ID/refresh" \
-H "Authorization: Api-Token YOUR_TOKEN"Issue 6: API Token Errors
# Test if API token is valid
curl -s "https://your-env.live.dynatrace.com/api/v2/metrics?pageSize=1" \
-H "Authorization: Api-Token YOUR_TOKEN" -o /dev/null -w "%{http_code}"
# 200 = valid, 401 = invalid/expired token, 403 = insufficient scope
# List required scopes for common operations:
# Metrics API v2: metrics.read
# Problems API v2: problems.read
# Entities API v2: entities.read
# Events ingest: events.ingest
# Settings write (Monaco): settings.write
# PaaS token (OneAgent): InstallerDownload
# Create new token via UI: Settings → Access Tokens → Generate new token
# OR via API (requires existing token with tokens.write scope):
curl -s -X POST \
"https://your-env.live.dynatrace.com/api/v2/apiTokens" \
-H "Authorization: Api-Token YOUR_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name":"ci-pipeline","scopes":["metrics.read","problems.read","events.ingest"]}'Troubleshooting Quick Reference
| Symptom | First Check | Command / Location |
|---|---|---|
| Host not in UI | OneAgent running? | systemctl status oneagent |
| No service detected | Traffic flowing? Tech supported? | Dynatrace: Host → Processes |
| K8s pods not injected | DynaKube status? | kubectl describe dynakube |
| False positive problems | Maintenance window needed? | Settings → Maintenance Windows |
| No AWS/Azure metrics | ActiveGate running? IAM correct? | systemctl status dynatracegateway |
| API 401/403 error | Token valid? Correct scope? | curl ... -w "%{http_code}" |
| Baseline learning slow | Needs 7+ days of data | Temporarily increase thresholds |
| PurePaths missing | Sampling? Agent version? | Check adaptive sampling settings |
Debugging Scenarios
- OneAgent running but Java service shows no APM data: The JVM may have a
-javaagentsecurity manager or another APM agent (AppDynamics, New Relic) that blocks bytecode instrumentation. Remove conflicting agents and restart. - Problems fire and close every 5 minutes repeatedly: Intermittent issue that's real but brief. Check if there's a scheduled job or health check endpoint causing the spike. Consider whether it's a genuine problem that needs fixing rather than suppressing.
- Dynatrace shows correct metrics but alerts don't fire: Check notification integration configuration (Settings → Integration → Problem notifications) — verify the integration is enabled, URL is correct, and test notification works.
- K8s node metrics missing but pod metrics work: The ActiveGate Kubernetes integration may not have the correct ClusterRole permissions to query node metrics. Check RBAC and compare with the required Dynatrace ClusterRole definition.
Real-world Use Case
A platform team deployed a new Kubernetes cluster and configured the DynaKube operator. In the Dynatrace UI, hosts and node metrics appeared correctly but no services were discovered. The issue: the new cluster used containerd with a custom runtime socket path (/run/k3s/containerd/containerd.sock) — OneAgent's default detection expected the standard socket at /run/containerd/containerd.sock. The fix was a DynaKube annotation specifying the custom socket path. After a pod rollout restart, all 68 services appeared in Dynatrace within 3 minutes.
Interview Questions
Beginner
/var/log/dynatrace/oneagent/ — contains oneagent0.log (current) and rotated archives. Search for "error" or "connection" keywords when diagnosing issues.
401 Unauthorized for an invalid or expired token. 403 Forbidden for a valid token with insufficient scope. Always verify the token has the required scopes for the API endpoint being called.
Run kubectl describe dynakube -n dynatrace and check Status.Conditions. Also verify injected pods have the Dynatrace init container: kubectl describe pod yourpod | grep dynatrace.
A scheduled time period during which Davis AI suppresses problem creation and/or alerting for specified entities — used for planned maintenance, deployments, or known high-traffic events that would otherwise trigger false problems.
A service is created when Dynatrace detects an entry point with request traffic (HTTP, gRPC, etc.). If the process is running but not receiving traffic yet — or uses an unsupported protocol — it appears as a process group without a service entity.
Intermediate
Navigate to Services → [service] → Settings → Anomaly Detection. Switch from AUTO to CUSTOM thresholds and raise the sensitivity — set higher deviation percentages or longer minimum duration before a problem is opened.
The IAM role assumption is failing — check that the external ID is correct, the trust policy allows Dynatrace's AWS account to assume the role, and the role has the required CloudWatch read permissions.
Restart the monitored process (or pod) — OneAgent re-instruments on process startup. Some instrumentation changes (like adding a new technology) may also require an OneAgent update to the latest version.
A Kubernetes custom resource (CRD) that configures the Dynatrace Operator — specifying OneAgent deployment mode (fullStack, cloudNativeFullStack, applicationMonitoring), ActiveGate capabilities, and token references.
1. Go to Settings → Integrations → Problem notifications. 2. Click "Send test notification" — check if the test succeeds. 3. Verify the problem filter settings (severity level, tag filters). 4. Check if problems are actually opening (not just alerting configured incorrectly).
Scenario-based
1. Is OneAgent installed on the host/pod injected? 2. Is the service's technology supported and the runtime version compatible? 3. Is traffic actually reaching the service? 4. Any instrumentation errors in OneAgent logs? 5. Check Dynatrace UI: Host → Processes — do you see the process? 6. Is there a conflicting agent?
Create a recurring maintenance window for the batch job's scheduled run time — e.g., every night from 02:00–04:00. Davis will suppress problem creation during this window and re-establish baselines for the next day's activity.
Check: 1. Is OneAgent injecting into the pod (init container present)? 2. Is the Java version supported? 3. Is there a security manager blocking bytecode injection? 4. Is the service actually receiving HTTP traffic (generate load, then check)? 5. Check OneAgent logs inside the container for instrumentation errors.
1. Short term: Create a maintenance window matching your deployment window. 2. Long term: Push deployment events to Dynatrace with the CI/CD API so Davis correlates anomalies to deployments — producing informational events rather than critical problems for known-good deployments.
The upgrade may have reset instrumentation configuration or cached bytecode. Restart the monitored services to force re-instrumentation. Also check the release notes for the new OneAgent version — it may have changed detection logic for specific frameworks.
Summary
Dynatrace troubleshooting always starts with the data path: agent → network → cluster → UI. OneAgent logs, systemctl status oneagent, and the DynaKube operator status are your first diagnostic steps. For Davis false positives, maintenance windows and custom sensitivity thresholds are the right tools. For API issues, always verify token scope before debugging the integration.