Hands-onLesson 8 of 9

Troubleshooting

Diagnose and resolve the most common Dynatrace operational issues — silent agents, missing metrics, misconfigured K8s injection, false-positive problems, and API errors.

Simple Explanation (ELI5)

Dynatrace is a sophisticated platform and sometimes things don't work as expected. This lesson gives you a systematic approach to every common failure category — with specific commands, log locations, and diagnostic steps used by Dynatrace engineers in production.

Troubleshooting Framework

Every Dynatrace issue follows the same data path: Host/Container → OneAgent → Network → ActiveGate → Dynatrace Cluster → UI. Identify which stage is broken by checking each step in sequence.

Issue 1: OneAgent Installed but Host Not Appearing in UI

bash — OneAgent connectivity diagnostics
# Step 1: Is OneAgent process running?
sudo systemctl status oneagent
# OR for containerised:
ps aux | grep oneagent

# Step 2: Check OneAgent logs for connection errors
sudo tail -100 /var/log/dynatrace/oneagent/oneagent*.log | grep -i "error\|warning\|connect"

# Step 3: Test connectivity to Dynatrace endpoint
curl -v https://your-env.live.dynatrace.com/communication \
  -H "Authorization: AgentTechnologyType=OneAgent"
# Expected: HTTP 200 — any other code = network/firewall issue

# Step 4: Verify agent configuration
sudo cat /var/lib/dynatrace/oneagent/agent/config/ruxitagentproc.conf | grep server

# Step 5: Run OneAgent self-diagnostics
sudo /opt/dynatrace/oneagent/agent/lib64/oneagentwatchdog --check

# Step 6: Restart agent if configuration was fixed
sudo systemctl restart oneagent

Issue 2: Services Not Detected by Dynatrace APM

bash — Service detection troubleshooting
# Step 1: Verify request traffic is flowing (OneAgent only instruments active services) # Send test requests to the service, then check Dynatrace # Step 2: Check technology type is supported # Java, .NET, Node.js, PHP, Go, Python, Ruby — all supported # Check: https://docs.dynatrace.com/docs/setup-and-configuration/technology-support # Step 3: Verify OneAgent version supports the runtime version # Example: Java 21 requires OneAgent >= 1.285 sudo /opt/dynatrace/oneagent/agent/tools/oneagentctl --get-version # Step 4: Check process visibility # In Dynatrace UI: Infrastructure → Hosts → [your host] → Processes # If process appears but no service: check if traffic type is detected # (HTTP/gRPC traffic must be observed for service to be created) # Step 5: Look for instrumentation errors in OneAgent log sudo grep -i "instrumentation\|agent.*error\|bytecode" \ /var/log/dynatrace/oneagent/oneagent*.log | tail -50 # Step 6: If Java — check JVM startup opts are not blocking instrumentation # JVM should NOT have: -Djavaagent (third-party agents may conflict) # Check: ps aux | grep java | grep javaagent

Issue 3: Kubernetes Pod Injection Failing

bash — K8s OneAgent injection diagnostics
# Step 1: Check DynaKube status
kubectl get dynakube -n dynatrace
kubectl describe dynakube dynakube -n dynatrace
# Look for: Status.Conditions — any "False" conditions indicate problems

# Step 2: Check Dynatrace Operator is running
kubectl get pods -n dynatrace
# Expected pods: dynakube-oneagent-*, dynakube-activegate-*, dynatrace-operator-*

# Step 3: Check Operator logs for injection errors
kubectl logs -n dynatrace deployment/dynatrace-operator --tail=100 | grep -i error

# Step 4: Verify namespace is configured for injection
kubectl get namespace your-namespace --show-labels
# Should include: dynatrace.com/inject=true OR rely on DynaKube namespaceSelector

# Step 5: Check webhook configuration
kubectl get mutatingwebhookconfiguration | grep dynatrace
kubectl describe mutatingwebhookconfiguration dynatrace-webhook

# Step 6: Check pod annotations for injection
kubectl get pod your-pod-xxxx -o yaml | grep -A5 annotations
# If disabled: remove annotation: instrumentation.dynatrace.com/inject=false

# Step 7: Force re-injection by restarting the pod
kubectl rollout restart deployment/your-service

Issue 4: Davis AI Firing False-Positive Problems

text — False positive investigation and resolution
COMMON CAUSES: 1. Baselining period too short (Davis needs 7+ days to learn patterns) 2. Major infrastructure change (new deployment, migration) confuses baseline 3. Planned high-traffic events (sales, launches) exceed baseline 4. Sensitivity set too high for noisy services RESOLUTION FOR PLANNED EVENTS: Settings → Maintenance Windows → New Maintenance Window - Type: Scheduled maintenance - Time range: exact event timeline - Affected entities: specific services or environments - Action: Disable problem detection + alerting RESOLUTION FOR NOISY SERVICES: Services → [Your Service] → Settings → Anomaly Detection - Response time: change AUTO to CUSTOM - Set higher thresholds (e.g., only alert if >3x baseline, not 1.5x) - Or disable specific anomaly type entirely for that service RESOLUTION FOR DEPLOYMENT NOISE: POST /api/v2/events/ingest (push deployment event before deploy) Davis will correlate anomalies with the deployment rather than treating them as independent problems — produces cleaner problem cards BASELINE RESET AFTER INFRASTRUCTURE CHANGE: After major change, Davis will re-learn over 7 days Increase anomaly detection sensitivity thresholds temporarily during the re-learning period

Issue 5: Missing Metrics for Cloud Resources

bash — Cloud integration diagnostics
# AWS Integration — validate via Dynatrace API
curl -s "https://your-env.live.dynatrace.com/api/v1/aws/iamExternalId" \
  -H "Authorization: Api-Token YOUR_TOKEN"

# Check AWS integration status
curl -s "https://your-env.live.dynatrace.com/api/v1/aws" \
  -H "Authorization: Api-Token YOUR_TOKEN" \
  | jq '.[] | {id: .id, status: .status, label: .label}'

# If status = ERROR, check:
# 1. IAM role ARN is correct
# 2. IAM role has the required policies:
#    - CloudWatch:GetMetricStatistics
#    - EC2:DescribeInstances
#    - tag:GetResources

# ActiveGate must be running for cloud integrations
kubectl get pods -n dynatrace | grep activegate
# OR on-prem:
sudo systemctl status dynatracegateway

# Azure — check service principal permissions
# The Dynatrace Azure app registration needs:
# - Monitoring Reader role on subscription
# - Reader role on subscription

# Force metric refresh via API
curl -s -X POST \
  "https://your-env.live.dynatrace.com/api/v1/aws/YOUR_CONFIG_ID/refresh" \
  -H "Authorization: Api-Token YOUR_TOKEN"

Issue 6: API Token Errors

bash — API token troubleshooting
# Test if API token is valid
curl -s "https://your-env.live.dynatrace.com/api/v2/metrics?pageSize=1" \
  -H "Authorization: Api-Token YOUR_TOKEN" -o /dev/null -w "%{http_code}"
# 200 = valid, 401 = invalid/expired token, 403 = insufficient scope

# List required scopes for common operations:
# Metrics API v2:           metrics.read
# Problems API v2:          problems.read
# Entities API v2:          entities.read
# Events ingest:            events.ingest
# Settings write (Monaco):  settings.write
# PaaS token (OneAgent):    InstallerDownload

# Create new token via UI: Settings → Access Tokens → Generate new token
# OR via API (requires existing token with tokens.write scope):
curl -s -X POST \
  "https://your-env.live.dynatrace.com/api/v2/apiTokens" \
  -H "Authorization: Api-Token YOUR_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name":"ci-pipeline","scopes":["metrics.read","problems.read","events.ingest"]}'

Troubleshooting Quick Reference

SymptomFirst CheckCommand / Location
Host not in UIOneAgent running?systemctl status oneagent
No service detectedTraffic flowing? Tech supported?Dynatrace: Host → Processes
K8s pods not injectedDynaKube status?kubectl describe dynakube
False positive problemsMaintenance window needed?Settings → Maintenance Windows
No AWS/Azure metricsActiveGate running? IAM correct?systemctl status dynatracegateway
API 401/403 errorToken valid? Correct scope?curl ... -w "%{http_code}"
Baseline learning slowNeeds 7+ days of dataTemporarily increase thresholds
PurePaths missingSampling? Agent version?Check adaptive sampling settings

Debugging Scenarios

Real-world Use Case

A platform team deployed a new Kubernetes cluster and configured the DynaKube operator. In the Dynatrace UI, hosts and node metrics appeared correctly but no services were discovered. The issue: the new cluster used containerd with a custom runtime socket path (/run/k3s/containerd/containerd.sock) — OneAgent's default detection expected the standard socket at /run/containerd/containerd.sock. The fix was a DynaKube annotation specifying the custom socket path. After a pod rollout restart, all 68 services appeared in Dynatrace within 3 minutes.

Interview Questions

Beginner

Where are OneAgent logs stored on Linux?

/var/log/dynatrace/oneagent/ — contains oneagent0.log (current) and rotated archives. Search for "error" or "connection" keywords when diagnosing issues.

What HTTP code does the Dynatrace API return for an invalid token?

401 Unauthorized for an invalid or expired token. 403 Forbidden for a valid token with insufficient scope. Always verify the token has the required scopes for the API endpoint being called.

How do you check if Kubernetes pods are being injected with OneAgent?

Run kubectl describe dynakube -n dynatrace and check Status.Conditions. Also verify injected pods have the Dynatrace init container: kubectl describe pod yourpod | grep dynatrace.

What is a maintenance window in Dynatrace?

A scheduled time period during which Davis AI suppresses problem creation and/or alerting for specified entities — used for planned maintenance, deployments, or known high-traffic events that would otherwise trigger false problems.

Why would a service appear in Dynatrace as a 'process' but not as a 'service'?

A service is created when Dynatrace detects an entry point with request traffic (HTTP, gRPC, etc.). If the process is running but not receiving traffic yet — or uses an unsupported protocol — it appears as a process group without a service entity.

Intermediate

How do you reduce false-positive Davis problems for a noisy service?

Navigate to Services → [service] → Settings → Anomaly Detection. Switch from AUTO to CUSTOM thresholds and raise the sensitivity — set higher deviation percentages or longer minimum duration before a problem is opened.

Why would AWS Cloud integration show no metrics even though ActiveGate is running?

The IAM role assumption is failing — check that the external ID is correct, the trust policy allows Dynatrace's AWS account to assume the role, and the role has the required CloudWatch read permissions.

How do you force OneAgent to re-detect a service after a code change?

Restart the monitored process (or pod) — OneAgent re-instruments on process startup. Some instrumentation changes (like adding a new technology) may also require an OneAgent update to the latest version.

What is the DynaKube custom resource?

A Kubernetes custom resource (CRD) that configures the Dynatrace Operator — specifying OneAgent deployment mode (fullStack, cloudNativeFullStack, applicationMonitoring), ActiveGate capabilities, and token references.

How do you debug a notification integration that isn't firing?

1. Go to Settings → Integrations → Problem notifications. 2. Click "Send test notification" — check if the test succeeds. 3. Verify the problem filter settings (severity level, tag filters). 4. Check if problems are actually opening (not just alerting configured incorrectly).

Scenario-based

A team says "Dynatrace isn't working" for their new service. Walk me through diagnosis.

1. Is OneAgent installed on the host/pod injected? 2. Is the service's technology supported and the runtime version compatible? 3. Is traffic actually reaching the service? 4. Any instrumentation errors in OneAgent logs? 5. Check Dynatrace UI: Host → Processes — do you see the process? 6. Is there a conflicting agent?

Davis AI is alerting every 10 minutes during nightly batch jobs. How do you stop this?

Create a recurring maintenance window for the batch job's scheduled run time — e.g., every night from 02:00–04:00. Davis will suppress problem creation during this window and re-establish baselines for the next day's activity.

You just ran a successful Dynatrace POC but the team asks: "Why isn't the Java microservice being traced?" What do you investigate?

Check: 1. Is OneAgent injecting into the pod (init container present)? 2. Is the Java version supported? 3. Is there a security manager blocking bytecode injection? 4. Is the service actually receiving HTTP traffic (generate load, then check)? 5. Check OneAgent logs inside the container for instrumentation errors.

Dynatrace shows a problem every deployment — "response time degradation." The team says these are expected. What do you configure?

1. Short term: Create a maintenance window matching your deployment window. 2. Long term: Push deployment events to Dynatrace with the CI/CD API so Davis correlates anomalies to deployments — producing informational events rather than critical problems for known-good deployments.

After upgrading OneAgent, several services stopped appearing in APM. What happened?

The upgrade may have reset instrumentation configuration or cached bytecode. Restart the monitored services to force re-instrumentation. Also check the release notes for the new OneAgent version — it may have changed detection logic for specific frameworks.

Summary

Dynatrace troubleshooting always starts with the data path: agent → network → cluster → UI. OneAgent logs, systemctl status oneagent, and the DynaKube operator status are your first diagnostic steps. For Davis false positives, maintenance windows and custom sensitivity thresholds are the right tools. For API issues, always verify token scope before debugging the integration.