Infrastructure Monitoring
Monitor hosts, containers, Kubernetes workloads, and network flows with Dynatrace — automatically, at every layer of the stack.
Simple Explanation (ELI5)
Infrastructure monitoring is watching the physical and virtual machines your software runs on. Dynatrace tracks CPU, memory, disk, network, and container health for every host — and connects that data to the services running on it, so when a VM runs out of memory you can immediately see which service was impacted.
What Dynatrace Monitors at the Infrastructure Layer
CPU usage, memory, disk I/O, network throughput, load average, swap, file system capacity — for every bare metal, VM, or cloud instance running OneAgent.
Identical processes grouped across hosts (e.g., all instances of your payment-service). Aggregated metrics + individual process-level visibility including thread counts and memory usage.
Container-level CPU and memory limits vs usage, OOMKill events, container restarts, and per-container network traffic — automatically correlated to the workload it belongs to.
Cluster, node, namespace, workload, pod, and container metrics. Deployment events, scaling actions, and pod lifecycle events all visible in context.
OneAgent captures TCP-level connection metrics: data sent/received, retransmissions, connection failures — mapped per process and service to diagnose network-layer issues.
AWS EC2, RDS, ELB, Lambda, S3; Azure VMs, App Service, SQL; GCP Compute, GKE — via ActiveGate cloud integrations polling cloud provider APIs.
Smartscape: Full-Stack Topology
Smartscape is Dynatrace's automatically maintained dependency topology. It models five layers:
Environment
VMs / Nodes
Groups
Users
When an infrastructure issue occurs (e.g., a host runs out of CPU), Smartscape immediately shows which process groups and services on that host are impacted — enabling vertical root cause analysis from infrastructure to user experience.
Kubernetes Monitoring — Full Detail
| Kubernetes Entity | Metrics Available | Key Signals to Watch |
|---|---|---|
| Cluster | Node count, resource utilisation, pod capacity | High node CPU/memory = scaling needed |
| Node | CPU, memory (requests/limits/actual), pod count | Node near memory limit = OOMKill risk |
| Namespace | Pod count, CPU/memory aggregated, error events | Quota exhaustion, unexpected pod counts |
| Workload (Deployment/DaemonSet) | Ready replicas, desired replicas, restart count | Restart loops, degraded replicas |
| Pod | Container CPU/memory, OOMKill, restart count | High restart count = crash loop |
| Container | CPU throttling %, memory usage vs limit | High CPU throttling = undersized limits |
Host Monitoring — Key Metrics
# List all hosts with CPU usage above 85% in last 30 minutes
curl -s -X GET \
"https://your-env.live.dynatrace.com/api/v2/metrics/query" \
-H "Authorization: Api-Token YOUR_API_TOKEN" \
-G \
--data-urlencode "metricSelector=builtin:host.cpu.usage:max:gt(85)" \
--data-urlencode "resolution=5m" \
--data-urlencode "from=now-30m" \
--data-urlencode "entitySelector=type(HOST)"
# Query memory available (MB) for production hosts
curl -s -X GET \
"https://your-env.live.dynatrace.com/api/v2/metrics/query" \
-H "Authorization: Api-Token YOUR_API_TOKEN" \
-G \
--data-urlencode "metricSelector=builtin:host.mem.avail.bytes" \
--data-urlencode "entitySelector=type(HOST),tag(environment:production)" \
| jq '.result[0].data[] | {entity: .dimensionMap, values: .values}'Kubernetes Monitoring via DQL (Dynatrace Query Language)
// Find pods with more than 3 restarts in last 1 hour fetch dt.entity.container_group_instance, from: now()-1h | filter container.restarts > 3 | fields entity.name, container.restarts, workload.name, namespace.name | sort container.restarts desc // CPU throttling per container across production namespace fetch dt.entity.container_group_instance, from: now()-30m | filter namespace.name == "production" | summarize avg_throttle = avg(cpu.throttled.percent) by entity.name, workload.name | filter avg_throttle > 50 | sort avg_throttle desc
Infrastructure to Application Correlation
A key Dynatrace capability: automatically linking infrastructure events to service impact. When a host's disk fills up and a service shows increased failure rate, Davis AI correlates these two events into a single problem — "Disk space exhaustion on host-prod-04 is causing writes to fail in order-service" — without you needing to manually connect the dots.
Network Monitoring
# Per-process network metrics (captured automatically by OneAgent): - Bytes sent/received per process per minute - TCP connection open/close rates - Connection retransmission count (indicates packet loss) - Connection failure count (indicates unreachable endpoints) # Cross-host traffic is mapped in Smartscape: - payment-service (host-prod-01) --> postgres (host-db-02): 1.2 MB/min - payment-service (host-prod-01) --> redis (host-cache-01): 450 KB/min - payment-service (host-prod-01) --> external-payment-gateway: 80 KB/min # High retransmissions = network congestion or misconfigured MTU # Connection failures = firewall rules, DNS resolution, or service down
Debugging Scenarios
- Service latency spikes correlate with host CPU spikes: Check if the process is CPU-constrained — look at host CPU and CPU-ready time. Possible fix: scale horizontally or increase CPU limits for the container.
- Pods show OOMKilled events: Container memory limit is too low. Check actual vs limit in K8s workload view. Increase the
resources.limits.memoryin the Deployment manifest. - No Kubernetes metrics in Dynatrace: DynaKube may not have
kubernetes-monitoringcapability in ActiveGate, or the Kubernetes API token permissions are insufficient. - High CPU throttling on containers: CPU request is set too low — container is being throttled. Increase
resources.requests.cpuor review whether the CPU limit is appropriate.
Real-world Use Case
A DevOps team received a Davis AI problem: "Response time degradation in order-service." The problem card showed the root cause: host prod-node-07 had 98% disk usage. Investigation in Dynatrace revealed the application was writing verbose DEBUG logs to a local volume that hadn't been rotated in 3 weeks. Log rotation was configured via a cron job and disk cleared — service response time normalised. The root cause (disk → service impact) was surfaced automatically by Dynatrace in the problem card hierarchy without any manual correlation.
Interview Questions
Beginner
CPU usage, memory (total/available/used), disk I/O (read/write bytes, IOPS), network (bytes in/out, errors), and load average — all collected automatically by OneAgent.
A logical grouping of identical processes running across multiple hosts (e.g., all instances of the payment-service JVM). Enables fleet-level metrics aggregated across the group.
When a container tries to use more CPU than its configured limit, the Linux cgroups scheduler throttles (pauses) it. High throttling causes latency — visible as long response times despite low node-level CPU usage.
Out Of Memory Killed — the Linux OOM killer terminated a container process because it exceeded its memory limit. Kubernetes then restarts the container, causing a pod restart event.
Dynatrace's automatically maintained, real-time topology map showing every entity (hosts, processes, services, applications) and every dependency relationship — updated continuously by OneAgent data.
Intermediate
Smartscape tracks which processes (and thus services) run on each host. When an infrastructure anomaly occurs, Davis AI traverses the Smartscape dependency graph to identify services that run on the affected host and correlates them into a single problem card.
Use an ActiveGate with the AWS cloud integration enabled. It polls CloudWatch metrics APIs and imports them into Dynatrace — enabling monitoring of managed services like RDS, Lambda, ELB, and DynamoDB without agents.
Cluster, nodes, namespaces, deployments, DaemonSets, StatefulSets, pods, and containers — including CPU/memory requests and limits, restart counts, and deployment events.
CPU usage measures how much CPU the process is actually consuming. CPU ready time measures how long the process was waiting for a CPU core to become available — high ready time causes latency even when overall host CPU appears low.
OneAgent captures TCP-level connection metrics per process. High retransmission rates indicate packet loss or network congestion. Connection failure spikes indicate unreachable endpoints. Both are visible in the host network analysis view.
Scenario-based
1. Check which pods are using the most memory on that node in Dynatrace. 2. Identify if any pod is near its memory limit (OOMKill risk). 3. Cordon the node to prevent new scheduling. 4. Drain non-critical pods to other nodes. 5. Investigate if a memory leak is causing growth.
Check network metrics — high TCP retransmissions or connection failures between service instances may be causing upstream timeouts that appear normal in the originating service. Also check if a load balancer or service mesh sidecar is adding latency.
Yes — even if services look fine now, sustained 95% CPU means there's no headroom for spikes. Check which process is consuming CPU, whether it's your application or a background process (e.g., log rotation, backup). Plan capacity action to prevent future impact.
Configure a Dynatrace anomaly detection custom threshold for builtin:host.disk.usedPct with a static threshold alert at 80% (warning) and 90% (critical). Add a notification integration to PagerDuty or Teams.
Dynatrace shows the pod restart event in the workload view and links to the container's logs (if log monitoring is configured). The problem card shows the restart pattern over time, and OneAgent may capture the last exception before the crash via APM.
Summary
Dynatrace infrastructure monitoring provides automatic, correlated visibility from bare metal or VMs down to individual containers and Kubernetes pods. The key differentiator is Smartscape — it continuously maps the relationship between infrastructure and services, enabling Davis AI to automatically determine when infrastructure issues cause application degradation without any manual correlation work.