IntermediateLesson 4 of 9

Infrastructure Monitoring

Monitor hosts, containers, Kubernetes workloads, and network flows with Dynatrace — automatically, at every layer of the stack.

Simple Explanation (ELI5)

Infrastructure monitoring is watching the physical and virtual machines your software runs on. Dynatrace tracks CPU, memory, disk, network, and container health for every host — and connects that data to the services running on it, so when a VM runs out of memory you can immediately see which service was impacted.

What Dynatrace Monitors at the Infrastructure Layer

Hosts

CPU usage, memory, disk I/O, network throughput, load average, swap, file system capacity — for every bare metal, VM, or cloud instance running OneAgent.

Process Groups

Identical processes grouped across hosts (e.g., all instances of your payment-service). Aggregated metrics + individual process-level visibility including thread counts and memory usage.

Containers

Container-level CPU and memory limits vs usage, OOMKill events, container restarts, and per-container network traffic — automatically correlated to the workload it belongs to.

Kubernetes

Cluster, node, namespace, workload, pod, and container metrics. Deployment events, scaling actions, and pod lifecycle events all visible in context.

Network Flows

OneAgent captures TCP-level connection metrics: data sent/received, retransmissions, connection failures — mapped per process and service to diagnose network-layer issues.

Cloud Resources

AWS EC2, RDS, ELB, Lambda, S3; Azure VMs, App Service, SQL; GCP Compute, GKE — via ActiveGate cloud integrations polling cloud provider APIs.

Smartscape: Full-Stack Topology

Smartscape is Dynatrace's automatically maintained dependency topology. It models five layers:

Datacenter /
Environment
Hosts /
VMs / Nodes
Process
Groups
Services
Applications /
Users

When an infrastructure issue occurs (e.g., a host runs out of CPU), Smartscape immediately shows which process groups and services on that host are impacted — enabling vertical root cause analysis from infrastructure to user experience.

Kubernetes Monitoring — Full Detail

Kubernetes EntityMetrics AvailableKey Signals to Watch
ClusterNode count, resource utilisation, pod capacityHigh node CPU/memory = scaling needed
NodeCPU, memory (requests/limits/actual), pod countNode near memory limit = OOMKill risk
NamespacePod count, CPU/memory aggregated, error eventsQuota exhaustion, unexpected pod counts
Workload (Deployment/DaemonSet)Ready replicas, desired replicas, restart countRestart loops, degraded replicas
PodContainer CPU/memory, OOMKill, restart countHigh restart count = crash loop
ContainerCPU throttling %, memory usage vs limitHigh CPU throttling = undersized limits

Host Monitoring — Key Metrics

bash — Query host CPU via Dynatrace API
# List all hosts with CPU usage above 85% in last 30 minutes
curl -s -X GET \
  "https://your-env.live.dynatrace.com/api/v2/metrics/query" \
  -H "Authorization: Api-Token YOUR_API_TOKEN" \
  -G \
  --data-urlencode "metricSelector=builtin:host.cpu.usage:max:gt(85)" \
  --data-urlencode "resolution=5m" \
  --data-urlencode "from=now-30m" \
  --data-urlencode "entitySelector=type(HOST)"

# Query memory available (MB) for production hosts
curl -s -X GET \
  "https://your-env.live.dynatrace.com/api/v2/metrics/query" \
  -H "Authorization: Api-Token YOUR_API_TOKEN" \
  -G \
  --data-urlencode "metricSelector=builtin:host.mem.avail.bytes" \
  --data-urlencode "entitySelector=type(HOST),tag(environment:production)" \
  | jq '.result[0].data[] | {entity: .dimensionMap, values: .values}'

Kubernetes Monitoring via DQL (Dynatrace Query Language)

dql — Pod restart count by workload (DQL)
// Find pods with more than 3 restarts in last 1 hour
fetch dt.entity.container_group_instance, from: now()-1h
| filter container.restarts > 3
| fields entity.name, container.restarts, workload.name, namespace.name
| sort container.restarts desc

// CPU throttling per container across production namespace
fetch dt.entity.container_group_instance, from: now()-30m
| filter namespace.name == "production"
| summarize avg_throttle = avg(cpu.throttled.percent) by entity.name, workload.name
| filter avg_throttle > 50
| sort avg_throttle desc

Infrastructure to Application Correlation

A key Dynatrace capability: automatically linking infrastructure events to service impact. When a host's disk fills up and a service shows increased failure rate, Davis AI correlates these two events into a single problem — "Disk space exhaustion on host-prod-04 is causing writes to fail in order-service" — without you needing to manually connect the dots.

Tip: In the Dynatrace UI, every host page shows which services run on it. Every service page shows which host(s) it runs on. Smartscape always keeps this mapping current.

Network Monitoring

text — OneAgent network visibility (what is captured automatically)
# Per-process network metrics (captured automatically by OneAgent):
- Bytes sent/received per process per minute
- TCP connection open/close rates
- Connection retransmission count (indicates packet loss)
- Connection failure count (indicates unreachable endpoints)

# Cross-host traffic is mapped in Smartscape:
- payment-service (host-prod-01) --> postgres (host-db-02): 1.2 MB/min
- payment-service (host-prod-01) --> redis (host-cache-01):  450 KB/min
- payment-service (host-prod-01) --> external-payment-gateway: 80 KB/min

# High retransmissions = network congestion or misconfigured MTU
# Connection failures = firewall rules, DNS resolution, or service down

Debugging Scenarios

Real-world Use Case

A DevOps team received a Davis AI problem: "Response time degradation in order-service." The problem card showed the root cause: host prod-node-07 had 98% disk usage. Investigation in Dynatrace revealed the application was writing verbose DEBUG logs to a local volume that hadn't been rotated in 3 weeks. Log rotation was configured via a cron job and disk cleared — service response time normalised. The root cause (disk → service impact) was surfaced automatically by Dynatrace in the problem card hierarchy without any manual correlation.

Interview Questions

Beginner

What infrastructure metrics does Dynatrace collect per host?

CPU usage, memory (total/available/used), disk I/O (read/write bytes, IOPS), network (bytes in/out, errors), and load average — all collected automatically by OneAgent.

What is a Process Group in Dynatrace?

A logical grouping of identical processes running across multiple hosts (e.g., all instances of the payment-service JVM). Enables fleet-level metrics aggregated across the group.

What is CPU throttling in Kubernetes?

When a container tries to use more CPU than its configured limit, the Linux cgroups scheduler throttles (pauses) it. High throttling causes latency — visible as long response times despite low node-level CPU usage.

What does OOMKilled mean?

Out Of Memory Killed — the Linux OOM killer terminated a container process because it exceeded its memory limit. Kubernetes then restarts the container, causing a pod restart event.

What is Smartscape?

Dynatrace's automatically maintained, real-time topology map showing every entity (hosts, processes, services, applications) and every dependency relationship — updated continuously by OneAgent data.

Intermediate

How does Dynatrace link an infrastructure problem to a service impact?

Smartscape tracks which processes (and thus services) run on each host. When an infrastructure anomaly occurs, Davis AI traverses the Smartscape dependency graph to identify services that run on the affected host and correlates them into a single problem card.

How do you monitor AWS services with Dynatrace without a OneAgent on every instance?

Use an ActiveGate with the AWS cloud integration enabled. It polls CloudWatch metrics APIs and imports them into Dynatrace — enabling monitoring of managed services like RDS, Lambda, ELB, and DynamoDB without agents.

What Kubernetes entities does Dynatrace monitor?

Cluster, nodes, namespaces, deployments, DaemonSets, StatefulSets, pods, and containers — including CPU/memory requests and limits, restart counts, and deployment events.

What is the difference between CPU usage and CPU ready time?

CPU usage measures how much CPU the process is actually consuming. CPU ready time measures how long the process was waiting for a CPU core to become available — high ready time causes latency even when overall host CPU appears low.

How do you detect network issues between services using Dynatrace?

OneAgent captures TCP-level connection metrics per process. High retransmission rates indicate packet loss or network congestion. Connection failure spikes indicate unreachable endpoints. Both are visible in the host network analysis view.

Scenario-based

A node in your Kubernetes cluster is at 98% memory. What actions do you take?

1. Check which pods are using the most memory on that node in Dynatrace. 2. Identify if any pod is near its memory limit (OOMKill risk). 3. Cordon the node to prevent new scheduling. 4. Drain non-critical pods to other nodes. 5. Investigate if a memory leak is causing growth.

Service latency is high but app logs show no errors and APM shows normal response times. Where do you look?

Check network metrics — high TCP retransmissions or connection failures between service instances may be causing upstream timeouts that appear normal in the originating service. Also check if a load balancer or service mesh sidecar is adding latency.

Dynatrace shows a host with 95% CPU but the services on it look fine. Is there a problem?

Yes — even if services look fine now, sustained 95% CPU means there's no headroom for spikes. Check which process is consuming CPU, whether it's your application or a background process (e.g., log rotation, backup). Plan capacity action to prevent future impact.

How would you set up alerting for disk space exhaustion before it causes an outage?

Configure a Dynatrace anomaly detection custom threshold for builtin:host.disk.usedPct with a static threshold alert at 80% (warning) and 90% (critical). Add a notification integration to PagerDuty or Teams.

A Kubernetes deployment is showing CrashLoopBackOff. How does Dynatrace help?

Dynatrace shows the pod restart event in the workload view and links to the container's logs (if log monitoring is configured). The problem card shows the restart pattern over time, and OneAgent may capture the last exception before the crash via APM.

Summary

Dynatrace infrastructure monitoring provides automatic, correlated visibility from bare metal or VMs down to individual containers and Kubernetes pods. The key differentiator is Smartscape — it continuously maps the relationship between infrastructure and services, enabling Davis AI to automatically determine when infrastructure issues cause application degradation without any manual correlation work.