AdvancedLesson 12 of 16

⚙️ Production Operations, Health Checks and Resource Limits

Run containers safely in production — set memory and CPU limits, configure health checks, implement graceful shutdown, and handle logging properly.

🧒 Simple Explanation (ELI5)

Running a container in production without limits is like letting a tenant in your building use unlimited electricity and water with no circuit breakers. One runaway process can starve all other containers on the host. Resource limits are the fuses and breakers — they protect everyone by capping what any one container can consume.

⚠️
Always set resource limits in production

Without limits, a memory leak or runaway process can OOM the entire host and take down all containers. In Kubernetes, resources without limits also cause pods to be evicted under node pressure. Every production container must have memory and CPU limits.

🔧 Resource Limits

bash
# Memory limits docker run -d \ --memory 512m \ # hard limit: container is OOMKilled if exceeded --memory-reservation 256m \ # soft limit: reclaimed under host pressure --memory-swap 512m \ # disable swap (= --memory value means no swap) myapp:1.0 # CPU limits docker run -d \ --cpus 1.5 \ # max 1.5 CPU cores --cpu-shares 512 \ # relative weight (default 1024) myapp:1.0 # Both together (typical production settings) docker run -d \ --name api \ --memory 512m \ --memory-reservation 256m \ --cpus 1.0 \ --restart unless-stopped \ myapp:1.0 # Monitor resource usage docker stats # live view of all containers docker stats api --no-stream # single snapshot

🏥 Health Checks

dockerfile
# Dockerfile HEALTHCHECK instruction HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \ CMD curl -f http://localhost:8080/healthz || exit 1 # Alternatives for containers without curl: HEALTHCHECK CMD wget -qO- http://localhost:8080/healthz || exit 1 # For Node.js — use a lightweight custom check script: HEALTHCHECK CMD node healthcheck.js
bash
# Check container health status docker inspect --format "{{.State.Health.Status}}" myapp # Outputs: healthy / unhealthy / starting # View health check log docker inspect --format "{{json .State.Health}}" myapp | python -m json.tool

🚪 Graceful Shutdown

javascript
# Node.js graceful shutdown — handle SIGTERM from docker stop / K8s const server = app.listen(PORT, () => console.log('Server started')); process.on('SIGTERM', () => { console.log('SIGTERM received — shutting down gracefully'); server.close(() => { // Close DB connections, flush logs, etc. console.log('Server closed'); process.exit(0); }); // Force exit after 30s if server.close hangs setTimeout(() => process.exit(1), 30000); });

📝 Production Logging

bash
# Containers should log to stdout/stderr — Docker captures it # Never write to files inside the container # Configure log driver (default is json-file) docker run -d \ --log-driver json-file \ --log-opt max-size=10m \ # rotate at 10MB --log-opt max-file=3 \ # keep 3 rotated files myapp:1.0 # Use syslog / fluentd / Azure Monitor for production log aggregation docker run -d \ --log-driver fluentd \ --log-opt fluentd-address=localhost:24224 \ myapp:1.0 # View logs regardless of driver docker logs myapp

🐛 Debugging Scenario

Problem: Container keeps restarting with exit code 137 — OOMKilled.

bash
# Step 1: confirm OOM kill docker inspect <id> | grep -i oom # OOMKilled: true # Step 2: find the actual memory usage before it died docker stats --no-stream <id> # realtime before next crash # Step 3: look at historical usage (if monitoring set up) # Azure Monitor / Grafana / Prometheus node_memory metrics # Step 4: fix options: # a) Increase the memory limit docker run -d --memory 1g myapp:1.0 # b) Fix the memory leak in the application (profiling needed) # c) Set NODE_OPTIONS=--max-old-space-size=512 for Node.js heap cap

🎯 Interview Questions

What happens when a container exceeds its memory limit?

The Linux kernel's Out-of-Memory killer terminates the container process with SIGKILL. The container exits with code 137. Docker (and Kubernetes) may then restart it depending on the restart policy. This is OOMKilled. To diagnose: docker inspect <id> | grep OOMKilled. Fix: increase the memory limit, fix the memory leak, or reduce heap usage.

What should a Docker health check endpoint return?

The health endpoint should return HTTP 200 when the application is ready to receive traffic. It should check the minimum required dependencies (DB connection, config loaded) — not deep dependencies that could cause false positives. A full dependency check should be a separate /ready endpoint. The check should be fast (under 2s) because it runs every 30s. Return non-200 or non-zero exit code when unhealthy.

Scenario: Your API container in production stops accepting requests but does NOT restart. The process is running but hanging. How do you fix it?

This is a liveness probe failure scenario. Without a health check, Docker has no way to detect a hung (zombie) application. Fix: 1) Add a HEALTHCHECK to the Dockerfile. 2) Configure --health-cmd, --health-interval, and --health-retries at run time. 3) In Kubernetes, add a livenessProbe — K8s restarts the pod when the probe fails. 4) Set --restart unless-stopped so Docker restarts on unhealthy status. 5) Add application-level watchdog timeouts for hung requests.

📋 Summary