Hands-on Lesson 13 of 14

Debugging Scenarios

Systematic debugging flows for the most common Helm failures: failed installs, wrong values, template errors, stuck releases, and more.

🧒 Simple Explanation (ELI5)

When something goes wrong with Helm, there's usually one of about 10 root causes. This page gives you a decision tree for each: see the symptom, follow the steps, find the root cause, apply the fix. Think of it as a Helm first-aid manual.

🔍 Master Debug Flowchart

Helm Debug Decision Tree
helm install/upgrade fails
Template error?
helm template . 2>&1
Timeout?
kubectl get events, describe pod
Stuck release?
helm rollback / delete pending secret
Values wrong?
helm get values, helm get manifest

🐛 Scenario 1: Template Rendering Failure

Symptom

text
Error: INSTALLATION FAILED: template: mychart/templates/deployment.yaml:25:20:
executing "mychart/templates/deployment.yaml" at <.Values.image.tag>:
nil pointer evaluating interface {}.tag

Debug Steps

bash
# Step 1: Reproduce locally
helm template test . 2>&1

# Step 2: Render with debug for more context
helm template test . --debug 2>&1

# Step 3: Render specific template
helm template test . -s templates/deployment.yaml 2>&1

# Step 4: Check values being passed
helm template test . --set image.tag=v1 2>&1
# If this works → the value is missing, not a syntax issue

Common Causes & Fixes

CauseFix
Nil pointer (.Values.missing.key)Use default: {{ .Values.image.tag | default .Chart.AppVersion }}
Unclosed if/range/with blockEnsure every {{ if }} has {{ end }}
Wrong indentation (nindent)Check nindent values match the YAML structure
Missing quote on stringUse | quote or {{ "value" }}

🐛 Scenario 2: Values Not Applied

Symptom

You set --set replicaCount=3 but only 1 pod is running. Or you pass a values file but the config doesn't change.

Debug Steps

bash
# Step 1: Check what values Helm actually used
helm get values myapp -n production
helm get values myapp -n production --all   # Includes defaults

# Step 2: Check rendered manifest
helm get manifest myapp -n production | grep replicas

# Step 3: Compare with expected
helm template test . --set replicaCount=3 | grep replicas

# Step 4: Check value merge order
# CLI --set > -f values-prod.yaml > -f values.yaml > chart defaults

Common Causes & Fixes

CauseFix
Case mismatch (replicacount vs replicaCount)YAML is case-sensitive. Match exactly.
Value nested wrong (--set auth.database vs --set postgresql.auth.database)Subchart values need parent key prefix.
Used --reuse-valuesOld values override new defaults. Use explicit -f instead.
Wrong values file orderLater -f files override earlier ones.
Template doesn't use the valueCheck template: grep -r "replicaCount" templates/

🐛 Scenario 3: Upgrade Failed / Timed Out

Symptom

text
Error: UPGRADE FAILED: timed out waiting for the condition

Debug Steps

bash
# Step 1: Check pod status
kubectl get pods -n production
kubectl describe pod <failing-pod> -n production

# Step 2: Check events (recent issues)
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

# Step 3: Check container logs
kubectl logs <pod-name> -n production
kubectl logs <pod-name> -n production --previous  # Previous crash

# Step 4: Check release status
helm status myapp -n production
helm history myapp -n production

Common Causes & Fixes

CauseFix
ImagePullBackOffWrong tag, missing registry auth. Fix image/tag value.
CrashLoopBackOffApp crash. Check logs. Fix app code or config.
Readiness probe failingWrong port/path in probe. Fix values or template.
Resource quota exceededReduce resource requests or increase quota.
PVC pendingNo matching StorageClass. Check kubectl get pvc.
Timeout too shortIncrease: --timeout 10m

🐛 Scenario 4: Stuck Release (pending-upgrade/pending-install)

Symptom

text
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Debug Steps

bash
# Step 1: Check current status
helm list -n production
helm status myapp -n production

# Step 2: View history
helm history myapp -n production
# Look for STATUS: pending-upgrade or pending-install

# Step 3: Rollback to last good revision
helm rollback myapp <last-deployed-revision> -n production

# Step 4: If rollback fails, manually fix
# List release secrets
kubectl get secrets -n production -l owner=helm,name=myapp

# Delete the stuck pending secret
kubectl delete secret sh.helm.release.v1.myapp.v<N> -n production

# Step 5: Retry
helm upgrade --install myapp . -n production
Prevention

Use --atomic on every install/upgrade. It auto-rolls back on failure, preventing stuck states. Also use --wait --timeout 5m to set clear boundaries.

🐛 Scenario 5: Hook Failure Blocking Install

Symptom

text
Error: failed pre-install: job failed: BackoffLimitExceeded

Debug Steps

bash
# Step 1: Find the hook Job
kubectl get jobs -n production
kubectl describe job myapp-db-migrate -n production

# Step 2: Check hook pod logs
kubectl logs job/myapp-db-migrate -n production

# Step 3: Fix the hook (usually wrong command, missing env var, DB unreachable)
# Edit the template, then:

# Step 4: Clean up failed hook
kubectl delete job myapp-db-migrate -n production

# Step 5: Retry
helm upgrade --install myapp . -n production --atomic

🐛 Scenario 6: "rendered manifests contain a resource that already exists"

Symptom

text
Error: rendered manifests contain a resource that already exists.
Unable to continue with install: existing resource conflict: ... ServiceAccount

Debug Steps

bash
# Step 1: Find who owns the resource
kubectl get serviceaccount <name> -n production -o yaml | grep -A 3 "annotations"

# Step 2a: If owned by another Helm release → use a unique name
# Check _helpers.tpl fullname template

# Step 2b: If created manually → adopt it into Helm
kubectl annotate serviceaccount <name> -n production \
  meta.helm.sh/release-name=myapp \
  meta.helm.sh/release-namespace=production
kubectl label serviceaccount <name> -n production \
  app.kubernetes.io/managed-by=Helm

# Step 3: Retry
helm upgrade --install myapp . -n production

🐛 Scenario 7: Diff Shows Unexpected Changes

Debug Steps

bash
# Install helm-diff plugin
helm plugin install https://github.com/databus23/helm-diff

# See what will change
helm diff upgrade myapp . -n production -f values-prod.yaml

# Compare revisions
helm diff revision myapp 5 6 -n production

# Common causes of unexpected diff:
# 1. --reuse-values merging old/new values unexpectedly
# 2. Chart upgrade changed default values
# 3. Random generator (password) regenerating each time
#    Fix: Use lookup function or external secret management

� Scenario 8: Network / Service Discovery Issues

Symptom

Pods are Running but can't communicate with each other, or Ingress doesn't route traffic.

Debug Steps

bash
# Step 1: Verify Service and Endpoints
kubectl get svc -n production
kubectl get endpoints myapp -n production
# If endpoints list is EMPTY → labels don't match

# Step 2: Compare Service selector with Pod labels
kubectl get svc myapp -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production --show-labels
# Ensure selectors match pod labels EXACTLY

# Step 3: Test DNS resolution from inside the cluster
kubectl run dns-test --rm -it --image=busybox -n production -- nslookup myapp
kubectl run dns-test --rm -it --image=busybox -n production -- nslookup myapp.production.svc.cluster.local

# Step 4: Test connectivity to the service
kubectl run curl-test --rm -it --image=curlimages/curl -n production -- \
  curl -v http://myapp:80/health

# Step 5: Check NetworkPolicies blocking traffic
kubectl get networkpolicies -n production
kubectl describe networkpolicy -n production
# If NetworkPolicy exists, verify it allows the necessary ingress/egress

# Step 6: For Ingress issues
kubectl describe ingress myapp -n production
kubectl get events -n production | grep ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
🔗
K8s Connection

Helm creates Services using label selectors from _helpers.tpl. If you override nameOverride or fullnameOverride mid-release, selectors may break. Always check kubectl get endpoints — empty endpoints means the selector doesn't match any pods.

🐛 Scenario 9: Slow Rollout / Pod Scheduling Delays

Symptom

Helm upgrade hangs, pods stay in Pending for a long time, or rollout is much slower than expected.

Debug Steps

bash
# Step 1: Check pod status and scheduling
kubectl get pods -n production -o wide
kubectl describe pod <pending-pod> -n production
# Look at Events section for:
#   FailedScheduling: 0/5 nodes are available: insufficient cpu
#   FailedScheduling: 0/5 nodes are available: pod has unbound PersistentVolumeClaims

# Step 2: Check node resources
kubectl top nodes
kubectl describe node <node-name> | grep -A 8 "Allocated resources"

# Step 3: Check resource quotas on the namespace
kubectl describe resourcequota -n production
# If quota is exceeded, reduce resource requests or increase quota

# Step 4: Check PodDisruptionBudget blocking rollout
kubectl get pdb -n production
kubectl describe pdb myapp-pdb -n production
# If minAvailable is too high, new pods can't be scheduled during rolling update

# Step 5: Check Deployment rollout strategy
kubectl get deploy myapp -n production -o jsonpath='{.spec.strategy}'
# With RollingUpdate: maxUnavailable=0 + maxSurge=1 is slowest but safest

# Step 6: Monitor rollout progress
kubectl rollout status deployment/myapp -n production
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

Common Causes & Fixes

CauseFix
Insufficient cluster resourcesScale cluster, reduce resource requests, or use priority classes
PDB blocking evictionTemporarily lower minAvailable or increase replicas first
Node affinity/taints mismatchCheck nodeSelector, tolerations, affinity in values
Slow image pullPre-pull images, use imagePullPolicy: IfNotPresent, check registry speed
Slow readiness probeIncrease initialDelaySeconds, check probe endpoint performance

�🛠️ Essential Debug Toolkit

CommandWhen to Use
helm template . 2>&1Template won't render
helm lint .Quick syntax check
helm get values <rel> -n <ns>Values seem wrong
helm get manifest <rel> -n <ns>See what was deployed
helm history <rel> -n <ns>Check revision history
helm status <rel> -n <ns>Current release state
helm diff upgrade ...Preview changes before applying
kubectl get events --sort-by='.lastTimestamp'Recent cluster events
kubectl describe pod <name>Pod scheduling/start issues
kubectl logs <pod> --previousCrash logs

🎯 Interview Questions

Beginner

Q: How do you debug a Helm template error?

Run helm template <name> . 2>&1 to see the exact error with file and line number. Use --debug for more context. Render individual templates with -s templates/deployment.yaml. Also helm lint for quick checks.

Q: Your helm upgrade timed out. What do you check first?

1) kubectl get pods -n <ns> — are pods starting? 2) kubectl describe pod <name> — events section shows why pods aren't ready. 3) kubectl get events — recent cluster events. Common: image pull failures, crash loops, probe failures.

Q: How do you check what values are currently used by a release?

helm get values <release> -n <ns> shows user-supplied values. Add --all to include computed defaults. helm get manifest <release> -n <ns> shows the actual rendered YAML applied to the cluster.

Q: What is 'pending-upgrade' status?

The release got stuck during an upgrade — Helm started but didn't finish (timeout, crash, cancelled). Further operations are blocked. Fix: helm rollback <rel> <rev> to the last good revision. Prevention: always use --atomic.

Q: How do you see the actual Kubernetes manifests a release deployed?

helm get manifest <release> -n <ns>. This shows the rendered YAML that Helm applied to the cluster. Compare with helm template output to check if values are being applied correctly.

Intermediate

Q: How does Helm's 3-way merge affect debugging?

Helm 3 does a 3-way merge: old chart manifest, live state, new chart manifest. If someone manually edited a resource (kubectl edit), Helm detects it. This can cause unexpected diffs. Debug with helm diff to see what Helm thinks changed. Fix by ensuring all changes go through Helm, not kubectl.

Q: What is the difference between helm template and helm install --dry-run?

helm template is purely local — renders templates without cluster access. helm install --dry-run talks to the cluster API to validate resources (checks apiVersions, CRDs, quotas) but doesn't create them. Use template for quick checks, dry-run for full validation.

Q: How do you debug why a value isn't being applied to a subchart?

1) Confirm nesting: subchart values go under subchartName: key. 2) helm get values <rel> -n <ns> --all — check if value is in the output. 3) helm get manifest — search for the expected resource. 4) Use helm template with --debug to see the full values tree.

Q: How would you debug a random password regenerating on every upgrade?

Template uses randAlphaNum which generates a new value each render. Fix: use the lookup function to check if the Secret exists first: {{ $existing := lookup "v1" "Secret" .Release.Namespace "myapp-secret" }}. If it exists, reuse the value. Otherwise generate. Or use an external secret manager.

Q: Your release shows "deployed" but pods are in CrashLoopBackOff. How?

Helm marks success when resources are created (without --wait) or ready (with --wait). Without --wait, Helm doesn't check pod status. The Deployment exists (deployed) but pods crash. Fix: always use --wait --timeout. Check: kubectl logs <pod>, kubectl describe pod.

Scenario-Based

Q: Production upgrade failed at 2 AM, release is stuck in pending-upgrade. Walk through your response.

1) helm status myapp -n prod — confirm state. 2) helm history myapp -n prod — find last deployed revision. 3) helm rollback myapp <rev> -n prod — restore service. 4) kubectl get pods -n prod — verify healthy. 5) Next day: helm diff upgrade to review what changed, fix the issue, deploy to staging first, then prod with --atomic.

Q: A teammate manually kubectl edited a Deployment. Now helm upgrade shows changes you didn't make. What happened?

Helm 3's 3-way merge: it sees the live state differs from both the old and new chart manifests. Helm will try to reconcile. Use helm diff upgrade to see all changes. The manual edit will be reverted to match the chart. Team rule: never kubectl edit Helm-managed resources; always change through values.yaml and helm upgrade.

Q: helm upgrade succeeds but the new version has a bug. How do you recover?

helm rollback myapp <previous-revision> -n prod. This re-deploys the previous chart+values combination. Verify with kubectl get pods and helm test. Then fix the bug, deploy to staging, validate, and re-deploy to prod. Use helm history to identify the correct revision number.

Q: You see "no matches for kind Deployment in version apps/v1beta1" after a cluster upgrade. Fix?

The chart uses a deprecated apiVersion removed in the new K8s version. Fix: update the chart's templates to use current apiVersions (apps/v1). Use .Capabilities.APIVersions.Has for conditional apiVersion selection if supporting multiple K8s versions. Also: helm template | kubeconform to check compatibility.

Q: Your Helm test passes in staging but fails in production. What's different?

Compare environments: 1) helm get values myapp -n staging --all vs prod. 2) Different network policies in prod? 3) Different resource limits causing OOM? 4) Different external dependencies (DB, API endpoints)? 5) DNS resolution differences. Run helm test --logs in both and compare. Check kubectl describe pod for the test pod in prod.

🌍 Real-World Debugging: The Friday 5PM Incident

A team pushed a Helm upgrade to production on Friday evening. Monitoring shows 503 errors spiking. Here's the full investigation:

bash
# 5:01 PM — Alert fires: 503 errors on /api/checkout

# Step 1: Is it Helm or K8s?
helm status payment-api -n prod
# STATUS: deployed — Helm thinks it's fine

# Step 2: Check pods
kubectl get pods -n prod -l app.kubernetes.io/name=payment-api
# 2/3 pods in CrashLoopBackOff, 1 still running (old revision)

# Step 3: Why are they crashing?
kubectl logs payment-api-7f8b9-xk2mn -n prod
# Error: STRIPE_API_KEY environment variable not set

# Step 4: What changed?
helm diff revision payment-api 11 12 -n prod
# Reveals: env block restructured, STRIPE_API_KEY moved under a new key

# Step 5: Check values that were applied
helm get values payment-api -n prod
# Missing: payment.stripe.apiKeySecret

# Step 6: Immediate fix — rollback
helm rollback payment-api 11 -n prod
kubectl get pods -n prod -l app.kubernetes.io/name=payment-api
# All 3 pods Running, 503s stop

# Step 7: Monday — Fix properly
# Update values-prod.yaml with the new key structure
# Test with: helm template test . -f values-prod.yaml | grep STRIPE
# Deploy to staging first, then prod with --atomic
Lesson Learned

The root cause was a chart refactor that moved env var keys under a new section, but the production values file wasn't updated to match. Always run helm diff upgrade before applying to production, and never deploy on Fridays without --atomic.

📝 Summary

← Back to Helm Course