Hands-on Lesson 13 of 14

Debugging Scenarios

Systematic debugging flows for the most common Helm failures: failed installs, wrong values, template errors, stuck releases, and more.

ðŸ§’ Simple Explanation (ELI5)

When something goes wrong with Helm, there's usually one of about 10 root causes. This page gives you a decision tree for each: see the symptom, follow the steps, find the root cause, apply the fix. Think of it as a Helm first-aid manual.

ðŸ” Master Debug Flowchart

Helm Debug Decision Tree

helm install/upgrade fails

Template error?

â†’

helm template . 2>&1

Timeout?

â†’

kubectl get events, describe pod

Stuck release?

â†’

helm rollback / delete pending secret

Values wrong?

â†’

helm get values, helm get manifest

ðŸ› Scenario 1: Template Rendering Failure

Symptom

text

Error: INSTALLATION FAILED: template: mychart/templates/deployment.yaml:25:20:
executing "mychart/templates/deployment.yaml" at <.Values.image.tag>:
nil pointer evaluating interface {}.tag

Debug Steps

bash

# Step 1: Reproduce locally
helm template test . 2>&1

# Step 2: Render with debug for more context
helm template test . --debug 2>&1

# Step 3: Render specific template
helm template test . -s templates/deployment.yaml 2>&1

# Step 4: Check values being passed
helm template test . --set image.tag=v1 2>&1
# If this works â†’ the value is missing, not a syntax issue

Common Causes & Fixes

Cause	Fix
Nil pointer (.Values.missing.key)	Use `default`: `{{ .Values.image.tag \| default .Chart.AppVersion }}`
Unclosed if/range/with block	Ensure every `{{ if }}` has `{{ end }}`
Wrong indentation (nindent)	Check nindent values match the YAML structure
Missing quote on string	Use `\| quote` or `{{ "value" }}`

ðŸ› Scenario 2: Values Not Applied

Symptom

You set --set replicaCount=3 but only 1 pod is running. Or you pass a values file but the config doesn't change.

Debug Steps

bash

# Step 1: Check what values Helm actually used
helm get values myapp -n production
helm get values myapp -n production --all   # Includes defaults

# Step 2: Check rendered manifest
helm get manifest myapp -n production | grep replicas

# Step 3: Compare with expected
helm template test . --set replicaCount=3 | grep replicas

# Step 4: Check value merge order
# CLI --set > -f values-prod.yaml > -f values.yaml > chart defaults

Common Causes & Fixes

Cause	Fix
Case mismatch (replicacount vs replicaCount)	YAML is case-sensitive. Match exactly.
Value nested wrong (`--set auth.database` vs `--set postgresql.auth.database`)	Subchart values need parent key prefix.
Used `--reuse-values`	Old values override new defaults. Use explicit `-f` instead.
Wrong values file order	Later `-f` files override earlier ones.
Template doesn't use the value	Check template: `grep -r "replicaCount" templates/`

ðŸ› Scenario 3: Upgrade Failed / Timed Out

Symptom

text

Error: UPGRADE FAILED: timed out waiting for the condition

Debug Steps

bash

# Step 1: Check pod status
kubectl get pods -n production
kubectl describe pod <failing-pod> -n production

# Step 2: Check events (recent issues)
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

# Step 3: Check container logs
kubectl logs <pod-name> -n production
kubectl logs <pod-name> -n production --previous  # Previous crash

# Step 4: Check release status
helm status myapp -n production
helm history myapp -n production

Common Causes & Fixes

Cause	Fix
ImagePullBackOff	Wrong tag, missing registry auth. Fix image/tag value.
CrashLoopBackOff	App crash. Check logs. Fix app code or config.
Readiness probe failing	Wrong port/path in probe. Fix values or template.
Resource quota exceeded	Reduce resource requests or increase quota.
PVC pending	No matching StorageClass. Check `kubectl get pvc`.
Timeout too short	Increase: `--timeout 10m`

ðŸ› Scenario 4: Stuck Release (pending-upgrade/pending-install)

Symptom

text

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Debug Steps

bash

# Step 1: Check current status
helm list -n production
helm status myapp -n production

# Step 2: View history
helm history myapp -n production
# Look for STATUS: pending-upgrade or pending-install

# Step 3: Rollback to last good revision
helm rollback myapp <last-deployed-revision> -n production

# Step 4: If rollback fails, manually fix
# List release secrets
kubectl get secrets -n production -l owner=helm,name=myapp

# Delete the stuck pending secret
kubectl delete secret sh.helm.release.v1.myapp.v<N> -n production

# Step 5: Retry
helm upgrade --install myapp . -n production

â—

Prevention

Use --atomic on every install/upgrade. It auto-rolls back on failure, preventing stuck states. Also use --wait --timeout 5m to set clear boundaries.

ðŸ› Scenario 5: Hook Failure Blocking Install

Symptom

text

Error: failed pre-install: job failed: BackoffLimitExceeded

Debug Steps

bash

# Step 1: Find the hook Job
kubectl get jobs -n production
kubectl describe job myapp-db-migrate -n production

# Step 2: Check hook pod logs
kubectl logs job/myapp-db-migrate -n production

# Step 3: Fix the hook (usually wrong command, missing env var, DB unreachable)
# Edit the template, then:

# Step 4: Clean up failed hook
kubectl delete job myapp-db-migrate -n production

# Step 5: Retry
helm upgrade --install myapp . -n production --atomic

ðŸ› Scenario 6: "rendered manifests contain a resource that already exists"

Symptom

text

Error: rendered manifests contain a resource that already exists.
Unable to continue with install: existing resource conflict: ... ServiceAccount

Debug Steps

bash

# Step 1: Find who owns the resource
kubectl get serviceaccount <name> -n production -o yaml | grep -A 3 "annotations"

# Step 2a: If owned by another Helm release â†’ use a unique name
# Check _helpers.tpl fullname template

# Step 2b: If created manually â†’ adopt it into Helm
kubectl annotate serviceaccount <name> -n production \
  meta.helm.sh/release-name=myapp \
  meta.helm.sh/release-namespace=production
kubectl label serviceaccount <name> -n production \
  app.kubernetes.io/managed-by=Helm

# Step 3: Retry
helm upgrade --install myapp . -n production

ðŸ› Scenario 7: Diff Shows Unexpected Changes

Debug Steps

bash

# Install helm-diff plugin
helm plugin install https://github.com/databus23/helm-diff

# See what will change
helm diff upgrade myapp . -n production -f values-prod.yaml

# Compare revisions
helm diff revision myapp 5 6 -n production

# Common causes of unexpected diff:
# 1. --reuse-values merging old/new values unexpectedly
# 2. Chart upgrade changed default values
# 3. Random generator (password) regenerating each time
#    Fix: Use lookup function or external secret management

ï¿½ Scenario 8: Network / Service Discovery Issues

Symptom

Pods are Running but can't communicate with each other, or Ingress doesn't route traffic.

Debug Steps

bash

# Step 1: Verify Service and Endpoints
kubectl get svc -n production
kubectl get endpoints myapp -n production
# If endpoints list is EMPTY â†’ labels don't match

# Step 2: Compare Service selector with Pod labels
kubectl get svc myapp -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production --show-labels
# Ensure selectors match pod labels EXACTLY

# Step 3: Test DNS resolution from inside the cluster
kubectl run dns-test --rm -it --image=busybox -n production -- nslookup myapp
kubectl run dns-test --rm -it --image=busybox -n production -- nslookup myapp.production.svc.cluster.local

# Step 4: Test connectivity to the service
kubectl run curl-test --rm -it --image=curlimages/curl -n production -- \
  curl -v http://myapp:80/health

# Step 5: Check NetworkPolicies blocking traffic
kubectl get networkpolicies -n production
kubectl describe networkpolicy -n production
# If NetworkPolicy exists, verify it allows the necessary ingress/egress

# Step 6: For Ingress issues
kubectl describe ingress myapp -n production
kubectl get events -n production | grep ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller

ðŸ”—

K8s Connection

Helm creates Services using label selectors from _helpers.tpl. If you override nameOverride or fullnameOverride mid-release, selectors may break. Always check kubectl get endpoints â€” empty endpoints means the selector doesn't match any pods.

ðŸ› Scenario 9: Slow Rollout / Pod Scheduling Delays

Symptom

Helm upgrade hangs, pods stay in Pending for a long time, or rollout is much slower than expected.

Debug Steps

bash

# Step 1: Check pod status and scheduling
kubectl get pods -n production -o wide
kubectl describe pod <pending-pod> -n production
# Look at Events section for:
#   FailedScheduling: 0/5 nodes are available: insufficient cpu
#   FailedScheduling: 0/5 nodes are available: pod has unbound PersistentVolumeClaims

# Step 2: Check node resources
kubectl top nodes
kubectl describe node <node-name> | grep -A 8 "Allocated resources"

# Step 3: Check resource quotas on the namespace
kubectl describe resourcequota -n production
# If quota is exceeded, reduce resource requests or increase quota

# Step 4: Check PodDisruptionBudget blocking rollout
kubectl get pdb -n production
kubectl describe pdb myapp-pdb -n production
# If minAvailable is too high, new pods can't be scheduled during rolling update

# Step 5: Check Deployment rollout strategy
kubectl get deploy myapp -n production -o jsonpath='{.spec.strategy}'
# With RollingUpdate: maxUnavailable=0 + maxSurge=1 is slowest but safest

# Step 6: Monitor rollout progress
kubectl rollout status deployment/myapp -n production
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

Common Causes & Fixes

Cause	Fix
Insufficient cluster resources	Scale cluster, reduce resource requests, or use priority classes
PDB blocking eviction	Temporarily lower minAvailable or increase replicas first
Node affinity/taints mismatch	Check `nodeSelector`, `tolerations`, `affinity` in values
Slow image pull	Pre-pull images, use imagePullPolicy: IfNotPresent, check registry speed
Slow readiness probe	Increase `initialDelaySeconds`, check probe endpoint performance

ï¿½ðŸ› ï¸ Essential Debug Toolkit

Command	When to Use
`helm template . 2>&1`	Template won't render
`helm lint .`	Quick syntax check
`helm get values <rel> -n <ns>`	Values seem wrong
`helm get manifest <rel> -n <ns>`	See what was deployed
`helm history <rel> -n <ns>`	Check revision history
`helm status <rel> -n <ns>`	Current release state
`helm diff upgrade ...`	Preview changes before applying
`kubectl get events --sort-by='.lastTimestamp'`	Recent cluster events
`kubectl describe pod <name>`	Pod scheduling/start issues
`kubectl logs <pod> --previous`	Crash logs

ðŸŽ¯ Interview Questions

Beginner

Q: How do you debug a Helm template error?â–¼

Run helm template <name> . 2>&1 to see the exact error with file and line number. Use --debug for more context. Render individual templates with -s templates/deployment.yaml. Also helm lint for quick checks.

Q: Your helm upgrade timed out. What do you check first?â–¼

1) kubectl get pods -n <ns> â€” are pods starting? 2) kubectl describe pod <name> â€” events section shows why pods aren't ready. 3) kubectl get events â€” recent cluster events. Common: image pull failures, crash loops, probe failures.

Q: How do you check what values are currently used by a release?â–¼

helm get values <release> -n <ns> shows user-supplied values. Add --all to include computed defaults. helm get manifest <release> -n <ns> shows the actual rendered YAML applied to the cluster.

Q: What is 'pending-upgrade' status?â–¼

The release got stuck during an upgrade â€” Helm started but didn't finish (timeout, crash, cancelled). Further operations are blocked. Fix: helm rollback <rel> <rev> to the last good revision. Prevention: always use --atomic.

Q: How do you see the actual Kubernetes manifests a release deployed?â–¼

helm get manifest <release> -n <ns>. This shows the rendered YAML that Helm applied to the cluster. Compare with helm template output to check if values are being applied correctly.

Intermediate

Q: How does Helm's 3-way merge affect debugging?â–¼

Helm 3 does a 3-way merge: old chart manifest, live state, new chart manifest. If someone manually edited a resource (kubectl edit), Helm detects it. This can cause unexpected diffs. Debug with helm diff to see what Helm thinks changed. Fix by ensuring all changes go through Helm, not kubectl.

Q: What is the difference between helm template and helm install --dry-run?â–¼

helm template is purely local â€” renders templates without cluster access. helm install --dry-run talks to the cluster API to validate resources (checks apiVersions, CRDs, quotas) but doesn't create them. Use template for quick checks, dry-run for full validation.

Q: How do you debug why a value isn't being applied to a subchart?â–¼

1) Confirm nesting: subchart values go under subchartName: key. 2) helm get values <rel> -n <ns> --all â€” check if value is in the output. 3) helm get manifest â€” search for the expected resource. 4) Use helm template with --debug to see the full values tree.

Q: How would you debug a random password regenerating on every upgrade?â–¼

Template uses randAlphaNum which generates a new value each render. Fix: use the lookup function to check if the Secret exists first: {{ $existing := lookup "v1" "Secret" .Release.Namespace "myapp-secret" }}. If it exists, reuse the value. Otherwise generate. Or use an external secret manager.

Q: Your release shows "deployed" but pods are in CrashLoopBackOff. How?â–¼

Helm marks success when resources are created (without --wait) or ready (with --wait). Without --wait, Helm doesn't check pod status. The Deployment exists (deployed) but pods crash. Fix: always use --wait --timeout. Check: kubectl logs <pod>, kubectl describe pod.

Scenario-Based

Q: Production upgrade failed at 2 AM, release is stuck in pending-upgrade. Walk through your response.â–¼

1) helm status myapp -n prod â€” confirm state. 2) helm history myapp -n prod â€” find last deployed revision. 3) helm rollback myapp <rev> -n prod â€” restore service. 4) kubectl get pods -n prod â€” verify healthy. 5) Next day: helm diff upgrade to review what changed, fix the issue, deploy to staging first, then prod with --atomic.

Q: A teammate manually kubectl edited a Deployment. Now helm upgrade shows changes you didn't make. What happened?â–¼

Helm 3's 3-way merge: it sees the live state differs from both the old and new chart manifests. Helm will try to reconcile. Use helm diff upgrade to see all changes. The manual edit will be reverted to match the chart. Team rule: never kubectl edit Helm-managed resources; always change through values.yaml and helm upgrade.

Q: helm upgrade succeeds but the new version has a bug. How do you recover?â–¼

helm rollback myapp <previous-revision> -n prod. This re-deploys the previous chart+values combination. Verify with kubectl get pods and helm test. Then fix the bug, deploy to staging, validate, and re-deploy to prod. Use helm history to identify the correct revision number.

Q: You see "no matches for kind Deployment in version apps/v1beta1" after a cluster upgrade. Fix?â–¼

The chart uses a deprecated apiVersion removed in the new K8s version. Fix: update the chart's templates to use current apiVersions (apps/v1). Use .Capabilities.APIVersions.Has for conditional apiVersion selection if supporting multiple K8s versions. Also: helm template | kubeconform to check compatibility.

Q: Your Helm test passes in staging but fails in production. What's different?â–¼

Compare environments: 1) helm get values myapp -n staging --all vs prod. 2) Different network policies in prod? 3) Different resource limits causing OOM? 4) Different external dependencies (DB, API endpoints)? 5) DNS resolution differences. Run helm test --logs in both and compare. Check kubectl describe pod for the test pod in prod.

ðŸŒ Real-World Debugging: The Friday 5PM Incident

A team pushed a Helm upgrade to production on Friday evening. Monitoring shows 503 errors spiking. Here's the full investigation:

bash

# 5:01 PM â€” Alert fires: 503 errors on /api/checkout

# Step 1: Is it Helm or K8s?
helm status payment-api -n prod
# STATUS: deployed â€” Helm thinks it's fine

# Step 2: Check pods
kubectl get pods -n prod -l app.kubernetes.io/name=payment-api
# 2/3 pods in CrashLoopBackOff, 1 still running (old revision)

# Step 3: Why are they crashing?
kubectl logs payment-api-7f8b9-xk2mn -n prod
# Error: STRIPE_API_KEY environment variable not set

# Step 4: What changed?
helm diff revision payment-api 11 12 -n prod
# Reveals: env block restructured, STRIPE_API_KEY moved under a new key

# Step 5: Check values that were applied
helm get values payment-api -n prod
# Missing: payment.stripe.apiKeySecret

# Step 6: Immediate fix â€” rollback
helm rollback payment-api 11 -n prod
kubectl get pods -n prod -l app.kubernetes.io/name=payment-api
# All 3 pods Running, 503s stop

# Step 7: Monday â€” Fix properly
# Update values-prod.yaml with the new key structure
# Test with: helm template test . -f values-prod.yaml | grep STRIPE
# Deploy to staging first, then prod with --atomic

â—

Lesson Learned

The root cause was a chart refactor that moved env var keys under a new section, but the production values file wasn't updated to match. Always run helm diff upgrade before applying to production, and never deploy on Fridays without --atomic.

ðŸ“ Summary

helm template + helm lint catch most template issues before install
helm get values / helm get manifest reveal what's actually deployed
kubectl describe pod + kubectl get events explain runtime failures
kubectl get endpoints reveals Service â†’ Pod selector mismatches
helm rollback is your emergency recovery â€” always know the last good revision
--atomic --wait --timeout prevents most stuck-state issues
Run helm diff upgrade before every production deploy to preview changes

â† PreviousBreak & Fix Next â†’Interview Preparation

â† Back to Helm Course

Debugging Scenarios

ðŸ§’ Simple Explanation (ELI5)

ðŸ” Master Debug Flowchart

ðŸ› Scenario 1: Template Rendering Failure

Symptom

Debug Steps

Common Causes & Fixes

ðŸ› Scenario 2: Values Not Applied

Symptom

Debug Steps

Common Causes & Fixes

ðŸ› Scenario 3: Upgrade Failed / Timed Out

Symptom

Debug Steps

Common Causes & Fixes

ðŸ› Scenario 4: Stuck Release (pending-upgrade/pending-install)

Symptom

Debug Steps

ðŸ› Scenario 5: Hook Failure Blocking Install

Symptom

Debug Steps

ðŸ› Scenario 6: "rendered manifests contain a resource that already exists"

Symptom

Debug Steps

ðŸ› Scenario 7: Diff Shows Unexpected Changes

Debug Steps

ï¿½ Scenario 8: Network / Service Discovery Issues

Symptom

Debug Steps

ðŸ› Scenario 9: Slow Rollout / Pod Scheduling Delays

Symptom

Debug Steps

Common Causes & Fixes

ï¿½ðŸ› ï¸ Essential Debug Toolkit

ðŸŽ¯ Interview Questions

Beginner

Intermediate

Scenario-Based

ðŸŒ Real-World Debugging: The Friday 5PM Incident

ðŸ“ Summary

ðŸ” Master Debug Flowchart

ðŸ› Scenario 1: Template Rendering Failure

ðŸ› Scenario 2: Values Not Applied

ðŸ› Scenario 3: Upgrade Failed / Timed Out

ðŸ› Scenario 4: Stuck Release (pending-upgrade/pending-install)

ðŸ› Scenario 5: Hook Failure Blocking Install

ðŸ› Scenario 6: "rendered manifests contain a resource that already exists"

ðŸ› Scenario 7: Diff Shows Unexpected Changes

ðŸ› Scenario 9: Slow Rollout / Pod Scheduling Delays

ï¿½ðŸ› ï¸ Essential Debug Toolkit

ðŸŒ Real-World Debugging: The Friday 5PM Incident

ðŸ“ Summary