Use --atomic on every install/upgrade. It auto-rolls back on failure, preventing stuck states. Also use --wait --timeout 5m to set clear boundaries.
Debugging Scenarios
Systematic debugging flows for the most common Helm failures: failed installs, wrong values, template errors, stuck releases, and more.
🧒 Simple Explanation (ELI5)
When something goes wrong with Helm, there's usually one of about 10 root causes. This page gives you a decision tree for each: see the symptom, follow the steps, find the root cause, apply the fix. Think of it as a Helm first-aid manual.
🔍 Master Debug Flowchart
🐛 Scenario 1: Template Rendering Failure
Symptom
Error: INSTALLATION FAILED: template: mychart/templates/deployment.yaml:25:20:
executing "mychart/templates/deployment.yaml" at <.Values.image.tag>:
nil pointer evaluating interface {}.tag
Debug Steps
# Step 1: Reproduce locally helm template test . 2>&1 # Step 2: Render with debug for more context helm template test . --debug 2>&1 # Step 3: Render specific template helm template test . -s templates/deployment.yaml 2>&1 # Step 4: Check values being passed helm template test . --set image.tag=v1 2>&1 # If this works → the value is missing, not a syntax issue
Common Causes & Fixes
| Cause | Fix |
|---|---|
| Nil pointer (.Values.missing.key) | Use default: {{ .Values.image.tag | default .Chart.AppVersion }} |
| Unclosed if/range/with block | Ensure every {{ if }} has {{ end }} |
| Wrong indentation (nindent) | Check nindent values match the YAML structure |
| Missing quote on string | Use | quote or {{ "value" }} |
🐛 Scenario 2: Values Not Applied
Symptom
You set --set replicaCount=3 but only 1 pod is running. Or you pass a values file but the config doesn't change.
Debug Steps
# Step 1: Check what values Helm actually used helm get values myapp -n production helm get values myapp -n production --all # Includes defaults # Step 2: Check rendered manifest helm get manifest myapp -n production | grep replicas # Step 3: Compare with expected helm template test . --set replicaCount=3 | grep replicas # Step 4: Check value merge order # CLI --set > -f values-prod.yaml > -f values.yaml > chart defaults
Common Causes & Fixes
| Cause | Fix |
|---|---|
| Case mismatch (replicacount vs replicaCount) | YAML is case-sensitive. Match exactly. |
Value nested wrong (--set auth.database vs --set postgresql.auth.database) | Subchart values need parent key prefix. |
Used --reuse-values | Old values override new defaults. Use explicit -f instead. |
| Wrong values file order | Later -f files override earlier ones. |
| Template doesn't use the value | Check template: grep -r "replicaCount" templates/ |
🐛 Scenario 3: Upgrade Failed / Timed Out
Symptom
Error: UPGRADE FAILED: timed out waiting for the condition
Debug Steps
# Step 1: Check pod status kubectl get pods -n production kubectl describe pod <failing-pod> -n production # Step 2: Check events (recent issues) kubectl get events -n production --sort-by='.lastTimestamp' | tail -20 # Step 3: Check container logs kubectl logs <pod-name> -n production kubectl logs <pod-name> -n production --previous # Previous crash # Step 4: Check release status helm status myapp -n production helm history myapp -n production
Common Causes & Fixes
| Cause | Fix |
|---|---|
| ImagePullBackOff | Wrong tag, missing registry auth. Fix image/tag value. |
| CrashLoopBackOff | App crash. Check logs. Fix app code or config. |
| Readiness probe failing | Wrong port/path in probe. Fix values or template. |
| Resource quota exceeded | Reduce resource requests or increase quota. |
| PVC pending | No matching StorageClass. Check kubectl get pvc. |
| Timeout too short | Increase: --timeout 10m |
🐛 Scenario 4: Stuck Release (pending-upgrade/pending-install)
Symptom
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Debug Steps
# Step 1: Check current status helm list -n production helm status myapp -n production # Step 2: View history helm history myapp -n production # Look for STATUS: pending-upgrade or pending-install # Step 3: Rollback to last good revision helm rollback myapp <last-deployed-revision> -n production # Step 4: If rollback fails, manually fix # List release secrets kubectl get secrets -n production -l owner=helm,name=myapp # Delete the stuck pending secret kubectl delete secret sh.helm.release.v1.myapp.v<N> -n production # Step 5: Retry helm upgrade --install myapp . -n production
🐛 Scenario 5: Hook Failure Blocking Install
Symptom
Error: failed pre-install: job failed: BackoffLimitExceeded
Debug Steps
# Step 1: Find the hook Job kubectl get jobs -n production kubectl describe job myapp-db-migrate -n production # Step 2: Check hook pod logs kubectl logs job/myapp-db-migrate -n production # Step 3: Fix the hook (usually wrong command, missing env var, DB unreachable) # Edit the template, then: # Step 4: Clean up failed hook kubectl delete job myapp-db-migrate -n production # Step 5: Retry helm upgrade --install myapp . -n production --atomic
🐛 Scenario 6: "rendered manifests contain a resource that already exists"
Symptom
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: ... ServiceAccount
Debug Steps
# Step 1: Find who owns the resource kubectl get serviceaccount <name> -n production -o yaml | grep -A 3 "annotations" # Step 2a: If owned by another Helm release → use a unique name # Check _helpers.tpl fullname template # Step 2b: If created manually → adopt it into Helm kubectl annotate serviceaccount <name> -n production \ meta.helm.sh/release-name=myapp \ meta.helm.sh/release-namespace=production kubectl label serviceaccount <name> -n production \ app.kubernetes.io/managed-by=Helm # Step 3: Retry helm upgrade --install myapp . -n production
🐛 Scenario 7: Diff Shows Unexpected Changes
Debug Steps
# Install helm-diff plugin helm plugin install https://github.com/databus23/helm-diff # See what will change helm diff upgrade myapp . -n production -f values-prod.yaml # Compare revisions helm diff revision myapp 5 6 -n production # Common causes of unexpected diff: # 1. --reuse-values merging old/new values unexpectedly # 2. Chart upgrade changed default values # 3. Random generator (password) regenerating each time # Fix: Use lookup function or external secret management
� Scenario 8: Network / Service Discovery Issues
Symptom
Pods are Running but can't communicate with each other, or Ingress doesn't route traffic.
Debug Steps
# Step 1: Verify Service and Endpoints
kubectl get svc -n production
kubectl get endpoints myapp -n production
# If endpoints list is EMPTY → labels don't match
# Step 2: Compare Service selector with Pod labels
kubectl get svc myapp -n production -o jsonpath='{.spec.selector}'
kubectl get pods -n production --show-labels
# Ensure selectors match pod labels EXACTLY
# Step 3: Test DNS resolution from inside the cluster
kubectl run dns-test --rm -it --image=busybox -n production -- nslookup myapp
kubectl run dns-test --rm -it --image=busybox -n production -- nslookup myapp.production.svc.cluster.local
# Step 4: Test connectivity to the service
kubectl run curl-test --rm -it --image=curlimages/curl -n production -- \
curl -v http://myapp:80/health
# Step 5: Check NetworkPolicies blocking traffic
kubectl get networkpolicies -n production
kubectl describe networkpolicy -n production
# If NetworkPolicy exists, verify it allows the necessary ingress/egress
# Step 6: For Ingress issues
kubectl describe ingress myapp -n production
kubectl get events -n production | grep ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
Helm creates Services using label selectors from _helpers.tpl. If you override nameOverride or fullnameOverride mid-release, selectors may break. Always check kubectl get endpoints — empty endpoints means the selector doesn't match any pods.
🐛 Scenario 9: Slow Rollout / Pod Scheduling Delays
Symptom
Helm upgrade hangs, pods stay in Pending for a long time, or rollout is much slower than expected.
Debug Steps
# Step 1: Check pod status and scheduling
kubectl get pods -n production -o wide
kubectl describe pod <pending-pod> -n production
# Look at Events section for:
# FailedScheduling: 0/5 nodes are available: insufficient cpu
# FailedScheduling: 0/5 nodes are available: pod has unbound PersistentVolumeClaims
# Step 2: Check node resources
kubectl top nodes
kubectl describe node <node-name> | grep -A 8 "Allocated resources"
# Step 3: Check resource quotas on the namespace
kubectl describe resourcequota -n production
# If quota is exceeded, reduce resource requests or increase quota
# Step 4: Check PodDisruptionBudget blocking rollout
kubectl get pdb -n production
kubectl describe pdb myapp-pdb -n production
# If minAvailable is too high, new pods can't be scheduled during rolling update
# Step 5: Check Deployment rollout strategy
kubectl get deploy myapp -n production -o jsonpath='{.spec.strategy}'
# With RollingUpdate: maxUnavailable=0 + maxSurge=1 is slowest but safest
# Step 6: Monitor rollout progress
kubectl rollout status deployment/myapp -n production
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
Common Causes & Fixes
| Cause | Fix |
|---|---|
| Insufficient cluster resources | Scale cluster, reduce resource requests, or use priority classes |
| PDB blocking eviction | Temporarily lower minAvailable or increase replicas first |
| Node affinity/taints mismatch | Check nodeSelector, tolerations, affinity in values |
| Slow image pull | Pre-pull images, use imagePullPolicy: IfNotPresent, check registry speed |
| Slow readiness probe | Increase initialDelaySeconds, check probe endpoint performance |
�🛠️ Essential Debug Toolkit
| Command | When to Use |
|---|---|
helm template . 2>&1 | Template won't render |
helm lint . | Quick syntax check |
helm get values <rel> -n <ns> | Values seem wrong |
helm get manifest <rel> -n <ns> | See what was deployed |
helm history <rel> -n <ns> | Check revision history |
helm status <rel> -n <ns> | Current release state |
helm diff upgrade ... | Preview changes before applying |
kubectl get events --sort-by='.lastTimestamp' | Recent cluster events |
kubectl describe pod <name> | Pod scheduling/start issues |
kubectl logs <pod> --previous | Crash logs |
🎯 Interview Questions
Beginner
Run helm template <name> . 2>&1 to see the exact error with file and line number. Use --debug for more context. Render individual templates with -s templates/deployment.yaml. Also helm lint for quick checks.
1) kubectl get pods -n <ns> — are pods starting? 2) kubectl describe pod <name> — events section shows why pods aren't ready. 3) kubectl get events — recent cluster events. Common: image pull failures, crash loops, probe failures.
helm get values <release> -n <ns> shows user-supplied values. Add --all to include computed defaults. helm get manifest <release> -n <ns> shows the actual rendered YAML applied to the cluster.
The release got stuck during an upgrade — Helm started but didn't finish (timeout, crash, cancelled). Further operations are blocked. Fix: helm rollback <rel> <rev> to the last good revision. Prevention: always use --atomic.
helm get manifest <release> -n <ns>. This shows the rendered YAML that Helm applied to the cluster. Compare with helm template output to check if values are being applied correctly.
Intermediate
Helm 3 does a 3-way merge: old chart manifest, live state, new chart manifest. If someone manually edited a resource (kubectl edit), Helm detects it. This can cause unexpected diffs. Debug with helm diff to see what Helm thinks changed. Fix by ensuring all changes go through Helm, not kubectl.
helm template is purely local — renders templates without cluster access. helm install --dry-run talks to the cluster API to validate resources (checks apiVersions, CRDs, quotas) but doesn't create them. Use template for quick checks, dry-run for full validation.
1) Confirm nesting: subchart values go under subchartName: key. 2) helm get values <rel> -n <ns> --all — check if value is in the output. 3) helm get manifest — search for the expected resource. 4) Use helm template with --debug to see the full values tree.
Template uses randAlphaNum which generates a new value each render. Fix: use the lookup function to check if the Secret exists first: {{ $existing := lookup "v1" "Secret" .Release.Namespace "myapp-secret" }}. If it exists, reuse the value. Otherwise generate. Or use an external secret manager.
Helm marks success when resources are created (without --wait) or ready (with --wait). Without --wait, Helm doesn't check pod status. The Deployment exists (deployed) but pods crash. Fix: always use --wait --timeout. Check: kubectl logs <pod>, kubectl describe pod.
Scenario-Based
1) helm status myapp -n prod — confirm state. 2) helm history myapp -n prod — find last deployed revision. 3) helm rollback myapp <rev> -n prod — restore service. 4) kubectl get pods -n prod — verify healthy. 5) Next day: helm diff upgrade to review what changed, fix the issue, deploy to staging first, then prod with --atomic.
Helm 3's 3-way merge: it sees the live state differs from both the old and new chart manifests. Helm will try to reconcile. Use helm diff upgrade to see all changes. The manual edit will be reverted to match the chart. Team rule: never kubectl edit Helm-managed resources; always change through values.yaml and helm upgrade.
helm rollback myapp <previous-revision> -n prod. This re-deploys the previous chart+values combination. Verify with kubectl get pods and helm test. Then fix the bug, deploy to staging, validate, and re-deploy to prod. Use helm history to identify the correct revision number.
The chart uses a deprecated apiVersion removed in the new K8s version. Fix: update the chart's templates to use current apiVersions (apps/v1). Use .Capabilities.APIVersions.Has for conditional apiVersion selection if supporting multiple K8s versions. Also: helm template | kubeconform to check compatibility.
Compare environments: 1) helm get values myapp -n staging --all vs prod. 2) Different network policies in prod? 3) Different resource limits causing OOM? 4) Different external dependencies (DB, API endpoints)? 5) DNS resolution differences. Run helm test --logs in both and compare. Check kubectl describe pod for the test pod in prod.
🌍 Real-World Debugging: The Friday 5PM Incident
A team pushed a Helm upgrade to production on Friday evening. Monitoring shows 503 errors spiking. Here's the full investigation:
# 5:01 PM — Alert fires: 503 errors on /api/checkout # Step 1: Is it Helm or K8s? helm status payment-api -n prod # STATUS: deployed — Helm thinks it's fine # Step 2: Check pods kubectl get pods -n prod -l app.kubernetes.io/name=payment-api # 2/3 pods in CrashLoopBackOff, 1 still running (old revision) # Step 3: Why are they crashing? kubectl logs payment-api-7f8b9-xk2mn -n prod # Error: STRIPE_API_KEY environment variable not set # Step 4: What changed? helm diff revision payment-api 11 12 -n prod # Reveals: env block restructured, STRIPE_API_KEY moved under a new key # Step 5: Check values that were applied helm get values payment-api -n prod # Missing: payment.stripe.apiKeySecret # Step 6: Immediate fix — rollback helm rollback payment-api 11 -n prod kubectl get pods -n prod -l app.kubernetes.io/name=payment-api # All 3 pods Running, 503s stop # Step 7: Monday — Fix properly # Update values-prod.yaml with the new key structure # Test with: helm template test . -f values-prod.yaml | grep STRIPE # Deploy to staging first, then prod with --atomic
The root cause was a chart refactor that moved env var keys under a new section, but the production values file wasn't updated to match. Always run helm diff upgrade before applying to production, and never deploy on Fridays without --atomic.
📝 Summary
helm template+helm lintcatch most template issues before installhelm get values/helm get manifestreveal what's actually deployedkubectl describe pod+kubectl get eventsexplain runtime failureskubectl get endpointsreveals Service → Pod selector mismatcheshelm rollbackis your emergency recovery — always know the last good revision--atomic --wait --timeoutprevents most stuck-state issues- Run
helm diff upgradebefore every production deploy to preview changes