Hands-on Lesson 15 of 16

Hands-on: Debugging Terraform Failures

Use practical runbooks for the most common real-world failures: state lock, drift, and failed apply executions.

Triage Framework

SymptomLikely AreaFirst Check
init failsProvider or backendVersion and backend config
plan failsAuth or schema mismatchCredentials and provider args
state lock errorConcurrent run or stale lockActive jobs and lock metadata
unexpected replacementDrift or refactorState address and immutable fields
apply failed mid-runPlatform dependency or permissionPartial creation and cloud-side error details

Useful Commands

bash
terraform validate
terraform plan
terraform state list
terraform state show azurerm_kubernetes_cluster.platform
terraform force-unlock LOCK_ID

Runbook 1: State Lock

  1. Check if a pipeline job is still running.
  2. If none is running, inspect recent failed jobs for interrupted apply.
  3. Only after confirmation, run terraform force-unlock LOCK_ID.
  4. Run terraform plan again to verify state consistency.

Runbook 2: Drift Issue

  1. Run plan and identify resources with unexpected changes.
  2. Compare Terraform code, state, and live Azure config.
  3. Decide ownership: keep manual change or revert to code-defined intent.
  4. Reconcile with code update or import, then re-run plan.

Runbook 3: Failed Apply Mid-Run

  1. Do not rerun blindly.
  2. Identify which resources were created before failure.
  3. Inspect failing resource error in cloud and Terraform output.
  4. Fix root cause (permission, quota, dependency), then run plan before apply.
🧠
Debugging Rule

Terraform errors are often platform-context errors presented through Terraform. Treat cloud dependencies and identity as first-class debugging dimensions.

Guided Exercises

  1. Trigger a lock scenario in a safe lab and practice lock triage.
  2. Introduce drift by changing a tag manually in Azure and reconcile cleanly.
  3. Simulate a failed apply by using an invalid SKU or permission scope and recover safely.

Interview Questions

Scenario Practice

A state lock appears in production. What do you do first?

I verify whether another valid run is still active before considering force-unlock.

Why is immediate rerun after failed apply dangerous?

Because partial resources may exist, and rerunning without understanding state can amplify damage.

How do you decide if drift should be accepted or reverted?

I identify the intended source of truth, then align code and state with that decision deliberately.

Summary

Reliable Terraform operations come from repeatable debugging runbooks, not guesswork. Lock, drift, and failed-apply handling are core production skills.