Hands-on Lesson 15 of 16

Hands-on: Debugging Terraform Failures

Use practical runbooks for the most common real-world failures: state lock, drift, and failed apply executions.

Triage Framework

Symptom	Likely Area	First Check
init fails	Provider or backend	Version and backend config
plan fails	Auth or schema mismatch	Credentials and provider args
state lock error	Concurrent run or stale lock	Active jobs and lock metadata
unexpected replacement	Drift or refactor	State address and immutable fields
apply failed mid-run	Platform dependency or permission	Partial creation and cloud-side error details

Useful Commands

bash

terraform validate
terraform plan
terraform state list
terraform state show azurerm_kubernetes_cluster.platform
terraform force-unlock LOCK_ID

Runbook 1: State Lock

Check if a pipeline job is still running.
If none is running, inspect recent failed jobs for interrupted apply.
Only after confirmation, run terraform force-unlock LOCK_ID.
Run terraform plan again to verify state consistency.

Runbook 2: Drift Issue

Run plan and identify resources with unexpected changes.
Compare Terraform code, state, and live Azure config.
Decide ownership: keep manual change or revert to code-defined intent.
Reconcile with code update or import, then re-run plan.

Runbook 3: Failed Apply Mid-Run

Do not rerun blindly.
Identify which resources were created before failure.
Inspect failing resource error in cloud and Terraform output.
Fix root cause (permission, quota, dependency), then run plan before apply.

🧠

Debugging Rule

Terraform errors are often platform-context errors presented through Terraform. Treat cloud dependencies and identity as first-class debugging dimensions.

Guided Exercises

Trigger a lock scenario in a safe lab and practice lock triage.
Introduce drift by changing a tag manually in Azure and reconcile cleanly.
Simulate a failed apply by using an invalid SKU or permission scope and recover safely.

Interview Questions

Scenario Practice

A state lock appears in production. What do you do first?▾

I verify whether another valid run is still active before considering force-unlock.

Why is immediate rerun after failed apply dangerous?▾

Because partial resources may exist, and rerunning without understanding state can amplify damage.

How do you decide if drift should be accepted or reverted?▾

I identify the intended source of truth, then align code and state with that decision deliberately.

Summary

Reliable Terraform operations come from repeatable debugging runbooks, not guesswork. Lock, drift, and failed-apply handling are core production skills.

PreviousProvision AKS Platform Stack ← Back to Course NextInterview Preparation