Hands-onLesson 14 of 16

Debugging Pipelines

Learn a systematic method to troubleshoot Azure DevOps pipelines: YAML errors, trigger problems, agent issues, permissions, secrets, task failures, and broken deployments.

🧒 Simple Explanation (ELI5)

When a pipeline fails, do not treat it like a mystery box. Treat it like a machine with checkpoints. Ask: did it start, did it get an agent, did it read variables, did the task run, did the target system accept the change, and did the application stay healthy afterward?

Debugging gets easier when you stop guessing and work from the exact failed layer outward.

🛠️ Debugging Toolkit

text
Debug order:
1. Trigger
2. YAML parse
3. Resource authorization
4. Agent allocation
5. Task execution
6. Target platform state
7. Application health
SymptomMost Likely LayerFirst Check
Run never appearsTrigger / pipeline definitionBranch filters, PR triggers, UI overrides
Run is queued foreverAgent poolPool capacity, demands, offline self-hosted agent
Resource authorization requiredPermissionsService connection, environment, variable group authorization
Task fails with 401 or 403Identity / permissionsService connection scope, expired secret, RBAC assignment
Deploy step is green but release is brokenTarget platform or appHelm history, kubectl events, pod logs, probes

🔧 Common Failure Scenarios

Scenario 1: Pipeline Never Triggers

Scenario 2: No Agent Available

Scenario 3: Resource Authorization Required

Scenario 4: Secrets or Variables Resolve Incorrectly

Scenario 5: Deployment Task Succeeds but App Is Still Broken

Scenario 6: Self-Hosted Agent Is Online but Fails Jobs Immediately

Scenario 7: Docker Build or ACR Push Fails with Permission Errors

Scenario 8: Azure CLI or Helm Task Fails Only in Production

Debug Decision Flow
Did run start?
Did job get agent?
Did task fail?
Is target healthy?
text
Quick decision tree

Pipeline did not start
  -> check trigger, branch, PR, UI overrides

Pipeline started but no job ran
  -> check resource authorization and agent availability

Job ran but a task failed
  -> inspect the exact task input, identity, and logs

Deployment task passed but app is unhealthy
  -> move to kubectl, Helm, ingress, config, and application logs

🛠️ Hands-on Break-and-Fix Labs

Lab 1: Broken Trigger

yaml
# Broken assumption: only main triggers, but code was pushed to develop
trigger:
- main

Fix by aligning trigger branches with the actual branch strategy or by validating PR builds separately.

Lab 2: Agent Capability Mismatch

yaml
pool:
  name: SelfHostedLinux
  demands:
    - kubectl
    - helm
    - docker

If no agent advertises those capabilities, the job waits forever. Either install the tools or route the job to the right pool.

Lab 3: Helm Deploy Timeout

bash
kubectl get pods -n production
kubectl describe deployment webapp -n production
kubectl logs deployment/webapp -n production
helm history webapp -n production

These commands tell you whether the problem is image pull, config, readiness, or application startup.

Lab 4: Permission Error on an Azure Service Connection

text
Error: The pipeline is not authorized to access this resource.
Error: Failed to fetch access token for Azure subscription.

Fix path: authorize the pipeline for the service connection, verify the identity behind the connection is still valid, and confirm the role assignment covers the target subscription, resource group, ACR, or AKS cluster.

⚠️
Do Not Re-run Blindly

If you do not understand the first failure, re-running the pipeline usually just burns time and makes logs noisier. Inspect the exact failing layer first.

📋 Interview Questions

Beginner

What is the first thing you check when a pipeline fails?

I identify the exact stage, job, and step that failed, then read the surrounding logs before changing anything.

What does enabling system diagnostics do?

It increases logging detail so task execution, commands, and environment interactions become easier to inspect.

Why is "pipeline succeeded but app failed" an important distinction?

Because orchestration success does not guarantee the deployed application is healthy or correctly configured.

What causes jobs to stay queued?

Usually lack of available agents, wrong demands, offline self-hosted agents, or pool misconfiguration.

Why do permission errors happen so often in Azure DevOps?

Because many resources such as service connections, environments, and variable groups require explicit authorization beyond simply having a valid YAML file.

Intermediate

How do you separate pipeline issues from platform issues?

I trace whether the failure occurred before deployment, during orchestration, or after the target accepted the change, then use platform-native tooling like kubectl or Azure CLI where appropriate.

What information would you capture during a deployment incident?

Pipeline run ID, commit SHA, image tag, Helm release revision, target namespace, failing step, and relevant application logs or events.

How do you debug variable resolution issues safely?

I print non-secret metadata, validate the expansion syntax, inspect scope and authorization, and avoid dumping whole environment sets that may reveal sensitive data.

What is your process for debugging a stuck self-hosted agent?

I check agent service health, machine connectivity, capability registration, disk space, long-running processes, and whether the machine can still reach Azure DevOps and target systems.

What is a common debugging anti-pattern?

Changing multiple things at once without evidence. That destroys your ability to isolate the real cause.

Scenario-Based

An AzureCLI task fails only in production. Where do you look first?

I compare service connection scope, environment-specific variables, target subscription or resource group, and any production-only approvals or checks.

A pipeline is green, but the new version never appeared. What is your triage path?

I inspect the live deployment spec, Helm release values, image tag, and rollout history to confirm whether the intended version was actually applied and became active.

How would you train a junior engineer to debug pipelines well?

I would teach a fixed sequence: trigger, authorization, agent, task, target, application. That disciplined order reduces panic and random guessing.

What if the pipeline logs are too sparse to diagnose the issue?

I rerun with system diagnostics, add targeted non-secret debug output, and gather target-side logs rather than rewriting the whole pipeline immediately.

What is your rule for deciding between rollback and forward-fix?

I weigh blast radius, time to recovery, confidence in the fix, and whether the failure is configuration, code, or infrastructure related. Fast restore of service is usually the priority.

🌍 Real-World Usage

Production Azure DevOps teams debug across layers: source control, pipeline orchestration, identity, registry, cluster, and application runtime. The engineers who move fastest are not the ones who guess fastest. They are the ones who isolate fastest.

🧾 Summary

Debugging pipelines is mostly about disciplined narrowing. Start at the failure boundary, separate orchestration from runtime health, and use the right tool for each layer instead of changing YAML blindly.