Hands-onLesson 14 of 16

Debugging Pipelines

Learn a systematic method to troubleshoot Azure DevOps pipelines: YAML errors, trigger problems, agent issues, permissions, secrets, task failures, and broken deployments.

🧒 Simple Explanation (ELI5)

When a pipeline fails, do not treat it like a mystery box. Treat it like a machine with checkpoints. Ask: did it start, did it get an agent, did it read variables, did the task run, did the target system accept the change, and did the application stay healthy afterward?

Debugging gets easier when you stop guessing and work from the exact failed layer outward.

🛠️ Debugging Toolkit

Run logs for every task and step.
View raw logs to inspect exact commands and task output.
System diagnostics enabled at queue time for verbose logs.
Agent pool view for queue state and capability problems.
Environment deployment history for target-side context.
kubectl and helm for AKS-target verification.

text

Debug order:
1. Trigger
2. YAML parse
3. Resource authorization
4. Agent allocation
5. Task execution
6. Target platform state
7. Application health

Symptom	Most Likely Layer	First Check
Run never appears	Trigger / pipeline definition	Branch filters, PR triggers, UI overrides
Run is queued forever	Agent pool	Pool capacity, demands, offline self-hosted agent
Resource authorization required	Permissions	Service connection, environment, variable group authorization
Task fails with 401 or 403	Identity / permissions	Service connection scope, expired secret, RBAC assignment
Deploy step is green but release is broken	Target platform or app	Helm history, kubectl events, pod logs, probes

🔧 Common Failure Scenarios

Scenario 1: Pipeline Never Triggers

Check branch filters and PR triggers.
Confirm the correct YAML file path and pipeline definition reference.
Verify UI-level trigger settings are not overriding YAML assumptions.

Scenario 2: No Agent Available

Review pool capacity and queued jobs.
Inspect agent capabilities against job demands.
Check if a self-hosted agent is offline or stuck.

Scenario 3: Resource Authorization Required

Authorize the pipeline for the service connection, variable group, feed, or environment.
Review permissions after pipeline cloning or project move operations.
Check whether the identity or group assignment recently changed.

Scenario 4: Secrets or Variables Resolve Incorrectly

Validate exact variable names and expansion syntax.
Check variable group linkage and permissions.
Inspect scripts for quoting or shell interpolation problems.

Scenario 5: Deployment Task Succeeds but App Is Still Broken

Inspect Helm release history and pod health.
Check readiness probes, ingress, config, and live logs.
Separate pipeline success from runtime correctness.

Scenario 6: Self-Hosted Agent Is Online but Fails Jobs Immediately

Check whether the agent service is running under an account that still has filesystem and network access.
Inspect disk space, workspace cleanup, and stale tool installations on the agent machine.
Review whether a recent OS patch, proxy change, or certificate change broke connectivity back to Azure DevOps.

Scenario 7: Docker Build or ACR Push Fails with Permission Errors

Verify the container registry service connection is authorized for this pipeline.
Check whether the service principal still has push rights to ACR.
Confirm the repository name and registry login server are correct; authentication errors and naming errors often look similar in Docker task logs.

Scenario 8: Azure CLI or Helm Task Fails Only in Production

Compare staging and production service connections instead of assuming the YAML is wrong.
Review production-only variable groups, namespace values, ingress hosts, and approval checks.
Confirm the production cluster and resource group names still match the live Azure resources.

Debug Decision Flow

Did run start?

→

Did job get agent?

→

Did task fail?

→

Is target healthy?

text

Quick decision tree

Pipeline did not start
  -> check trigger, branch, PR, UI overrides

Pipeline started but no job ran
  -> check resource authorization and agent availability

Job ran but a task failed
  -> inspect the exact task input, identity, and logs

Deployment task passed but app is unhealthy
  -> move to kubectl, Helm, ingress, config, and application logs

🛠️ Hands-on Break-and-Fix Labs

Lab 1: Broken Trigger

yaml

# Broken assumption: only main triggers, but code was pushed to develop
trigger:
- main

Fix by aligning trigger branches with the actual branch strategy or by validating PR builds separately.

Lab 2: Agent Capability Mismatch

yaml

pool:
  name: SelfHostedLinux
  demands:
    - kubectl
    - helm
    - docker

If no agent advertises those capabilities, the job waits forever. Either install the tools or route the job to the right pool.

Lab 3: Helm Deploy Timeout

bash

kubectl get pods -n production
kubectl describe deployment webapp -n production
kubectl logs deployment/webapp -n production
helm history webapp -n production

These commands tell you whether the problem is image pull, config, readiness, or application startup.

Lab 4: Permission Error on an Azure Service Connection

text

Error: The pipeline is not authorized to access this resource.
Error: Failed to fetch access token for Azure subscription.

Fix path: authorize the pipeline for the service connection, verify the identity behind the connection is still valid, and confirm the role assignment covers the target subscription, resource group, ACR, or AKS cluster.

⚠️

Do Not Re-run Blindly

If you do not understand the first failure, re-running the pipeline usually just burns time and makes logs noisier. Inspect the exact failing layer first.

📋 Interview Questions

Beginner

What is the first thing you check when a pipeline fails?▾

I identify the exact stage, job, and step that failed, then read the surrounding logs before changing anything.

What does enabling system diagnostics do?▾

It increases logging detail so task execution, commands, and environment interactions become easier to inspect.

Why is "pipeline succeeded but app failed" an important distinction?▾

Because orchestration success does not guarantee the deployed application is healthy or correctly configured.

What causes jobs to stay queued?▾

Usually lack of available agents, wrong demands, offline self-hosted agents, or pool misconfiguration.

Why do permission errors happen so often in Azure DevOps?▾

Because many resources such as service connections, environments, and variable groups require explicit authorization beyond simply having a valid YAML file.

Intermediate

How do you separate pipeline issues from platform issues?▾

I trace whether the failure occurred before deployment, during orchestration, or after the target accepted the change, then use platform-native tooling like kubectl or Azure CLI where appropriate.

What information would you capture during a deployment incident?▾

Pipeline run ID, commit SHA, image tag, Helm release revision, target namespace, failing step, and relevant application logs or events.

How do you debug variable resolution issues safely?▾

I print non-secret metadata, validate the expansion syntax, inspect scope and authorization, and avoid dumping whole environment sets that may reveal sensitive data.

What is your process for debugging a stuck self-hosted agent?▾

I check agent service health, machine connectivity, capability registration, disk space, long-running processes, and whether the machine can still reach Azure DevOps and target systems.

What is a common debugging anti-pattern?▾

Changing multiple things at once without evidence. That destroys your ability to isolate the real cause.

Scenario-Based

An AzureCLI task fails only in production. Where do you look first?▾

I compare service connection scope, environment-specific variables, target subscription or resource group, and any production-only approvals or checks.

A pipeline is green, but the new version never appeared. What is your triage path?▾

I inspect the live deployment spec, Helm release values, image tag, and rollout history to confirm whether the intended version was actually applied and became active.

How would you train a junior engineer to debug pipelines well?▾

I would teach a fixed sequence: trigger, authorization, agent, task, target, application. That disciplined order reduces panic and random guessing.

What if the pipeline logs are too sparse to diagnose the issue?▾

I rerun with system diagnostics, add targeted non-secret debug output, and gather target-side logs rather than rewriting the whole pipeline immediately.

What is your rule for deciding between rollback and forward-fix?▾

I weigh blast radius, time to recovery, confidence in the fix, and whether the failure is configuration, code, or infrastructure related. Fast restore of service is usually the priority.

🌍 Real-World Usage

Production Azure DevOps teams debug across layers: source control, pipeline orchestration, identity, registry, cluster, and application runtime. The engineers who move fastest are not the ones who guess fastest. They are the ones who isolate fastest.

🧾 Summary

Debugging pipelines is mostly about disciplined narrowing. Start at the failure boundary, separate orchestration from runtime health, and use the right tool for each layer instead of changing YAML blindly.

PreviousBuild a Full CI/CD Pipeline ← Back to Course NextClassic to YAML Migration