If you do not understand the first failure, re-running the pipeline usually just burns time and makes logs noisier. Inspect the exact failing layer first.
Debugging Pipelines
Learn a systematic method to troubleshoot Azure DevOps pipelines: YAML errors, trigger problems, agent issues, permissions, secrets, task failures, and broken deployments.
🧒 Simple Explanation (ELI5)
When a pipeline fails, do not treat it like a mystery box. Treat it like a machine with checkpoints. Ask: did it start, did it get an agent, did it read variables, did the task run, did the target system accept the change, and did the application stay healthy afterward?
Debugging gets easier when you stop guessing and work from the exact failed layer outward.
🛠️ Debugging Toolkit
- Run logs for every task and step.
- View raw logs to inspect exact commands and task output.
- System diagnostics enabled at queue time for verbose logs.
- Agent pool view for queue state and capability problems.
- Environment deployment history for target-side context.
- kubectl and helm for AKS-target verification.
Debug order: 1. Trigger 2. YAML parse 3. Resource authorization 4. Agent allocation 5. Task execution 6. Target platform state 7. Application health
| Symptom | Most Likely Layer | First Check |
|---|---|---|
| Run never appears | Trigger / pipeline definition | Branch filters, PR triggers, UI overrides |
| Run is queued forever | Agent pool | Pool capacity, demands, offline self-hosted agent |
| Resource authorization required | Permissions | Service connection, environment, variable group authorization |
| Task fails with 401 or 403 | Identity / permissions | Service connection scope, expired secret, RBAC assignment |
| Deploy step is green but release is broken | Target platform or app | Helm history, kubectl events, pod logs, probes |
🔧 Common Failure Scenarios
Scenario 1: Pipeline Never Triggers
- Check branch filters and PR triggers.
- Confirm the correct YAML file path and pipeline definition reference.
- Verify UI-level trigger settings are not overriding YAML assumptions.
Scenario 2: No Agent Available
- Review pool capacity and queued jobs.
- Inspect agent capabilities against job demands.
- Check if a self-hosted agent is offline or stuck.
Scenario 3: Resource Authorization Required
- Authorize the pipeline for the service connection, variable group, feed, or environment.
- Review permissions after pipeline cloning or project move operations.
- Check whether the identity or group assignment recently changed.
Scenario 4: Secrets or Variables Resolve Incorrectly
- Validate exact variable names and expansion syntax.
- Check variable group linkage and permissions.
- Inspect scripts for quoting or shell interpolation problems.
Scenario 5: Deployment Task Succeeds but App Is Still Broken
- Inspect Helm release history and pod health.
- Check readiness probes, ingress, config, and live logs.
- Separate pipeline success from runtime correctness.
Scenario 6: Self-Hosted Agent Is Online but Fails Jobs Immediately
- Check whether the agent service is running under an account that still has filesystem and network access.
- Inspect disk space, workspace cleanup, and stale tool installations on the agent machine.
- Review whether a recent OS patch, proxy change, or certificate change broke connectivity back to Azure DevOps.
Scenario 7: Docker Build or ACR Push Fails with Permission Errors
- Verify the container registry service connection is authorized for this pipeline.
- Check whether the service principal still has push rights to ACR.
- Confirm the repository name and registry login server are correct; authentication errors and naming errors often look similar in Docker task logs.
Scenario 8: Azure CLI or Helm Task Fails Only in Production
- Compare staging and production service connections instead of assuming the YAML is wrong.
- Review production-only variable groups, namespace values, ingress hosts, and approval checks.
- Confirm the production cluster and resource group names still match the live Azure resources.
Quick decision tree Pipeline did not start -> check trigger, branch, PR, UI overrides Pipeline started but no job ran -> check resource authorization and agent availability Job ran but a task failed -> inspect the exact task input, identity, and logs Deployment task passed but app is unhealthy -> move to kubectl, Helm, ingress, config, and application logs
🛠️ Hands-on Break-and-Fix Labs
Lab 1: Broken Trigger
# Broken assumption: only main triggers, but code was pushed to develop trigger: - main
Fix by aligning trigger branches with the actual branch strategy or by validating PR builds separately.
Lab 2: Agent Capability Mismatch
pool:
name: SelfHostedLinux
demands:
- kubectl
- helm
- dockerIf no agent advertises those capabilities, the job waits forever. Either install the tools or route the job to the right pool.
Lab 3: Helm Deploy Timeout
kubectl get pods -n production kubectl describe deployment webapp -n production kubectl logs deployment/webapp -n production helm history webapp -n production
These commands tell you whether the problem is image pull, config, readiness, or application startup.
Lab 4: Permission Error on an Azure Service Connection
Error: The pipeline is not authorized to access this resource. Error: Failed to fetch access token for Azure subscription.
Fix path: authorize the pipeline for the service connection, verify the identity behind the connection is still valid, and confirm the role assignment covers the target subscription, resource group, ACR, or AKS cluster.
📋 Interview Questions
Beginner
I identify the exact stage, job, and step that failed, then read the surrounding logs before changing anything.
It increases logging detail so task execution, commands, and environment interactions become easier to inspect.
Because orchestration success does not guarantee the deployed application is healthy or correctly configured.
Usually lack of available agents, wrong demands, offline self-hosted agents, or pool misconfiguration.
Because many resources such as service connections, environments, and variable groups require explicit authorization beyond simply having a valid YAML file.
Intermediate
I trace whether the failure occurred before deployment, during orchestration, or after the target accepted the change, then use platform-native tooling like kubectl or Azure CLI where appropriate.
Pipeline run ID, commit SHA, image tag, Helm release revision, target namespace, failing step, and relevant application logs or events.
I print non-secret metadata, validate the expansion syntax, inspect scope and authorization, and avoid dumping whole environment sets that may reveal sensitive data.
I check agent service health, machine connectivity, capability registration, disk space, long-running processes, and whether the machine can still reach Azure DevOps and target systems.
Changing multiple things at once without evidence. That destroys your ability to isolate the real cause.
Scenario-Based
I compare service connection scope, environment-specific variables, target subscription or resource group, and any production-only approvals or checks.
I inspect the live deployment spec, Helm release values, image tag, and rollout history to confirm whether the intended version was actually applied and became active.
I would teach a fixed sequence: trigger, authorization, agent, task, target, application. That disciplined order reduces panic and random guessing.
I rerun with system diagnostics, add targeted non-secret debug output, and gather target-side logs rather than rewriting the whole pipeline immediately.
I weigh blast radius, time to recovery, confidence in the fix, and whether the failure is configuration, code, or infrastructure related. Fast restore of service is usually the priority.
🌍 Real-World Usage
Production Azure DevOps teams debug across layers: source control, pipeline orchestration, identity, registry, cluster, and application runtime. The engineers who move fastest are not the ones who guess fastest. They are the ones who isolate fastest.
🧾 Summary
Debugging pipelines is mostly about disciplined narrowing. Start at the failure boundary, separate orchestration from runtime health, and use the right tool for each layer instead of changing YAML blindly.