Hands-onLesson 11 of 12

Troubleshooting & Debugging

Debug common GCP issues: connectivity, IAM, databases, deployments. Practice systematic debugging to resolve production incidents.

Debugging Framework

Step 1: Identify the symptom (app down, slow, permission denied). Step 2: Check logs (Cloud Logging, Cloud Trace). Step 3: Isolate the layer (compute, network, database, IAM). Step 4: Inspect resources. Step 5: Fix and test. Step 6: Prevent recurrence (alert, monitoring, documentation).

Runbook 1: VM Cannot Reach Database

Symptom: "Connection refused" or timeout when app connects to Cloud SQL.

bash
# Step 1: Check if Cloud SQL instance is running
gcloud sql instances describe web-app-db --format='value(state)'
# Expected: RUNNABLE

# Step 2: Check VM-to-database connectivity via Cloud SQL Auth Proxy
# On the VM:
cloud-sql-proxy -instances=PROJECT:us-central1:web-app-db=tcp:5432 &

# Step 3: Test connection
psql -h 127.0.0.1 -U app_user -d app_db
# If prompted for password, enter: SecurePassword123

# Step 4: If still fails, check IAM permissions
gcloud sql instances describe web-app-db --format='value(settings.backupConfiguration.enabled)'

# Step 5: Verify Cloud SQL Proxy can reach database
gcloud sql connect web-app-db --user=app_user
# Should prompt for password without timeout

# Step 6: Check Cloud SQL logs
gcloud sql operations list --instance=web-app-db --limit=10

Common Fixes

Runbook 2: Permission Denied (403 Error)

Symptom: GAM: "permission 'storage.objects.get' for 'projects/PROJECT/buckets/BUCKET' was denied."

bash
# Step 1: Identify the principal (user, service account, group)
# Check which user/SA made the API call
gcloud logging read 'protoPayload.methodName="storage.objects.get"' \
  --limit=5 --format='value(protoPayload.authenticationInfo.principalEmail)'

# Step 2: Check what role the principal has
PRINCIPAL="user:alice@company.com"
gcloud projects get-iam-policy PROJECT \
  --flatten='bindings[].members' \
  --filter="bindings.members:${PRINCIPAL}"

# Step 3: Grant missing role
gcloud projects add-iam-policy-binding PROJECT \
  --member=${PRINCIPAL} \
  --role=roles/storage.objectViewer

# Step 4: Verify (may take 1-2 minutes to propagate)
gcloud projects get-iam-policy PROJECT \
  --flatten='bindings[].members' \
  --filter="bindings.members:${PRINCIPAL}"

# Step 5: Test access
gsutil ls gs://my-bucket/

Common Fixes

Runbook 3: VM Cannot SSH

Symptom: "Connection timed out" or "Permission denied (publickey)."

bash
# Step 1: Verify VM is running
gcloud compute instances describe INSTANCE_NAME \
  --format='value(status)'
# Expected: RUNNING

# Step 2: Check firewall allows SSH (port 22)
gcloud compute firewall-rules list \
  --filter='direction:INGRESS AND allowed:tcp=22'

# Step 3: Verify OS login is not required
gcloud compute instances describe INSTANCE_NAME \
  --format='value(metadata.items.block-project-ssh-keys)'

# Step 4: Use gcloud SSH wrapper (handles keys automatically)
gcloud compute ssh INSTANCE_NAME --zone=us-central1-a

# Step 5: If still fails, check service account has compute.instances.osLogin role
gcloud projects get-iam-policy PROJECT \
  --flatten='bindings[].members' \
  --filter='bindings.roles:roles/compute.osLoginInstanceAccessRole'

# Step 6: As last resort, use Serial Console (requires monitoring.metricWriter role)
gcloud compute instances get-serial-port-output INSTANCE_NAME --zone=us-central1-a

Common Fixes

Runbook 4: Load Balancer Health Check Failing

Symptom: Instances marked "unhealthy" by load balancer; all traffic drops.

bash
# Step 1: Check health check configuration
gcloud compute health-checks describe my-health-check \
  --format='value(httpHealthChecks)'
# Expected: port=8080, requestPath=/health

# Step 2: SSH to instance and test health endpoint manually
gcloud compute ssh web-instance-abc123 --zone=us-central1-a
# On remote instance:
curl -v http://localhost:8080/health
# Expected: HTTP 200 OK

# Step 3: Check app is actually listening on port 8080
# On remote instance:
netstat -tlnp | grep 8080
# Should show: tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN

# Step 4: Verify firewall allows health check IPs (GCP's health checker)
# GCP health checks originate from 35.191.0.0/16 and 130.211.0.0/22
gcloud compute firewall-rules create allow-health-checks \
  --allow=tcp:8080 \
  --source-ranges='35.191.0.0/16,130.211.0.0/22'

# Step 5: Check backend service health
gcloud compute backend-services get-health my-backend --global

# Step 6: Retry health check manually
gcloud compute health-checks update my-health-check \
  --check-interval=10s --timeout=5s --unhealthy-threshold=1

Common Fixes

Runbook 5: Slow Queries on Cloud SQL

Symptom: App works but queries take 30+ seconds. High CPU on database.

bash
# Step 1: Enable Query Insights
gcloud sql instances patch web-app-db \
  --insights-config-enabled

# Step 2: View slowest queries (via Cloud Console or API)
# Console: Cloud SQL → web-app-db → Query Insights → Top 10 Queries

# Step 3: Identify missing indexes
# Run EXPLAIN ANALYZE on slow query:
cloud-sql-proxy -instances=PROJECT:us-central1:web-app-db=tcp:5432 &
psql -h 127.0.0.1 -U app_user -d app_db
# In psql:
EXPLAIN ANALYZE SELECT * FROM users WHERE email='test@example.com';
# Look for "Seq Scan" (full table scan) = missing index

# Step 4: Add index
CREATE INDEX idx_users_email ON users(email);

# Step 5: Re-run EXPLAIN ANALYZE
# Should show "Index Scan" (much faster)

# Step 6: Monitor performance
gcloud sql instances describe web-app-db --format='value(settings.insightsConfig)'

Common Fixes

Debugging Checklist

IssueCheckCommand
VM downInstance state`gcloud compute instances describe INSTANCE --format='value(status)'`
No connectivityFirewall rules`gcloud compute firewall-rules list --filter='direction:INGRESS'`
App errorsCloud Logging`gcloud logging read 'resource.type=gce_instance' --limit=50`
Permission deniedIAM policy`gcloud projects get-iam-policy PROJECT`
Slow performanceCloud Trace / CPU metrics`gcloud monitoring metrics-descriptors list`

Interview Questions

Scenario-based Troubleshooting

User reports: "I get 403 Permission Denied when accessing the API." Debug.

1. Check Cloud Audit Logs for the error: `gcloud logging read 'severity=ERROR AND protoPayload.methodName="Method"'`. 2. Identify the service account/user making the call. 3. Check IAM policy: `gcloud projects get-iam-policy`. 4. Grant missing role: `gcloud projects add-iam-policy-binding --member=... --role=...`. 5. Test again (allow 1-2 min for propagation). Most common fix: User/SA lacks required role (usually missing storage.objectViewer or compute.admin).

App works locally but crashes on GCP VMs. Logs show: "command not found: node." What's wrong?

The startup script didn't run or npm wasn't installed. 1. SSH to instance. 2. Check startup script logs: `sudo journalctl -u google-startup-scripts.service`. 3. Verify Node.js is installed: `node --version`. 4. If missing, reinstall: `sudo apt-get update && sudo apt-get install -y nodejs npm`. 5. Restart app: `pm2 start app.js` or `nohup node app.js > app.log 2>&1 &`. 6. Verify it's running: `curl http://localhost:8080`.

Production database is down. SLA is 4 hours uptime min/month. What do you do?

1. Declare incident (Slack #incidents). 2. Check Cloud SQL status: Dashboard should show instance state. 3. If instance is RUNNING but not responding, restart: `gcloud sql instances restart web-app-db`. 4. If restart fails, check billing (account might be disabled). 5. If billing OK, contact Google Cloud support (ticket within 1 hour for prod down). 6. In parallel, failover to read replica if available: `gcloud sql backups list --instance=web-app-db`. 7. Document incident for postmortem. Most common cause: Billing account disabled or disk full (Cloud SQL can't write).

Common GCP Issues & Fixes

IssueCauseFix
Deployment fails with "quota exceeded"Project hit resource quota (VMs, IPs, etc.)Check quotas: `gcloud compute project-info describe`. Request increase via Console.
"Operation timed out"Slow network, heavy load, underprovisioned resourceIncrease timeouts, scale up instance size, check network bandwidth.
Billing surprises (bill 10x higher)Runaway instances, data transfer, GCP APIs called excessivelyCheck Cloud Monitoring for resource usage spikes. Set budget alerts.
SSH key not workingOS Login enforced, or key perms wrong (0600 required)Use `gcloud compute ssh` or grant compute.osLoginInteractiveUser role.

Summary

Systematic debugging beats guessing: