Troubleshooting & Debugging
Debug common GCP issues: connectivity, IAM, databases, deployments. Practice systematic debugging to resolve production incidents.
Debugging Framework
Step 1: Identify the symptom (app down, slow, permission denied). Step 2: Check logs (Cloud Logging, Cloud Trace). Step 3: Isolate the layer (compute, network, database, IAM). Step 4: Inspect resources. Step 5: Fix and test. Step 6: Prevent recurrence (alert, monitoring, documentation).
Runbook 1: VM Cannot Reach Database
Symptom: "Connection refused" or timeout when app connects to Cloud SQL.
# Step 1: Check if Cloud SQL instance is running gcloud sql instances describe web-app-db --format='value(state)' # Expected: RUNNABLE # Step 2: Check VM-to-database connectivity via Cloud SQL Auth Proxy # On the VM: cloud-sql-proxy -instances=PROJECT:us-central1:web-app-db=tcp:5432 & # Step 3: Test connection psql -h 127.0.0.1 -U app_user -d app_db # If prompted for password, enter: SecurePassword123 # Step 4: If still fails, check IAM permissions gcloud sql instances describe web-app-db --format='value(settings.backupConfiguration.enabled)' # Step 5: Verify Cloud SQL Proxy can reach database gcloud sql connect web-app-db --user=app_user # Should prompt for password without timeout # Step 6: Check Cloud SQL logs gcloud sql operations list --instance=web-app-db --limit=10
Common Fixes
- Payment issue: Billing account disabled. Check Console → Billing.
- IAM issue: Service account lacks cloudsql.client role. Grant: `gcloud projects add-iam-policy-binding PROJECT --member=serviceAccount:APP_SA@PROJECT.iam.gserviceaccount.com --role=roles/cloudsql.client`
- Network issue: VM and Cloud SQL in different zones/regions. For best performance, place in same region.
- Firewall issue: Egress firewall blocks outbound to Cloud SQL IP. Allow TCP:3306 or TCP:5432 depending on DB.
Runbook 2: Permission Denied (403 Error)
Symptom: GAM: "permission 'storage.objects.get' for 'projects/PROJECT/buckets/BUCKET' was denied."
# Step 1: Identify the principal (user, service account, group)
# Check which user/SA made the API call
gcloud logging read 'protoPayload.methodName="storage.objects.get"' \
--limit=5 --format='value(protoPayload.authenticationInfo.principalEmail)'
# Step 2: Check what role the principal has
PRINCIPAL="user:alice@company.com"
gcloud projects get-iam-policy PROJECT \
--flatten='bindings[].members' \
--filter="bindings.members:${PRINCIPAL}"
# Step 3: Grant missing role
gcloud projects add-iam-policy-binding PROJECT \
--member=${PRINCIPAL} \
--role=roles/storage.objectViewer
# Step 4: Verify (may take 1-2 minutes to propagate)
gcloud projects get-iam-policy PROJECT \
--flatten='bindings[].members' \
--filter="bindings.members:${PRINCIPAL}"
# Step 5: Test access
gsutil ls gs://my-bucket/Common Fixes
- Wrong role: User has Viewer (read-only); needs Editor. Grant specific role (storage.admin, compute.admin).
- Resource-level perms: Check bucket/instance IAM separately. `gcloud storage buckets describe gs://bucket --format='value(iam_configuration.public_access_prevention)'`
- Propagation delay: IAM changes take 1-2 minutes. Wait, then retry.
- Service account key expired: If using service account key, regenerate: `gcloud iam service-accounts keys create ~/new-key.json --iam-account=SA@PROJECT.iam.gserviceaccount.com`
Runbook 3: VM Cannot SSH
Symptom: "Connection timed out" or "Permission denied (publickey)."
# Step 1: Verify VM is running gcloud compute instances describe INSTANCE_NAME \ --format='value(status)' # Expected: RUNNING # Step 2: Check firewall allows SSH (port 22) gcloud compute firewall-rules list \ --filter='direction:INGRESS AND allowed:tcp=22' # Step 3: Verify OS login is not required gcloud compute instances describe INSTANCE_NAME \ --format='value(metadata.items.block-project-ssh-keys)' # Step 4: Use gcloud SSH wrapper (handles keys automatically) gcloud compute ssh INSTANCE_NAME --zone=us-central1-a # Step 5: If still fails, check service account has compute.instances.osLogin role gcloud projects get-iam-policy PROJECT \ --flatten='bindings[].members' \ --filter='bindings.roles:roles/compute.osLoginInstanceAccessRole' # Step 6: As last resort, use Serial Console (requires monitoring.metricWriter role) gcloud compute instances get-serial-port-output INSTANCE_NAME --zone=us-central1-a
Common Fixes
- Firewall blocked: Allow SSH: `gcloud compute firewall-rules create allow-ssh --allow=tcp:22 --source-ranges=YOUR_IP/32`
- OS Login enforced: Your project requires OS Login (via Org Policy). Use gcloud ssh (not traditional SSH keys).
- VM preempted: Preemptible VMs can terminate anytime. Check: `gcloud compute instances describe INSTANCE_NAME --format='value(scheduling.preemptible)'`
- Disk full: VM might be running out of disk. Connect via Serial Console to investigate.
Runbook 4: Load Balancer Health Check Failing
Symptom: Instances marked "unhealthy" by load balancer; all traffic drops.
# Step 1: Check health check configuration gcloud compute health-checks describe my-health-check \ --format='value(httpHealthChecks)' # Expected: port=8080, requestPath=/health # Step 2: SSH to instance and test health endpoint manually gcloud compute ssh web-instance-abc123 --zone=us-central1-a # On remote instance: curl -v http://localhost:8080/health # Expected: HTTP 200 OK # Step 3: Check app is actually listening on port 8080 # On remote instance: netstat -tlnp | grep 8080 # Should show: tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN # Step 4: Verify firewall allows health check IPs (GCP's health checker) # GCP health checks originate from 35.191.0.0/16 and 130.211.0.0/22 gcloud compute firewall-rules create allow-health-checks \ --allow=tcp:8080 \ --source-ranges='35.191.0.0/16,130.211.0.0/22' # Step 5: Check backend service health gcloud compute backend-services get-health my-backend --global # Step 6: Retry health check manually gcloud compute health-checks update my-health-check \ --check-interval=10s --timeout=5s --unhealthy-threshold=1
Common Fixes
- App not listening: Health endpoint /health not implemented or app crashed. SSH to VM and test locally.
- Firewall blocks health checks: Allow source ranges 35.191.0.0/16 and 130.211.0.0/22 on port 8080.
- Timeout too short: Increase timeout: `gcloud compute health-checks update my-health-check --timeout=10s`
- Wrong port: Verify health check port matches app port. Default is port 80; if app on 8080, health check must also use 8080.
Runbook 5: Slow Queries on Cloud SQL
Symptom: App works but queries take 30+ seconds. High CPU on database.
# Step 1: Enable Query Insights gcloud sql instances patch web-app-db \ --insights-config-enabled # Step 2: View slowest queries (via Cloud Console or API) # Console: Cloud SQL → web-app-db → Query Insights → Top 10 Queries # Step 3: Identify missing indexes # Run EXPLAIN ANALYZE on slow query: cloud-sql-proxy -instances=PROJECT:us-central1:web-app-db=tcp:5432 & psql -h 127.0.0.1 -U app_user -d app_db # In psql: EXPLAIN ANALYZE SELECT * FROM users WHERE email='test@example.com'; # Look for "Seq Scan" (full table scan) = missing index # Step 4: Add index CREATE INDEX idx_users_email ON users(email); # Step 5: Re-run EXPLAIN ANALYZE # Should show "Index Scan" (much faster) # Step 6: Monitor performance gcloud sql instances describe web-app-db --format='value(settings.insightsConfig)'
Common Fixes
- Missing indexes: Add indexes on columns used in WHERE clauses. `CREATE INDEX idx_name ON table(column);`
- Excessive data transfer: SELECT * is slow; select specific columns. `SELECT id, name FROM users WHERE active=true;`
- Connection pool exhausted: Too many concurrent queries. Increase max_connections or use connection pooling (PgBouncer).
- Instance too small: db-f1-micro is bottleneck. Upgrade to db-n1-standard-4 (CPU/RAM increase = 10x performance).
Debugging Checklist
| Issue | Check | Command |
|---|---|---|
| VM down | Instance state | `gcloud compute instances describe INSTANCE --format='value(status)'` |
| No connectivity | Firewall rules | `gcloud compute firewall-rules list --filter='direction:INGRESS'` |
| App errors | Cloud Logging | `gcloud logging read 'resource.type=gce_instance' --limit=50` |
| Permission denied | IAM policy | `gcloud projects get-iam-policy PROJECT` |
| Slow performance | Cloud Trace / CPU metrics | `gcloud monitoring metrics-descriptors list` |
Interview Questions
Scenario-based Troubleshooting
1. Check Cloud Audit Logs for the error: `gcloud logging read 'severity=ERROR AND protoPayload.methodName="Method"'`. 2. Identify the service account/user making the call. 3. Check IAM policy: `gcloud projects get-iam-policy`. 4. Grant missing role: `gcloud projects add-iam-policy-binding --member=... --role=...`. 5. Test again (allow 1-2 min for propagation). Most common fix: User/SA lacks required role (usually missing storage.objectViewer or compute.admin).
The startup script didn't run or npm wasn't installed. 1. SSH to instance. 2. Check startup script logs: `sudo journalctl -u google-startup-scripts.service`. 3. Verify Node.js is installed: `node --version`. 4. If missing, reinstall: `sudo apt-get update && sudo apt-get install -y nodejs npm`. 5. Restart app: `pm2 start app.js` or `nohup node app.js > app.log 2>&1 &`. 6. Verify it's running: `curl http://localhost:8080`.
1. Declare incident (Slack #incidents). 2. Check Cloud SQL status: Dashboard should show instance state. 3. If instance is RUNNING but not responding, restart: `gcloud sql instances restart web-app-db`. 4. If restart fails, check billing (account might be disabled). 5. If billing OK, contact Google Cloud support (ticket within 1 hour for prod down). 6. In parallel, failover to read replica if available: `gcloud sql backups list --instance=web-app-db`. 7. Document incident for postmortem. Most common cause: Billing account disabled or disk full (Cloud SQL can't write).
Common GCP Issues & Fixes
| Issue | Cause | Fix |
|---|---|---|
| Deployment fails with "quota exceeded" | Project hit resource quota (VMs, IPs, etc.) | Check quotas: `gcloud compute project-info describe`. Request increase via Console. |
| "Operation timed out" | Slow network, heavy load, underprovisioned resource | Increase timeouts, scale up instance size, check network bandwidth. |
| Billing surprises (bill 10x higher) | Runaway instances, data transfer, GCP APIs called excessively | Check Cloud Monitoring for resource usage spikes. Set budget alerts. |
| SSH key not working | OS Login enforced, or key perms wrong (0600 required) | Use `gcloud compute ssh` or grant compute.osLoginInteractiveUser role. |
Summary
Systematic debugging beats guessing:
- Identify symptom and layer (compute, network, identity, storage, database).
- Check Cloud Logging, Cloud Trace, metrics always.
- Use gcloud CLI to inspect resources (instances, databases, IAM policies).
- Common culprits: IAM permissions, firewall rules, missing indexes, resource quotas.
- Document runbooks for recurring issues. Automate alerts (set budgets, thresholds).