Interview Preparation: Complete Drill
Master 50+ curated interview questions. Practice behavioral, technical, and scenario-based answers. Land your GCP job.
Interview Strategy
Behavioral (30% of interview): "Tell me about a time you..." — Focus on impact, learning, ownership. Technical (50%): "How would you design...?" — Explain trade-offs, show your thinking. Scenario (20%): "What if...?" — Practical problem-solving; discuss diagnosing, fixing, preventing.
Behavioral Questions
Common Questions
STAR format: Situation: Database down, 5 minute downtime, 10k users affected. Task: Root cause and fix. Action: Checked Cloud Audit Logs → found disk full. Expanded disk capacity while running. Implemented monitoring to alert on disk > 80%. Result: Zero downtime after, prevented recurrence. Learning: Proactive monitoring beats reactive patching.
Situation: $5k/month bill shocking for startup. Task: Cut 30% without downtime. Action: 1. Analyzed spend by service. 2. Right-sized instances (many were over-provisioned). 3. Bought 1-year commitments (25% savings). 4. Moved old data to Archive storage (95% off). 5. Scheduled non-prod to scale down at night. Result: Bill dropped to $3k/month (40% savings). Learning: Disciplined cost culture matters.
Situation: Legacy Oracle database, 500 GB, no downtime allowed. Task: Migrate to Cloud SQL (Postgres) in 3 months. Action: 1. Planned migration in phases (test → staging → prod). 2. Set up replication (real-time data sync). 3. Ran dual-system for 2 weeks (both working). 4. Switched DNS during low-traffic window (2am). 5. Monitored closely for 1 week. Result: Zero downtime, clean cutover. Learning: Planning and communication beat heroics.
Situation: Team wanted to use Lambda for heavy batch processing (not suitable). Task: Advocate for Dataflow (better fit). Action: 1. Ran proof-of-concept (10 min processing cost: $50 Lambda vs $0.50 Dataflow). 2. Presented cost analysis to team. 3. Got buy-in. 4. Implemented Dataflow. Result: 100x cost savings, faster processing. Learning: Data > opinions. Come prepared with evidence.
Early in career, I struggled with Kubernetes complexity. Weakness: Avoided it, used simpler tools. Action: 1. Took a structured course (Linux Academy). 2. Deployed personal project on GKE. 3. Debugged issues hands-on. Result: Comfortable with K8s now, helped teams migrate to GKE. Learning: Growth mindset beats fear. Always be learning.
Technical Deep-dive Questions
Compute & Scaling
Architecture: Multi-region (us-central1, europe-west1, asia-southeast1). Each region: Cloud CDN → Cloud HTTP(S) LB → Cloud Run (auto-scales 0-1000 replicas) → Cloud SQL replica (read in each region, write in primary). Global load balancer routes users to closest region. Data: Replicate SQL to all regions, BigQuery for analytics (read-only). Cost: $50k-100k/month. Trade-offs: Multi-region complexity, data replication overhead, latency acceptable (100-200ms).
Cloud Run auto-scales to 1000 replicas instantly (seconds). Database becomes bottleneck: 1. Read replicas in each region. 2. Cache aggressively (Memorystore Redis; cache hit rate 80%+). 3. Queue heavy requests (Pub/Sub + batch processing). 4. Serve static content from Cloud CDN (avoid origin). 5. Circuit-break if database still overloaded (return 503, backoff). Result: App survives 100x surge. Cost: Higher bill that day, but sustainable.
Compute Engine: Full VM control, expensive ($50+/month), for complex apps. App Engine: Managed, language-specific (Go, Python), auto-scales, but vendor lock-in. Cloud Run: Container-native, scales to 0 (cheapest), simplest, best for early startups. Choice: Cloud Run for MVP (saves ops time), migrate to Compute Engine if special requirements (GPU, complex OS config).
Databases & Storage
Cloud SQL: Transactional (OLTP). Online store, user auth, real-time reads/writes. BigQuery: Analytical (OLAP). Historical data, aggregations, BI. Scan 10 TB in 30 seconds. Firestore: Real-time documents, mobile apps, nested data. Scales to millions of writes/sec. Choose based on workload: transactions → SQL, analytics → BigQuery, real-time docs → Firestore.
Cheapest path: 1. Store data on Cloud Storage (free ingress). 2. Use bq load (streaming, $6.25/TB scanned). 3. Use Dataflow (ETL, auto-scales, same cost). Faster path: Storage Transfer Service (bulk transfer). Most expensive: API streaming ($0.05 per 200 MB). For archive: Store as Parquet on GCS, query directly (cheaper than loading because you only scan columns you need).
1. Automated daily backups (Cloud SQL does this, 35 days retention). 2. Weekly full backup to Cloud Storage (long-term archive). 3. Point-in-time recovery (transaction logs retain 7 days). 4. Test recovery monthly (to ensure backups work). 5. RPO (Recovery Point Objective): 1 day (max data loss if disaster). RTO (Recovery Time Objective): 1 hour (restore to working state). Cost: ~$50/month. Trade-off: More frequent backups = higher cost, lower data loss risk.
Scenario-Based Questions
Fire-fighting Scenarios
1. Check if service is actually down: Ping endpoint (not web UI, could be frontend issue). 2. Check Cloud Logging for errors: `gcloud logging read 'severity=ERROR'`. 3. Check recent deployments (did we just deploy?). Rollback if yes. 4. Check resources (disk full? out of memory? quota hit?). 5. Restart service/pod. 6. If still down after 5 min, escalate + investigate during day. Document incident. RTO: 5-15 min for most issues. Key: Speed over perfection at 3am.
1. Check if table size changed: `SELECT table_name, size_bytes FROM information_schema.table_storage GROUP BY table_name ORDER BY size_bytes DESC;` 2. Check recent data loads (did we add 10x data?). 3. Check if indexes were dropped (rare but possible). 4. Query cost spike? New complex joins introduced. 5. Most common: Queries now scanning 10x more data because new data added. Fix: Use partitioning (scan only recent data) or clustering. Deploy fix and test.
1. Export billing data: `bq query --use_legacy_sql=false 'SELECT service.description, ROUND(SUM(cost), 2) FROM PROJECT.billing WHERE DATE(usage_start_time)=TODAY() GROUP BY service'` 2. Which service? If Compute Engine, someone left many VMs running. If data transfer, check egress (usually from big DataFlow job). 3. Check recent changes (new deployment? new data pipeline?). 4. Kill runaway resources immediately. 5. Long-term: Set budget alerts in GCP Console (alert at $1k/day, hard stop at $2k/day). Most common cause: Dev/test resources not cleaned up or data transfer between regions.
GCP Architecture Patterns
Pattern 1: Resilient, Auto-Scaling Web App
Layers: CDN (CloudFront alt: Cloud CDN) → Global HTTP(S) LB → Cloud Run + Cloud Tasks → Cloud SQL (read replicas) + BigQuery (analytics). Features: Multi-region, auto-scales, caching, background jobs, monitoring. Cost: $500-2k/month depending on traffic.
Pattern 2: Real-time Analytics Pipeline
Flow: Events → Cloud Pub/Sub → Dataflow (ETL) → BigQuery. Features: Low-latency ingestion (microseconds), real-time aggregations. Cost: $100-500/month depending on event volume.
Pattern 3: DevOps Best Practice CI/CD
Flow: GitHub → Cloud Build (compile) → Container Registry (store image) → Deploy to GKE or Cloud Run. Features: Automated testing, canary deployments, rollback. Cost: Build minutes (free tier sufficient for most), container storage ($0.10/GB/month).
Key Concepts to Memorize
| Concept | Definition | Interview Tip |
|---|---|---|
| SLA | Service Level Agreement (uptime commitment, e.g., 99.99%) | State it when talking about HA. "We target 99.9% uptime." |
| RPO | Recovery Point Objective (max data loss acceptable, e.g., 1 hour) | Backup strategy question. "RPO is 1 hour, backup every hour." |
| RTO | Recovery Time Objective (max time to restore, e.g., 4 hours) | Disaster recovery plan. "RTO is 4 hours, we've tested recovery." |
| Latency | Time for request to travel + process (ms) | Mention when talking about global systems. "Global LB keeps latency < 100ms." |
| Throughput | Requests per second (RPS) the system can handle | Scaling metric. "System handles 10k RPS at < 200ms latency." |
| CAP Theorem | Consistency, Availability, Partition tolerance; pick 2 | Database choice discussion. "Firestore chose CA; we accept eventual consistency." |
Practice Drill: Mock Interview
60-minute Mock Interview
- Intro (5 min): "Tell me about yourself. 1-minute pitch."
- Behavioral (10 min): "Tell me about a production incident. How'd you debug?"
- Technical (30 min): "Design a global web app for 100 million users. Trade-offs?"
- Coding/Scenario (10 min): "Write a gcloud command to deploy app to Cloud Run with auto-scaling."
- Questions for Us (5 min): "What's your on-call rotation? How do you handle incidents?"
Self-Scoring Checklist
- Did I explain trade-offs (cost vs complexity, fast vs cheap)?
- Did I ask clarifying questions (scale? budget? SLA requirements)?
- Did I show deeper thinking (considered multi-region, caching, backups)?
- Did I mention real costs (never say "free," say "optimized to $X/month")?
- Did I explain why I chose GCP over AWS/Azure?
Red Flags to Avoid
- Vague answers: Don't say "I use Cloud SQL." Say "I use Cloud SQL with read replicas + daily backups + point-in-time recovery for RPO=1 hour."
- Ignoring scale: Always ask "How many users/requests per second?" before designing.
- No cost awareness: Mention costs. "Multi-region adds complexity and $20k/month for our scale."
- No monitoring: Always include monitoring/alerting in design. "We set alerts at 70% CPU, 85% disk to auto-scale before hitting limits."
- Forgetting backup strategy: Every production system needs backups. "Daily backups, RPO 1 day, RTO 4 hours, tested monthly."
Summary: Interview Winning Checklist
- ✅ Research company before interview (tech stack, size, industry).
- ✅ Prepare 3-5 real stories (incidents, optimizations, migrations) with numbers.
- ✅ Know GCP service map (compute, storage, databases, networking).
- ✅ Practice gcloud CLI commands (deploy, scale, debug).
- ✅ Ask clarifying questions; think out loud; mention trade-offs.
- ✅ Always include monitoring, backups, and security in designs.
- ✅ End interview by asking questions; show genuine interest.
- ✅ Follow up with thank-you note within 24 hours.