Hands-onLesson 12 of 12

Interview Preparation: Complete Drill

Master 50+ curated interview questions. Practice behavioral, technical, and scenario-based answers. Land your GCP job.

Interview Strategy

Behavioral (30% of interview): "Tell me about a time you..." — Focus on impact, learning, ownership. Technical (50%): "How would you design...?" — Explain trade-offs, show your thinking. Scenario (20%): "What if...?" — Practical problem-solving; discuss diagnosing, fixing, preventing.

Behavioral Questions

Common Questions

Tell me about your biggest production incident and how you resolved it.

STAR format: Situation: Database down, 5 minute downtime, 10k users affected. Task: Root cause and fix. Action: Checked Cloud Audit Logs → found disk full. Expanded disk capacity while running. Implemented monitoring to alert on disk > 80%. Result: Zero downtime after, prevented recurrence. Learning: Proactive monitoring beats reactive patching.

Tell me about a time you optimized cloud costs.

Situation: $5k/month bill shocking for startup. Task: Cut 30% without downtime. Action: 1. Analyzed spend by service. 2. Right-sized instances (many were over-provisioned). 3. Bought 1-year commitments (25% savings). 4. Moved old data to Archive storage (95% off). 5. Scheduled non-prod to scale down at night. Result: Bill dropped to $3k/month (40% savings). Learning: Disciplined cost culture matters.

Describe a time you led a tech migration (e.g., on-prem to cloud).

Situation: Legacy Oracle database, 500 GB, no downtime allowed. Task: Migrate to Cloud SQL (Postgres) in 3 months. Action: 1. Planned migration in phases (test → staging → prod). 2. Set up replication (real-time data sync). 3. Ran dual-system for 2 weeks (both working). 4. Switched DNS during low-traffic window (2am). 5. Monitored closely for 1 week. Result: Zero downtime, clean cutover. Learning: Planning and communication beat heroics.

Tell me about a time you disagreed with a technical decision. How did you handle it?

Situation: Team wanted to use Lambda for heavy batch processing (not suitable). Task: Advocate for Dataflow (better fit). Action: 1. Ran proof-of-concept (10 min processing cost: $50 Lambda vs $0.50 Dataflow). 2. Presented cost analysis to team. 3. Got buy-in. 4. Implemented Dataflow. Result: 100x cost savings, faster processing. Learning: Data > opinions. Come prepared with evidence.

What's a weakness you have and how do you work on it?

Early in career, I struggled with Kubernetes complexity. Weakness: Avoided it, used simpler tools. Action: 1. Took a structured course (Linux Academy). 2. Deployed personal project on GKE. 3. Debugged issues hands-on. Result: Comfortable with K8s now, helped teams migrate to GKE. Learning: Growth mindset beats fear. Always be learning.

Technical Deep-dive Questions

Compute & Scaling

Design a global web app for 10 million users. Diagram it.

Architecture: Multi-region (us-central1, europe-west1, asia-southeast1). Each region: Cloud CDN → Cloud HTTP(S) LB → Cloud Run (auto-scales 0-1000 replicas) → Cloud SQL replica (read in each region, write in primary). Global load balancer routes users to closest region. Data: Replicate SQL to all regions, BigQuery for analytics (read-only). Cost: $50k-100k/month. Trade-offs: Multi-region complexity, data replication overhead, latency acceptable (100-200ms).

How do you handle 100x traffic spike (e.g., viral moment)?

Cloud Run auto-scales to 1000 replicas instantly (seconds). Database becomes bottleneck: 1. Read replicas in each region. 2. Cache aggressively (Memorystore Redis; cache hit rate 80%+). 3. Queue heavy requests (Pub/Sub + batch processing). 4. Serve static content from Cloud CDN (avoid origin). 5. Circuit-break if database still overloaded (return 503, backoff). Result: App survives 100x surge. Cost: Higher bill that day, but sustainable.

Compare Compute Engine vs App Engine vs Cloud Run for a web app. Which to choose?

Compute Engine: Full VM control, expensive ($50+/month), for complex apps. App Engine: Managed, language-specific (Go, Python), auto-scales, but vendor lock-in. Cloud Run: Container-native, scales to 0 (cheapest), simplest, best for early startups. Choice: Cloud Run for MVP (saves ops time), migrate to Compute Engine if special requirements (GPU, complex OS config).

Databases & Storage

When do you use BigQuery vs Cloud SQL vs Firestore?

Cloud SQL: Transactional (OLTP). Online store, user auth, real-time reads/writes. BigQuery: Analytical (OLAP). Historical data, aggregations, BI. Scan 10 TB in 30 seconds. Firestore: Real-time documents, mobile apps, nested data. Scales to millions of writes/sec. Choose based on workload: transactions → SQL, analytics → BigQuery, real-time docs → Firestore.

You have 10 TB of data. How do you load it into BigQuery cheaply?

Cheapest path: 1. Store data on Cloud Storage (free ingress). 2. Use bq load (streaming, $6.25/TB scanned). 3. Use Dataflow (ETL, auto-scales, same cost). Faster path: Storage Transfer Service (bulk transfer). Most expensive: API streaming ($0.05 per 200 MB). For archive: Store as Parquet on GCS, query directly (cheaper than loading because you only scan columns you need).

Design a backup strategy for a production database.

1. Automated daily backups (Cloud SQL does this, 35 days retention). 2. Weekly full backup to Cloud Storage (long-term archive). 3. Point-in-time recovery (transaction logs retain 7 days). 4. Test recovery monthly (to ensure backups work). 5. RPO (Recovery Point Objective): 1 day (max data loss if disaster). RTO (Recovery Time Objective): 1 hour (restore to working state). Cost: ~$50/month. Trade-off: More frequent backups = higher cost, lower data loss risk.

Scenario-Based Questions

Fire-fighting Scenarios

It's 3am, production is down. App can't start. You have 15 minutes for on-call runbook. What do you do?

1. Check if service is actually down: Ping endpoint (not web UI, could be frontend issue). 2. Check Cloud Logging for errors: `gcloud logging read 'severity=ERROR'`. 3. Check recent deployments (did we just deploy?). Rollback if yes. 4. Check resources (disk full? out of memory? quota hit?). 5. Restart service/pod. 6. If still down after 5 min, escalate + investigate during day. Document incident. RTO: 5-15 min for most issues. Key: Speed over perfection at 3am.

Your data warehouse queries just got 10x slower. Diagnose in 5 minutes.

1. Check if table size changed: `SELECT table_name, size_bytes FROM information_schema.table_storage GROUP BY table_name ORDER BY size_bytes DESC;` 2. Check recent data loads (did we add 10x data?). 3. Check if indexes were dropped (rare but possible). 4. Query cost spike? New complex joins introduced. 5. Most common: Queries now scanning 10x more data because new data added. Fix: Use partitioning (scan only recent data) or clustering. Deploy fix and test.

Bill jumped 5x unexpectedly. Root cause analysis?

1. Export billing data: `bq query --use_legacy_sql=false 'SELECT service.description, ROUND(SUM(cost), 2) FROM PROJECT.billing WHERE DATE(usage_start_time)=TODAY() GROUP BY service'` 2. Which service? If Compute Engine, someone left many VMs running. If data transfer, check egress (usually from big DataFlow job). 3. Check recent changes (new deployment? new data pipeline?). 4. Kill runaway resources immediately. 5. Long-term: Set budget alerts in GCP Console (alert at $1k/day, hard stop at $2k/day). Most common cause: Dev/test resources not cleaned up or data transfer between regions.

GCP Architecture Patterns

Pattern 1: Resilient, Auto-Scaling Web App

Layers: CDN (CloudFront alt: Cloud CDN) → Global HTTP(S) LB → Cloud Run + Cloud Tasks → Cloud SQL (read replicas) + BigQuery (analytics). Features: Multi-region, auto-scales, caching, background jobs, monitoring. Cost: $500-2k/month depending on traffic.

Pattern 2: Real-time Analytics Pipeline

Flow: Events → Cloud Pub/Sub → Dataflow (ETL) → BigQuery. Features: Low-latency ingestion (microseconds), real-time aggregations. Cost: $100-500/month depending on event volume.

Pattern 3: DevOps Best Practice CI/CD

Flow: GitHub → Cloud Build (compile) → Container Registry (store image) → Deploy to GKE or Cloud Run. Features: Automated testing, canary deployments, rollback. Cost: Build minutes (free tier sufficient for most), container storage ($0.10/GB/month).

Key Concepts to Memorize

ConceptDefinitionInterview Tip
SLAService Level Agreement (uptime commitment, e.g., 99.99%)State it when talking about HA. "We target 99.9% uptime."
RPORecovery Point Objective (max data loss acceptable, e.g., 1 hour)Backup strategy question. "RPO is 1 hour, backup every hour."
RTORecovery Time Objective (max time to restore, e.g., 4 hours)Disaster recovery plan. "RTO is 4 hours, we've tested recovery."
LatencyTime for request to travel + process (ms)Mention when talking about global systems. "Global LB keeps latency < 100ms."
ThroughputRequests per second (RPS) the system can handleScaling metric. "System handles 10k RPS at < 200ms latency."
CAP TheoremConsistency, Availability, Partition tolerance; pick 2Database choice discussion. "Firestore chose CA; we accept eventual consistency."

Practice Drill: Mock Interview

60-minute Mock Interview

  1. Intro (5 min): "Tell me about yourself. 1-minute pitch."
  2. Behavioral (10 min): "Tell me about a production incident. How'd you debug?"
  3. Technical (30 min): "Design a global web app for 100 million users. Trade-offs?"
  4. Coding/Scenario (10 min): "Write a gcloud command to deploy app to Cloud Run with auto-scaling."
  5. Questions for Us (5 min): "What's your on-call rotation? How do you handle incidents?"

Self-Scoring Checklist

Red Flags to Avoid

Summary: Interview Winning Checklist

← Back to Course🎓 Congratulations! You've completed the GCP course.