Interview Preparation
Master architecture interview questions. Focus on design decisions, trade-offs, real-world complexities, and enterprise thinking.
Key Insight: Interview vs Implementation
Interview: Hiring manager wants to see your design thinking: Can you ask clarifying questions? Balance competing concerns? Explain decisions? Not about "perfect" answer—it's about the journey.
Format: Usually 45-60 minutes. Start broad (clarify scope), narrow (choose trade-offs), deep-dive (defend decisions).
Scenario 1: Design a Social Media Platform (Instagram-like)
Question: Design a social media platform with 100M users, photo uploads, real-time feeds, like counters. How would you architect this in Azure?
Ideal Answer Structure:
1. Clarify Requirements (Ask!):
• Read/write ratio? (feeds = lot of reads, crucial for design)
• Consistency for likes? (does "like" appear instantly or OK if delay?)
• Peak load? (scale limits)
• Geo distribution required? (latency tolerance)
• Budget constraints?
2. Identify Components:
• Photo storage: Blob storage (immutable, cheap, geo-redundant)
• Metadata (user, caption): SQL DB
• Feeds: Cosmos DB (eventually consistent, low latency)
• Like counts: Redis cache (super fast counters)
• Real-time updates: Service Bus / SignalR (push to clients)
3. Handle Key Challenges:
• Hot spots: Celebrity with 100M followers. If every like hits DB, overload. Solution: Cache at CDN/Redis, async update
• Feed generation: Showing feed = pulling posts from 1000 users. Very slow. Solution: Pre-generate feeds, cache, update async
• Scale: 100M users generates massive data. Solution: Multi-tenant in Cosmos, partition by user region
• Consistency vs latency: Like counter immediate? Or can lag 5 seconds? Choose eventual consistency for speed
4. Architecture Sketch:
• Consistency vs latency: Chose eventual (like counter may lag, but feed fast)
• Cost vs features: Multi-region? Only if required (cost high). Single region + CDN for latency
• Real-time updates: SignalR = expensive. Alternative: polling feeds every 5 sec (slower, cheaper)
Interviewer Happy Because: You asked clarifying questions, identified components, solved hot-spots problem (realistic!), explained trade-offs, showed thinking.
Ideal Answer Structure:
1. Clarify Requirements (Ask!):
• Read/write ratio? (feeds = lot of reads, crucial for design)
• Consistency for likes? (does "like" appear instantly or OK if delay?)
• Peak load? (scale limits)
• Geo distribution required? (latency tolerance)
• Budget constraints?
2. Identify Components:
• Photo storage: Blob storage (immutable, cheap, geo-redundant)
• Metadata (user, caption): SQL DB
• Feeds: Cosmos DB (eventually consistent, low latency)
• Like counts: Redis cache (super fast counters)
• Real-time updates: Service Bus / SignalR (push to clients)
3. Handle Key Challenges:
• Hot spots: Celebrity with 100M followers. If every like hits DB, overload. Solution: Cache at CDN/Redis, async update
• Feed generation: Showing feed = pulling posts from 1000 users. Very slow. Solution: Pre-generate feeds, cache, update async
• Scale: 100M users generates massive data. Solution: Multi-tenant in Cosmos, partition by user region
• Consistency vs latency: Like counter immediate? Or can lag 5 seconds? Choose eventual consistency for speed
4. Architecture Sketch:
Users (distributed globally)
↓
CDN (photos, static content)
↓
API Gateway (rate limit, auth)
↓
Services (independent, scalable)
├── Photo Service (Blob + resizing)
├── Feed Service (Cosmos DB, Redis cache)
├── Like Service (Redis for hot counts, SQL for trending)
└── User Service (SQL DB, Azure AD)
↓
Data Lake (analytics on engagement)
5. Trade-offs Discussed:• Consistency vs latency: Chose eventual (like counter may lag, but feed fast)
• Cost vs features: Multi-region? Only if required (cost high). Single region + CDN for latency
• Real-time updates: SignalR = expensive. Alternative: polling feeds every 5 sec (slower, cheaper)
Interviewer Happy Because: You asked clarifying questions, identified components, solved hot-spots problem (realistic!), explained trade-offs, showed thinking.
Scenario 2: Migrate Legacy On-Premises Data Center to Azure (Enterprise)
Question: Your company has a 20-year-old on-premises data center with 200 applications, databases, mainframe. You have 2 years to migrate everything to Azure. How do you approach this?
Ideal Answer:
1. Scope (Most Important):
• Categorize apps: Quick wins (lift-shift VMs), expensive (databases), hard (mainframes, legacy)
• Define phases: Phase 1 (easy apps, month 1-6), Phase 2 (core systems, month 6-18), Phase 3 (mainframes, decommission on-prem, month 18-24)
2. Choice: Lift-and-Shift vs Refactor vs Retire?
• Lift-shift: 70% of apps (VM → VM, cheap, fast, low risk). Use Azure Migrate
• Refactor: 20% of apps (optimize for cloud, redo architecture). Legacy enterprise software
• Retire: 10% of apps (turn off, no longer needed). Legacy forgotten systems
3. Infrastructure Design:
• Hub-and-spoke landing zone (heard this in Lesson 3!)
• Connectivity: ExpressRoute (dedicated link on-prem to Azure, not internet)
• Network peering: All Azure apps in spokes, centralized firewall
• Identity: Hybrid Azure AD (on-prem users still work)
4. Database Strategy (Critical!):
• SQL Server on-prem? Options: SQL DB (managed), SQL on Azure VM (compat)
• Legacy databases? Might need to keep as-is temporarily
• Plan: Migrate critical DBs first (others depend on them)
5. Risk Mitigation:
• Pilot program: Migrate 3-5 non-critical apps first (learn failures)
• Testing: Full testing phase after each migration (apps may behave differently)
• Rollback plan: If migration fails, quick rollback to on-prem (keep running 6 months parallel)
• Runbooks: Document all procedures (repeatable for 200 apps)
6. Cost Considerations:
• Biggest cost: Parallel running (cloud + on-prem) during migration. Plan exit date
• License: Some software tied to on-prem. Negotiate cloud licenses
• Training: Teams need Azure skills (budget for ramp-up)
Success Metrics: 50% of apps migrated by 12 months (pace check). Incidents <5% (quality check). Cost on cloud ≤ on-prem (ROI check).
Ideal Answer:
1. Scope (Most Important):
• Categorize apps: Quick wins (lift-shift VMs), expensive (databases), hard (mainframes, legacy)
• Define phases: Phase 1 (easy apps, month 1-6), Phase 2 (core systems, month 6-18), Phase 3 (mainframes, decommission on-prem, month 18-24)
2. Choice: Lift-and-Shift vs Refactor vs Retire?
• Lift-shift: 70% of apps (VM → VM, cheap, fast, low risk). Use Azure Migrate
• Refactor: 20% of apps (optimize for cloud, redo architecture). Legacy enterprise software
• Retire: 10% of apps (turn off, no longer needed). Legacy forgotten systems
3. Infrastructure Design:
• Hub-and-spoke landing zone (heard this in Lesson 3!)
• Connectivity: ExpressRoute (dedicated link on-prem to Azure, not internet)
• Network peering: All Azure apps in spokes, centralized firewall
• Identity: Hybrid Azure AD (on-prem users still work)
4. Database Strategy (Critical!):
• SQL Server on-prem? Options: SQL DB (managed), SQL on Azure VM (compat)
• Legacy databases? Might need to keep as-is temporarily
• Plan: Migrate critical DBs first (others depend on them)
5. Risk Mitigation:
• Pilot program: Migrate 3-5 non-critical apps first (learn failures)
• Testing: Full testing phase after each migration (apps may behave differently)
• Rollback plan: If migration fails, quick rollback to on-prem (keep running 6 months parallel)
• Runbooks: Document all procedures (repeatable for 200 apps)
6. Cost Considerations:
• Biggest cost: Parallel running (cloud + on-prem) during migration. Plan exit date
• License: Some software tied to on-prem. Negotiate cloud licenses
• Training: Teams need Azure skills (budget for ramp-up)
Success Metrics: 50% of apps migrated by 12 months (pace check). Incidents <5% (quality check). Cost on cloud ≤ on-prem (ROI check).
Scenario 3: Cost Exploded by 300% (Problem Solving)
Question: Your startup's Azure bill was $3k last month, now $12k this month (4x increase). Your team thinks there's a bug. How do you investigate and fix?
Ideal Answer:
1. Immediate Investigation (5 min):
• Azure Cost Management dashboard: Sort by resource (what's costing most?)
• Likely culprits: Data Transfer, Database, VM count, Storage
2. Deep Dive (assume data transfer = $9k spike):
• Check: Are we transferring data between regions? (costs $0.01-0.02/GB)
• Hypothesis: New feature reads blob from different region every request
• Action: Check logs (Application Insights) for egress patterns
• Root cause found: New analytics job transferring 1TB/day between regions (unoptimized)
3. Fixes:
• Option A: Cache blob locally (reduce transfers 90%)
• Option B: Replicate blob to same region (eliminate inter-region cost)
• Option C: Compress data before transfer (reduce volume 50%)
• Option D: Reduce frequency (analytics nightly instead of hourly)
4. Prevention:
• Set budget alert at $5k/month (warns before explosion)
• Add cost anomaly detector (auto-alerts if spike detected)
• Require architecture review for new features (catch expensive designs early)
• FinOps: Document why each cost exists, own it per team
Interviewer Sees: Systematic investigation, root-cause thinking (not just "bug"), creative solutions, governance mindset.
Ideal Answer:
1. Immediate Investigation (5 min):
• Azure Cost Management dashboard: Sort by resource (what's costing most?)
• Likely culprits: Data Transfer, Database, VM count, Storage
2. Deep Dive (assume data transfer = $9k spike):
• Check: Are we transferring data between regions? (costs $0.01-0.02/GB)
• Hypothesis: New feature reads blob from different region every request
• Action: Check logs (Application Insights) for egress patterns
• Root cause found: New analytics job transferring 1TB/day between regions (unoptimized)
3. Fixes:
• Option A: Cache blob locally (reduce transfers 90%)
• Option B: Replicate blob to same region (eliminate inter-region cost)
• Option C: Compress data before transfer (reduce volume 50%)
• Option D: Reduce frequency (analytics nightly instead of hourly)
4. Prevention:
• Set budget alert at $5k/month (warns before explosion)
• Add cost anomaly detector (auto-alerts if spike detected)
• Require architecture review for new features (catch expensive designs early)
• FinOps: Document why each cost exists, own it per team
Interviewer Sees: Systematic investigation, root-cause thinking (not just "bug"), creative solutions, governance mindset.
Scenario 4: Design for Regulatory Compliance (Advanced)
Question: Design a healthcare app for US hospitals that must comply with HIPAA, SOC 2, and be availability >99.9%. Data must stay in US. Budget is generous. How would you architect?
Ideal Answer:
1. Security-First Approach:
• Encryption: At rest (storage, DB) + in transit (TLS)
• Access: RBAC (doctors see only their patients) + MFA
• Auditing: Every access logged (who, what, when, why)
• Incident response: 24/7 monitoring, alert on suspicious activity
2. Architecture for HIPAA:
• Network: Private VNet (no public internet exposure), ExpressRoute for on-prem connects
• Databases: Always encrypted, automatic backups (immutable), SQL DB with RA-GRSRS (read-access geo-redundant)
• Storage: Patient data ALWAYS encrypted, audit logs encrypted, cannot be deleted (append-only)
• Managed Identity: Don't use keys in code, use Azure AD identities
3. High Availability (99.9% = ~43 min downtime/year):
• Multi-zone deployment (3 zones) for 99.95% SLA
• Backup to separate region (multi-region SQL geo-replication)
• Auto-failover after 30 sec (patients can't wait hours)
4. Compliance Audit Trail:
• Event logging: Every DB access logged to immutable storage
• Retention: 7 years minimum (HIPAA legal hold)
• Access reviews: Quarterly, automated (who has access should have access?)
• Amendments: Patient requests correction? Log both (never delete medical records)
5. Cost (Not Highest Priority Here!):
• Premium services necessary (encryption, multi-zone, compliance) = higher cost OK
• Estimate: ~$20k/month (compliance + security > cost optimization)
Success: Passes HIPAA audit, achieves 99.9% SLA, responsive to patient requests, maintains 7-year audit trails.
Ideal Answer:
1. Security-First Approach:
• Encryption: At rest (storage, DB) + in transit (TLS)
• Access: RBAC (doctors see only their patients) + MFA
• Auditing: Every access logged (who, what, when, why)
• Incident response: 24/7 monitoring, alert on suspicious activity
2. Architecture for HIPAA:
• Network: Private VNet (no public internet exposure), ExpressRoute for on-prem connects
• Databases: Always encrypted, automatic backups (immutable), SQL DB with RA-GRSRS (read-access geo-redundant)
• Storage: Patient data ALWAYS encrypted, audit logs encrypted, cannot be deleted (append-only)
• Managed Identity: Don't use keys in code, use Azure AD identities
3. High Availability (99.9% = ~43 min downtime/year):
• Multi-zone deployment (3 zones) for 99.95% SLA
• Backup to separate region (multi-region SQL geo-replication)
• Auto-failover after 30 sec (patients can't wait hours)
4. Compliance Audit Trail:
• Event logging: Every DB access logged to immutable storage
• Retention: 7 years minimum (HIPAA legal hold)
• Access reviews: Quarterly, automated (who has access should have access?)
• Amendments: Patient requests correction? Log both (never delete medical records)
5. Cost (Not Highest Priority Here!):
• Premium services necessary (encryption, multi-zone, compliance) = higher cost OK
• Estimate: ~$20k/month (compliance + security > cost optimization)
Success: Passes HIPAA audit, achieves 99.9% SLA, responsive to patient requests, maintains 7-year audit trails.
Scenario 5: Choose Azure Service (Real Complexity)
Question: You need to run 1000 analytics jobs daily, each runs 1-5 hours, unpredictable. Which Azure compute service: VMs (on-demand), Batch, Kubernetes (AKS), or Container Instances? Evaluate pros/cons.
Ideal Answer (Not Simple!):
VMs (on-demand):
✓ Full control, any runtime
✗ Manual provisioning (slow), high cost (pay even if idle), not designed for this
Azure Batch:
✓ Built for batch jobs, auto-scale VMs, managed
✓ Spot instances available (cheap!)
✗ More setup overhead, not as flexible
Verdict: Good if pure batch (no interactive)
Azure Kubernetes (AKS):
✓ Powerful, flexible, scales instantly
✓ Can use spot node pools (70% cheaper)
✗ Complex (Kubernetes learning curve), overkill for simple jobs
Verdict: If jobs diverse (ML + analytics + streaming), worth it
Container Instances (ACI):
✓ Simplest, serverless (no manage VMs)
✓ Cheap for short jobs (pay per second!)
✗ Slower startup (~10 sec), not for ultra-quick scaling
Verdict: Best for simple analytics, no cluster overhead
MY RECOMMENDATION:
ACI for quickest path (1000 jobs = 1000 containers in parallel with auto-scaling). If jobs are 1-5 hours, pay per-second is fine. Startup is 10 seconds (acceptable overhead). If later need more control → migrate to AKS (containers already built).
Ideal Answer (Not Simple!):
VMs (on-demand):
✓ Full control, any runtime
✗ Manual provisioning (slow), high cost (pay even if idle), not designed for this
Azure Batch:
✓ Built for batch jobs, auto-scale VMs, managed
✓ Spot instances available (cheap!)
✗ More setup overhead, not as flexible
Verdict: Good if pure batch (no interactive)
Azure Kubernetes (AKS):
✓ Powerful, flexible, scales instantly
✓ Can use spot node pools (70% cheaper)
✗ Complex (Kubernetes learning curve), overkill for simple jobs
Verdict: If jobs diverse (ML + analytics + streaming), worth it
Container Instances (ACI):
✓ Simplest, serverless (no manage VMs)
✓ Cheap for short jobs (pay per second!)
✗ Slower startup (~10 sec), not for ultra-quick scaling
Verdict: Best for simple analytics, no cluster overhead
MY RECOMMENDATION:
ACI for quickest path (1000 jobs = 1000 containers in parallel with auto-scaling). If jobs are 1-5 hours, pay per-second is fine. Startup is 10 seconds (acceptable overhead). If later need more control → migrate to AKS (containers already built).
Interview Tips
Do's
- ✓ Ask clarifying questions first (don't assume)
- ✓ State assumptions clearly ("I'm assuming 100 req/sec, is that right?")
- ✓ Discuss trade-offs (show you know nothing is perfect)
- ✓ Deep dive on hard parts (scaling, consistency, cost)
- ✓ Explain "why" not just "what" (design thinking, not memorization)
Don'ts
- ✗ Give immediate answer without clarifying scope (everyone fails this)
- ✗ Over-engineer (don't use all 5 pillars if not needed)
- ✗ Ignore cost (real architects balance cost + performance + security)
- ✗ Can't explain a design choice ("I just put it there")
- ✗ Dismiss a problem ("That won't happen" - always assume it might)
Summary
- Interview structure: Clarify → Design → Handle complexity → Trade-offs → Defend
- Common themes: Consistency vs latency, scale challenges, cost awareness, security first, automation
- Enterprise thinking: Compliance matters. Migration is hard. Teams need training. Runbooks matter.
- Be honest: "I don't know" beats wrong answer. "Let me think about that" is OK