Hands-onInterview Q&A

Interview Preparation

Master architecture interview questions. Focus on design decisions, trade-offs, real-world complexities, and enterprise thinking.

Key Insight: Interview vs Implementation

Interview: Hiring manager wants to see your design thinking: Can you ask clarifying questions? Balance competing concerns? Explain decisions? Not about "perfect" answer—it's about the journey.

Format: Usually 45-60 minutes. Start broad (clarify scope), narrow (choose trade-offs), deep-dive (defend decisions).

Scenario 1: Design a Social Media Platform (Instagram-like)

Question: Design a social media platform with 100M users, photo uploads, real-time feeds, like counters. How would you architect this in Azure?

Ideal Answer Structure:

1. Clarify Requirements (Ask!):
• Read/write ratio? (feeds = lot of reads, crucial for design)
• Consistency for likes? (does "like" appear instantly or OK if delay?)
• Peak load? (scale limits)
• Geo distribution required? (latency tolerance)
• Budget constraints?

2. Identify Components:
• Photo storage: Blob storage (immutable, cheap, geo-redundant)
• Metadata (user, caption): SQL DB
• Feeds: Cosmos DB (eventually consistent, low latency)
• Like counts: Redis cache (super fast counters)
• Real-time updates: Service Bus / SignalR (push to clients)

3. Handle Key Challenges:
• Hot spots: Celebrity with 100M followers. If every like hits DB, overload. Solution: Cache at CDN/Redis, async update
• Feed generation: Showing feed = pulling posts from 1000 users. Very slow. Solution: Pre-generate feeds, cache, update async
• Scale: 100M users generates massive data. Solution: Multi-tenant in Cosmos, partition by user region
• Consistency vs latency: Like counter immediate? Or can lag 5 seconds? Choose eventual consistency for speed

4. Architecture Sketch:

    Users (distributed globally)

      ↓

    CDN (photos, static content)

      ↓

    API Gateway (rate limit, auth)

      ↓

    Services (independent, scalable)

    ├── Photo Service (Blob + resizing)

    ├── Feed Service (Cosmos DB, Redis cache)

    ├── Like Service (Redis for hot counts, SQL for trending)

    └── User Service (SQL DB, Azure AD)

      ↓

    Data Lake (analytics on engagement)

5. Trade-offs Discussed:
• Consistency vs latency: Chose eventual (like counter may lag, but feed fast)
• Cost vs features: Multi-region? Only if required (cost high). Single region + CDN for latency
• Real-time updates: SignalR = expensive. Alternative: polling feeds every 5 sec (slower, cheaper)

Interviewer Happy Because: You asked clarifying questions, identified components, solved hot-spots problem (realistic!), explained trade-offs, showed thinking.

Scenario 2: Migrate Legacy On-Premises Data Center to Azure (Enterprise)

Question: Your company has a 20-year-old on-premises data center with 200 applications, databases, mainframe. You have 2 years to migrate everything to Azure. How do you approach this?

Ideal Answer:

1. Scope (Most Important):
• Categorize apps: Quick wins (lift-shift VMs), expensive (databases), hard (mainframes, legacy)
• Define phases: Phase 1 (easy apps, month 1-6), Phase 2 (core systems, month 6-18), Phase 3 (mainframes, decommission on-prem, month 18-24)

2. Choice: Lift-and-Shift vs Refactor vs Retire?
• Lift-shift: 70% of apps (VM → VM, cheap, fast, low risk). Use Azure Migrate
• Refactor: 20% of apps (optimize for cloud, redo architecture). Legacy enterprise software
• Retire: 10% of apps (turn off, no longer needed). Legacy forgotten systems

3. Infrastructure Design:
• Hub-and-spoke landing zone (heard this in Lesson 3!)
• Connectivity: ExpressRoute (dedicated link on-prem to Azure, not internet)
• Network peering: All Azure apps in spokes, centralized firewall
• Identity: Hybrid Azure AD (on-prem users still work)

4. Database Strategy (Critical!):
• SQL Server on-prem? Options: SQL DB (managed), SQL on Azure VM (compat)
• Legacy databases? Might need to keep as-is temporarily
• Plan: Migrate critical DBs first (others depend on them)

5. Risk Mitigation:
• Pilot program: Migrate 3-5 non-critical apps first (learn failures)
• Testing: Full testing phase after each migration (apps may behave differently)
• Rollback plan: If migration fails, quick rollback to on-prem (keep running 6 months parallel)
• Runbooks: Document all procedures (repeatable for 200 apps)

6. Cost Considerations:
• Biggest cost: Parallel running (cloud + on-prem) during migration. Plan exit date
• License: Some software tied to on-prem. Negotiate cloud licenses
• Training: Teams need Azure skills (budget for ramp-up)

Success Metrics: 50% of apps migrated by 12 months (pace check). Incidents <5% (quality check). Cost on cloud ≤ on-prem (ROI check).

Scenario 3: Cost Exploded by 300% (Problem Solving)

Question: Your startup's Azure bill was $3k last month, now $12k this month (4x increase). Your team thinks there's a bug. How do you investigate and fix?

Ideal Answer:

1. Immediate Investigation (5 min):
• Azure Cost Management dashboard: Sort by resource (what's costing most?)
• Likely culprits: Data Transfer, Database, VM count, Storage

2. Deep Dive (assume data transfer = $9k spike):
• Check: Are we transferring data between regions? (costs $0.01-0.02/GB)
• Hypothesis: New feature reads blob from different region every request
• Action: Check logs (Application Insights) for egress patterns
• Root cause found: New analytics job transferring 1TB/day between regions (unoptimized)

3. Fixes:
• Option A: Cache blob locally (reduce transfers 90%)
• Option B: Replicate blob to same region (eliminate inter-region cost)
• Option C: Compress data before transfer (reduce volume 50%)
• Option D: Reduce frequency (analytics nightly instead of hourly)

4. Prevention:
• Set budget alert at $5k/month (warns before explosion)
• Add cost anomaly detector (auto-alerts if spike detected)
• Require architecture review for new features (catch expensive designs early)
• FinOps: Document why each cost exists, own it per team

Interviewer Sees: Systematic investigation, root-cause thinking (not just "bug"), creative solutions, governance mindset.

Scenario 4: Design for Regulatory Compliance (Advanced)

Question: Design a healthcare app for US hospitals that must comply with HIPAA, SOC 2, and be availability >99.9%. Data must stay in US. Budget is generous. How would you architect?

Ideal Answer:

1. Security-First Approach:
• Encryption: At rest (storage, DB) + in transit (TLS)
• Access: RBAC (doctors see only their patients) + MFA
• Auditing: Every access logged (who, what, when, why)
• Incident response: 24/7 monitoring, alert on suspicious activity

2. Architecture for HIPAA:
• Network: Private VNet (no public internet exposure), ExpressRoute for on-prem connects
• Databases: Always encrypted, automatic backups (immutable), SQL DB with RA-GRSRS (read-access geo-redundant)
• Storage: Patient data ALWAYS encrypted, audit logs encrypted, cannot be deleted (append-only)
• Managed Identity: Don't use keys in code, use Azure AD identities

3. High Availability (99.9% = ~43 min downtime/year):
• Multi-zone deployment (3 zones) for 99.95% SLA
• Backup to separate region (multi-region SQL geo-replication)
• Auto-failover after 30 sec (patients can't wait hours)

4. Compliance Audit Trail:
• Event logging: Every DB access logged to immutable storage
• Retention: 7 years minimum (HIPAA legal hold)
• Access reviews: Quarterly, automated (who has access should have access?)
• Amendments: Patient requests correction? Log both (never delete medical records)

5. Cost (Not Highest Priority Here!):
• Premium services necessary (encryption, multi-zone, compliance) = higher cost OK
• Estimate: ~$20k/month (compliance + security > cost optimization)

Success: Passes HIPAA audit, achieves 99.9% SLA, responsive to patient requests, maintains 7-year audit trails.

Scenario 5: Choose Azure Service (Real Complexity)

Question: You need to run 1000 analytics jobs daily, each runs 1-5 hours, unpredictable. Which Azure compute service: VMs (on-demand), Batch, Kubernetes (AKS), or Container Instances? Evaluate pros/cons.

Ideal Answer (Not Simple!):

VMs (on-demand):
✓ Full control, any runtime
✗ Manual provisioning (slow), high cost (pay even if idle), not designed for this

Azure Batch:
✓ Built for batch jobs, auto-scale VMs, managed
✓ Spot instances available (cheap!)
✗ More setup overhead, not as flexible
Verdict: Good if pure batch (no interactive)

Azure Kubernetes (AKS):
✓ Powerful, flexible, scales instantly
✓ Can use spot node pools (70% cheaper)
✗ Complex (Kubernetes learning curve), overkill for simple jobs
Verdict: If jobs diverse (ML + analytics + streaming), worth it

Container Instances (ACI):
✓ Simplest, serverless (no manage VMs)
✓ Cheap for short jobs (pay per second!)
✗ Slower startup (~10 sec), not for ultra-quick scaling
Verdict: Best for simple analytics, no cluster overhead

MY RECOMMENDATION:
ACI for quickest path (1000 jobs = 1000 containers in parallel with auto-scaling). If jobs are 1-5 hours, pay per-second is fine. Startup is 10 seconds (acceptable overhead). If later need more control → migrate to AKS (containers already built).

Interview Tips

Do's

✓ Ask clarifying questions first (don't assume)
✓ State assumptions clearly ("I'm assuming 100 req/sec, is that right?")
✓ Discuss trade-offs (show you know nothing is perfect)
✓ Deep dive on hard parts (scaling, consistency, cost)
✓ Explain "why" not just "what" (design thinking, not memorization)

Don'ts

✗ Give immediate answer without clarifying scope (everyone fails this)
✗ Over-engineer (don't use all 5 pillars if not needed)
✗ Ignore cost (real architects balance cost + performance + security)
✗ Can't explain a design choice ("I just put it there")
✗ Dismiss a problem ("That won't happen" - always assume it might)

Summary

Interview structure: Clarify → Design → Handle complexity → Trade-offs → Defend
Common themes: Consistency vs latency, scale challenges, cost awareness, security first, automation
Enterprise thinking: Compliance matters. Migration is hard. Teams need training. Runbooks matter.
Be honest: "I don't know" beats wrong answer. "Let me think about that" is OK