Hands-onPractical Design

Real-world Architectures

Design & deploy complete architectures for real scenarios: e-commerce platform, SaaS applications, data analytics, and enterprise systems.

Lab 1: E-commerce Platform (99.95% SLA)

Requirements

  • Global users (US, EU, Asia)
  • Millions of transactions daily
  • 99.95% availability SLA
  • Latency <100ms for 95% of users
  • Real-time analytics on transactions
  • Budget: Enterprise (cost not primary driver)

Proposed Architecture

GLOBAL LAYER
└── Azure Front Door (geo-routing, DDoS protection)

REGION 1 (East US - Primary)
├── Public IP + Azure Firewall
├── App Service (3 instances, auto-scale)
├── App Insights (monitoring)
├── SQL DB (geo-replicated to EU)
├── Redis Cache (session store)
├── Service Bus (async messaging)
└── Cosmos DB (global replication)

REGION 2 (West EU - HA)
├── App Service (active, handles EU traffic)
├── SQL DB (geo-replica)
└── Redis Cache (replica)

DATA LAYER
├── SQL DB: Orders, products, inventory
├── Cosmos DB: Carts (eventual consistency)
├── Event Hub: Transaction stream
└── Data Lake: Analytics

Key Decisions

  • Front Door: Global load balancer, auto-routes based on latency, DDoS edge protection
  • App Service + zones: Spreads apps across zones (99.95% SLA)
  • SQL DB + geo-replication: Consistency for orders (strong), replicate to EU for read-local
  • Cosmos DB + multi-region: Shopping carts (eventual consistency acceptable), sub-millisecond reads
  • Auto-scale: CPU-based scaling for busy periods (Black Friday)
  • Cache (Redis): Session store, prevent DB overload
  • Service Bus: Decouple checkout from inventory (async), prevent timeouts

Failure Scenarios

  • App instance fails (1 of 3): LB routes to healthy instances, no downtime
  • Zone fails (entire zone): Across 3 zones, max 1/3 capacity lost. Auto-scale adds instances
  • Primary region fails (entire East US): Front Door routes all traffic to EU. geo-replica takes over. Data lag <5 seconds
  • Cosmos DB partition fails: Multi-region partition, fallback to replica

Cost Estimate

  • App Service (auto-scale 3-10): ~$2k/month
  • SQL DB (Premium): ~$1.5k/month
  • Cosmos DB (400 RU): ~$800/month
  • Front Door: ~$500/month
  • Data egress (inter-region): ~$600/month
  • Total: ~$5.4k/month

Lab 2: SaaS Analytics Platform

Requirements

  • Multi-tenant SaaS (1000s of customers)
  • Real-time dashboards & reports
  • Data isolation (customer A can't see B's data)
  • 99.9% SLA
  • Cost-sensitive (small/medium customers premium-sensitive)

Proposed Architecture

IDENTITY LAYER
└── Azure AD B2C (customer login, single sign-on)

API GATEWAY
└── API Management (rate limiting, auth, billing)

APPLICATION TIER
└── Container Instances (auto-scale, 5-50 replicas)

DATABASE TIER (Per-Tenant Isolation)
├── Option 1: Separate DB per customer (max isolation)
├── Option 2: Shared DB + row-level security (cost savings)
└── Choose mix: Enterprise = separate, SMB = shared

REAL-TIME ANALYTICS
├── Stream Analytics (ingests logs 24/7)
├── Event Hub (1000s events/sec)
├── Power BI Embedded (personalized dashboards)
└── Data Lake (query history)

COST OPTIMIZATION
├── Spot containers for non-critical tasks
├── Auto-scale down at night (off-peak)
└── Reserved capacity for baseline

Key Decisions

  • Multi-tenant isolation: Separate DBs for enterprise (high isolation/cost), shared for SMB
  • Row-level security (RLS): If shared DB, enforce RLS at database level (customer sees only their rows)
  • Container instances: Cheap, stateless, scales fast. Good for SaaS
  • API Management: Rate limiting per tenant (premium = higher limit), billing integration
  • Power BI Embedded: Personalized dashboards per customer (branding, custom metrics)
  • Spot containers: Run analytics/batch jobs on spot (70% cheaper), resilient to eviction

Scaling Pattern

  • Customer onboarding: New customer → provisioned in shared DB (or new isolated if enterprise) → dashboard created in Power BI
  • Load growth: As customer's volume grows, can migrate to isolated DB (zero-downtime with replication)

Lab 3: Enterprise Data Lake (Batch + Real-time)

Requirements

  • Ingest: Batch (daily uploads) + Real-time (logs, IoT, APIs)
  • Governance: Multi-team access, PII masking, audit trails
  • Analytics: SQL queries + Spark notebooks

Proposed Architecture

INGESTION
├── Batch: Data Factory (daily ETL jobs)
├── Real-time: Event Hub → Stream Analytics → ADLS
└── API: Logic Apps (webhook triggers)

DATA LAKE (Raw → Processed → Curated)
├── Raw layer: ADLS container (immutable, audit logged)
├── Processed layer: Delta tables (versioned, ACID)
└── Curated layer: Optimized for analytics

GOVERNANCE
├── Purview (data catalog, lineage, PII detection)
├── Synapse access control (who sees what)
└── Audit logging (track data access)

ANALYTICS
├── Synapse SQL (distributed SQL queries)
├── Spark (ML, complex ETL)
└── Power BI (dashboards)

Key Decisions

  • Data Factory: Schedules daily ETL pipelines, handles partial failures with retry
  • Delta Lake: ACID transactions on data lake (prevents corruption, enables time travel)
  • Purview: Automatic PII detection, masking rules per team (Finance sees customer IDs, Marketing sees masked)
  • Synapse: Distributed SQL (query TBs fast), + Spark for ML

Best Practices Summary

Design Patterns Found in Real Architectures

  • Layering: Public → Apps → Data (separation of concerns)
  • Async patterns: Use Service Bus/Event Hub to decouple components (prevent timeouts)
  • Caching layer: Redis for hot data (massive throughput increase)
  • Multi-tenancy consideration: Decide: isolated DB per customer vs shared DB + RLS (cost-isolation tradeoff)
  • Monitoring everywhere: Application Insights + Azure Monitor in every design (observability = debugging faster)
  • Cost consciousness: Auto-scale, spot VMs, tiered storage (real architects optimize cost)

Summary

Your Turn: Design Challenge

Scenario: A logistics company needs to track shipments globally. Drivers upload location + package info every 30 seconds. Dashboard shows routes + delivery ETAs. 500,000 drivers. Must survive region failure.

Design this architecture on paper (30 min). Answer: Pick components for ingestion (hint: 500k devices = need event hub, not API), storage (timeseries DB?), real-time processing, dashboard. Then compare with classmates!