AdvancedCost

Cost Optimization

Master Azure cost strategies: right-sizing, reserved instances, spot VMs, auto-scaling, and financial architecture decisions.

🧠 ELI5 Explanation

You rent a car: pay per day. Smart choice? Daily rental at $100/day. Better deal: monthly pass at $60/day (commit 30 days). Smartest deal: carpool on weekends, rent only needed days. Cloud is similar: pay-as-you-go expensive, reserved instances cheaper (commit to 1/3 years), spot instances cheapest (use when available).

Cost Structure in Azure

Common Cost Components

  • Compute: VMs, app service, container instances (usually largest cost)
  • Storage: Blobs, disks, databases (growing with data)
  • Networking: Bandwidth, VPN, ExpressRoute (small unless multi-region)
  • Services: Databases, backup, monitoring, AI/ML (varies)

Pricing Models

Model Cost/Month Commitment When to Use Pay-As-You-Go $1000 (example) None (hourly) Unpredictable workloads, testing, short-term 1-Year RI $700 (30% discount) 1 year prepay or monthly Predictable, stable workloads (production) 3-Year RI $550 (45% discount) 3 year prepay Long-term, high confidence, mission-critical Spot VM $100-300 (70-90% off) None (can be evicted) Batch jobs, non-critical workloads, tolerance for interruption Savings Plan $650 (35% discount) 1 or 3 year Flexibility across VM sizes/families, predictable compute spend

Cost Optimization Strategies

Strategy 1: Right-Sizing (Quick Win)

Problem: Over-provisioned VMs (picked too large, using only 20% capacity).

Solution: Analyze usage, downsize VMs to appropriate tier.

  • Use Azure Advisor recommendations (identifies over-provisioned VMs)
  • Monitor CPU, memory, disk (30 days of metrics)
  • Example: D4 VM 20% utilized → downsize to B2 (50-70% cost reduction)

Strategy 2: Reserved Instances (Most Common)

Problem: Paying full price for predictable workloads.

Solution: Buy reserved instances for expected capacity, save 30-45%.

  • Analyze 30-90 day trend: if stable utilization, buy RI
  • Upfront payment option (more savings, less flexible)
  • Monthly payment option (familiar, easier budget)
  • Example: Production app needs 3 VMs always → buy 3-year RI, save $4,320/year

Strategy 3: Spot VMs (Aggressive)

Problem: Running batch jobs, analytics, or non-critical workloads at full price.

Solution: Use spot VMs (Azure excess capacity) at 70-90% discount.

  • Risk: Can be evicted when Azure needs capacity (receive 30-sec notice)
  • Best for: Fault-tolerant workloads (batch processing, Kubernetes workers, ML training)
  • Example: Batch job normally $1000/month on VMs → use spot for $100-300

Strategy 4: Auto-Scaling (Efficiency)

Problem: Running full capacity 24/7 even during low-traffic periods.

Solution: Auto-scale down during off-peak, up during peak.

  • Scale on metrics: CPU >70% → add VMs, CPU <30% → remove VMs
  • Schedule-based: Reduce instances nights/weekends (predictable patterns)
  • Example: Web app: 10 VMs day, 2 VMs night (60% cost reduction for night traffic)

Strategy 5: Storage Optimization

Problem: Not tiering data by access patterns or keeping unnecessary data.

Solution: Use storage tiers, delete unused data.

  • Hot: Frequently accessed, highest cost ($0.0184/GB/month)
  • Cool: Accessed monthly, 50% cheaper ($0.0092/GB/month)
  • Archive: Accessed yearly, 90% cheaper ($0.00198/GB/month)
  • Lifecycle policy: Move data cool after 30 days, archive after 90 days

Cost Tracking & Governance

Tools

  • Azure Cost Management: Track spending by resource, subscription, tag
  • Azure Advisor: Automated recommendations (right-size, reservations, etc.)
  • Tags: Label resources with cost-center, project (enables cost allocation)
  • Budgets & Alerts: Set spending caps, email alerts when exceeded

FinOps Culture

Treat cloud cost like engineering problem:

  • Automate cleanup (delete unused resources nightly)
  • Enable chargeback (teams see their cost impact)
  • Regular reviews (monthly cost audit)
  • Set guardrails (policies enforce cost discipline)

Real-world Example: Cost Reduction Project

Baseline (Month 1):
• 20 VMs D4 (over-provisioned, avg 25% CPU) = $8,000/month
• No spot VMs for batch jobs = $2,000/month
• Storage all hot tier = $1,000/month
• Total = $11,000/month

Actions & Savings:
1. Right-size → 14 B2 VMs (30% smaller) = $4,200/month (47% savings)
2. Add 6 spot VMs for batch = $400/month (vs $1,200 on-demand, 67% savings)
3. Implement auto-scaling (scale to 8 VMs night) = $2,800/month on compute (30% savings)
4. Move data cool/archive = $300/month on storage (70% savings)
5. Buy 1-year RIs on stable 8 VMs = $1,800/month (40% discount on 8 VMs)

Total After Optimization: $5,500/month
Monthly Savings: $5,500 (50% reduction!)
Annual Savings: $66,000

Tradeoff: RI commitment (1 year), spot VM eviction risk, monitoring overhead. Worth it.

Cost vs Availability Trade-off

Decision Framework

  • Mission-critical: Prioritize availability → multi-region, HA → higher cost
  • Standard: Balance cost & availability → reservation + auto-scale
  • Dev/Test: Prioritize cost → pay-as-you-go, minimal redundancy, auto-shutdown

Summary

Interview Questions

Q: Explain reserved instances. Why buy them if on-demand is more flexible?
A: RIs = 30-45% discount in exchange for 1-3 year commitment. If workload is stable/predictable, savings far outweigh flexibility loss. Example: 3 VMs for 3 years = $66k saved. Risk: worse if workload disappears (sunk cost), but most enterprises use RIs for core stable workloads.
Q: Your team's cloud bill jumped 40% last month. How would you investigate?
A: 1) Check Cost Management dashboard for cost drivers (which resources increased?). 2) Use Advisor (over-provisioned VMs?). 3) Check for zombie resources (unused VMs, old snapshots). 4) See if new deployments added. 5) Review for misconfiguration (data exfiltration, runaway scaling). 6) Check if RIs expired → fell back to pay-as-go.
Q: Design a cost-optimized architecture for a startup with $5k/month cloud budget and 99.9% SLA requirement.
A: Single region (no multi-region to save cost). Use availability zones for 99.9% SLA instead of active-active. Reserved instances for baseline capacity (predictable). Auto-scale with spot VMs for peak (fault-tolerant batch). Cool/archive storage. Result: 99.9% SLA achievable at $5k/month without breaking bank.