Cost Optimization
Master Azure cost strategies: right-sizing, reserved instances, spot VMs, auto-scaling, and financial architecture decisions.
🧠 ELI5 Explanation
You rent a car: pay per day. Smart choice? Daily rental at $100/day. Better deal: monthly pass at $60/day (commit 30 days). Smartest deal: carpool on weekends, rent only needed days. Cloud is similar: pay-as-you-go expensive, reserved instances cheaper (commit to 1/3 years), spot instances cheapest (use when available).
Cost Structure in Azure
Common Cost Components
- Compute: VMs, app service, container instances (usually largest cost)
- Storage: Blobs, disks, databases (growing with data)
- Networking: Bandwidth, VPN, ExpressRoute (small unless multi-region)
- Services: Databases, backup, monitoring, AI/ML (varies)
Pricing Models
Cost Optimization Strategies
Strategy 1: Right-Sizing (Quick Win)
Problem: Over-provisioned VMs (picked too large, using only 20% capacity).
Solution: Analyze usage, downsize VMs to appropriate tier.
- Use Azure Advisor recommendations (identifies over-provisioned VMs)
- Monitor CPU, memory, disk (30 days of metrics)
- Example: D4 VM 20% utilized → downsize to B2 (50-70% cost reduction)
Strategy 2: Reserved Instances (Most Common)
Problem: Paying full price for predictable workloads.
Solution: Buy reserved instances for expected capacity, save 30-45%.
- Analyze 30-90 day trend: if stable utilization, buy RI
- Upfront payment option (more savings, less flexible)
- Monthly payment option (familiar, easier budget)
- Example: Production app needs 3 VMs always → buy 3-year RI, save $4,320/year
Strategy 3: Spot VMs (Aggressive)
Problem: Running batch jobs, analytics, or non-critical workloads at full price.
Solution: Use spot VMs (Azure excess capacity) at 70-90% discount.
- Risk: Can be evicted when Azure needs capacity (receive 30-sec notice)
- Best for: Fault-tolerant workloads (batch processing, Kubernetes workers, ML training)
- Example: Batch job normally $1000/month on VMs → use spot for $100-300
Strategy 4: Auto-Scaling (Efficiency)
Problem: Running full capacity 24/7 even during low-traffic periods.
Solution: Auto-scale down during off-peak, up during peak.
- Scale on metrics: CPU >70% → add VMs, CPU <30% → remove VMs
- Schedule-based: Reduce instances nights/weekends (predictable patterns)
- Example: Web app: 10 VMs day, 2 VMs night (60% cost reduction for night traffic)
Strategy 5: Storage Optimization
Problem: Not tiering data by access patterns or keeping unnecessary data.
Solution: Use storage tiers, delete unused data.
- Hot: Frequently accessed, highest cost ($0.0184/GB/month)
- Cool: Accessed monthly, 50% cheaper ($0.0092/GB/month)
- Archive: Accessed yearly, 90% cheaper ($0.00198/GB/month)
- Lifecycle policy: Move data cool after 30 days, archive after 90 days
Cost Tracking & Governance
Tools
- Azure Cost Management: Track spending by resource, subscription, tag
- Azure Advisor: Automated recommendations (right-size, reservations, etc.)
- Tags: Label resources with cost-center, project (enables cost allocation)
- Budgets & Alerts: Set spending caps, email alerts when exceeded
FinOps Culture
Treat cloud cost like engineering problem:
- Automate cleanup (delete unused resources nightly)
- Enable chargeback (teams see their cost impact)
- Regular reviews (monthly cost audit)
- Set guardrails (policies enforce cost discipline)
Real-world Example: Cost Reduction Project
Baseline (Month 1):
• 20 VMs D4 (over-provisioned, avg 25% CPU) = $8,000/month
• No spot VMs for batch jobs = $2,000/month
• Storage all hot tier = $1,000/month
• Total = $11,000/month
Actions & Savings:
1. Right-size → 14 B2 VMs (30% smaller) = $4,200/month (47% savings)
2. Add 6 spot VMs for batch = $400/month (vs $1,200 on-demand, 67% savings)
3. Implement auto-scaling (scale to 8 VMs night) = $2,800/month on compute (30% savings)
4. Move data cool/archive = $300/month on storage (70% savings)
5. Buy 1-year RIs on stable 8 VMs = $1,800/month (40% discount on 8 VMs)
Total After Optimization: $5,500/month
Monthly Savings: $5,500 (50% reduction!)
Annual Savings: $66,000
Tradeoff: RI commitment (1 year), spot VM eviction risk, monitoring overhead. Worth it.
Cost vs Availability Trade-off
Decision Framework
- Mission-critical: Prioritize availability → multi-region, HA → higher cost
- Standard: Balance cost & availability → reservation + auto-scale
- Dev/Test: Prioritize cost → pay-as-you-go, minimal redundancy, auto-shutdown
Summary
- Pricing models: Pay-as-go → 1-yr RI → 3-yr RI (increasing savings)
- Right-sizing: Match VM size to actual usage (quick 20-30% savings)
- Reserved instances: 30-45% discount for predictable workloads
- Spot VMs: 70-90% discount for evictable, fault-tolerant workloads
- Auto-scaling: Scale down off-peak, reduce cost 20-60%
- Storage tiers: Move old data to cool/archive, 70-90% cost reduction
- FinOps: Treat cost like code (automate, track, govern)
Interview Questions
A: RIs = 30-45% discount in exchange for 1-3 year commitment. If workload is stable/predictable, savings far outweigh flexibility loss. Example: 3 VMs for 3 years = $66k saved. Risk: worse if workload disappears (sunk cost), but most enterprises use RIs for core stable workloads.
A: 1) Check Cost Management dashboard for cost drivers (which resources increased?). 2) Use Advisor (over-provisioned VMs?). 3) Check for zombie resources (unused VMs, old snapshots). 4) See if new deployments added. 5) Review for misconfiguration (data exfiltration, runaway scaling). 6) Check if RIs expired → fell back to pay-as-go.
A: Single region (no multi-region to save cost). Use availability zones for 99.9% SLA instead of active-active. Reserved instances for baseline capacity (predictable). Auto-scale with spot VMs for peak (fault-tolerant batch). Cool/archive storage. Result: 99.9% SLA achievable at $5k/month without breaking bank.