Disaster Recovery
Protect against catastrophic failures: RTO/RPO, backup strategies, failover automation, and Azure Site Recovery.
🧠 ELI5 Explanation
Imagine your house burns down. Insurance policy (Backup): You get money to rebuild. How fast can you move back (RTO)? 3 months to rebuild = high RTO. What did you lose (RPO)? Photos from last week not backed up = high RPO (lost data). Better strategy: photo backup every day (RPO = 1 day), rebuild kit ready (RTO = 1 week).
Core DR Concepts
RTO: Recovery Time Objective
Definition: Maximum acceptable time to restore service after failure.
- RTO = 1 hour: After failure detected, service must be back online in 1 hour max
- RTO = 24 hours: Can be offline for a day (less critical systems)
- RTO = 0 (near-zero): Instant failover (active-active multi-region)
RPO: Recovery Point Objective
Definition: Maximum acceptable data loss (how much time worth of data can you lose).
- RPO = 15 minutes: You can lose up to 15 minutes of data (backups every 15 min)
- RPO = 1 hour: Hourly backups acceptable (lose up to 59 minutes before failure)
- RPO = 0 (no data loss): Synchronous replication or real-time backups
DR Strategies
Strategy 1: Backup & Restore (Cheapest)
Setup: Regular backups to blob storage (daily or weekly).
- RTO: High (hours to restore from backup)
- RPO: High (lose last 24 hours if daily backup)
- Cost: Low (backup storage only)
- When to use: Dev/test, low-criticality apps, budget-limited
Strategy 2: Standby Site (Warm Standby)
Setup: Secondary site with replicated data but no active traffic. Manual failover to secondary when needed.
- RTO: Medium (hours, manual intervention)
- RPO: Low (async replication, near-real-time)
- Cost: Medium (maintain idle secondary resources)
- When to use: Standard enterprise apps
Strategy 3: Hot Standby (Active-Passive)
Setup: Secondary site fully deployed, auto-failover on detection of primary failure.
- RTO: Low (<5 minutes, automated)
- RPO: Low (async replication)
- Cost: Higher (maintain hot standby)
- When to use: Mission-critical systems
Strategy 4: Active-Active (Highest Availability)
Setup: Both sites active, traffic distributed, instant failover.
- RTO: Near-zero (instant)
- RPO: Near-zero (sync replication)
- Cost: Highest (run 2x resources)
- When to use: Critical financial/healthcare systems
Azure Site Recovery (ASR)
What Is ASR?
Managed service that automates replication and failover.
- Replicate: On-prem VMs or Azure VMs to secondary region
- Orchestrate: Multi-VM failover (coordinated)
- Test: Failover recovery without affecting production
- Automate: Failover on failure detection
On-Premises
├── Hyper-V/VMware Hosts
│ ├── App VM 1
│ ├── App VM 2
│ └── DB VM
└── Mobility Agent (replication)
↓
Azure Site Recovery Vault (replication orchestration)
↓
Azure (Secondary Region)
├── App VM 1 (replica)
├── App VM 2 (replica)
└── DB VM (replica)
Normal: On-prem handles traffic
Disaster: Failover initiated → Azure VMs activated → users routed to Azure
Backup Strategy: 3-2-1 Rule
Best practice for backup:
- 3 copies: Original + 2 backups
- 2 media types: E.g., local disk + cloud blob
- 1 offsite: At least one copy geographically distant (disaster protection)
# Example: 3-2-1 backup strategy 1. Production DB on VM (Primary) 2. Backup 1: Local snapshot (fast restore) 3. Backup 2: Geo-redundant blob storage (disaster protection) Architecture: - VM running in East US (prod) - Daily snapshot to managed disk (local, fast) - Nightly backup to geo-redundant storage (copies to West US) If disaster in East: Restore from West US blob in <15 min
Real-world Example: Enterprise Critical System
Setup:
• Primary: SQL Server running in East US
• Req: RTO <1 hour, RPO <15 minutes, SLA 99.99%
DR Strategy (Hot Standby):
1. Azure Site Recovery configured: On-prem DB → Azure hot standby
2. Replication frequency: Every 15 minutes (RPO = 15 min)
3. Failover orchestration: Multi-VM failover (app + DB coordinated)
4. Auto-failover policy: After 30-min downtime, trigger failover
5. Backup: Nightly to geo-redundant blob (3-2-1 rule)
Failure Scenario:
• On-prem datacenter power loss (10:00 AM)
• ASR detects 30 consecutive failed health checks (10:30 AM)
• Auto-failover initiated: Azure VMs started, traffic routed to Azure
• RTO achieved: ~30 minutes total (downtime + failover)
• Data loss: Up to 15 minutes (max time since last replication)
• RPO: 15 minutes ✓ (requirement met)
Result: Business-critical system back online within RTO, minimal data loss
DR Testing & Maintenance
Critical step often skipped: Test your DR plan regularly.
- Test failover: ASR simulates failover without affecting production
- Frequency: Quarterly (at least semi-annually)
- Document: Record results, issues, update runbooks
- Why: Backup useless if recovery fails under pressure
Summary
- RTO: How fast to restore (critical = <1 hour)
- RPO: How much data loss acceptable (critical = <15 minutes)
- Strategies: Backup → Warm standby → Hot standby → Active-active (cost increases)
- Azure Site Recovery: Automates replication, orchestration, failover
- 3-2-1 rule: Best practice (3 backups, 2 media types, 1 offsite)
- Test regularly: DR plan only works if tested
Interview Questions
A: RTO = max acceptable downtime. RPO = max acceptable data loss. Example: RTO 1 hour = must be back online in 1 hour. RPO 15 min = acceptable to lose 15 minutes of data. Different requirements need different strategies (more expensive for lower RTO/RPO).
A: Active-Active multi-region setup: Primary (East US) + Secondary (West US) synchronized. Front Door routes users. Every 5 minutes: data sync check. If primary fails, automatic failover to secondary (happens in seconds, well under 15-minute RTO). Replicate all data real-time. Very expensive, but meets strict requirements.
A: Test failover procedure (does it actually work?), restoration time (is our RTO realistic?), data integrity (no corruption after failover), communication plan (notify teams), runbook accuracy (steps valid?). Do quarterly or at major infrastructure changes.