AdvancedResilience

Disaster Recovery

Protect against catastrophic failures: RTO/RPO, backup strategies, failover automation, and Azure Site Recovery.

🧠 ELI5 Explanation

Imagine your house burns down. Insurance policy (Backup): You get money to rebuild. How fast can you move back (RTO)? 3 months to rebuild = high RTO. What did you lose (RPO)? Photos from last week not backed up = high RPO (lost data). Better strategy: photo backup every day (RPO = 1 day), rebuild kit ready (RTO = 1 week).

Core DR Concepts

RTO: Recovery Time Objective

Definition: Maximum acceptable time to restore service after failure.

  • RTO = 1 hour: After failure detected, service must be back online in 1 hour max
  • RTO = 24 hours: Can be offline for a day (less critical systems)
  • RTO = 0 (near-zero): Instant failover (active-active multi-region)

RPO: Recovery Point Objective

Definition: Maximum acceptable data loss (how much time worth of data can you lose).

  • RPO = 15 minutes: You can lose up to 15 minutes of data (backups every 15 min)
  • RPO = 1 hour: Hourly backups acceptable (lose up to 59 minutes before failure)
  • RPO = 0 (no data loss): Synchronous replication or real-time backups
Metric Critical App Standard App Dev/Test App RTO <1 hour 4-24 hours 24-48 hours RPO 15-60 minutes 1-4 hours 24+ hours or acceptable to lose Cost $$$$ (high) $$ (medium) $ (low)

DR Strategies

Strategy 1: Backup & Restore (Cheapest)

Setup: Regular backups to blob storage (daily or weekly).

  • RTO: High (hours to restore from backup)
  • RPO: High (lose last 24 hours if daily backup)
  • Cost: Low (backup storage only)
  • When to use: Dev/test, low-criticality apps, budget-limited

Strategy 2: Standby Site (Warm Standby)

Setup: Secondary site with replicated data but no active traffic. Manual failover to secondary when needed.

  • RTO: Medium (hours, manual intervention)
  • RPO: Low (async replication, near-real-time)
  • Cost: Medium (maintain idle secondary resources)
  • When to use: Standard enterprise apps

Strategy 3: Hot Standby (Active-Passive)

Setup: Secondary site fully deployed, auto-failover on detection of primary failure.

  • RTO: Low (<5 minutes, automated)
  • RPO: Low (async replication)
  • Cost: Higher (maintain hot standby)
  • When to use: Mission-critical systems

Strategy 4: Active-Active (Highest Availability)

Setup: Both sites active, traffic distributed, instant failover.

  • RTO: Near-zero (instant)
  • RPO: Near-zero (sync replication)
  • Cost: Highest (run 2x resources)
  • When to use: Critical financial/healthcare systems

Azure Site Recovery (ASR)

What Is ASR?

Managed service that automates replication and failover.

  • Replicate: On-prem VMs or Azure VMs to secondary region
  • Orchestrate: Multi-VM failover (coordinated)
  • Test: Failover recovery without affecting production
  • Automate: Failover on failure detection
Azure Site Recovery: On-prem to Azure

On-Premises
├── Hyper-V/VMware Hosts
│ ├── App VM 1
│ ├── App VM 2
│ └── DB VM
└── Mobility Agent (replication)

Azure Site Recovery Vault (replication orchestration)

Azure (Secondary Region)
├── App VM 1 (replica)
├── App VM 2 (replica)
└── DB VM (replica)

Normal: On-prem handles traffic
Disaster: Failover initiated → Azure VMs activated → users routed to Azure

Backup Strategy: 3-2-1 Rule

Best practice for backup:

# Example: 3-2-1 backup strategy
1. Production DB on VM (Primary)
2. Backup 1: Local snapshot (fast restore)
3. Backup 2: Geo-redundant blob storage (disaster protection)

Architecture:
- VM running in East US (prod)
- Daily snapshot to managed disk (local, fast)
- Nightly backup to geo-redundant storage (copies to West US)

If disaster in East: Restore from West US blob in <15 min

Real-world Example: Enterprise Critical System

Setup:
• Primary: SQL Server running in East US
• Req: RTO <1 hour, RPO <15 minutes, SLA 99.99%

DR Strategy (Hot Standby):
1. Azure Site Recovery configured: On-prem DB → Azure hot standby
2. Replication frequency: Every 15 minutes (RPO = 15 min)
3. Failover orchestration: Multi-VM failover (app + DB coordinated)
4. Auto-failover policy: After 30-min downtime, trigger failover
5. Backup: Nightly to geo-redundant blob (3-2-1 rule)

Failure Scenario:
• On-prem datacenter power loss (10:00 AM)
• ASR detects 30 consecutive failed health checks (10:30 AM)
• Auto-failover initiated: Azure VMs started, traffic routed to Azure
• RTO achieved: ~30 minutes total (downtime + failover)
• Data loss: Up to 15 minutes (max time since last replication)
• RPO: 15 minutes ✓ (requirement met)

Result: Business-critical system back online within RTO, minimal data loss

DR Testing & Maintenance

Critical step often skipped: Test your DR plan regularly.

Summary

Interview Questions

Q: What's the difference between RTO and RPO? Why does it matter?
A: RTO = max acceptable downtime. RPO = max acceptable data loss. Example: RTO 1 hour = must be back online in 1 hour. RPO 15 min = acceptable to lose 15 minutes of data. Different requirements need different strategies (more expensive for lower RTO/RPO).
Q: Design a DR plan for a financial trading system requiring 99.99% SLA, RTO <15 min, RPO <5 min.
A: Active-Active multi-region setup: Primary (East US) + Secondary (West US) synchronized. Front Door routes users. Every 5 minutes: data sync check. If primary fails, automatic failover to secondary (happens in seconds, well under 15-minute RTO). Replicate all data real-time. Very expensive, but meets strict requirements.
Q: What should you test in a DR plan?
A: Test failover procedure (does it actually work?), restoration time (is our RTO realistic?), data integrity (no corruption after failover), communication plan (notify teams), runbook accuracy (steps valid?). Do quarterly or at major infrastructure changes.