Skip to content
general business-continuity

Disaster Recovery (DR)

disaster-recovery business-continuity resilience infrastructure
Plain English

Disaster recovery is your plan for when everything goes wrong: the server room floods, ransomware encrypts everything, or a cloud region goes offline. It is not just about having backups; it is about knowing exactly how to get your systems running again, who is responsible for what, and how long it should take. Organizations that practice disaster recovery can come back online in hours. Organizations that do not can lose days, data, or the entire business.

Technical Definition

Disaster Recovery (DR) is a structured approach to restoring IT systems, data, and operations after a disruptive event. It is a subset of business continuity planning (BCP), which covers the broader organizational response.

Key metrics:

  • RTO (Recovery Time Objective): maximum acceptable downtime. “We must be operational within 4 hours.”
  • RPO (Recovery Point Objective): maximum acceptable data loss measured in time. “We can lose up to 1 hour of data.”
  • MTTR (Mean Time to Recovery): average time to restore service after an incident.

DR strategies (ordered by cost and RTO):

StrategyRTORPOCostDescription
Backup & restoreHours to daysHoursLowRestore from backups to new infrastructure
Pilot light10-30 minMinutesMediumCore systems always running at DR site; scale up on failover
Warm standbyMinutesSecondsMedium-HighScaled-down replica running in DR site; scale up on failover
Active-active (multi-site)Near zeroNear zeroHighFull production in multiple sites; traffic shifts on failure

DR site options:

  • Same region, different AZ: protects against single data center failure (AWS multi-AZ)
  • Cross-region: protects against regional outages (AWS us-east-1 primary, us-west-2 DR)
  • Cross-cloud: protects against provider-level incidents (AWS primary, GCP DR)

DR plan components:

  1. Asset inventory (what systems exist and their criticality tiers)
  2. Recovery procedures (step-by-step runbooks for each system)
  3. Communication plan (who gets notified, escalation paths)
  4. Roles and responsibilities (who executes each recovery step)
  5. Testing schedule (tabletop exercises, partial failovers, full DR drills)

DR testing with database failover

# AWS RDS: promote read replica to primary (failover)
$ aws rds promote-read-replica \
  --db-instance-identifier bytesnation-db-replica
# Takes 1-5 minutes; replica becomes standalone primary

# Update application DNS to point to new primary
$ aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "db.bytesnation.com",
        "Type": "CNAME",
        "TTL": 60,
        "ResourceRecords": [{"Value": "bytesnation-db-replica.abc.us-west-2.rds.amazonaws.com"}]
      }
    }]
  }'

# Verify application connectivity to new primary
$ psql -h db.bytesnation.com -U app -c "SELECT pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 f
# "f" (false) confirms this is now the primary, not a replica
In the Wild

Disaster recovery separates resilient organizations from fragile ones. The 2017 AWS S3 outage in us-east-1 took down thousands of websites and services for hours; companies with multi-region DR failover stayed online. Ransomware attacks increasingly target backups specifically, making immutable off-site backups and tested DR procedures essential. Compliance frameworks (SOC 2, HIPAA, PCI-DSS) require documented and tested DR plans. The most common DR failure is untested plans: the runbook says “restore from backup,” but nobody has verified the restore process works end-to-end. Best practice is to run a full DR drill at least annually, simulating a complete loss of the primary site. For homelabs, DR might be as simple as a documented process for rebuilding Proxmox from ZFS replication snapshots stored on a second machine or cloud storage.