it · level 9

Backups & Recovery

3-2-1, restore drills, and why offline copies beat ransomware.

180 XP

Backups & Recovery

Backups are the infrastructure layer that lets a company keep existing when something goes badly wrong — ransomware, accidental delete, a dropped laptop, a SaaS vendor's worst day. This level is about building a recovery plan you can actually prove works, not a plan that should work in theory.

The 3-2-1 rule

The canonical backup baseline, and the one every auditor will ask about, is 3-2-1:

3 total copies of the data (including the primary).
2 different media types (e.g. disk and cloud, or disk and tape).
1 copy stored off-site.

Modern updates push it to 3-2-1-1-0 — three copies, two media, one off-site, one immutable or offline, with zero errors on restore tests. The extra "1" is the ransomware lesson of the 2020s: attackers encrypt everything they can reach, so at least one copy must be on media they can't reach from the production network.

RPO vs RTO — the two budgets

Every backup plan is scoped by two numbers the business has to set:

RPO (Recovery Point Objective) — the worst-case amount of data you're willing to lose, measured in minutes. "We tolerate losing at most an hour of orders" → RPO = 60m.
RTO (Recovery Time Objective) — the worst-case time you're willing to be down, measured in minutes. "We need to be trading again within 4 hours" → RTO = 240m.

RPO drives snapshot cadence (tier). RTO drives the destination (onsite NAS restores in minutes; offline tape vaulting might take a day). A plan that hits RPO but misses RTO is a plan that loses no data but can't get the company back up in time — and vice versa.

The two are a product call, not a purely technical one. Finance tells you what loss is bearable; engineering builds the tier that fits.

Backup categories by data shape

Different data lives in different systems and needs different tools:

Endpoint (laptops, desktops) — user files, browser state, local photos. Tools like Datto EDR-integrated backup, Druva inSync, or Backblaze for Business. Rule: tier ≥ hourly, destination = cloud. Don't use offline vaulting here — endpoints are mobile and the snapshot window has to be short.
M365 / Google Workspace — mail, OneDrive/Drive, SharePoint/Docs. Microsoft's retention isn't a backup; it's a soft-delete with a fixed expiry window. You need a third-party backup (Druva, Veeam for M365, AvePoint) to protect against account compromise, malicious delete, or retention-policy churn.
SaaS (Salesforce, HubSpot, Jira, etc.) — vendor data, API-exposed. Vendors promise durability but rarely tenant-level restore. Tools like Rippling, OwnBackup, or Spanning snapshot SaaS data to an independent vault.
Server (on-prem or IaaS VMs, databases, file shares) — the classic target. Veeam dominates on-prem; Rubrik, Cohesity, and cloud-native tools (AWS Backup, Azure Backup) handle hybrid. Servers can use any destination; the 3-2-1 rule is non-negotiable.

Continuous vs periodic

Continuous Data Protection (CDP) captures every write — RPO approaches zero at the cost of infrastructure (disk replication, storage fabric overhead, licensing). Worth it for order-entry databases, payment systems, anything where a minute of loss is six figures.

Periodic snapshots (hourly, daily, weekly) trade RPO for cost. Most workloads sit happily at daily with hourly incrementals. Don't over-engineer — a $200/mo hourly backup on a brochure website is money lit on fire.

Restore drill culture

A backup you've never restored is not a backup.

Every backup tool claims to work. Many don't — silent corruption, truncated snapshots, missing schema versions, restore paths that require licences nobody remembers purchasing. The only way you know the backup works is by restoring it.

Run quarterly restore drills at minimum: pick a random snapshot, restore it to a staging environment, verify the data is intact. Document the steps, time them (that's your real RTO), and rotate who runs the drill so knowledge doesn't sit with one engineer.

Ransomware-proof copies

The 2020s changed the threat model. Attackers now:

Breach the network.
Sit quietly for weeks, mapping systems and credentials.
Delete or encrypt everything reachable — including network-attached backups, because most shops gave the backup server Domain Admin rights.
Demand ransom.

The only reliable mitigation is a copy on media the attacker cannot reach from the production network:

Offline / air-gapped tape (LTO cartridges in a safe or off-site vault).
Immutable cloud object storage (S3 Object Lock, Azure Blob immutability).
Isolated cloud account with its own credentials, not trusted from the production VPC.

Cheap, slow to restore, and completely ransom-proof. Most mid-market SMBs skip this tier because "it's overkill". Most mid-market SMBs also have a bad month eventually.

Gotchas the helpdesk catches

"M365 retention is a backup." It is not. Retention is a soft-delete timer (14–30 days by default) with an admin override. A compromised admin can purge everything; a retention policy change silently shrinks the window. Third-party backup is mandatory for regulated orgs.
SaaS vendor data loss. Read your MSA carefully — most SaaS vendors disclaim liability for customer data. They replicate the service, not your tenant. If Salesforce nukes your org's data tomorrow, your contract says it's your problem.
Silent backup failures. The #1 failure mode in SME IT is "the backup job has been failing for six months but nobody reads the email alerts". Wire backup success/failure into the same alerting stack as the monitoring (PagerDuty, Opsgenie, a Slack channel someone watches) — don't rely on nightly email digests.
Laptops on VPN only. Endpoint backup clients that only snapshot when on corporate VPN miss users who haven't connected in weeks. Cloud-direct tools (Druva, Code42) avoid this; on-prem endpoint backup setups typically fail this test.
Restore permissions. The technician who runs the restore needs permission to read the backup AND write to the restore target. Get this wrong once at 3am during an outage and you'll never get it wrong again.

Playground

Pick a scenario (endpoint loss, M365 purge, SaaS wipe, ransomware), then choose a tier (snapshot cadence) and destination (onsite/cloud/offline). The fit indicators update live — green when RPO, RTO, and cost all fit, red when any budget is blown. The score summarises the trade-off; reasons explain it in plain English.

Visualizer

The Backup Timeline panel lays snapshots out on a 14-day window. The orange wedge is the loss window between the last good snapshot and "now". The blue RPO band shows the cadence your tier guarantees; the lighter RTO band projects forward from "now" to show how long restoring from the chosen destination takes. Coarse tiers (weekly, quarterly) leave visible gaps; continuous makes the cadence vanish into a solid bar — you can see why CDP is worth it for critical systems, and why tape-vaulting an endpoint laptop doesn't make sense.