Building a Multi-Site Backup Strategy with ZFS Snapshots and Encrypted Offsite Replication

Introduction

Data loss is not a question of if, but when. Hardware failures, ransomware attacks, accidental deletions, and datacenter outages are realities every infrastructure team must plan for. A multi-site backup strategy built on ZFS snapshots and encrypted offsite replication provides a resilient foundation that balances performance, storage efficiency, and security. This guide covers the full stack: snapshot fundamentals, incremental send/receive pipelines, native ZFS encryption, offsite replication over SSH tunnels, automated retention policies, and restore verification procedures.

ZFS Snapshot Fundamentals

ZFS snapshots are point-in-time, read-only copies of a dataset or pool. Unlike traditional backup tools that copy data block by block, ZFS snapshots are instantaneous and consume zero additional space at creation — they only consume space as the live dataset diverges from the snapshot state. This copy-on-write property makes snapshots extremely cheap to create and ideal as the base unit of a backup pipeline.

Creating a snapshot is a single command:

zfs snapshot rpool/data/vms@2026-04-03_02:00

Snapshots are identified by the @ separator. The dataset name precedes it, the snapshot name follows. List all snapshots on a pool with:

zfs list -t snapshot -o name,creation,used,refer -s creation

The used column shows how much unique data the snapshot holds that is no longer referenced by newer snapshots — a key metric for understanding retention cost. Rolling back to a snapshot destroys all data written after the snapshot was taken:

zfs rollback rpool/data/vms@2026-04-03_02:00

Incremental Send and Receive

ZFS send/receive is the native mechanism for replicating datasets between pools, hosts, or storage backends. A full send serializes the entire dataset into a byte stream piped directly into zfs receive on a remote host.

# Full initial send
zfs send rpool/data/vms@2026-04-03_02:00 | ssh backup01 zfs receive tank/backups/vms

After the initial full send, incremental sends transmit only the blocks that changed between two snapshots:

# Incremental send between two snapshots
zfs send -i rpool/data/vms@2026-04-03_02:00 rpool/data/vms@2026-04-04_02:00 \
  | ssh backup01 zfs receive tank/backups/vms

The -I flag (capital I) sends all intermediate snapshots between two points, useful when the remote is multiple snapshots behind:

zfs send -I rpool/data/vms@2026-04-01 rpool/data/vms@2026-04-04 \
  | ssh backup01 zfs receive tank/backups/vms

For large datasets across slow links, pipe through mbuffer to smooth out I/O bursts and add progress visibility:

zfs send -I rpool/data/vms@2026-04-01 rpool/data/vms@2026-04-04 \
  | mbuffer -s 128k -m 1G \
  | ssh -c [email protected] backup01 \
    "mbuffer -s 128k -m 1G | zfs receive -F tank/backups/vms"

Native ZFS Encryption

Since OpenZFS 0.8, native encryption is available at the dataset level. Unlike filesystem-level encryption tools such as LUKS, ZFS encryption encrypts individual blocks and allows granular key management per dataset. Creating an encrypted dataset with a passphrase:

zfs create \
  -o encryption=aes-256-gcm \
  -o keyformat=passphrase \
  -o keylocation=prompt \
  rpool/data/encrypted-vms

For automated operation, store the key as a raw file:

openssl rand -base64 64 > /etc/zfs/keys/encrypted-vms.key
chmod 600 /etc/zfs/keys/encrypted-vms.key

zfs create \
  -o encryption=aes-256-gcm \
  -o keyformat=raw \
  -o keylocation=file:///etc/zfs/keys/encrypted-vms.key \
  rpool/data/encrypted-vms

When sending encrypted datasets offsite, use the -w (raw) flag to transmit data in its encrypted form. The receiving host never sees plaintext — a critical property for offsite replication to untrusted or third-party storage:

zfs send -w -I rpool/data/encrypted-vms@yesterday rpool/data/encrypted-vms@today \
  | ssh storage.example-corp.com zfs receive tank/offsite/encrypted-vms

Offsite Replication to Remote Storage Boxes

Many hosted storage providers offer SFTP/SSH access with large storage quotas. The challenge is that these targets do not run ZFS, so you cannot use zfs receive remotely. Instead, serialize the stream to a compressed file:

SNAP_DATE=$(date +%Y-%m-%d)
DATASET="rpool/data/encrypted-vms"
PREV_SNAP="${DATASET}@$(date -d yesterday +%Y-%m-%d)"
TODAY_SNAP="${DATASET}@${SNAP_DATE}"

zfs send -w -i "$PREV_SNAP" "$TODAY_SNAP" \
  | pv \
  | gzip \
  | ssh -i /root/.ssh/storage_box \
      -p 23 \
      [email protected] \
      "cat > /backups/vms/incremental_${SNAP_DATE}.zfs.gz"

For ZFS-capable offsite targets — a second Proxmox node, a VPS with ZFS — create a dedicated replication user with only the necessary permissions:

# On the remote backup host
useradd -m -s /bin/bash zfsrepl
zfs allow zfsrepl receive,create,mount,destroy tank/backups

Restrict the SSH key on the receiving end using command= in authorized_keys to prevent interactive shell access:

command="zfs receive -F tank/backups/vms",no-port-forwarding,no-x11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... zfsrepl@proxmox01

Retention Policies and Automated Pruning

Unconstrained snapshots accumulate indefinitely and eventually exhaust pool space. A tiered retention policy balances granularity against storage cost. A common scheme for production infrastructure:

Hourly snapshots — retain 24 (covering the last day)
Daily snapshots — retain 7 (covering the last week)
Weekly snapshots — retain 4 (covering the last month)
Monthly snapshots — retain 12 (covering the last year)

The sanoid tool implements this policy declaratively. Define a policy in /etc/sanoid/sanoid.conf:

[rpool/data/vms]
  use_template = production
  recursive = yes

[template_production]
  frequently = 0
  hourly = 24
  daily = 7
  weekly = 4
  monthly = 12
  autosnap = yes
  autoprune = yes

The companion tool syncoid handles replication with matching declarative configuration, managing bookmark tracking so incremental sends resume automatically after network interruptions. For custom scripted pruning:

#!/bin/bash
DATASET="rpool/data/vms"
KEEP_DAILY=7

zfs list -t snapshot -H -o name -s creation "$DATASET" \
  | grep "@daily-" \
  | head -n -${KEEP_DAILY} \
  | while read snap; do
      echo "Destroying old snapshot: $snap"
      zfs destroy "$snap"
    done

Verification: Restore Tests and Checksum Validation

A backup that has never been tested is not a backup — it is a hypothesis. Automated restore verification should run weekly at minimum. The verification process has two components: checksum validation and functional restore.

ZFS scrubs validate the integrity of all data on a pool against stored checksums:

zpool scrub rpool
zpool status rpool | grep -A3 "scan:"

For offsite file backups, generate checksums at send time and store them alongside the archive:

zfs send -w "$TODAY_SNAP" \
  | tee >(sha256sum > /tmp/checksum.txt) \
  | gzip > /tmp/backup.zfs.gz
scp /tmp/checksum.txt [email protected]:/backups/vms/backup_${SNAP_DATE}.sha256

For functional restore tests, clone the backup snapshot to an isolated dataset and verify filesystem integrity:

#!/bin/bash
RESTORE_SNAP="tank/backups/vms@$(date -d 'last sunday' +%Y-%m-%d)"
TEST_DATASET="tank/restore-test/vms"

zfs destroy -r "$TEST_DATASET" 2>/dev/null
zfs clone "$RESTORE_SNAP" "$TEST_DATASET"

if mountpoint -q "/mnt/restore-test"; then
  echo "RESTORE OK: $(date)" | tee -a /var/log/backup-verify.log
else
  echo "RESTORE FAILED: $(date)" | tee -a /var/log/backup-verify.log
  /usr/local/bin/send_alert.sh "ZFS restore verification failed on $(hostname)"
fi

zfs destroy -r "$TEST_DATASET"

Cron Automation with Logging and Alerting

The full backup pipeline — snapshot, replicate, prune, verify — should be fully automated with structured logging and failure alerts. A production cron layout on a Proxmox host:

# /etc/cron.d/zfs-backup
0 * * * * root /usr/local/bin/zfs-snapshot-hourly.sh >> /var/log/zfs-backup/hourly.log 2>&1
0 2 * * * root /usr/local/bin/zfs-replicate-offsite.sh >> /var/log/zfs-backup/replicate.log 2>&1
0 3 * * * root /usr/local/bin/zfs-prune.sh >> /var/log/zfs-backup/prune.log 2>&1
0 4 * * 0 root /usr/local/bin/zfs-verify-restore.sh >> /var/log/zfs-backup/verify.log 2>&1

Each script should log a structured entry on success and send an alert on non-zero exit. A minimal alerting wrapper using curl to post to a webhook:

send_alert() {
  local message="$1"
  curl -s -X POST https://alerts.example-corp.com/webhook \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"BACKUP ALERT on $(hostname): ${message}\"}"
}

trap 'send_alert "Script failed at line $LINENO"' ERR

Practical Example: Proxmox Host Backing Up VM and CT Datasets

A Proxmox VE host stores virtual machines under rpool/data as ZVOL-backed or directory-backed datasets. The following is a complete replication script for a Proxmox environment replicating to a secondary node at backup01.example-corp.com:

#!/bin/bash
set -euo pipefail

DATASETS=("rpool/data/vm-100-disk-0" "rpool/data/vm-101-disk-0" "rpool/data/subvol-200-disk-0")
REMOTE="backup01.example-corp.com"
REMOTE_POOL="tank/backups"
SNAP_NAME="auto-$(date +%Y-%m-%dT%H:%M)"
LOG="/var/log/zfs-backup/replicate.log"

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }

for DATASET in "${DATASETS[@]}"; do
  SHORTNAME=$(basename "$DATASET")
  REMOTE_DS="${REMOTE_POOL}/${SHORTNAME}"

  log "Snapshotting ${DATASET}@${SNAP_NAME}"
  zfs snapshot "${DATASET}@${SNAP_NAME}"

  LAST_LOCAL=$(zfs list -t snapshot -H -o name -s creation "$DATASET" | tail -2 | head -1)
  LAST_SNAP=$(echo "$LAST_LOCAL" | cut -d@ -f2)

  log "Sending incremental ${LAST_SNAP} -> ${SNAP_NAME} to ${REMOTE}:${REMOTE_DS}"
  zfs send -w -i "${DATASET}@${LAST_SNAP}" "${DATASET}@${SNAP_NAME}" \
    | ssh -i /root/.ssh/zfsrepl -c [email protected] "$REMOTE" \
        "zfs receive -F ${REMOTE_DS}"

  log "Replication complete for ${SHORTNAME}"
done

log "All datasets replicated successfully"

Summary

A ZFS-based multi-site backup strategy delivers snapshot efficiency, encrypted offsite transfer, and automated lifecycle management in a single coherent toolchain. The key principles are: take frequent local snapshots, replicate incrementally to a secondary site with -w (raw) mode to preserve encryption end-to-end, enforce tiered retention policies with automated pruning, and run scheduled restore verification so failures are discovered during a drill — not a disaster. Combined with structured logging and alerting, this approach provides the observability needed to maintain confidence in backup integrity across heterogeneous infrastructure.