ZFS and Replication

So remember back in chapters 28 and 31 when I set up NFS exports using a USB thumbdrive? Obviously my crowning achievement as an infrastructure engineer.

After living with that setup for a bit, I finally got my hands on some SSDs. Not new ones, mind you – these are various drives I've accumulated over the years. Eight of them, to be precise:

3x 120GB SSDs
3x ~450GB SSDs
2x 1TB SSDs

Time to do something more serious with storage.

The Storage Strategy

I spent way too much time researching distributed storage options. GlusterFS? Apparently dead. Lustre? Way overkill for a Pi cluster, and the complexity-to-benefit ratio is terrible. BeeGFS? Same story.

So I decided to split the drives across three different storage systems:

ZFS for the 3x 120GB drives – rock solid, great snapshot support, and I already know it
Ceph for the 3x 450GB drives – the gold standard for distributed block storage in Kubernetes
SeaweedFS for the 2x 1TB drives – interesting distributed object storage that's simpler than MinIO

Today we're tackling ZFS, because I actually have experience with it and it seemed like the easiest place to start.

The ZFS Setup

I created a role called goldentooth.setup_zfs to handle all of this. The basic idea is to set up ZFS on nodes that have SSDs attached, create datasets for shared storage, and then use Sanoid for snapshot management and Syncoid for replication between nodes.

First, let's install ZFS and configure it for the Pi's limited RAM:

- name: 'Install ZFS.'
  ansible.builtin.apt:
    name:
      - 'zfsutils-linux'
      - 'zfs-dkms'
      - 'zfs-zed'
      - 'sanoid'
    state: 'present'
    update_cache: true

- name: 'Configure ZFS Event Daemon.'
  ansible.builtin.lineinfile:
    path: '/etc/zfs/zed.d/zed.rc'
    regexp: '^#?ZED_EMAIL_ADDR='
    line: 'ZED_EMAIL_ADDR="{{ my.email }}"'
  notify: 'Restart ZFS-zed service.'

- name: 'Limit ZFS ARC to 128MB of RAM.'
  ansible.builtin.lineinfile:
    path: '/etc/modprobe.d/zfs.conf'
    line: 'options zfs zfs_arc_max=1073741824'
    create: true
  notify: 'Update initramfs.'

That ARC limit is important – by default ZFS will happily eat half your RAM for caching, which is not great when you only have 8GB to start with.

Creating the Pool

The pool creation is straightforward. I'm not doing anything fancy like RAID-Z because I only have one SSD per node:

- name: 'Create ZFS pool.'
  ansible.builtin.command: |
    zpool create {{ zfs.pool.name }} {{ zfs.pool.device }}
  args:
    creates: "/{{ zfs.pool.name }}"
  when: ansible_hostname == 'allyrion'

Wait, why when: ansible_hostname == 'allyrion'? Well, it turns out I'm only creating the pool on the primary node. The other nodes will receive the data via replication. This is a bit different from a typical ZFS setup where each node would have its own pool, but it makes sense for my use case.

Sanoid for Snapshots

Sanoid is a fantastic tool for managing ZFS snapshots. It handles creating snapshots on a schedule and pruning old ones according to a retention policy. The configuration is pretty simple:

# Primary dataset for source snapshots
[{{ zfs.pool.name }}/{{ zfs.datasets[0].name }}]
	use_template = production
	recursive = yes
	autosnap = yes
	autoprune = yes

[template_production]
	frequently = 0
	hourly = 36
	daily = 30
	monthly = 3
	yearly = 0
	autosnap = yes
	autoprune = yes

This keeps 36 hourly snapshots, 30 daily snapshots, and 3 monthly snapshots. No yearly snapshots because, let's be honest, this cluster probably won't last that long without me completely rebuilding it.

Syncoid for Replication

Here's where it gets interesting. Syncoid is Sanoid's companion tool that handles ZFS replication. It's basically a smart wrapper around zfs send and zfs receive that handles all the complexity of incremental replication.

I set up systemd services and timers to handle the replication:

[Unit]
Description=Syncoid ZFS replication to %i
After=zfs-import.target
Requires=zfs-import.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/syncoid --no-privilege-elevation {{ zfs.pool.name }}/{{ zfs.datasets[0].name }} root@%i:{{ zfs.pool.name }}/{{ zfs.datasets[0].name }}
StandardOutput=journal
StandardError=journal

The %i is systemd template magic – it gets replaced with whatever comes after the @ in the service name. So syncoid@bramble-01.service would replicate to bramble-01.

The timer runs every 15 minutes:

[Unit]
Description=Syncoid ZFS replication to %i timer
Requires=syncoid@%i.service

[Timer]
OnCalendar=*:0/15
RandomizedDelaySec=60
Persistent=true

SSH Configuration for Replication

Of course, Syncoid needs to SSH between nodes to do the replication. Initially, I tried to set this up with a separate SSH key for ZFS replication. That turned into such a mess that it actually motivated me to finally implement SSH certificates properly (see the previous chapter).

After setting up SSH certificates, I could simplify the configuration to just reference the certificates:

- name: 'Configure SSH config for ZFS replication using certificates.'
  ansible.builtin.blockinfile:
    path: '/root/.ssh/config'
    create: true
    mode: '0600'
    block: |
      # ZFS replication configuration using SSH certificates
      {% for host in groups['zfs'] %}
      {% if host != inventory_hostname %}
      Host {{ host }}
        HostName {{ hostvars[host]['ipv4_address'] }}
        User root
        CertificateFile /etc/step/certs/root_ssh_key-cert.pub
        IdentityFile /etc/step/certs/root_ssh_key
        StrictHostKeyChecking no
        UserKnownHostsFile /dev/null
      {% endif %}
      {% endfor %}

Much cleaner! No more key management, just point to the certificates that are already being automatically renewed. Sometimes a little pain is exactly what you need to motivate doing things the right way.

The Topology

The way I set this up, only the first node in the zfs group (allyrion) actually creates datasets and takes snapshots. The other nodes just receive replicated data:

- name: 'Enable and start Syncoid timers for replication targets.'
  ansible.builtin.systemd:
    name: "syncoid@{{ item }}.timer"
    enabled: true
    state: 'started'
  loop: "{{ groups['zfs'] | reject('eq', inventory_hostname) | list }}"
  when:
    - groups['zfs'] | length > 1
    - inventory_hostname == groups['zfs'][0]  # Only run on first ZFS node (allyrion)

This creates a hub-and-spoke topology where allyrion is the primary and replicates to all other ZFS nodes. It's not the most resilient topology (if allyrion dies, no new snapshots), but it's simple and works for my needs.

Does It Work?

Let's check using the goldentooth CLI:

$ goldentooth command allyrion 'zfs list'
allyrion | CHANGED | rc=0 >>
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool        546K   108G    24K  /rpool
rpool/data    53K   108G    25K  /data

Nice! The pool is there. Now let's look at snapshots:

$ goldentooth command allyrion 'zfs list -t snapshot'
allyrion | CHANGED | rc=0 >>
NAME                                                        USED  AVAIL  REFER  MOUNTPOINT
rpool/data@autosnap_2025-07-18_18:13:17_monthly               0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_daily                 0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_hourly                0B      -    24K  -
rpool/data@autosnap_2025-07-18_19:00:03_hourly                0B      -    24K  -
rpool/data@autosnap_2025-07-18_20:00:10_hourly                0B      -    24K  -
...
rpool/data@autosnap_2025-07-19_14:00:15_hourly                0B      -    24K  -
rpool/data@syncoid_allyrion_2025-07-19:10:45:32-GMT-04:00     0B      -    25K  -

Excellent! Sanoid is creating snapshots hourly, daily, and monthly. That last snapshot with the "syncoid" prefix shows that replication is happening too.

And on the replica nodes? Let me check which nodes have ZFS:

$ goldentooth command gardener 'zfs list'
gardener | CHANGED | rc=0 >>
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool        600K   108G    25K  /rpool
rpool/data    53K   108G    25K  /rpool/data

The replica has the same dataset structure. And the snapshots?

$ goldentooth command gardener 'zfs list -t snapshot | head -5'
gardener | CHANGED | rc=0 >>
NAME                                                        USED  AVAIL  REFER  MOUNTPOINT
rpool/data@autosnap_2025-07-18_18:13:17_monthly               0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_daily                 0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_hourly                0B      -    24K  -
rpool/data@autosnap_2025-07-18_19:00:03_hourly                0B      -    24K  -

Perfect! The snapshots are being replicated from allyrion to gardener. The replication is working.

Performance

How's the performance? Well... it's ZFS on a single SSD connected to a Raspberry Pi. It's not going to win any benchmarks:

$ goldentooth command_root allyrion 'dd if=/dev/zero of=/data/test bs=1M count=100'
allyrion | CHANGED | rc=0 >>
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.205277 s, 511 MB/s

511 MB/s writes! That's... actually surprisingly good for a Pi with a SATA SSD over USB3. Clearly the ZFS caching is helping here, but even so, that's plenty fast for shared configuration files, build artifacts, and other cluster data.

Goldentooth