GPU Storage NFS Export

With the cluster expanded to include the Velaryon GPU node, a natural question emerged: how can the Raspberry Pi cluster nodes efficiently exchange data with the GPU node for machine learning workloads and other compute-intensive tasks?

The solution was to leverage Velaryon's secondary 1TB NVMe SSD and expose it to the entire cluster via NFS, creating a high-speed shared storage pool specifically for Pi-GPU data exchange.

The Challenge

Velaryon came with two storage devices:

Primary NVMe (nvme1n1): Linux system drive
Secondary NVMe (nvme0n1): 1TB drive with old NTFS partitions from previous Windows installation

The goal was to repurpose this secondary drive as shared storage while maintaining architectural separation - the GPU node should provide storage services without becoming a structural component of the Pi cluster.

Storage Architecture Decision

Rather than integrating Velaryon into the existing storage ecosystem (ZFS replication, Ceph distributed storage), I opted for a simpler approach:

Pure ext4: Single partition consuming the entire 1TB drive
NFS export: Simple, performant network filesystem
Subnet-wide access: Available to all 10.4.x.x nodes

This keeps the GPU node loosely coupled while providing the needed functionality.

Implementation

Drive Preparation

First, I cleared the old NTFS partitions and created a fresh GPT layout:

# Clear existing partition table
sudo wipefs -af /dev/nvme0n1

# Create new GPT partition table and single partition
sudo parted /dev/nvme0n1 mklabel gpt
sudo parted /dev/nvme0n1 mkpart primary ext4 0% 100%

# Format as ext4
sudo mkfs.ext4 -L gpu-storage /dev/nvme0n1p1

The resulting filesystem has UUID 5bc38d5b-a7a4-426e-acdb-e5caf0a809d9 and is mounted persistently at /mnt/gpu-storage.

NFS Server Configuration

Velaryon was configured as an NFS server with a single export:

# /etc/exports
/mnt/gpu-storage 10.4.0.0/20(rw,sync,no_root_squash,no_subtree_check)

This grants read-write access to the entire infrastructure subnet with synchronous writes for data integrity.

Ansible Integration

Rather than manually configuring each node, I integrated the GPU storage into the existing Ansible automation:

Inventory Updates (inventory/hosts):

nfs_server:
  hosts:
    allyrion:    # Existing NFS server
    velaryon:    # New GPU storage server

Host Variables (inventory/host_vars/velaryon.yaml):

nfs_exports:
  - "/mnt/gpu-storage 10.4.0.0/20(rw,sync,no_root_squash,no_subtree_check)"

Global Configuration (group_vars/all/vars.yaml):

nfs:
  mounts:
    primary:      # Existing allyrion NFS share
      share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
      mount: '/mnt/nfs'
      safe_name: 'mnt-nfs'
      type: 'nfs'
      options: {}
    gpu_storage:  # New GPU storage share
      share: "{{ hostvars['velaryon'].ipv4_address }}:/mnt/gpu-storage"
      mount: '/mnt/gpu-storage'
      safe_name: 'mnt-gpu\x2dstorage'  # Systemd unit name escaping
      type: 'nfs'
      options: {}

Systemd Automount Configuration

The trickiest part was configuring systemd automount units. Systemd requires special character escaping for mount paths - the mount point /mnt/gpu-storage must use the unit name mnt-gpu\x2dstorage (where \x2d is the escaped dash).

Mount Unit Template (templates/mount.j2):

[Unit]
Description=Mount {{ item.key }}

[Mount]
What={{ item.value.share }}
Where={{ item.value.mount }}
Type={{ item.value.type }}
{% if item.value.options -%}
Options={{ item.value.options | join(',') }}
{% else -%}
Options=defaults
{% endif %}

[Install]
WantedBy=default.target

Automount Unit Template (templates/automount.j2):

[Unit]
Description=Automount {{ item.key }}
After=remote-fs-pre.target network-online.target network.target
Before=umount.target remote-fs.target

[Automount]
Where={{ item.value.mount }}
TimeoutIdleSec=60

[Install]
WantedBy=default.target

Deployment Playbook

A new playbook setup_gpu_storage.yaml orchestrates the entire deployment:

---
# Setup GPU storage on Velaryon with NFS export
- name: 'Setup Velaryon GPU storage and NFS export'
  hosts: 'velaryon'
  become: true
  tasks:
    - name: 'Ensure GPU storage mount point exists'
      ansible.builtin.file:
        path: '/mnt/gpu-storage'
        state: 'directory'
        owner: 'root'
        group: 'root'
        mode: '0755'

    - name: 'Check if GPU storage is mounted'
      ansible.builtin.command:
        cmd: 'mountpoint -q /mnt/gpu-storage'
      register: gpu_storage_mounted
      failed_when: false
      changed_when: false

    - name: 'Mount GPU storage if not already mounted'
      ansible.builtin.mount:
        src: 'UUID=5bc38d5b-a7a4-426e-acdb-e5caf0a809d9'
        path: '/mnt/gpu-storage'
        fstype: 'ext4'
        opts: 'defaults'
        state: 'mounted'
      when: gpu_storage_mounted.rc != 0

- name: 'Configure NFS exports on Velaryon'
  hosts: 'velaryon'
  become: true
  roles:
    - 'geerlingguy.nfs'

- name: 'Setup NFS mounts on all nodes'
  hosts: 'all'
  become: true
  roles:
    - 'goldentooth.setup_nfs_mounts'

Usage

The GPU storage is now seamlessly integrated into the goldentooth CLI:

# Deploy/update GPU storage configuration
goldentooth setup_gpu_storage

# Enable automount on specific nodes
goldentooth command allyrion 'sudo systemctl enable --now mnt-gpu\x2dstorage.automount'

# Verify access (automounts on first access)
goldentooth command cargyll,bettley 'ls /mnt/gpu-storage/'

Results

The implementation provides:

1TB shared storage available cluster-wide at /mnt/gpu-storage
Automatic mounting via systemd automount on directory access
Full Ansible automation via the goldentooth CLI
Clean separation between Pi cluster and GPU node architectures

Data written from any node is immediately visible across the cluster, enabling seamless Pi-GPU workflows for machine learning datasets, model artifacts, and computational results.

Goldentooth