SeaweedFS: Distributed Object Storage (Take Two)

I needed distributed object storage for the cluster. Harbor needs an S3-compatible backend, and I'd rather not rely on external cloud storage for a local cluster. SeaweedFS is perfect for this – it's lightweight, designed for ARM, and provides S3-compatible APIs out of the box.

I've deployed SeaweedFS before (chapter 63), but that was ages ago, before the Talos migration. Time to do it properly with the operator pattern and USB SSD storage.

The Architecture

SeaweedFS has three main components:

Masters (3 replicas): Handle metadata and coordination using Raft consensus
Volume Servers (6 replicas): Store the actual data, each on a dedicated USB SSD
Filer (1 replica): Provides the S3-compatible API gateway

The plan: deploy everything via the SeaweedFS Operator, using local-path-usb-provisioner to provision storage on USB SSDs attached to 6 specific nodes.

The Deployment

I reorganized the SeaweedFS directory structure into operator and cluster subdirectories, then deployed:

# Master configuration (Raft HA)
master:
  replicas: 3
  config: |
    raftHashicorp = true
    defaultReplication = "001"  # 2 total copies

# Volume servers on USB SSDs
volume:
  replicas: 6
  nodeSelector:
    storage.seaweedfs/volume: "true"
  requests:
    storage: 100Gi
  storageClassName: local-path-usb

# Filer with S3 API
filer:
  replicas: 1
  s3:
    enabled: true

Flux picked it up, the operator deployed, masters came up, filer started... and then 4 of 6 volume servers were stuck Pending.

The Problem: Inchfield's Corrupt Partition Table

Checking the pending volumes:

$ kubectl get pods -n seaweedfs | grep volume
goldentooth-storage-volume-0   0/1     Pending   0          2m
goldentooth-storage-volume-1   1/1     Running   0          2m
goldentooth-storage-volume-2   1/1     Running   0          2m
goldentooth-storage-volume-3   1/1     Running   0          2m
goldentooth-storage-volume-4   0/1     Pending   0          2m
goldentooth-storage-volume-5   1/1     Running   0          2m

Both pending volumes were scheduled to inchfield. The PVCs were bound, but the helper pods that format the volumes were failing:

$ kubectl logs helper-pod-create-pvc-... -n local-path-usb-provisioner
mkdir: can't create directory '/var/mnt/usb/...': Read-only file system

Read-only filesystem? That's weird. The Talos volume manager should have mounted /var/mnt/usb from the USB disk. Let me check:

$ talosctl -n inchfield get volumestatuses
NAMESPACE   TYPE           ID      VERSION   PHASE   LOCATION
runtime     VolumeStatus   u-usb   3         failed  /dev/sda1

$ talosctl -n inchfield get volumestatuses u-usb -o yaml
spec:
  phase: failed
  error: "error probing disk: open /dev/sda1: no such file or directory"

The volume manager sees the partition in its discovery scan, but /dev/sda1 doesn't exist as a device node. That's a partition table problem.

Looking at the kernel's view:

$ talosctl -n inchfield read /proc/partitions | grep sda
   8        0  976762584 sda

No sda1 partition! Compare with a working node:

$ talosctl -n gardener read /proc/partitions | grep sda
   8        0  117220824 sda
   8        1  117219328 sda1

The kernel can't see any partitions on inchfield's disk. The GPT partition table is corrupt or missing.

The Fix: Nuke It From Orbit

Time to rebuild the disk. I created a privileged pod on inchfield to partition and format the disk:

apiVersion: v1
kind: Pod
metadata:
  name: disk-formatter-inchfield
  namespace: local-path-usb-provisioner
spec:
  nodeSelector:
    kubernetes.io/hostname: inchfield
  restartPolicy: Never
  hostNetwork: true
  hostPID: true
  securityContext:
    privileged: true
  containers:
  - name: formatter
    image: ubuntu:22.04
    command: ["/bin/bash", "-c"]
    args: ["apt-get update && apt-get install -y gdisk xfsprogs && sleep 3600"]
    volumeMounts:
    - name: dev
      mountPath: /dev
  volumes:
  - name: dev
    hostPath:
      path: /dev

But when I tried to format, Talos Volume Manager had the device locked. Even after wiping the partition table, the kernel kept using the GPT backup table at the end of the disk.

Attempt 1: Wipe just the beginning

$ kubectl exec disk-formatter-inchfield -- dd if=/dev/zero of=/dev/sda bs=1M count=100

Nope. Partition still there (restored from backup GPT).

Attempt 2: Properly zap both GPT tables

$ kubectl exec disk-formatter-inchfield -- sgdisk --zap-all /dev/sda
GPT data structures destroyed!
Warning: The kernel is still using the old partition table.

Closer, but the kernel and Talos still had locks on the device.

Attempt 3: Reboot the node

$ talosctl -n inchfield reboot

After the reboot, I checked the disk:

$ talosctl -n inchfield get discoveredvolumes | grep usb
inchfield   runtime     DiscoveredVolume   sda1   2   partition   1.0 TB   xfs   u-usb

Wait, what? It's already XFS with the u-usb label?

Turns out there was an old XFS filesystem on the disk from a previous setup. The corrupt GPT was just hiding it. The reboot cleared Talos' locks and allowed it to discover the filesystem properly.

The Second Problem: Permission Denied

With the disk working, the volume pods started... and immediately crashed:

$ kubectl logs goldentooth-storage-volume-0 -n seaweedfs
Folder /data0 Permission: -rwxr-xr-x
F1118 03:27:47 cannot generate uuid of dir /data0: failed to write uuid
  to /data0/vol_dir.uuid: open /data0/vol_dir.uuid: permission denied

The helper pod created the directory as root with 755 permissions, but SeaweedFS runs as uid 1000 (non-root). Checking a working volume:

$ kubectl exec goldentooth-storage-volume-1 -n seaweedfs -- ls -ld /data0
drwxrwxrwx    2 root     root           143 Nov 18 03:05 /data0

777 permissions on working volumes, 755 on inchfield's. The helper pod on inchfield must have created it with restrictive permissions.

Quick fix with another privileged pod:

$ kubectl run permission-fixer --rm -i --restart=Never \
  --overrides='{"spec":{"nodeSelector":{"kubernetes.io/hostname":"inchfield"},"hostNetwork":true,"containers":[{"name":"fixer","image":"busybox","command":["sh","-c","chmod -R 777 /var/mnt/usb"],"securityContext":{"privileged":true},"volumeMounts":[{"name":"usb","mountPath":"/var/mnt/usb"}]}],"volumes":[{"name":"usb","hostPath":{"path":"/var/mnt/usb"}}]}}' \
  --image=busybox -n local-path-usb-provisioner
Permissions fixed!

$ kubectl delete pods goldentooth-storage-volume-0 goldentooth-storage-volume-4 -n seaweedfs

Success

A minute later:

$ kubectl get pods -n seaweedfs
NAME                           READY   STATUS    RESTARTS       AGE
goldentooth-storage-filer-0    1/1     Running   0              36m
goldentooth-storage-master-0   1/1     Running   1 (36m ago)    36m
goldentooth-storage-master-1   1/1     Running   1 (36m ago)    36m
goldentooth-storage-master-2   1/1     Running   1 (36m ago)    36m
goldentooth-storage-volume-0   1/1     Running   1              2m
goldentooth-storage-volume-1   1/1     Running   1              29m
goldentooth-storage-volume-2   1/1     Running   1              29m
goldentooth-storage-volume-3   1/1     Running   1              29m
goldentooth-storage-volume-4   1/1     Running   1              2m
goldentooth-storage-volume-5   1/1     Running   1              29m

All green! The cluster now has:

600GB of distributed object storage across 6 nodes
S3-compatible API ready for Harbor
Automatic replication (2 copies of each object)
Fault tolerance via Raft consensus

I need to get Longhorn or smth else running so I can have an RWM volume and HA for the Filer. Probably. IDK.

Key Learnings

GPT has backup tables – Wiping just the beginning of a disk isn't enough. GPT keeps a backup partition table at the end, and the kernel will restore from it. Use sgdisk --zap-all.
Talos Volume Manager is persistent – Even after wiping partition data, Talos caches volume information. A reboot was needed to fully release locks.
local-path provisioner permission issues – Helper pods run as root and create directories with restrictive permissions. Applications running as non-root need 777 permissions on the mount point.
Partition table corruption is sneaky – Talos' DiscoveredVolumes controller scans for filesystem UUIDs directly, so it can "see" filesystems even when the partition table is corrupt. But without valid partition entries, the kernel won't create device nodes, preventing actual mounts.

Goldentooth