Expanding the Kubernetes Cluster
With the Goldentooth cluster continuing to evolve, it was time to bring two more nodes into the Kubernetes fold... Karstark and Lipps, two Raspberry Pi 4Bs (4GB) that were just kinda sitting around.
The Current State
Before the expansion, our Kubernetes cluster consisted of:
- Control Plane (3 nodes): bettley, cargyll, dalt
- Workers (7 nodes): erenford, fenn, gardener, harlton, inchfield, jast, velaryon
Karstark and Lipps were already fully integrated into the cluster infrastructure:
- Both were part of the Consul service mesh as clients
- Both were configured as Nomad clients for workload scheduling
- Both were included in other cluster services like Ray and Slurm
However, they weren't yet part of the Kubernetes cluster, which meant we were missing out on their compute capacity for containerized workloads.
Installing Kubernetes Packages
The first step was to ensure both nodes had the necessary Kubernetes packages installed. Using the goldentooth CLI:
ansible-playbook -i inventory/hosts playbooks/install_k8s_packages.yaml --limit karstark,lipps
This playbook handled:
- Installing and configuring containerd as the container runtime
- Installing kubeadm, kubectl, and kubelet packages
- Setting up proper systemd cgroup configuration
- Enabling and starting the kubelet service
Both nodes successfully installed Kubernetes v1.32.7, which was slightly newer than the existing cluster nodes running v1.32.3.
The Challenge: Certificate Issues
When attempting to use the standard goldentooth bootstrap_k8s
command, we ran into certificate verification issues. The bootstrap process was timing out when trying to communicate with the Kubernetes API server.
The error manifested as:
tls: failed to verify certificate: x509: certificate signed by unknown authority
This is a common issue in clusters that have been running for a while (393 days in our case) and have undergone certificate rotations or updates.
The Solution: Manual Join Process
Instead of relying on the automated bootstrap, I took a more direct approach:
-
Generate a join token from the control plane:
goldentooth command_root bettley "kubeadm token create --print-join-command"
-
Execute the join command on both nodes:
goldentooth command_root karstark,lipps "kubeadm join 10.4.0.10:6443 --token yi3zz8.qf0ziy9ce7nhnkjv --discovery-token-ca-cert-hash sha256:0d6c8981d10e407429e135db4350e6bb21382af57addd798daf6c3c5663ac964 --skip-phases=preflight"
The --skip-phases=preflight
flag was key here, as it bypassed the problematic preflight checks while still allowing the nodes to join successfully.
Verification
After the join process completed, both nodes appeared in the cluster:
goldentooth command_root bettley "kubectl get nodes"
NAME STATUS ROLES AGE VERSION
bettley Ready control-plane 393d v1.32.3
cargyll Ready control-plane 393d v1.32.3
dalt Ready control-plane 393d v1.32.3
erenford Ready <none> 393d v1.32.3
fenn Ready <none> 393d v1.32.3
gardener Ready <none> 393d v1.32.3
harlton Ready <none> 393d v1.32.3
inchfield Ready <none> 393d v1.32.3
jast Ready <none> 393d v1.32.3
karstark Ready <none> 53s v1.32.7
lipps Ready <none> 54s v1.32.7
velaryon Ready <none> 52d v1.32.5
Perfect! Both nodes transitioned from "NotReady" to "Ready" status within about a minute, indicating that the Calico CNI networking had successfully configured them.
The New Topology
Our Kubernetes cluster now consists of:
- Control Plane (3 nodes): bettley, cargyll, dalt
- Workers (9 nodes): erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps, velaryon (GPU)
This brings us to a total of 12 nodes in the Kubernetes cluster, matching the full complement of our Raspberry Pi bramble plus the x86 GPU node.
GPU Node Configuration
Velaryon, my x86 GPU node, required special configuration to ensure GPU workloads are only scheduled intentionally:
Hardware Specifications
- GPU: NVIDIA GeForce RTX 2070 (8GB VRAM)
- CPU: 24 cores (x86_64)
- Memory: 32GB RAM
- Architecture: amd64
Kubernetes Configuration
The node is configured with:
- Label:
gpu=true
for workload targeting - Taint:
gpu=true:NoSchedule
to prevent accidental scheduling - Architecture:
arch=amd64
for x86-specific workloads
Scheduling Requirements
To schedule workloads on Velaryon, pods must include:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
This ensures that only workloads explicitly designed for GPU execution can access the expensive GPU resources, following the same intentional scheduling pattern used with Nomad.
GPU Resource Detection Challenge
While the taint-based scheduling was working correctly, getting Kubernetes to actually detect and expose the GPU resources proved more challenging. The NVIDIA device plugin is responsible for discovering GPUs and advertising them as nvidia.com/gpu
resources that pods can request.
Initial Problem
The device plugin was failing with the error:
E0719 16:20:41.050191 1 factory.go:115] Incompatible platform detected
E0719 16:20:41.050193 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
Despite having installed the NVIDIA Container Toolkit and configuring containerd, the device plugin couldn't detect the NVML library from within its container environment.
The Root Cause
The issue was that the device plugin container couldn't access:
- NVIDIA Management Library:
libnvidia-ml.so.1
needed for GPU discovery - Device files:
/dev/nvidia*
required for direct GPU communication - Proper privileges: Needed to interact with kernel-level GPU drivers
The Solution
After several iterations, the working configuration required:
Library Access:
volumeMounts:
- name: nvidia-ml-lib
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
readOnly: true
- name: nvidia-ml-actual
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
readOnly: true
Device Access:
volumeMounts:
- name: dev
mountPath: /dev
volumes:
- name: dev
hostPath:
path: /dev
Container Privileges:
securityContext:
privileged: true
Verification
Once properly configured, the device plugin successfully reported:
I0719 16:56:06.462937 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0719 16:56:06.463631 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0719 16:56:06.465420 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
The GPU resource then appeared in the node's capacity:
kubectl get nodes -o json | jq '.items[] | select(.metadata.name=="velaryon") | .status.capacity'
{
"cpu": "24",
"ephemeral-storage": "102626232Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32803048Ki",
"nvidia.com/gpu": "1",
"pods": "110"
}
Testing GPU Resource Allocation
To verify the system was working end-to-end, I created a test pod that:
- Requests GPU resources:
nvidia.com/gpu: 1
- Includes proper tolerations: To bypass the
gpu=true:NoSchedule
taint - Targets the GPU node: Using
gpu: "true"
node selector
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-workload
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
containers:
- name: gpu-test
image: busybox
command: ["sleep", "60"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
The pod successfully scheduled and the node showed:
nvidia.com/gpu 1 1
This confirmed that GPU resource allocation tracking was working correctly.
Final NVIDIA Device Plugin Configuration
For reference, here's the complete working NVIDIA device plugin DaemonSet configuration:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
priorityClassName: system-node-critical
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
- name: nvidia-ml-lib
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
readOnly: true
- name: nvidia-ml-actual
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
readOnly: true
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
- name: nvidia-ml-lib
hostPath:
path: /lib/x86_64-linux-gnu/libnvidia-ml.so.1
- name: nvidia-ml-actual
hostPath:
path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
Key aspects of this configuration:
- Targeted deployment: Only runs on nodes with
gpu: "true"
label - Taint tolerance: Can schedule on nodes with
gpu=true:NoSchedule
taint - Privileged access: Required for kernel-level GPU driver interaction
- Library binding: Specific mounts for NVIDIA ML library files
- Device access: Full
/dev
mount for GPU device communication