Introduction

Who am I?

Who I am

A portrait of the author in the form he will assume over the course of this project, having returned to our time to warn his present self against pursuing this course of action.

My name is Nathan Douglas. The best source of information about my electronic life is probably my GitHub profile. It almost certainly would not be my LinkedIn profile. I also have a blog about non-computer-related stuff here.

What Do I Do?

What I'll Be Doing The author in his eventual form advising the author in his present form not to do the thing, and why.

I've been trying to get computers to do what I want, with mixed success, since the early mid-nineties. I earned my Bachelor's in Computer Science from the University of Nevada at Las Vegas in 2011, and I've been working as a Software/DevOps engineer ever since, depending on gig.

I consider DevOps a methodology and a role, in that I try to work in whatever capacity I can to improve the product delivery lifecycle and shorten delivery lead time. I generally do the work that is referred to as "DevOps" or "platform engineering" or "site reliability engineering", but I try to emphasize the theoretical aspects, e.g. Lean Management, sytems thinking, etc. That's not to say that I'm an expert, but just that I try to keep the technical details grounded in the philosophical justifications, the big picture.

Update (2025-04-05): At present I consider myself more of a platform engineer. I'm trying to move into an MLOps space, though, and from there into High-Performance Computing. I also would like to eventually shift into deep tech research and possibly get my PhD in mathematics or computer science.

Background

VMWare Workstation

"What would you do if you had an AMD K6-2 333MHz and 96MB RAM?" "I'd run two copies of Windows 98, my dude."

At some point in the very early 00's, I believe, I first encountered VMWare and the idea that I could run a computer inside of another computer. That wasn't the first time I'd encountered a virtual machine -- I'd played with Java in the '90's, and played Zork and other Infocom and Inform games -- but it might've been the first time that I really understood the idea.

And I made use of it. For a long time – most of my twenties – I was occupied by a writing project. I maintained a virtual machine that ran a LAMP server and hosted various content management systems and related technologies: raw HTML pages, MediaWiki, DokuWiki, Drupal, etc, all to organize my thoughts on this and other projects. Along the way, I learned a whole lot about this sort of deployment: namely, that it was a pain in the ass.

I finally abandoned that writing project around the time Docker came out. I immediately understood what it was: a less tedious VM. (Admittedly, my understanding was not that sophisticated.) I built a decent set of skills with Docker and used it wherever I could. I thought Docker was about as good as it got.

At some point around 2016 or 2017, I became aware of Kubernetes. I immediately built a 4-node cluster with old PCs, doing a version of Kubernetes the Hard Way on bare metal, and then shifted to a custom system with four VMWare VMs that PXE booted, setup a CoreOS configuration with Ignition and what was then called Matchbox, and formed into a self-healing cluster with some neat toys like GlusterFS, etc. Eventually, though, I started neglecting the cluster and tore it down.

Around 2021, my teammates and I started considering a Kubernetes-based infrastructure for our applications, so I got back into it. I set up a rather complicated infrastructure on a three-node Proxmox VE cluster that would create three three-node Kubernetes clusters using LXC containers. From there I explored ArgoCD and GitOps and Helm and some other things that I hadn't really played with before. But again, my interest waned and the cluster didn't actually get much action.

A large part of this, I think, is that I didn't trust it to run high-FAF (Family Acceptance Factor) apps, like Plex, etc. After all, this was supposed to be a cluster I could tinker with, and tear down and destroy and rebuild at any time with a moment's notice. So in practice, this ended up being a toy cluster.

And while I'd gone through Kubernetes the Hard Way (twice!), I got the irritating feeling that I hadn't really learned all that much. I'd done Linux From Scratch, and had run Gentoo for several years, so I was no stranger to the idea of following a painfully manual process filled with shell commands and waiting for days for my computer to be useful again. And I did learn a lot from all three projects, but, for whatever reason, it didn't stick all that well.

Motivation

In late 2023, my team's contract concluded, and there was a possibility I might be laid off. My employer quickly offered me a position on another team, which I happily and gratefully accepted, but I had already applied to several other positions. I had some promising paths forward, but... not as many as I would like. It was an unnerving experience.

Not everyone is using Kubernetes, of course, but it's an increasingly essential skill in my field. There are other skills I have – Ansible, Terraform, Linux system administration, etc – but I'm not entirely comfortable with my knowledge of Kubernetes, so I'd like to deepen and broaden that as effectively as possible.

Goals

I want to get really good at Kubernetes. Not just administering it, but having a good understanding of what is going on under the hood at any point, and how best to inspect and troubleshoot and repair a cluster.

I want to have a fertile playground for experimenting; something that is not used for other purposes, not expected to be stable, ideally not even accessed by anyone else. Something I can do the DevOps equivalent of destroy with an axe, without consequences.

I want to document everything I've learned exhaustively. I don't want to take a command for granted, or copy and paste, or even copying and pasting after nodding thoughtfully at a wall of text. I want to embed things deeply into my thiccc skull.

Generally, I want to be beyond prepared for my CKA, CKAD, and CKS certification exams. I hate test anxiety. I hate feeling like there are gaps in my knowledge. I want to go in confident, and I want my employers and teammates to be confident of my abilities.

Approach

This is largely going to consist of me reading documentation and banging my head against the wall. I'll provide links to the relevant information, and type out the commands, but I also want to persist this in Infrastructure-as-Code. Consequently, I'll link to Ansible tasks/roles/playbooks for each task as well.

Cluster Hardware

I went with a PicoCluster 10H. I'm well aware that I could've cobbled something together and spent much less money; I have indeed done the thing with a bunch of Raspberry Pis screwed to a board and plugged into an Anker USB charger and a TP-Link switch.

I didn't want to do that again, though. For one, I've experienced problems with USB chargers seeming to lose power over time, and some small switches getting flaky when powered from USB. I liked the power supply of the PicoCluster and its cooling configuration. I liked that it did pretty much exactly what I wanted, and if I had problems I could yell at someone else about it rather than getting derailed by hardware rabbit holes.

I also purchased ten large heatsinks with fans, specifically these. There were others I liked a bit more, and these interfered with the standoffs that were used to build each stack of five Raspberry Pis, but these seemed as though they would likely be the most reliable in the long run.

I purchased SanDisk 128GB Extreme microSDXC cards for local storage. I've been using SanDisk cards for years with no significant issues or complaints.

The individual nodes are Raspberry Pi 4B/8GB. As of the time I'm writing this, Raspberry Pi 5s are out, and they offer very substantial benefits over the 4B. That said, they also have higher energy consumption, lower availability, and so forth. I'm opting for a lower likelihood of surprises because, again, I just don't want to spend much time dealing with hardware and I don't expect performance to hinder me.

Technical Specifications

Complete Node Inventory

The cluster consists of 13 nodes with specific roles and configurations:

Raspberry Pi Nodes (12 total):

  • allyrion (10.4.0.10) - NFS server, HAProxy load balancer, Docker host
  • bettley (10.4.0.11) - Kubernetes control plane, Consul server, Vault server
  • cargyll (10.4.0.12) - Kubernetes control plane, Consul server, Vault server
  • dalt (10.4.0.13) - Kubernetes control plane, Consul server, Vault server
  • erenford (10.4.0.14) - Kubernetes worker, Ray head node, ZFS storage
  • fenn (10.4.0.15) - Kubernetes worker, Ceph storage node
  • gardener (10.4.0.16) - Kubernetes worker, Grafana host, ZFS storage
  • harlton (10.4.0.17) - Kubernetes worker
  • inchfield (10.4.0.18) - Kubernetes worker, Loki log aggregation
  • jast (10.4.0.19) - Kubernetes worker, Step-CA certificate authority
  • karstark (10.4.0.20) - Kubernetes worker, Ceph storage node
  • lipps (10.4.0.21) - Kubernetes worker, Ceph storage node

x86 GPU Node:

  • velaryon (10.4.0.30) - AMD Ryzen 9 3900X, 32GB RAM, NVIDIA RTX 2070 Super

Hardware Architecture

Raspberry Pi 4B Specifications:

  • CPU: ARM Cortex-A72 quad-core @ 2.0GHz (overclocked from 1.5GHz)
  • RAM: 8GB LPDDR4
  • Storage: SanDisk 128GB Extreme microSDXC (UHS-I Class 10)
  • Network: Gigabit Ethernet (onboard)
  • GPIO: Used for fan control (pin 14) and hardware monitoring

Performance Optimizations:

arm_freq=2000
over_voltage=6

These overclocking settings provide approximately 33% performance increase while maintaining thermal stability with active cooling.

Network Infrastructure

Network Segmentation:

  • Infrastructure CIDR: 10.4.0.0/20 - Physical network backbone
  • Service CIDR: 172.16.0.0/20 - Kubernetes virtual services
  • Pod CIDR: 192.168.0.0/16 - Container networking
  • MetalLB Range: 10.4.11.0/24 - Load balancer IP allocation

MAC Address Registry: Each node has documented MAC addresses for network boot and management:

  • Raspberry Pi nodes: d8:3a:dd:* and dc:a6:32:* prefixes
  • x86 node: 2c:f0:5d:0f:ff:39 (velaryon)

Storage Architecture

Distributed Storage Strategy:

NFS Shared Storage:

  • Server: allyrion exports /mnt/usb1
  • Clients: All 13 nodes mount at /mnt/nfs
  • Use Cases: Configuration files, shared datasets, cluster coordination

ZFS Storage Pool:

  • Nodes: allyrion, erenford, gardener
  • Pool: rpool with rpool/data dataset
  • Features: Snapshots, replication, compression
  • Optimization: 128MB ARC limit for Raspberry Pi RAM constraints

Ceph Distributed Storage:

  • Nodes: fenn, karstark, lipps
  • Purpose: Highly available distributed block and object storage
  • Integration: Kubernetes persistent volumes

Thermal Management

Cooling Configuration:

  • Heatsinks: Large aluminum heatsinks with 40mm fans per node
  • Fan Control: GPIO-based temperature control at 60°C threshold
  • Airflow: PicoCluster chassis provides directed airflow path
  • Monitoring: Temperature sensors exposed via Prometheus metrics

Thermal Performance:

  • Idle: ~45-50°C ambient
  • Load: ~60-65°C under sustained workload
  • Throttling: No thermal throttling observed during normal operations

Power Architecture

Power Supply:

  • Input: Single AC connection to PicoCluster power distribution
  • Per Node: 5V/3A regulated power (avoiding USB charger degradation)
  • Efficiency: ~90% efficiency at typical load
  • Redundancy: Single point of failure by design (acceptable for lab environment)

Power Consumption:

  • Raspberry Pi: ~8W idle, ~15W peak per node
  • Total Pi Load: ~96W idle, ~180W peak (12 nodes)
  • x86 Node: ~150W idle, ~300W peak
  • Cluster Total: ~250W idle, ~480W peak

Hardware Monitoring

Metrics Collection:

  • Node Exporter: Hardware sensors, thermal data, power metrics
  • Prometheus: Centralized metrics aggregation
  • Grafana: Real-time dashboards with thermal and performance alerts

Monitored Parameters:

  • CPU temperature and frequency
  • Memory usage and availability
  • Storage I/O and capacity
  • Network interface statistics
  • Fan speed and cooling device status

Reliability Considerations

Hardware Resilience:

  • No RAID: Individual node failure acceptable (distributed applications)
  • Network Redundancy: Single switch (acceptable for lab)
  • Power Redundancy: Single PSU (lab environment limitation)
  • Cooling Redundancy: Individual fan failure affects single node only

Failure Recovery:

  • Kubernetes: Automatic pod rescheduling on node failure
  • Consul/Vault: Multi-node quorum survives single node loss
  • Storage: ZFS replication and Ceph redundancy provide data protection

Future Expansion

Planned Upgrades:

  • SSD Storage: USB 3.0 SSD expansion for high-IOPS workloads
  • Network Upgrades: Potential 10GbE expansion via USB adapters
  • Additional GPU: PCIe expansion for ML workloads

Frequently Asked Questions

So, how do you like the PicoCluster so far?

I have no complaints. Putting it together was straightforward; the documentation was great, everything was labeled correctly, etc. Cooling seems very adequate and performance and appearance are perfect.

The integrated power supply has been particularly reliable compared to previous experiences with USB charger-based setups. The structured cabling and chassis design make maintenance and monitoring much easier than ad-hoc Raspberry Pi clusters.

Have you considered adding SSDs for mass storage?

Yes, and I have some cables and spare SSDs for doing so. I'm not sure if I actually will. We'll see.

The current storage architecture with ZFS pools on USB-attached SSDs and distributed Ceph storage has proven adequate for most workloads. The microSD cards handle the OS and container storage well, while shared storage needs are met through the NFS and distributed storage layers.

Meet the Nodes

It's generally frowned upon nowadays to treat servers like "pets" as opposed to "cattle". And, indeed, I'm trying not to personify these little guys too much, but... you can have my custom MOTD, hostnames, and prompts when you pry them from my cold, dead fingers.

The nodes are identified with a letter A-J and labeled accordingly on the ethernet port so that if one needs to be replaced or repaired, that can be done with a minimum of confusion. Then, I gave each the name of a noble house from A Song of Ice and Fire and gave it a MOTD (based on the coat of arms) and a themed Bash prompt.

In my experience, when I'm working in multiple servers simultaneously, it's good for me to have a bright warning sign letting me know, as unambiguously as possible, what server I'm actually logged in on. (I've never blown up prod thinking it was staging, but if I'm shelled into prod I'm deeply concerned about that possibility)

This is just me being a bit over-the-top, I guess.

✋ Allyrion

  • Prompt: Link
  • MoTD: Link
  • Role: Load Balancer
  • MAC Address: d8:3a:dd:8a:7d:aa
  • IP Address: 10.4.0.10

🐞 Bettley

  • Prompt: Link
  • MoTD: Link
  • Role: Control Plane 1
  • MAC Address: d8:3a:dd:89:c1:0b
  • IP Address: 10.4.0.11

🦢 Cargyll

  • Prompt: Link
  • MoTD: Link
  • Role: Worker
  • MAC Address: d8:3a:dd:8a:7d:ef
  • IP Address: 10.4.0.12

🍋 Dalt

  • Prompt: Link
  • MoTD: Link
  • Role: Worker
  • MAC Address: d8:3a:dd:8a:7e:9a
  • IP Address: 10.4.0.13

🦩 Erenford

  • Prompt: Link
  • MoTD: Link
  • Role: Worker
  • MAC Address: d8:3a:dd:8a:80:3c
  • IP Address: 10.4.0.14

🌺 Fenn

  • Prompt: Link
  • MoTD: Link
  • Role: Control Plane 2
  • MAC Address: d8:3a:dd:89:ef:61
  • IP Address: 10.4.0.15

🧤 Gardener

  • Prompt: Link
  • MoTD: Link
  • Role: Control Plane 3
  • MAC Address: d8:3a:dd:89:aa:7d
  • IP Address: 10.4.0.16

🌳 Harlton

  • Prompt: Link
  • MoTD: Link
  • Role: Worker
  • MAC Address: d8:3a:dd:89:f9:23
  • IP Address: 10.4.0.17

🏁 Inchfield

  • Prompt: Link
  • MoTD: Link
  • Role: Worker
  • MAC Address: d8:3a:dd:89:fa:fc
  • IP Address: 10.4.0.18

🦁 Jast

  • Prompt: Link
  • MoTD: Link
  • Role: Worker
  • MAC Address: d8:3a:dd:89:f0:4b
  • IP Address: 10.4.0.19

Node Configuration

After physically installing and setting up the nodes, the next step is to perform basic configuration. You can see the Ansible playbook I use for this, which currently runs the following roles:

  • goldentooth.configure:
    • Set timezone; last thing I need to do when working with computers is having to perform arithmetic on times and dates.
    • Set keybord layout; this should be set already, but I want to be sure.
    • Enable overclocking; I've installed an adequate cooling system to support the Pis running full-throttle at their full spec clock.
    • Enable fan control; the heatsinks I've installed include fans to prevent CPU throttling under heavy load.
    • Enable and configure certain cgroups; this allows Kubernetes to manage and limit resources on the system.
      • cpuset: This is used to manage the assignment of individual CPUs (both physical and logical) and memory nodes to tasks running in a cgroup. It allows for pinning processes to specific CPUs and memory nodes, which can be very useful in a containerized environment for performance tuning and ensuring that certain processes have dedicated CPU time. Kubernetes can use cpuset to ensure that workloads (containers/pods) have dedicated processing resources. This is particularly important in multi-tenant environments or when running workloads that require guaranteed CPU cycles. By controlling CPU affinity and ensuring that processes are not competing for CPU time, Kubernetes can improve the predictability and efficiency of applications.
      • memory: This is used to limit the amount of memory that tasks in a cgroup can use. This includes both RAM and swap space. It provides mechanisms to monitor memory usage and enforce hard or soft limits on the memory available to processes. When a limit is reached, the cgroup can trigger OOM (Out of Memory) killer to select and kill processes exceeding their allocation. Kubernetes uses the memory cgroup to enforce memory limits specified for pods and containers, preventing a single workload from consuming all available memory, which could lead to system instability or affect other workloads. It allows for better resource isolation, efficient use of system resources, and ensures that applications adhere to their specified resource limits, promoting fairness and reliability.
      • hugetlb: This is used to manage huge pages, a feature of modern operating systems that allows the allocation of memory in larger blocks (huge pages) compared to standard page sizes. This can significantly improve performance for certain workloads by reducing the overhead of page translation and increasing TLB (Translation Lookaside Buffer) hits. Some applications, particularly those dealing with large datasets or high-performance computing tasks, can benefit significantly from using huge pages. Kubernetes can use it to allocate huge pages to these workloads, improving performance and efficiency. This is not going to be a concern for my use, but I'm enabling it anyway simply because it's recommended.
    • Disable swap. Kubernetes doesn't like swap by default, and although this can be worked around, I'd prefer to avoid swapping on SD cards. I don't really expect a high memory pressure condition anyway.
    • Set preferred editor; I like nano, although I can (after years of practice) safely and reliably exit vi.
    • Set certain kernel modules to load at boot:
      • overlay: This supports OverlayFS, a type of union filesystem. It allows one filesystem to be overlaid on top of another, combining their contents. In the context of containers, OverlayFS can be used to create a layered filesystem that combines multiple layers into a single view, making it efficient to manage container images and writable container layers.
      • br_netfilter: This allows bridged network traffic to be filtered by iptables and ip6tables. This is essential for implementing network policies, including those related to Network Address Translation (NAT), port forwarding, and traffic filtering. Kubernetes uses it to enforce network policies that control ingress and egress traffic to pods and between pods. This is crucial for maintaining the security and isolation of containerized applications. It also enables the necessary manipulation of traffic for services to direct traffic to pods, and for pods to communicate with each other and the outside world. This includes the implementation of services, load balancing, and NAT for pod networking. And by allowing iptables to filter bridged traffic, br_netfilter helps Kubernetes manage network traffic more efficiently, ensuring consistent network performance and reliability across the cluster.
    • Load above kernel modules on every boot.
    • Set some kernel parameters:
      • net.bridge.bridge-nf-call-iptables: This allows iptables to inspect and manipulate the traffic that passes through a Linux bridge. A bridge is a way to connect two network segments, acting somewhat like a virtual network switch. When enabled, it allows iptables rules to be applied to traffic coming in or going out of a bridge, effectively enabling network policies, NAT, and other iptables-based functionalities for bridged traffic. This is essential in Kubernetes for implementing network policies that control access to and from pods running on the same node, ensuring the necessary level of network isolation and security.
      • net.bridge.bridge-nf-call-ip6tables: As above, but for IPv6 traffic.
      • net.ipv4.ip_forward: This controls the ability of the Linux kernel to forward IP packets from one network interface to another, a fundamental capability for any router or gateway. Enabling IP forwarding is crucial for a node to route traffic between pods, across different nodes, or between pods and the external network. It allows the node to act as a forwarder or router, which is essential for the connectivity of pods across the cluster, service exposure, and for pods to access the internet or external resources when necessary.
    • Add SSH public key to root's authorized keys; this is already performed for my normal user by Raspberry Pi Imager.
  • goldentooth.set_hostname: Set the hostname of the node (including a line in /etc/hosts). This doesn't need to be a separate role, obviously. I just like the structure as I have it.
  • goldentooth.set_motd: Set the MotD, as described in the previous chapter.
  • goldentooth.set_bash_prompt: Set the Bash prompt, as described in the previous chapter.
  • goldentooth.setup_security: Some basic security configuration. Currently, this just uses Jeff Geerling's ansible-role-security to perform some basic tasks, like setting up unattended upgrades, etc, but I might expand this in the future.

Raspberry Pi Imager doesn't allow you to specify an SSH key for the root user, so I do this in goldentooth.configure. However, I also have Kubespray installed (for when I want things to Just Work™), and Kubespray expects the remote user to be root. As a result, I specify that the remote user is my normal user account in the configure_cluster playbook. This means a lot of become: true in the roles, but I would prefer eventually to ditch Kubespray and disallow root login via SSH.

Anyway, we need to rerun goldentooth.set_bash_prompt, but as the root user. This almost never matters, since I prefer to SSH as a normal user and use sudo, but I like my prompts and you can't take them away from me.

With the nodes configured, we can start talking about the different roles they will serve.

Cluster Roles and Responsibilities

Observations:

  • The cluster has a single power supply but two power distribution units (PDUs) and two network switches, so it seems reasonable to segment the cluster into left and right halves.
  • I want high availability, which requires a control plane capable of a quorum, so a minimum of three nodes in the control plane.
  • I want to use a dedicated external load balancer for the control plane rather than configure my existing Opnsense firewall/router. (I'll have to do that to enable MetalLB via BGP, sadly.)
  • So that would yield one load balancer, three control plane nodes, and six worker nodes.
  • With the left-right segmentation, I can locate one load balancer and one control plane node on the left side, two control plane nodes on the right side, and three worker nodes on each side.

This isn't really high-availability; the cluster has multiple single points of failure:

  • the load balancer node
  • whichever network switch is connected to the upstream
  • the power supply
  • the PDU powering the LB
  • the PDU powering the upstream switch
  • etc.

That said, I find those acceptable given the nature of this project.

Load Balancer

Allyrion, the first node alphabetically and the top node on the left side, will run a load balancer. I had a number of options here, but I ended up going with HAProxy. HAProxy was my introduction to load balancing, reverse proxying, and so forth, and I have kind of a soft spot for it.

I'd also considered Traefik, which I use elsewhere in my homelab, but I believe I'll use it as an ingress controller. Similarly, I think I prefer to use Nginx on a per-application level. I'm pursuing this project first and foremost to learn and to document my learning, and I'd prefer to cover as much ground as possible, and as clearly as possible, and I believe I can do this best if I don't have to worry about having to specify which installation of $proxy I'm referring to at any given time.

So:

  • HAProxy: Load balancer
  • Traefik: Ingress controller
  • Nginx: Miscellaneous

Control Plane

Bettley (the second node on the left side), Gardener, and Harlton (the first and second nodes on the right side) will be the control plane nodes.

It's common, in small home Kubernetes clusters, to remove the control plane taint (node-role.kubernetes.io/control-plane) to allow miscellaneous pods to be scheduled on the control plane nodes. I won't be doing that here; six worker nodes should be sufficient for my purposes, and I'll try (where possible and practical) to follow best practices. That said, I might find some random fun things to run on my control plane nodes, and I'll adjust their tolerations accordingly.

Workers

The remaining nodes (Cargyll, Dalt, and Erenford on the left, and Harlton, Inchfield, and Jast on the right) are dedicated workers. What sort of workloads will they run?

Well, probably nothing interesting. Not Plex, not torrent clients or *darrs. Mostly logging, metrics, and similar. I'll probably end up gathering a lot of data about data. And that's fine – these Raspberry Pis are running off SD cards; I don't really want them to be doing anything interesting anyway.

Network Topology

In case you don't quite have a picture of the infrastructure so far, it should look like this:

Network Topology

Frequently Asked Questions

Why didn't you make Etcd high-availability?

It seems like I'd need that cluster to have a quorum too, so we're talking about three nodes for the control plane, three nodes for Etcd, one for the load balancer, and, uh, three worker nodes. That's a bit more than I'd like to invest, and I'd like to avoid doubling up anywhere (although I'll probably add additional functionality to the load balancer). I'm interested in the etcd side of things, but not really enough to compromise elsewhere. I could be missing something obvious, though; if so, please let me know.

Why didn't you just do A=load balancer, B-D=control plane, and E-J=workers?

I could've and should've and still might. But because I'm a bit of a fool and wasn't really paying attention, I put A-E on the left and F-J on the right, rather than A,C,E,G,I on the left and B,D,F,H,J on the right, which would've been a bit cleaner. As it is, I need to think a second about which nodes are control nodes, since they aren't in a strict alphabetical order.

I might adjust this in the future; it should be easy to do so, after all, I just don't particularly want to take the cluster apart and rebuild it, especially since the standoffs were kind of messy as a consequence of the heatsinks.

Load Balancer

This cluster should have a high-availability control plane, and we can start laying the groundwork for that immediately.

This might sound complex, but all we're doing is:

  • creating a load balancer
  • configuring the load balancer to use all of the control plane nodes as a list of backends
  • telling anything that sends requests to a control plane node to send them to the load balancer instead

High-Availability for Dummies

As mentioned before, we're using HAProxy as a load balancer. First, though, I'll install rsyslog, a log processing system. It will gather logs from HAProxy and deposit them in a more ergonomic location.

$ sudo apt install -y rsyslog

At least at the time of writing (February 2024), rsyslog on Raspberry Pi OS includes a bit of configuration that relocates HAProxy logs:

# /etc/rsyslog.d/49-haproxy.conf

# Create an additional socket in haproxy's chroot in order to allow logging via
# /dev/log to chroot'ed HAProxy processes
$AddUnixListenSocket /var/lib/haproxy/dev/log

# Send HAProxy messages to a dedicated logfile
:programname, startswith, "haproxy" {
  /var/log/haproxy.log
  stop
}

In Raspberry Pi OS, installing and configuring HAProxy is a simple matter.

$ sudo apt install -y haproxy

Here is the configuration I'm working with for HAProxy at the time of writing (February 2024); I've done my best to comment it thoroughly. You can also see the Jinja2 template and the role that deploys the template to configure HAProxy.

# /etc/haproxy/haproxy.cfg

# This is the HAProxy configuration file for the load balancer in my Kubernetes
# cluster. It is used to load balance the API server traffic between the
# control plane nodes.

# Global parameters
global
  # Sets uid for haproxy process.
  user haproxy
  # Sets gid for haproxy process.
  group haproxy
  # Sets the maximum per-process number of concurrent connections.
  maxconn 4096
  # Configure logging.
  log /dev/log local0
  log /dev/log local1 notice

# Default parameters
defaults
  # Use global log configuration.
  log global

# Frontend configuration for the HAProxy stats page.
frontend stats-frontend
  # Listen on all IPv4 addresses on port 8404.
  bind *:8404
  # Use HTTP mode.
  mode http
  # Enable the stats page.
  stats enable
  # Set the URI to access the stats page.
  stats uri /stats
  # Set the refresh rate of the stats page.
  stats refresh 10s
  # Set the realm to access the stats page.
  stats realm HAProxy\ Statistics
  # Set the username and password to access the stats page.
  stats auth nathan:<redacted>
  # Hide HAProxy version to improve security.
  stats hide-version

# Kubernetes API server frontend configuration.
frontend k8s-api-server
  # Listen on the IPv4 address of the load balancer on port 6443.
  bind 10.4.0.10:6443
  # Use TCP mode, which means that the connection will be passed to the server
  # without TLS termination, etc.
  mode tcp
  # Enable logging of the client's IP address and port.
  option tcplog
  # Use the Kubernetes API server backend.
  default_backend k8s-api-server

# Kubernetes API server backend configuration.
backend k8s-api-server
  # Use TCP mode, not HTTPS.
  mode tcp
  # Sets the maximum time to wait for a connection attempt to a server to
  # succeed.
  timeout connect 10s
  # Sets the maximum inactivity time on the client side. I might reduce this at
  # some point.
  timeout client 86400s
  # Sets the maximum inactivity time on the server side. I might reduce this at
  # some point.
  timeout server 86400s
  # Sets the load balancing algorithm.
  # `roundrobin` means that each server is used in turns, according to their
  # weights.
  balance roundrobin
  # Enable health checks.
  option tcp-check
  # For each control plane node, add a server line with the node's hostname and
  # IP address.
  # The `check` parameter enables health checks.
  # The `fall` parameter sets the number of consecutive health check failures
  # after which the server is considered to be down.
  # The `rise` parameter sets the number of consecutive health check successes
  # after which the server is considered to be up.
  server bettley 10.4.0.11:6443 check fall 3 rise 2
  server fenn 10.4.0.15:6443 check fall 3 rise 2
  server gardener 10.4.0.16:6443 check fall 3 rise 2

This enables the HAProxy stats frontend, which allows us to gain some insight into the operation of the frontend in something like realtime.

HAProxy Stats

We see that our backends are unavailable, which is of course expected at this time. We can also read the logs, in /var/log/haproxy.log:

$ cat /var/log/haproxy.log
2024-02-21T07:03:16.603651-05:00 allyrion haproxy[1305383]: [NOTICE]   (1305383) : haproxy version is 2.6.12-1+deb12u1
2024-02-21T07:03:16.603906-05:00 allyrion haproxy[1305383]: [NOTICE]   (1305383) : path to executable is /usr/sbin/haproxy
2024-02-21T07:03:16.604085-05:00 allyrion haproxy[1305383]: [WARNING]  (1305383) : Exiting Master process...
2024-02-21T07:03:16.607180-05:00 allyrion haproxy[1305383]: [ALERT]    (1305383) : Current worker (1305385) exited with code 143 (Terminated)
2024-02-21T07:03:16.607558-05:00 allyrion haproxy[1305383]: [WARNING]  (1305383) : All workers exited. Exiting... (0)
2024-02-21T07:03:16.771133-05:00 allyrion haproxy[1305569]: [NOTICE]   (1305569) : New worker (1305572) forked
2024-02-21T07:03:16.772082-05:00 allyrion haproxy[1305569]: [NOTICE]   (1305569) : Loading success.
2024-02-21T07:03:16.775819-05:00 allyrion haproxy[1305572]: [WARNING]  (1305572) : Server k8s-api-server/bettley is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:16.776309-05:00 allyrion haproxy[1305572]: Server k8s-api-server/bettley is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:16.776584-05:00 allyrion haproxy[1305572]: Server k8s-api-server/bettley is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.423831-05:00 allyrion haproxy[1305572]: [WARNING]  (1305572) : Server k8s-api-server/fenn is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.424229-05:00 allyrion haproxy[1305572]: Server k8s-api-server/fenn is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.424446-05:00 allyrion haproxy[1305572]: Server k8s-api-server/fenn is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.653803-05:00 allyrion haproxy[1305572]: Connect from 10.0.2.162:53155 to 10.4.0.10:8404 (stats-frontend/HTTP)
2024-02-21T07:03:17.677482-05:00 allyrion haproxy[1305572]: Connect from 10.0.2.162:53156 to 10.4.0.10:8404 (stats-frontend/HTTP)
2024-02-21T07:03:18.114561-05:00 allyrion haproxy[1305572]: [WARNING]  (1305572) : Server k8s-api-server/gardener is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:18.115141-05:00 allyrion haproxy[1305572]: [ALERT]    (1305572) : backend 'k8s-api-server' has no server available!
2024-02-21T07:03:18.115560-05:00 allyrion haproxy[1305572]: Server k8s-api-server/gardener is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:18.116133-05:00 allyrion haproxy[1305572]: Server k8s-api-server/gardener is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:18.117560-05:00 allyrion haproxy[1305572]: backend k8s-api-server has no server available!
2024-02-21T07:03:18.118458-05:00 allyrion haproxy[1305572]: backend k8s-api-server has no server available!

This is fine and dandy, and will be addressed in future chapters.

Container Runtime

Kubernetes is a container orchestration platform and therefore requires some container runtime to be installed.

This is a simple step; containerd is well-supported, well-regarded, and I don't have any reason not to use it.

I used Jeff Geerling's Ansible role to install and configure containerd on my cluster; this is really the point at which some kind of IaC/configuration management system becomes something more than a polite suggestion 🙂

Configuration Details

The containerd installation and configuration is managed through several key components:

Ansible Role Configuration

The geerlingguy.containerd role is specified in my requirements.yml and configured with these critical variables in group_vars/all/vars.yaml:

# geerlingguy.containerd configuration
containerd_package: 'containerd.io'
containerd_package_state: 'present' 
containerd_service_state: 'started'
containerd_service_enabled: true
containerd_config_cgroup_driver_systemd: true  # Critical for Kubernetes integration

Runtime Integration with Kubernetes

The most important aspect of the containerd configuration is its integration with Kubernetes. The cluster explicitly configures the CRI socket path:

kubernetes:
  cri_socket_path: 'unix:///var/run/containerd/containerd.sock'

This socket path is used throughout the kubeadm initialization and join processes, ensuring Kubernetes can communicate with the container runtime.

Systemd Cgroup Driver

The configuration sets SystemdCgroup = true in the containerd configuration file (/etc/containerd/config.toml), which is essential because:

  1. Kubernetes 1.22+ requires systemd cgroup driver for kubelet
  2. Consistency: Both kubelet and containerd must use the same cgroup driver
  3. Resource Management: Enables proper CPU/memory limits enforcement

Generated Configuration

The Ansible role generates a complete containerd configuration with these key settings:

# Runtime configuration
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true  # Critical for Kubernetes cgroup management

# Socket configuration  
[grpc]
address = "/run/containerd/containerd.sock"

Installation Process

The Ansible role performs these steps:

  1. Repository Setup: Adds Docker CE repository (containerd.io package source)
  2. Package Installation: Installs containerd.io package
  3. Default Config Generation: Runs containerd config default to generate base config
  4. Systemd Cgroup Modification: Patches config to set SystemdCgroup = true
  5. Service Management: Enables and starts containerd service

Architecture Support

The configuration automatically handles ARM64 architecture for the Raspberry Pi nodes through architecture detection in the Ansible variables, ensuring proper package selection for both ARM64 (Pi nodes) and AMD64 (x86 nodes).

Troubleshooting Tools

The installation also provides crictl (Container Runtime Interface CLI) for debugging and inspecting containers directly at the runtime level, which proves invaluable when troubleshooting Kubernetes pod issues.

The container runtime installation is handled in my install_k8s_packages.yaml playbook, which is where we'll be spending some time in subsequent sections.

Networking

Kubernetes uses three different networks:

  • Infrastructure: The physical or virtual backbone connecting the machines hosting the nodes. The infrastructure network enables connectivity between the nodes; this is essential for the Kubernetes control plane components (like the kube-apiserver, etcd, scheduler, and controller-manager) and the worker nodes to communicate with each other. Although pods communicate with each other via the pod network (overlay network), the underlying infrastructure network supports this by facilitating the physical or virtual network paths between nodes.
  • Service: This is a purely virtual and internal network. It allows services to communicate with each other and with Pods seamlessly. This network layer abstracts the actual network details from the services, providing a consistent and simplified interface for inter-service communication. When a Service is created, it is automatically assigned a unique IP address from the service network's address space. This IP address is stable for the lifetime of the Service, even if the Pods that make up the Service change. This stable IP address makes it easier to configure DNS or other service discovery mechanisms.
  • Pod: This is a crucial component that allows for seamless communication between pods across the cluster, regardless of which node they are running on. This networking model is designed to ensure that each pod gets its own unique IP address, making it appear as though each pod is on a flat network where every pod can communicate with every other pod directly without NAT.

My infrastructure network is already up and running at 10.4.0.0/20. I'll configure my service network at 172.16.0.0/20 and my pod network at 192.168.0.0/16.

Network Architecture Implementation

CIDR Block Allocations

The goldentooth cluster uses a carefully planned network segmentation strategy:

  • Infrastructure Network: 10.4.0.0/20 - Physical network backbone
  • Service Network: 172.16.0.0/20 - Kubernetes virtual services
  • Pod Network: 192.168.0.0/16 - Container-to-container communication
  • MetalLB Range: 10.4.11.0/24 - Load balancer service IPs

Physical Network Topology

The cluster consists of:

Control Plane Nodes (High Availability):

  • bettley (10.4.0.11), cargyll (10.4.0.12), dalt (10.4.0.13)

Load Balancer and Services:

  • allyrion (10.4.0.10) - HAProxy load balancer, NFS server

Worker Nodes:

  • 8 Raspberry Pi ARM64 workers: erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps
  • 1 x86 GPU worker: velaryon (10.4.0.30)

CNI Implementation: Calico

The cluster uses Calico as the Container Network Interface (CNI) plugin. Calico is configured during the kubeadm initialization:

kubeadm init \
  --control-plane-endpoint="10.4.0.10:6443" \
  --service-cidr="172.16.0.0/20" \
  --pod-network-cidr="192.168.0.0/16" \
  --kubernetes-version="stable-1.32"

Calico provides:

  • Layer 3 networking with BGP routing
  • Network policies for microsegmentation
  • Cross-node pod communication without overlay networks
  • Integration with the existing BGP infrastructure

Load Balancer Architecture

HAProxy Configuration: The cluster uses HAProxy running on allyrion (10.4.0.10) to provide high availability for the Kubernetes API server:

  • Frontend: Listens on port 6443
  • Backend: Round-robin load balancing across all three control plane nodes
  • Health Checks: TCP-based health checks with fall=3, rise=2 configuration
  • Monitoring: Prometheus metrics endpoint on port 8405

This ensures the cluster remains accessible even if individual control plane nodes fail.

BGP Integration with MetalLB

The cluster implements BGP-based load balancing using MetalLB:

Router Configuration (OPNsense with FRR):

  • Router AS Number: 64500
  • Cluster AS Number: 64501
  • BGP Peer: Router at 10.4.0.1

MetalLB Configuration:

spec:
  myASN: 64501
  peerASN: 64500
  peerAddress: 10.4.0.1
  addressPool: '10.4.11.0 - 10.4.15.254'

This allows Kubernetes LoadBalancer services to receive real IP addresses that are automatically routed through the network infrastructure.

Static Route Management

The networking Ansible role automatically:

  1. Detects the primary network interface using ip route show 10.4.0.0/20
  2. Adds static routes for the MetalLB range: ip route add 10.4.11.0/24 dev <interface>
  3. Persists routes in /etc/network/interfaces.d/<interface>.cfg for boot persistence

Service Discovery and DNS

The cluster implements comprehensive service discovery:

  • Cluster Domain: goldentooth.net
  • Node Domain: nodes.goldentooth.net
  • Services Domain: services.goldentooth.net
  • External DNS: Automated DNS record management via external-dns operator

Network Security

Certificate-Based Security:

  • Step-CA: Provides automated certificate management for all services
  • TLS Everywhere: All inter-service communication is encrypted
  • SSH Certificates: Automated SSH certificate provisioning

Service Mesh Integration:

  • Consul: Provides service discovery and health checking across both Kubernetes and Nomad
  • Network Policies: Configured but not strictly enforced by default

Multi-Orchestrator Networking

The cluster supports both Kubernetes and HashiCorp Nomad workloads on the same physical network:

  • Kubernetes: Calico CNI with BGP routing
  • Nomad: Bridge networking with Consul Connect service mesh
  • Vault: Network-based authentication and secrets distribution

Monitoring Network Integration

Observability Stack:

  • Prometheus: Scrapes metrics across all network endpoints
  • Grafana: Centralized dashboards accessible via MetalLB LoadBalancer
  • Loki: Log aggregation with Vector log shipping across nodes
  • Node Exporter: Per-node metrics collection

With this network architecture decided and implemented, we can move forward to the next phase of cluster construction.

Configuring Packages

Rather than YOLOing binaries onto our nodes like heathens, we'll use Apt and Ansible.

I wrote the above line before a few hours or so of fighting with Apt, Ansible, the repository signing key, documentation on the greater internet, my emotions, etc.

The long and short of it is that apt-key add is deprecated in Debian and Ubuntu, and consequently ansible.builtin.apt_key should be deprecated, but cannot be at this time for backward compatibility with older versions of Debian and Ubuntu and other derivative distributions.

The reason for this deprecation, as I understand it, is that apt-key add adds a key to /etc/apt/trusted.gpg.d. Keys here can be used to sign any package, including a package downloaded from an official distro package repository. This weakens our defenses against supply-chain attacks.

The new recommendation is to add the key to /etc/apt/keyrings, where it will be used when appropriate but not, apparently, to sign for official distro package repositories.

A further complication is that the Kubernetes project has moved its package repositories a time or two and completely rewrote the repository structure.

As a result, if you Google™, you will find a number of ways of using Ansible or a shell command to configure the Kubernetes apt repository on Debian/Ubuntu/Raspberry Pi OS, but none of them are optimal.

The Desired End-State

Here are my expectations:

  • use the new deb822 format, not the old sources.list format
  • preserve idempotence
  • don't point to deprecated package repositories
  • actually work

Existing solutions failed at one or all of these.

For the record, what we're trying to create is:

  • a file located at /etc/apt/keyrings/kubernetes.asc containing the Kubernetes package repository signing key
  • a file located at /etc/apt/sources.list.d/kubernetes.sources containing information about the Kubernetes package repository.

The latter should look something like the following:

X-Repolib-Name: kubernetes
Types: deb
URIs: https://pkgs.k8s.io/core:/stable:/v1.29/deb/
Suites: /
Architectures: arm64
Signed-By: /etc/apt/keyrings/kubernetes.asc

The Solution

After quite some time and effort and suffering, I arrived at a solution.

You can review the original task file for changes, but I'm embedding it here because it was weirdly a nightmare to arrive at a working solution.

I've edited this only to substitute strings for the variables that point to them, so it should be a working solution more-or-less out-of-the-box.

---
- name: 'Install packages needed to use the Kubernetes Apt repository.'
  ansible.builtin.apt:
    name:
      - 'apt-transport-https'
      - 'ca-certificates'
      - 'curl'
      - 'gnupg'
      - 'python3-debian'
    state: 'present'

- name: 'Add Kubernetes repository.'
  ansible.builtin.deb822_repository:
    name: 'kubernetes'
    types:
      - 'deb'
    uris:
      - "https://pkgs.k8s.io/core:/stable:/v1.29/deb/"
    suites:
      - '/'
    architectures:
      - 'arm64'
    signed_by: "https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key"

After this, you will of course need to update your Apt cache and install the three Kubernetes tools we'll use shortly: kubeadm, kubectl, and kubelet.

Installing Packages

Now that we have functional access to the Kubernetes Apt package repository, we can install some important Kubernetes tools:

  • kubeadm provides a straightforward way to setup and configure a Kubernetes cluster (API server, Controller Manager, DNS, etc). Kubernetes the Hard Way basically does what kubeadm does. I use kubeadm because my goal is to go not necessarily deeper, but farther.
  • kubectl is a CLI tool for administering a Kubernetes cluster; you can deploy applications, inspect resources, view logs, etc. As I'm studying for my CKA, I want to use kubectl for as much as possible.
  • kubelet runs on each and every node in the cluster and ensures that pods are functioning as desired and takes steps to correct their behavior when it does not match the desired state.

Package Installation Implementation

Kubernetes Package Configuration

The package installation is managed through Ansible variables in group_vars/all/vars.yaml:

kubernetes_version: '1.32'

kubernetes:
  apt_packages:
    - 'kubeadm'
    - 'kubectl' 
    - 'kubelet'
  apt_repo_url: "https://pkgs.k8s.io/core:/stable:/v{{ kubernetes_version }}/deb/"

This configuration:

  • Version management: Uses Kubernetes 1.32 (latest stable at time of writing)
  • Repository pinning: Uses version-specific repository for consistency
  • Package selection: Core Kubernetes tools required for cluster operation

Installation Process

The installation is handled by the install_k8s_packages.yaml playbook, which performs these steps:

1. Container Runtime Setup:

- name: 'Setup `containerd`.'
  hosts: 'k8s_cluster'
  remote_user: 'root'
  roles:
    - { role: 'geerlingguy.containerd' }

This ensures containerd is installed and configured before Kubernetes packages.

2. Package Installation:

- name: 'Install Kubernetes packages.'
  ansible.builtin.apt:
    name: "{{ kubernetes.apt_packages }}"
    state: 'present'
  notify:
    - 'Hold Kubernetes packages.'
    - 'Enable and restart kubelet service.'

3. Package Hold Management:

- name: 'Hold Kubernetes packages.'
  ansible.builtin.dpkg_selections:
    name: "{{ package }}"
    selection: 'hold'
  loop: "{{ kubernetes.apt_packages }}"

This prevents accidental upgrades during regular system updates, ensuring cluster stability.

Service Configuration

kubelet Service Activation:

- name: 'Enable and restart kubelet service.'
  ansible.builtin.systemd_service:
    name: 'kubelet'
    state: 'restarted'
    enabled: true
    daemon_reload: true

Key features:

  • Auto-start: Enables kubelet to start automatically on boot
  • Service restart: Ensures kubelet starts with new configuration
  • Daemon reload: Refreshes systemd to recognize any unit file changes

Target Nodes

The installation targets the k8s_cluster inventory group, which includes:

  • Control plane nodes: bettley, cargyll, dalt (3 nodes)
  • Worker nodes: All remaining Raspberry Pi nodes + velaryon GPU node (10 nodes)

This ensures all cluster nodes have consistent Kubernetes tooling.

Version Management Strategy

Repository Strategy:

  • Version-pinned repositories: Uses v1.32 specific repository
  • Package holds: Prevents accidental upgrades via dpkg --set-selections
  • Coordinated updates: Cluster-wide version management through Ansible

Upgrade Process:

  1. Update kubernetes_version variable
  2. Run install_k8s_packages.yaml playbook
  3. Coordinate cluster upgrade using kubeadm upgrade
  4. Update containerd and other runtime components as needed

Integration with Container Runtime

The playbook ensures proper integration between Kubernetes and containerd:

Runtime Configuration:

  • CRI socket: /var/run/containerd/containerd.sock
  • Cgroup driver: systemd (required for Kubernetes 1.22+)
  • Image service: containerd handles container image management

Service Dependencies:

  • containerd must be running before kubelet starts
  • kubelet configured to use containerd as container runtime
  • Proper systemd service ordering ensures reliable startup

Command Line Integration

The installation integrates with the goldentooth CLI:

# Install Kubernetes packages across cluster
goldentooth install_k8s_packages

# Uninstall if needed (cleanup)
goldentooth uninstall_k8s_packages

Post-Installation Verification

After installation, you can verify the tools are properly installed:

# Check versions
goldentooth command k8s_cluster 'kubeadm version'
goldentooth command k8s_cluster 'kubectl version --client'
goldentooth command k8s_cluster 'kubelet --version'

# Verify package holds
goldentooth command k8s_cluster 'apt-mark showhold | grep kube'

Installing these tools is comparatively simple with the automated approach, just sudo apt-get install -y kubeadm kubectl kubelet, but the Ansible implementation adds important production considerations like version pinning, service management, and cluster-wide coordination that manual installation would miss.

kubeadm init

kubeadm does a wonderful job of simplifying Kubernetes cluster bootstrapping (if you don't believe me, just read Kubernetes the Hard Way), but there's still a decent amount of work involved. Since we're creating a high-availability cluster, we need to do some magic to convey secrets between the control plane nodes, generate join tokens for the worker nodes, etc.

So, we will:

  • run kubeadm on the first control plane node
  • copy some data around
  • run a different kubeadm command to join the rest of the control plane nodes to the cluster
  • copy some more data around
  • run a different kubeadm command to join the worker nodes to the cluster

and then we're done!

kubeadm init takes a number of command-line arguments.

You can look at the actual Ansible tasks bootstrapping my cluster, but this is what my command evaluates out to:

kubeadm init \
  --control-plane-endpoint="10.4.0.10:6443" \
  --kubernetes-version="stable-1.29" \
  --service-cidr="172.16.0.0/20" \
  --pod-network-cidr="192.168.0.0/16" \
  --cert-dir="/etc/kubernetes/pki" \
  --cri-socket="unix:///var/run/containerd/containerd.sock" \
  --upload-certs

I'll break that down line by line:

# Run through all of the phases of initializing a Kubernetes control plane.
kubeadm init \
  # Requests should target the load balancer, not this particular node.
  --control-plane-endpoint="10.4.0.10:6443" \
  # We don't need any more instability than we already have.
  # At time of writing, 1.29 is the current release.
  --kubernetes-version="stable-1.29" \
  # As described in the chapter on Networking, this is the CIDR from which
  # service IP addresses will be allocated.
  # This gives us 4,094 IP addresses to work with.
  --service-cidr="172.16.0.0/20" \
  # As described in the chapter on Networking, this is the CIDR from which
  # pod IP addresses will be allocated.
  # This gives us 65,534 IP addresses to work with.
  --pod-network-cidr="192.168.0.0/16"
  # This is the directory that will host TLS certificates, keys, etc for
  # cluster communication.
  --cert-dir="/etc/kubernetes/pki"
  # This is the URI of the container runtime interface socket, which allows
  # direct interaction with the container runtime.
  --cri-socket="unix:///var/run/containerd/containerd.sock"
  # Upload certificates into the appropriate secrets, rather than making us
  # do that manually.
  --upload-certs

Oh, you thought I was just going to blow right by this, didncha? No, this ain't Kubernetes the Hard Way, but I do want to make an effort to understand what's going on here. So here, courtesy of kubeadm init --help, is the list of phases that kubeadm runs through by default.

preflight                    Run pre-flight checks
certs                        Certificate generation
  /ca                          Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
  /apiserver                   Generate the certificate for serving the Kubernetes API
  /apiserver-kubelet-client    Generate the certificate for the API server to connect to kubelet
  /front-proxy-ca              Generate the self-signed CA to provision identities for front proxy
  /front-proxy-client          Generate the certificate for the front proxy client
  /etcd-ca                     Generate the self-signed CA to provision identities for etcd
  /etcd-server                 Generate the certificate for serving etcd
  /etcd-peer                   Generate the certificate for etcd nodes to communicate with each other
  /etcd-healthcheck-client     Generate the certificate for liveness probes to healthcheck etcd
  /apiserver-etcd-client       Generate the certificate the apiserver uses to access etcd
  /sa                          Generate a private key for signing service account tokens along with its public key
kubeconfig                   Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
  /admin                       Generate a kubeconfig file for the admin to use and for kubeadm itself
  /super-admin                 Generate a kubeconfig file for the super-admin
  /kubelet                     Generate a kubeconfig file for the kubelet to use *only* for cluster bootstrapping purposes
  /controller-manager          Generate a kubeconfig file for the controller manager to use
  /scheduler                   Generate a kubeconfig file for the scheduler to use
etcd                         Generate static Pod manifest file for local etcd
  /local                       Generate the static Pod manifest file for a local, single-node local etcd instance
control-plane                Generate all static Pod manifest files necessary to establish the control plane
  /apiserver                   Generates the kube-apiserver static Pod manifest
  /controller-manager          Generates the kube-controller-manager static Pod manifest
  /scheduler                   Generates the kube-scheduler static Pod manifest
kubelet-start                Write kubelet settings and (re)start the kubelet
upload-config                Upload the kubeadm and kubelet configuration to a ConfigMap
  /kubeadm                     Upload the kubeadm ClusterConfiguration to a ConfigMap
  /kubelet                     Upload the kubelet component config to a ConfigMap
upload-certs                 Upload certificates to kubeadm-certs
mark-control-plane           Mark a node as a control-plane
bootstrap-token              Generates bootstrap tokens used to join a node to a cluster
kubelet-finalize             Updates settings relevant to the kubelet after TLS bootstrap
  /experimental-cert-rotation  Enable kubelet client certificate rotation
addon                        Install required addons for passing conformance tests
  /coredns                     Install the CoreDNS addon to a Kubernetes cluster
  /kube-proxy                  Install the kube-proxy addon to a Kubernetes cluster
show-join-command            Show the join command for control-plane and worker node

So now I will go through each of these in turn to explain how the cluster is created.

kubeadm init phases

preflight

The preflight phase performs a number of checks of the environment to ensure it is suitable. These aren't, as far as I can tell, documented anywhere -- perhaps because documentation would inevitably drift out of sync with the code rather quickly. And, besides, we're engineers and this is an open-source project; if we care that much, we can just read the source code!

But I'll go through and mention a few of these checks, just for the sake of discussion and because there are some important concepts.

  • Networking: It checks that certain ports are available and firewall settings do not prevent communication.
  • Container Runtime: It requires a container runtime, since... Kubernetes is a container orchestration platform.
  • Swap: Historically, Kubernetes has balked at running on a system with swap enabled, for performance and stability reasons, but this has been lifted recently.
  • Uniqueness: It checks that each hostname is different in order to prevent networking conflicts.
  • Kernel Parameters: It checks for certain cgroups (see the Node configuration chapter for more information). It used to check for some networking parameters as well, to ensure traffic can flow properly, but it appears this might not be a thing anymore in 1.30.

certs

This phase generates important certificates for communication between cluster components.

/ca

This generates a self-signed certificate authority that will be used to provision identities for all of the other Kubernetes components, and lays the groundwork for the security and reliability of their communication by ensuring that all components are able to trust one another.

By generating its own root CA, a Kubernetes cluster can be self-sufficient in managing the lifecycle of the certificates it uses for TLS. This includes generating, distributing, rotating, and revoking certificates as needed. This autonomy simplifies the setup and ongoing management of the cluster, especially in environments where integrating with an external CA might be challenging.

It's worth mentioning that this includes client certificates as well as server certificates, since client certificates aren't currently as well-known and ubiquitous as server certificates. So just as the API server has a server certificate that allows clients making requests to verify its identity, so clients will have a client certificate that allows the server to verify their identity.

So these certificate relationships maintain CIA (Confidentiality, Integrity, and Authentication) by:

  • encrypting the data transmitted between the client and the server (Confidentiality)
  • preventing tampering with the data transmitted between the client and the server (Integrity)
  • verifying the identity of the server and the client (Authentication)

/apiserver

The Kubernetes API server is the central management entity of the cluster. The Kubernetes API allows users and internal and external processes and components to communicate and report and manage the state of the cluster. The API server accepts, validates, and executes REST operations, and is the only cluster component that interacts with etcd directly. etcd is the source of truth within the cluster, so it is essential that communication with the API server be secure.

/apiserver-kubelet-client

This is a client certificate for the API server, ensuring that it can authenticate itself to each kubelet and prove that it is a legitimate source of commands and requests.

/front-proxy-ca and front-proxy-client

The Front Proxy certificates seem to only be used in situations where kube-proxy is supporting an extension API server, and the API server/aggregator needs to connect to an extension API server respectively. This is beyond the scope of this project.

/etcd-ca

etcd can be configured to run "stacked" (deployed onto the control plane) or as an external cluster. For various reasons (security via isolation, access control, simplified rotation and management, etc), etcd is provided its own certificate authority.

/etcd-server

This is a server certificate for each etcd node, assuring the Kubernetes API server and etcd peers of its identity.

/etcd-peer

This is a client and server certificate, distributed to each etcd node, that enables them to communicate securely with one another.

/etcd-healthcheck-client

This is a client certificate that enables the caller to probe etcd. It permits broader access, in that multiple clients can use it, but the degree of that access is very restricted.

/apiserver-etcd-client

This is a client certificate permitting the API server to communicate with etcd.

/sa

This is a public and private key pair that is used for signing service account tokens.

Service accounts are used to provide an identity for processes that run in a Pod, permitting them to interact securely with the API server.

Service account tokens are JWTs (JSON Web Tokens). When a Pod accesses the Kubernetes API, it can present a service account token as a bearer token in the HTTP Authorization header. The API server then uses the public key to verify the signature on the token, and can then evaluate whether the claims are valid, etc.

kubeconfig

These phases write the necessary configuration files to secure and facilitate communication within the cluster and between administrator tools (like kubectl) and the cluster.

/admin

This is the kubeconfig file for the cluster administrator. It provides the admin user with full access to the cluster.

Now, per a change in 1.29, as Rory McCune explains, this admin credential is no longer a member of system:masters and instead has access granted via RBAC. This means that access can be revoked without having to manually rotate all of the cluster certificates.

/super-admin

This new credential also provides full access to the cluster, but via the system:masters group mechanism (read: irrevocable without rotating certificates). This also explains why, when watching my cluster spin up while using the admin.conf credentials, a time or two I saw access denied errors!

/kubelet

This credential is for use with the kubelet during cluster bootstrapping. It provides a baseline cluster-wide configuration for all kubelets in the cluster. It points to the client certificates that allow the kubelet to communicate with the API server so we can propagate cluster-level configuration to each kubelet.

/controller-manager

This credential is used by the Controller Manager. The Controller Manager is responsible for running controller processes, which watch the state of the cluster through the API server and make changes attempting to move the current state towards the desired state. This file contains credentials that allow the Controller Manager to communicate securely with the API server.

/scheduler

This credential is used by the Kubernetes Scheduler. The Scheduler is responsible for assigning work, in the form of Pods, to different nodes in the cluster. It makes these decisions based on resource availability, workload requirements, and other policies. This file contains the credentials needed for the Scheduler to interact with the API server.

etcd

This phase generates the static pod manifest file for local etcd.

Static pod manifests are files kept in (in our case) /etc/kubernetes/manifests; the kubelet observes this directory and will start/replace/delete pods accordingly. In the case of a "stacked" cluster, where we have critical control plane components like etcd and the API server running within pods, we need some method of creating and managing pods without those components. Static pod manifests provide this capability.

/local

This phase configures a local etcd instance to run on the same node as the other control plane components. This is what we'll be doing; later, when we join additional nodes to the control plane, the etcd cluster will expand.

For instance, the static pod manifest file for etcd on bettley, my first control plane node, has a spec.containers[0].command that looks like this:

....
  - command:
    - etcd
    - --advertise-client-urls=https://10.4.0.11:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://10.4.0.11:2380
    - --initial-cluster=bettley=https://10.4.0.11:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://10.4.0.11:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://10.4.0.11:2380
    - --name=bettley
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
....

whereas on fenn, the second control plane node, the corresponding static pod manifest file looks like this:

  - command:
    - etcd
    - --advertise-client-urls=https://10.4.0.15:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://10.4.0.15:2380
    - --initial-cluster=fenn=https://10.4.0.15:2380,gardener=https://10.4.0.16:2380,bettley=https://10.4.0.11:2380
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://10.4.0.15:2379
    - --listen-metrics-urls=http://127.0.0.1:2381
    - --listen-peer-urls=https://10.4.0.15:2380
    - --name=fenn
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

and correspondingly, we can see three pods:

$ kubectl -n kube-system get pods
NAME                                       READY   STATUS    RESTARTS   AGE
etcd-bettley                               1/1     Running   19         3h23m
etcd-fenn                                  1/1     Running   0          3h22m
etcd-gardener                              1/1     Running   0          3h23m

control-plane

This phase generates the static pod manifest files for the other (non-etcd) control plane components.

/apiserver

This generates the static pod manifest file for the API server, which we've already discussed quite a bit.

/controller-manager

This generates the static pod manifest file for the controller manager. The controller manager embeds the core control loops shipped with Kubernetes. A controller is a loop that watches the shared state of the cluster through the API server and makes changes attempting to move the current state towards the desired state. Examples of controllers that are part of the Controller Manager include the Replication Controller, Endpoints Controller, Namespace Controller, and ServiceAccounts Controller.

/scheduler

This phase generates the static pod manifest file for the scheduler. The scheduler is responsible for allocating pods to nodes in the cluster based on various scheduling principles, including resource availability, constraints, affinities, etc.

kubelet-start

Throughout this process, the kubelet has been in a crash loop because it hasn't had a valid configuration.

This phase generates a config which (at least on my system) is stored at /var/lib/kubelet/config.yaml, as well as a "bootstrap" configuration that allows the kubelet to connect to the control plane (and retrieve credentials for longterm use).

Then the kubelet is restarted and will bootstrap with the control plane.

upload-certs

This phase enables the secure distribution of the certificates we created above, in the certs phases.

Some certificates need to be shared across the cluster (or at least across the control plane) for secure communication. This includes the certificates for the API server, etcd, the front proxy, etc.

kubeadm generates an encryption key that is used to encrypt the certificates, so they're not actually exposed in plain text at any point. Then the encrypted certificates are uploaded to etcd, a distributed key-value store that Kubernetes uses for persisting cluster state. To facilitate future joins of control plane nodes without having to manually distribute certificates, these encrypted certificates are stored in a specific kubeadm-certs secret.

The encryption key is required to decrypt the certificates for use by joining nodes. This key is not uploaded to the cluster for security reasons. Instead, it must be manually shared with any future control plane nodes that join the cluster. kubeadm outputs this key upon completion of the upload-certs phase, and it's the administrator's responsibility to securely transfer this key when adding new control plane nodes.

This process allows for the secure addition of new control plane nodes to the cluster by ensuring they have access to the necessary certificates to communicate securely with the rest of the cluster. Without this phase, administrators would have to manually copy certificates to each new node, which can be error-prone and insecure.

By automating the distribution of these certificates and utilizing encryption for their transfer, kubeadm significantly simplifies the process of scaling the cluster's control plane, while maintaining high standards of security.

mark-control-plane

In this phase, kubeadm applies a specific label to control plane nodes: node-role.kubernetes.io/control-plane=""; this marks the node as part of the control plane. Additionally, the node receives a taint, node-role.kubernetes.io/control-plane=:NoSchedule, which will prevent normal workloads from being scheduled on it.

As noted previously, I see no reason to remove this taint, although I'll probably enable some tolerations for certain workloads (monitoring, etc).

bootstrap-token

This phase creates bootstrap tokens, which are used to authenticate new nodes joining the cluster. This is how we are able to easily scale the cluster dynamically without copying multiple certificates and private keys around.

The "TLS bootstrap" process allows a kubelet to automatically request a certificate from the Kubernetes API server. This certificate is then used for secure communication within the cluster. The process involves the use of a bootstrap token and a Certificate Signing Request (CSR) that the kubelet generates. Once approved, the kubelet receives a certificate and key that it uses for authenticated communication with the API server.

Bootstrap tokens are a simple bearer token. This token is composed of two parts: an ID and a secret, formatted as <id>.<secret>. The ID and secret are randomly generated strings that authenticate the joining nodes to the cluster.

The generated token is configured with specific permissions using RBAC policies. These permissions typically allow the token to create a certificate signing request (CSR) that the Kubernetes control plane can then approve, granting the joining node the necessary certificates to communicate securely within the cluster.

By default, bootstrap tokens are set to expire after a certain period (24 hours by default), ensuring that tokens cannot be reused indefinitely for joining new nodes to the cluster. This behavior enhances the security posture of the cluster by limiting the window during which a token is valid.

Once generated and configured, the bootstrap token is stored as a secret in the kube-system namespace.

kubelet-finalize

This phase ensures that the kubelet is fully configured with the necessary settings to securely and effectively participate in the cluster. It involves applying any final kubelet configurations that might depend on the completion of the TLS bootstrap process.

addon

This phase sets up essential add-ons required for the cluster to meet the Kubernetes Conformance Tests.

/coredns

CoreDNS provides DNS services for the internal cluster network, allowing pods to find each other by name and services to load-balance across a set of pods.

/kube-proxy

kube-proxy is responsible for managing network communication inside the cluster, implementing part of the Kubernetes Service concept by maintaining network rules on nodes. These rules allow network communication to pods from network sessions inside or outside the cluster.

kube-proxy ensures that the networking aspect of Kubernetes services is correctly handled, allowing for the routing of traffic to the appropriate destinations. It operates in the user space, and it can also run in iptables mode, where it manipulates rules to allow network traffic. This allows services to be exposed to the external network, load balances traffic to pods across the multiple instances, etc.

show-join-command

This phase simplifies the process of expanding a Kubernetes cluster by generating bootstrap tokens and providing the necessary command to join additional nodes, whether they are worker nodes or additional control plane nodes.

In the next section, we'll actually bootstrap the cluster.

Bootstrapping the First Control Plane Node

With a solid idea of what it is that kubeadm init actually does, we can return to our command:

kubeadm init \
  --control-plane-endpoint="10.4.0.10:6443" \
  --kubernetes-version="stable-1.29" \
  --service-cidr="172.16.0.0/20" \
  --pod-network-cidr="192.168.0.0/16" \
  --cert-dir="/etc/kubernetes/pki" \
  --cri-socket="unix:///var/run/containerd/containerd.sock" \
  --upload-certs

It's really pleasantly concise, given how much is going on under the hood.

The Ansible tasks also symlinks the /etc/kubernetes/admin.conf file to ~/.kube/config (so we can use kubectl without having to specify the config file).

Then it sets up my preferred Container Network Interface addon, Calico. I have in the past sometimes used Flannel, but Flannel doesn't support NetworkPolicy resources as it is a Layer 3 networking solution, whereas Calico operates at Layer 3 and Layer 4, which allows it fine-grained control over traffic based on ports, protocol types, sources and destinations, etc.

I want to play with NetworkPolicy resources, so Calico it is.

The next couple of steps create bootstrap tokens so we can join the cluster.

Joining the Rest of the Control Plane

The next phase of bootstrapping is to admit the rest of the control plane nodes to the control plane.

Certificate Key Extraction

Before joining additional control plane nodes, we need to extract the certificate key from the initial kubeadm init output:

- name: 'Set the kubeadm certificate key.'
  ansible.builtin.set_fact:
    k8s_certificate_key: "{{ line | regex_search('--certificate-key ([^ ]+)', '\\1') | first }}"
  loop: "{{ hostvars[kubernetes.first]['kubeadm_init'].stdout_lines | default([]) }}"
  when: '(line | trim) is match(".*--certificate-key.*")'

This certificate key is crucial for securely downloading control plane certificates during the join process. The --upload-certs flag from the initial kubeadm init uploaded these certificates to the cluster, encrypted with this key.

Dynamic Token Generation

Rather than using a static token, we generate a fresh token for the join process:

- name: 'Create kubeadm token for joining nodes.'
  ansible.builtin.command:
    cmd: "kubeadm --kubeconfig {{ kubernetes.admin_conf_path }} token create"
  delegate_to: "{{ kubernetes.first }}"
  register: 'temp_token'

- name: 'Set kubeadm token fact.'
  ansible.builtin.set_fact:
    kubeadm_token: "{{ temp_token.stdout }}"

This approach:

  • Security: Uses short-lived tokens (24-hour default expiry)
  • Automation: No need to manually specify or distribute tokens
  • Reliability: Fresh token for each bootstrap operation

JoinConfiguration Template

The JoinConfiguration manifest is generated from a Jinja2 template (kubeadm-controlplane.yaml.j2):

apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
discovery:
  bootstrapToken:
    apiServerEndpoint: {{ haproxy.ipv4_address }}:6443
    token: {{ kubeadm_token }}
    unsafeSkipCAVerification: true
  timeout: 5m0s
  tlsBootstrapToken: {{ kubeadm_token }}
controlPlane:
  localAPIEndpoint:
    advertiseAddress: {{ ipv4_address }}
    bindPort: 6443
  certificateKey: {{ k8s_certificate_key }}
nodeRegistration:
  name: {{ inventory_hostname }}
  criSocket: {{ kubernetes.cri_socket_path }}
{% if inventory_hostname in kubernetes.rest %}
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
{% else %}
  taints: []
{% endif %}

Key Configuration Elements:

Discovery Configuration:

  • API Server Endpoint: Points to HAProxy load balancer (10.4.0.10:6443)
  • Bootstrap Token: Dynamically generated token for secure cluster discovery
  • CA Verification: Skipped for simplicity (acceptable in trusted network)
  • Timeout: 5-minute timeout for discovery operations

Control Plane Configuration:

  • Local API Endpoint: Each node advertises its own IP for API server communication
  • Certificate Key: Enables secure download of control plane certificates
  • Bind Port: Standard Kubernetes API server port (6443)

Node Registration:

  • CRI Socket: Uses containerd socket (unix:///var/run/containerd/containerd.sock)
  • Node Name: Uses Ansible inventory hostname for consistency
  • Taints: Control plane nodes get NoSchedule taint to prevent workload scheduling

Control Plane Join Process

The actual joining process involves several orchestrated steps:

1. Configuration Setup

- name: 'Ensure presence of Kubernetes directory.'
  ansible.builtin.file:
    path: '/etc/kubernetes'
    state: 'directory'
    mode: '0755'

- name: 'Create kubeadm control plane config.'
  ansible.builtin.template:
    src: 'kubeadm-controlplane.yaml.j2'
    dest: '/etc/kubernetes/kubeadm-controlplane.yaml'
    mode: '0640'
    backup: true

2. Readiness Verification

- name: 'Wait for the kube-apiserver to be ready.'
  ansible.builtin.wait_for:
    host: "{{ haproxy.ipv4_address }}"
    port: '6443'
    timeout: 180

This ensures the load balancer and initial control plane node are ready before attempting to join.

3. Clean State Reset

- name: 'Reset certificate directory.'
  ansible.builtin.shell:
    cmd: |
      if [ -f /etc/kubernetes/manifests/kube-apiserver.yaml ]; then
        kubeadm reset -f --cert-dir {{ kubernetes.pki_path }};
      fi

This conditional reset ensures a clean state if a node was previously part of a cluster.

4. Control Plane Join

- name: 'Join the control plane node to the cluster.'
  ansible.builtin.command:
    cmd: kubeadm join --config /etc/kubernetes/kubeadm-controlplane.yaml
  register: 'kubeadm_join'

5. Administrative Access Setup

- name: 'Ensure .kube directory exists.'
  ansible.builtin.file:
    path: '~/.kube'
    state: 'directory'
    mode: '0755'

- name: 'Symlink the kubectl admin.conf to ~/.kube/config.'
  ansible.builtin.file:
    src: '/etc/kubernetes/admin.conf'
    dest: '~/.kube/config'
    state: 'link'
    mode: '0600'

This sets up kubectl access for the root user on each control plane node.

Target Nodes

The control plane joining process targets nodes in the kubernetes.rest group:

  • bettley (10.4.0.11) - Second control plane node
  • cargyll (10.4.0.12) - Third control plane node

This gives us a 3-node control plane for high availability, capable of surviving the failure of any single node.

High Availability Considerations

Load Balancer Integration:

  • All control plane nodes use the HAProxy endpoint for cluster communication
  • Even control plane nodes connect through the load balancer for API server access
  • This ensures consistent behavior whether accessing from inside or outside the cluster

Certificate Management:

  • Control plane certificates are automatically distributed via the certificate key mechanism
  • Each node gets its own API server certificate with appropriate SANs
  • Certificate rotation is handled through normal Kubernetes certificate management

Etcd Clustering:

  • kubeadm automatically configures etcd clustering across all control plane nodes
  • Stacked topology (etcd on same nodes as API server) for simplicity
  • Quorum maintained with 3 nodes (can survive 1 node failure)

After these steps complete, a simple kubeadm join --config /etc/kubernetes/kubeadm-controlplane.yaml on each remaining control plane node is sufficient to complete the highly available control plane setup.

Admitting the Worker Nodes

After establishing a highly available control plane, the final phase of cluster bootstrapping involves admitting worker nodes. While conceptually simple, this process involves several important considerations for security, automation, and cluster topology.

Worker Node Join Command Generation

The process begins by generating a fresh join command from the first control plane node:

- name: 'Get a kubeadm join command for worker nodes.'
  ansible.builtin.command:
    cmd: 'kubeadm token create --print-join-command'
  changed_when: false
  when: 'ansible_hostname == kubernetes.first'
  register: 'kubeadm_join_command'

This command:

  • Dynamic tokens: Creates a new bootstrap token with 24-hour expiry
  • Complete command: Returns fully formed join command with discovery information
  • Security: Each bootstrap operation gets a fresh token to minimize exposure

Join Command Structure

The generated join command typically looks like:

kubeadm join 10.4.0.10:6443 \
  --token abc123.defghijklmnopqrs \
  --discovery-token-ca-cert-hash sha256:1234567890abcdef...

Key components:

  • API Server Endpoint: HAProxy load balancer address (10.4.0.10:6443)
  • Bootstrap Token: Temporary authentication token for initial cluster access
  • CA Certificate Hash: SHA256 hash of cluster CA certificate for secure discovery

Ansible Automation

The join command is distributed and executed across all worker nodes:

- name: 'Set the kubeadm join command fact.'
  ansible.builtin.set_fact:
    kubeadm_join_command: |
      {{ hostvars[kubernetes.first]['kubeadm_join_command'].stdout }} --ignore-preflight-errors=all

- name: 'Join node to Kubernetes control plane.'
  ansible.builtin.command:
    cmd: "{{ kubeadm_join_command }}"
  when: "clean_hostname in groups['k8s_worker']"
  changed_when: false

Automation features:

  • Fact distribution: Join command shared across all hosts via Ansible facts
  • Selective execution: Only runs on nodes in the k8s_worker inventory group
  • Preflight error handling: --ignore-preflight-errors=all allows join despite minor configuration warnings

Worker Node Inventory

The worker nodes are organized in the Ansible inventory under k8s_worker:

Raspberry Pi Workers (8 nodes):

  • erenford (10.4.0.14) - Ray head node, ZFS storage
  • fenn (10.4.0.15) - Ceph storage node
  • gardener (10.4.0.16) - Grafana host, ZFS storage
  • harlton (10.4.0.17) - General purpose worker
  • inchfield (10.4.0.18) - Loki host, Seaweed storage
  • jast (10.4.0.19) - Step-CA host, Seaweed storage
  • karstark (10.4.0.20) - Ceph storage node
  • lipps (10.4.0.21) - Ceph storage node

GPU Worker (1 node):

  • velaryon (10.4.1.10) - x86 node with GPU acceleration

This topology provides:

  • Heterogeneous compute: Mix of ARM64 (Pi) and x86_64 (velaryon) architectures
  • Specialized workloads: GPU node for ML/AI workloads
  • Storage diversity: Nodes optimized for different storage backends (ZFS, Ceph, Seaweed)

Node Registration Process

When a worker node joins the cluster, several automated processes occur:

1. TLS Bootstrap

# kubelet initiates TLS bootstrapping
kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \
        --kubeconfig=/etc/kubernetes/kubelet.conf

This process:

  • Uses bootstrap token for initial authentication
  • Generates node-specific key pair
  • Requests certificate signing from cluster CA
  • Receives permanent kubeconfig upon approval

2. Node Labels and Taints

Automatic labels applied:

  • kubernetes.io/arch=arm64 (Pi nodes) or kubernetes.io/arch=amd64 (velaryon)
  • kubernetes.io/os=linux
  • node.kubernetes.io/instance-type= (based on node hardware)

No default taints: Worker nodes accept all workloads by default, unlike control plane nodes with NoSchedule taints.

3. Container Runtime Integration

Each worker node connects to the local containerd socket:

# kubelet configuration
criSocket: unix:///var/run/containerd/containerd.sock

This ensures:

  • Container lifecycle: kubelet manages pod containers via containerd
  • Image management: containerd handles container image pulls and storage
  • Runtime security: Proper cgroup and namespace isolation

Cluster Topology Verification

After worker node admission, the cluster achieves the desired topology:

Control Plane (3 nodes)

  • High availability: Survives single node failure
  • Load balanced: All API requests go through HAProxy
  • Etcd quorum: 3-node etcd cluster for data consistency

Worker Pool (9 nodes)

  • Compute capacity: 8x Raspberry Pi 4B + 1x x86 GPU node
  • Workload distribution: Scheduler can place pods across heterogeneous hardware
  • Fault tolerance: Workloads can survive multiple worker node failures

Networking Integration

  • Pod networking: Calico CNI provides cluster-wide pod connectivity
  • Service networking: kube-proxy configures service load balancing
  • External access: MetalLB provides LoadBalancer service implementation

Verification Commands

After worker node admission, verify cluster health:

# Check all nodes are Ready
kubectl get nodes -o wide

# Verify kubelet health across cluster
goldentooth command k8s_cluster 'systemctl status kubelet'

# Check pod networking
kubectl get pods -n kube-system -o wide

# Verify resource availability
kubectl top nodes

And voilà! We have a functioning cluster.

Voilà

We can also see that the cluster is functioning well from HAProxy's perspective:

HAProxy Stats 2

Implementation Details

The complete worker node admission process is automated in the bootstrap_k8s.yaml playbook, which orchestrates:

  1. Control plane initialization on the first node
  2. Control plane expansion to remaining master nodes
  3. Worker node admission across the entire worker pool
  4. Network configuration with Calico CNI
  5. Service mesh preparation for later HashiCorp Consul integration

This systematic approach ensures consistent cluster topology and provides a solid foundation for deploying containerized applications and platform services.

Where Do We Go From Here?

We have a functioning cluster now, which is to say that I've spent many hours of my life that I'm not going to get back just doing the same thing that the official documentation manages to convey in just a few lines.

Or that Jeff Geerling's geerlingguy.kubernetes has already managed to do.

And it's not a tenth of a percent as much as Kubespray can do.

Not much to be proud of, but again, this is a personal learning journey. I'm just trying to build a cluster thoughtfully, limiting the black boxes and the magic as much as practical.

The Foundation is Set

What we've accomplished so far represents the essential foundation of any production Kubernetes cluster:

Core Infrastructure ✅

  • High availability control plane with 3 nodes and etcd quorum
  • Load balanced API access through HAProxy for reliability
  • Container runtime (containerd) with proper CRI integration
  • Pod networking with Calico CNI providing cluster-wide connectivity
  • Worker node pool with heterogeneous hardware (ARM64 + x86_64)

Automation and Reproducibility ✅

  • Infrastructure as Code with comprehensive Ansible automation
  • Idempotent operations ensuring consistent cluster state
  • Version-pinned packages preventing unexpected upgrades
  • Goldentooth CLI providing unified cluster management interface

But a bare Kubernetes cluster, while functional, is just the beginning. Real production workloads require additional platform services and operational capabilities.

The Platform Journey Ahead

The following phases will transform our basic cluster into a comprehensive container platform:

Phase 1: Application Platform Services

The next immediate priorities focus on making the cluster useful for application deployment:

GitOps and Application Management:

  • Helm package management for standardized application packaging
  • Argo CD for GitOps-based continuous deployment
  • ApplicationSets for managing applications across environments
  • Sealed Secrets for secure secret management in Git repositories

Ingress and Load Balancing:

  • MetalLB for LoadBalancer service implementation
  • BGP configuration for dynamic route advertisement
  • External DNS for automatic DNS record management
  • TLS certificate automation with cert-manager

Phase 2: Observability and Operations

Production clusters require comprehensive observability:

Metrics and Monitoring:

  • Prometheus for metrics collection and alerting
  • Grafana for visualization and dashboards
  • Node exporters for hardware and OS metrics
  • Custom metrics for application-specific monitoring

Logging and Troubleshooting:

  • Loki for centralized log aggregation
  • Vector for log collection and routing
  • Distributed tracing for complex application debugging
  • Alert routing for operational incident response

Phase 3: Storage and Data Management

Stateful applications require sophisticated storage solutions:

Distributed Storage:

  • NFS exports for shared storage across the cluster
  • Ceph cluster for distributed block and object storage
  • ZFS replication for data durability and snapshots
  • SeaweedFS for scalable object storage

Backup and Recovery:

  • Velero for cluster backup and disaster recovery
  • Database backup automation for stateful workloads
  • Cross-datacenter replication for business continuity

Phase 4: Security and Compliance

Enterprise-grade security requires multiple layers:

PKI and Certificate Management:

  • Step-CA for internal certificate authority
  • Automatic certificate rotation for all cluster services
  • SSH certificate authentication for secure node access
  • mTLS everywhere for service-to-service communication

Secrets and Access Control:

  • HashiCorp Vault for enterprise secret management
  • AWS KMS integration for encryption key management
  • RBAC policies for fine-grained access control
  • Pod security standards for workload isolation

Phase 5: Multi-Orchestrator Hybrid Cloud

The final phase explores advanced orchestration patterns:

Service Mesh and Discovery:

  • Consul service mesh for advanced networking and security
  • Cross-platform service discovery between Kubernetes and Nomad
  • Traffic management and circuit breaking patterns

Workload Distribution:

  • Nomad integration for specialized workloads and batch jobs
  • Ray cluster for distributed machine learning workloads
  • GPU acceleration for AI/ML and scientific computing

Learning Philosophy

This journey prioritizes understanding over convenience:

Transparency Over Magic:

  • Each component is manually configured to understand its purpose
  • Ansible automation makes every configuration decision explicit
  • Documentation captures the reasoning behind each choice

Production Patterns from Day One:

  • High availability configurations even in the homelab
  • Security-first approach with proper PKI and encryption
  • Monitoring and observability built into every service

Platform Engineering Mindset:

  • Reusable patterns that could scale to enterprise environments
  • GitOps workflows that support team collaboration
  • Self-service capabilities for application developers

The Road Ahead

The following chapters will implement these platform services systematically, building up the cluster's capabilities layer by layer. Each addition will:

  1. Solve a real operational problem (not just add complexity)
  2. Follow production best practices (high availability, security, monitoring)
  3. Integrate with existing services (leveraging our PKI, service discovery, etc.)
  4. Document the implementation (including failure modes and troubleshooting)

This methodical approach ensures that when we're done, we'll have not just a working cluster, but a deep understanding of how modern container platforms are built and operated.

In the following sections, I'll add more functionality.

Installing Helm

I have a lot of ambitions for this cluster, but after some deliberation, the thing I most want to do right now is deploy something to Kubernetes.

So I'll be starting out by installing Argo CD, and I'll do that... soon. In the next chapter. I decided to install Argo CD via Helm, since I expect that Helm will be useful in other situations as well, e.g. trying out applications before I commit (no pun intended) to bringing them into GitOps.

So I created a playbook and role to cover installing Helm.

Installation Implementation

Package Repository Approach

Rather than downloading binaries manually, I chose to use the official Helm APT repository for a more maintainable installation. The Ansible role adds the repository using the modern deb822_repository format:

- name: 'Add Helm package repository.'
  ansible.builtin.deb822_repository:
    name: 'helm'
    types: ['deb']
    uris: ['https://baltocdn.com/helm/stable/debian/']
    suites: ['all']
    components: ['main']
    architectures: ['arm64']
    signed_by: 'https://baltocdn.com/helm/signing.asc'

This approach provides several benefits:

  • Automatic updates: Using state: 'latest' ensures we get the most recent Helm version
  • Security: Uses the official Helm signing key for package verification
  • Architecture support: Properly configured for ARM64 architecture on Raspberry Pi nodes
  • Maintainability: Standard package management simplifies updates and removes manual binary management

Installation Scope

Helm is installed only on the Kubernetes control plane nodes (k8s_control_plane group). This is sufficient because:

  1. Post-Tiller Architecture: Modern Helm (v3+) doesn't require a server-side component
  2. Client-only Tool: Helm operates entirely as a client-side tool that communicates with the Kubernetes API
  3. Administrative Access: Control plane nodes are where cluster administration typically occurs
  4. Resource Efficiency: No need to install on every worker node

Integration with Cluster Architecture

Kubernetes Integration: The installation leverages the existing kubernetes.core Ansible collection, ensuring proper integration with the cluster's Kubernetes components. The role depends on:

  • Existing cluster RBAC configurations
  • Kubernetes API server access from control plane nodes
  • Standard kubeconfig files for authentication

GitOps Integration: Helm serves as a crucial component for the GitOps workflow, particularly for Argo CD installation. The integration follows this pattern:

- name: 'Add Argo Helm chart repository.'
  kubernetes.core.helm_repository:
    name: 'argo'
    repo_url: "{{ argo_cd.chart_repo_url }}"

- name: 'Install Argo CD from Helm chart.'
  kubernetes.core.helm:
    atomic: false
    chart_ref: 'argo/argo-cd'
    chart_version: "{{ argo_cd.chart_version }}"
    create_namespace: true
    release_name: 'argocd'
    release_namespace: "{{ argo_cd.namespace }}"

Security Considerations

The installation follows security best practices:

  • Signed Packages: Uses official Helm signing key for package verification
  • Trusted Repository: Sources packages directly from Helm's CDN
  • No Custom RBAC: Relies on existing Kubernetes cluster RBAC rather than creating additional permissions
  • System-level Installation: Installed as root for proper system integration

Command Line Integration

The installation integrates seamlessly with the goldentooth CLI:

goldentooth install_helm

This command maps directly to the Ansible playbook execution, maintaining consistency with the cluster's unified management interface.

Version Management Strategy

The configuration uses a state: 'latest' strategy, which means:

  • Automatic Updates: Each playbook run ensures the latest Helm version is installed
  • Application-level Pinning: Specific chart versions are managed at the application level (e.g., Argo CD chart version 7.1.5)
  • Simplified Maintenance: No need to manually track Helm version updates

High Availability Considerations

By installing Helm on all control plane nodes, the configuration provides:

  • Redundancy: Any control plane node can perform Helm operations
  • Administrative Flexibility: Cluster administrators can use any control plane node
  • Disaster Recovery: Helm operations can continue even if individual control plane nodes fail

Fortunately, this is fairly simple to install and trivial to configure, which is not something I can say for Argo CD 🙂

Installing Argo CD

GitOps is a methodology based around treating IaC stored in Git as a source of truth for the desired state of the infrastructure. Put simply, whatever you push to main becomes the desired state and your IaC systems, whether they be Terraform, Ansible, etc, will be invoked to bring the actual state into alignment.

Argo CD is a popular system for implementing GitOps with Kubernetes. It can observe a Git repository for changes and react to those changes accordingly, creating/destroying/replacing resources as needed within the cluster.

Argo CD is a large, complicated application in its own right; its Helm chart is thousands of lines long. I'm not trying to learn it all right now, and fortunately, I have a fairly simple structure in mind.

I'll install Argo CD via a new Ansible playbook and role that use Helm, which we set up in the last section.

None of this is particularly complex, but I'll document some of my values overrides here:

# I've seen a mix of `argocd` and `argo-cd` scattered around. I preferred
# `argocd`, but I will shift to `argo-cd` where possible to improve
# consistency.
#
# EDIT: The `argocd` CLI tool appears to be broken and does not allow me to
# override the names of certain components when port forwarding.
# See https://github.com/argoproj/argo-cd/issues/16266 for details.
# As a result, I've gone through and reverted my changes to standardize as much
# as possible on `argocd`. FML.
nameOverride: 'argocd'
global:
  # This evaluates to `argocd.goldentooth.hellholt.net`.
  domain: "{{ argocd_domain }}"
  # Add Prometheus scrape annotations to all metrics services. This can
  # be used as an alternative to the ServiceMonitors.
  addPrometheusAnnotations: true
  # Default network policy rules used by all components.
  networkPolicy:
    # Create NetworkPolicy objects for all components; this is currently false
    # but I think I'd like to create these later.
    create: false
    # Default deny all ingress traffic; I want to improve security, so I hope
    # to enable this later.
    defaultDenyIngress: false
configs:
  secret:
    createSecret: true
    # Specify a password. I store an "easy" password, which is in my muscle
    # memory, so I'll use that for right now.
    argocdServerAdminPassword: "{{ vault.easy_password | password_hash('bcrypt') }}"
  # Refer to the repositories that host our applications.
  repositories:
    # This is the main (and likely only) one.
    gitops:
      type: 'git'
      name: 'gitops'
      # This turns out to be https://github.com/goldentooth/gitops.git
      url: "{{ argocd_app_repo_url }}"

redis-ha:
  # Enable Redis high availability.
  enabled: true

controller:
  # The HA configuration keeps this at one, and I don't see a reason to change.
  replicas: 1

server:
  # Enable
  autoscaling:
    enabled: true
    # This immediately scaled up to 3 replicas.
    minReplicas: 2
  # I'll make this more secure _soon_.
  extraArgs:
    - '--insecure'
  # I don't have load balancing set up yet.
  service:
    type: 'ClusterIP'

repoServer:
  autoscaling:
    enabled: true
    minReplicas: 2

applicationSet:
  replicas: 2

Pods in the Argo CD namespace

Installation Architecture

The Argo CD installation uses a sophisticated Helm-based approach with the following components:

  • Chart Version: 7.1.5 from the official Argo repository (https://argoproj.github.io/argo-helm)
  • CLI Installation: ARM64-specific Argo CD CLI installed to /usr/local/bin/argocd
  • Namespace: Dedicated argocd namespace with proper resource isolation
  • Deployment Scope: Runs once on control plane nodes for efficient resource usage

High Availability Configuration

The installation implements enterprise-grade high availability:

Redis High Availability:

redis-ha:
  enabled: true

Component Scaling:

  • Server: Autoscaling enabled with minimum 2 replicas for redundancy
  • Repo Server: Autoscaling enabled with minimum 2 replicas for Git repository operations
  • Application Set Controller: 2 replicas for ApplicationSet management
  • Controller: 1 replica (following HA recommendations for the core controller)

This configuration ensures that Argo CD remains operational even during node failures or maintenance.

Security and Authentication

Admin Authentication: The cluster uses bcrypt-hashed passwords stored in the encrypted Ansible vault:

argocdServerAdminPassword: "{{ secret_vault.easy_password | password_hash('bcrypt') }}"

GitHub Integration: For private repository access, the installation creates a Kubernetes secret:

apiVersion: v1
kind: Secret
metadata:
  name: github-token
  namespace: argocd
data:
  token: "{{ secret_vault.github_token | b64encode }}"

Current Security Posture:

  • Server configured with --insecure flag (temporary for initial setup)
  • Network policies prepared but not yet enforced
  • RBAC relies on default admin access patterns

Service and Network Integration

LoadBalancer Configuration: Unlike the basic ClusterIP shown in the values, the actual deployment uses:

service:
  type: LoadBalancer
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "argocd.{{ cluster.domain }}"
    external-dns.alpha.kubernetes.io/ttl: "60"

This integration provides:

  • MetalLB Integration: Automatic IP address assignment from the 10.4.11.0/24 pool
  • External DNS: Automatic DNS record creation for argocd.goldentooth.net
  • Public Access: Direct access from the broader network infrastructure

GitOps Implementation: App of Apps Pattern

The cluster implements the sophisticated "Application of Applications" pattern for managing GitOps workflows:

AppProject Configuration:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: gitops-repo
spec:
  sourceRepos:
    - '*'  # Lab environment - all repositories allowed
  destinations:
    - namespace: '*'
       server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: '*'
      kind: '*'

ApplicationSet Generator: The cluster uses GitHub SCM Provider generator to automatically discover and deploy applications:

generators:
- scmProvider:
    github:
      organization: goldentooth
      labelSelector:
        matchLabels:
          gitops-repo: "true"

This pattern automatically creates Argo CD Applications for any repository in the goldentooth organization with the gitops-repo label.

Application Standards and Sync Policies

Standardized Sync Configuration: All applications follow consistent sync policies:

syncPolicy:
  automated:
    prune: true      # Remove resources not in Git
    selfHeal: true   # Automatically fix configuration drift
  syncOptions:
    - Validate=true
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
    - RespectIgnoreDifferences=true
    - ApplyOutOfSyncOnly=true

Wave-based Deployment: Applications use argocd.argoproj.io/wave annotations for ordered deployment, ensuring dependencies are deployed before dependent services.

Monitoring Integration

Prometheus Integration:

global:
  addPrometheusAnnotations: true

This configuration ensures all Argo CD components expose metrics for the cluster's Prometheus monitoring stack, providing visibility into GitOps operations and performance.

Current Application Portfolio

The GitOps system currently manages:

  • MetalLB: Load balancer implementation
  • External Secrets: Integration with HashiCorp Vault
  • Prometheus Node Exporter: Node-level monitoring
  • Additional applications: Automatically discovered via the ApplicationSet pattern

Command Line Integration

The installation provides seamless CLI integration:

# Install Argo CD
goldentooth install_argo_cd

# Install managed applications
goldentooth install_argo_cd_apps

Access Methods

Web Interface Access:

  • Production: Direct access via https://argocd.goldentooth.net (LoadBalancer + External DNS)
  • Development: Port forwarding via kubectl -n argocd port-forward service/argocd-server 8081:443 --address 0.0.0.0

After running the port-forward command on one of my control plane nodes, I'm able to view the web interface and log in. With the App of Apps pattern configured, the interface shows automatically discovered applications and their sync status.

The GitOps foundation is now established, enabling declarative application management across the entire cluster infrastructure.

The "Incubator" GitOps Application

Previously, we discussed GitOps and how Argo CD provides a platform for implementing GitOps for Kubernetes.

As mentioned, the general idea is to have some Git repository somewhere that defines an application. We create a corresponding resource in Argo CD to represent that application, and Argo CD will henceforth watch the repository and make changes to the running application as needed.

What does the repository actually include? Well, it might be a Helm chart, or a kustomization, or raw manifests, etc. Pretty much anything that could be done in Kubernetes.

Of course, setting this up involves some manual work; you need to actually create the application within Argo CD and, if you want it to hang around, you need to presumably commit that resource to some version control system somewhere. We of course want to be careful who has access to that repository, though, and we might not want engineers to have access to Argo CD itself. So suddenly there's a rather uncomfortable amount of work and coupling in all of this.

GitOps Deployment Patterns

Traditional Application Management Challenges

Manual application creation:

  • Platform engineers must create Argo CD Application resources manually
  • Direct access to Argo CD UI required for application management
  • Configuration drift between different environments
  • Difficulty managing permissions and access control at scale

Repository proliferation:

  • Each application requires its own repository or subdirectory
  • Inconsistent structure and standards across teams
  • Complex permission management across multiple repositories
  • Operational overhead for maintaining repository access

The App-of-Apps Pattern

A common pattern in Argo CD is the "app-of-apps" pattern. This is simply an Argo CD application pointing to a repository that contains other Argo CD applications. Thus you can have a single application created for you by the principal platform engineer, and you can turn it into fifty or a hundred finely grained pieces of infrastructure that said principal engineer doesn't have to know about 🙂

(If they haven't configured the security settings carefully, it can all just be your little secret 😉)

App-of-Apps Architecture:

Root Application (managed by platform team)
├── Application 1 (e.g., monitoring stack)
├── Application 2 (e.g., ingress controllers)
├── Application 3 (e.g., security tools)
└── Application N (e.g., developer applications)

Benefits of App-of-Apps:

  • Single entry point: Platform team manages one root application
  • Delegated management: Development teams control their applications
  • Hierarchical organization: Logical grouping of related applications
  • Simplified bootstrapping: New environments start with root application

Limitations of App-of-Apps:

  • Resource proliferation: Each application creates an Application resource
  • Dependency management: Complex inter-application dependencies
  • Scaling challenges: Manual management of hundreds of applications
  • Limited templating: Difficult to apply consistent patterns

ApplicationSet Pattern (Modern Approach)

A (relatively) new construct in Argo CD is the ApplicationSet construct, which seeks to more clearly define how applications are created and fix the problems with the "app-of-apps" approach. That's the approach we will take in this cluster for mature applications.

ApplicationSet Architecture:

ApplicationSet (template-driven)
├── Generator (Git directories, clusters, pull requests)
├── Template (Application template with parameters)
└── Applications (dynamically created from template)

ApplicationSet Generators:

  • Git Generator: Scans Git repositories for directories or files
  • Cluster Generator: Creates applications across multiple clusters
  • List Generator: Creates applications from predefined lists
  • Matrix Generator: Combines multiple generators for complex scenarios

Example ApplicationSet Configuration:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: gitops-repo
  namespace: argocd
spec:
  generators:
  - scmProvider:
      github:
        organization: goldentooth
        allBranches: false
        labelSelector:
          matchLabels:
            gitops-repo: "true"
  template:
    metadata:
      name: '{{repository}}'
    spec:
      project: gitops-repo
      source:
        repoURL: '{{url}}'
        targetRevision: '{{branch}}'
        path: .
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{repository}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true

The Incubator Project Strategy

Given that we're operating in a lab environment, we can use the "app-of-apps" approach for the Incubator, which is where we can try out new configurations. We can give it fairly unrestricted access while we work on getting things to deploy correctly, and then lock things down as we zero in on a stable configuration.

Development vs Production Patterns

Incubator (Development):

  • App-of-Apps pattern: Manual application management for experimentation
  • Permissive security: Broad access for rapid prototyping
  • Flexible structure: Accommodate diverse application types
  • Quick iteration: Fast deployment and testing cycles

Production (Mature Applications):

  • ApplicationSet pattern: Template-driven automation at scale
  • Restrictive security: Principle of least privilege
  • Standardized structure: Consistent patterns and practices
  • Controlled deployment: Change management and approval processes

But meanwhile, we'll create an AppProject manifest for the Incubator:

---
apiVersion: 'argoproj.io/v1alpha1'
kind: 'AppProject'
metadata:
  name: 'incubator'
  # Argo CD resources need to deploy into the Argo CD namespace.
  namespace: 'argocd'
  finalizers:
    - 'resources-finalizer.argocd.argoproj.io'
spec:
  description: 'Goldentooth incubator project'
  # Allow manifests to deploy from any Git repository.
  # This is an acceptable security risk because this is a lab environment
  # and I am the only user.
  sourceRepos:
    - '*'
  destinations:
    # Prevent any resources from deploying into the kube-system namespace.
    - namespace: '!kube-system'
      server: '*'
    # Allow resources to deploy into any other namespace.
    - namespace: '*'
      server: '*'
  clusterResourceWhitelist:
    # Allow any cluster resources to deploy.
    - group: '*'
      kind: '*'

As mentioned before, this is very permissive. It only slightly differs from the default project by preventing resources from deploying into the kube-system namespace.

We'll also create an Application manifest:

apiVersion: 'argoproj.io/v1alpha1'
kind: 'Application'
metadata:
  name: 'incubator'
  namespace: 'argocd'
  labels:
    name: 'incubator'
    managed-by: 'argocd'
spec:
  project: 'incubator'
  source:
    repoURL: "https://github.com/goldentooth/incubator.git"
    path: './'
    targetRevision: 'HEAD'
  destination:
    server: 'https://kubernetes.default.svc'
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - Validate=true
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - RespectIgnoreDifferences=true
      - ApplyOutOfSyncOnly=true

That's sufficient to get it to pop up in the Applications view in Argo CD.

Argo CD Incubator

AppProject Configuration Deep Dive

Security Boundary Configuration

The AppProject resource provides security boundaries and policy enforcement:

spec:
  description: 'Goldentooth incubator project'
  sourceRepos:
    - '*'  # Allow any Git repository (lab environment only)
  destinations:
    - namespace: '!kube-system'  # Explicit exclusion
      server: '*'
    - namespace: '*'            # Allow all other namespaces
      server: '*'
  clusterResourceWhitelist:
    - group: '*'               # Allow any cluster-scoped resources
      kind: '*'

Security Trade-offs:

  • Permissive source repos: Allows rapid experimentation with external charts
  • Namespace protection: Prevents accidental modification of system namespaces
  • Cluster resource access: Enables testing of operators and custom resources
  • Lab environment justification: Security relaxed for learning and development

Production AppProject Example

For comparison, a production AppProject would be much more restrictive:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production-apps
  namespace: argocd
spec:
  description: 'Production applications with strict controls'
  sourceRepos:
    - 'https://github.com/goldentooth/helm-charts'
    - 'https://charts.bitnami.com/bitnami'
  destinations:
    - namespace: 'production-*'
      server: 'https://kubernetes.default.svc'
  clusterResourceWhitelist:
    - group: ''
      kind: 'Namespace'
    - group: 'rbac.authorization.k8s.io'
      kind: 'ClusterRole'
  namespaceResourceWhitelist:
    - group: 'apps'
      kind: 'Deployment'
    - group: ''
      kind: 'Service'
  roles:
    - name: 'developers'
      policies:
        - 'p, proj:production-apps:developers, applications, get, production-apps/*, allow'
        - 'p, proj:production-apps:developers, applications, sync, production-apps/*, allow'

Application Configuration Patterns

Sync Policy Configuration

The Application's sync policy defines automated behavior:

syncPolicy:
  automated:
    prune: true      # Remove resources deleted from Git
    selfHeal: true   # Automatically fix configuration drift
  syncOptions:
    - Validate=true                    # Validate resources before applying
    - CreateNamespace=true             # Auto-create target namespaces
    - PrunePropagationPolicy=foreground # Wait for dependent resources
    - PruneLast=true                   # Delete applications last
    - RespectIgnoreDifferences=true    # Honor ignoreDifferences rules
    - ApplyOutOfSyncOnly=true         # Only apply changed resources

Sync Policy Implications:

  • Prune: Ensures Git repository is single source of truth
  • Self-heal: Prevents manual changes from persisting
  • Validation: Catches configuration errors before deployment
  • Namespace creation: Reduces manual setup for new applications

Repository Structure for App-of-Apps

The incubator repository structure supports the app-of-apps pattern:

incubator/
├── README.md
├── applications/
│   ├── monitoring/
│   │   ├── prometheus.yaml
│   │   ├── grafana.yaml
│   │   └── alertmanager.yaml
│   ├── networking/
│   │   ├── metallb.yaml
│   │   ├── external-dns.yaml
│   │   └── cert-manager.yaml
│   └── storage/
│       ├── nfs-provisioner.yaml
│       ├── ceph-operator.yaml
│       └── seaweedfs.yaml
└── environments/
    ├── dev/
    ├── staging/
    └── production/

Directory Organization Benefits:

  • Logical grouping: Applications organized by functional area
  • Environment separation: Different configurations per environment
  • Clear ownership: Teams can own specific directories
  • Selective deployment: Enable/disable application groups easily

Integration with ApplicationSets

Migration Path from App-of-Apps

As applications mature, they can be migrated from the incubator to ApplicationSet management:

Migration Steps:

  1. Stabilize configuration: Test thoroughly in incubator environment
  2. Create Helm chart: Package application as reusable chart
  3. Add to gitops-repo: Tag repository for ApplicationSet discovery
  4. Remove from incubator: Delete Application from incubator repository
  5. Verify automation: Confirm ApplicationSet creates new Application

Example Migration:

# Before: Manual Application in incubator
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring-stack
  namespace: argocd
spec:
  project: incubator
  source:
    repoURL: 'https://github.com/goldentooth/monitoring'
    path: './manifests'

# After: Automatically generated by ApplicationSet
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring
  namespace: argocd
  ownerReferences:
  - apiVersion: argoproj.io/v1alpha1
    kind: ApplicationSet
    name: gitops-repo
spec:
  project: gitops-repo
  source:
    repoURL: 'https://github.com/goldentooth/monitoring'
    path: '.'

ApplicationSet Template Advantages

Consistent Configuration:

  • All applications get same sync policy
  • Standardized labeling and annotations
  • Uniform security settings across applications
  • Reduced configuration drift between applications

Template Parameters:

template:
  metadata:
    name: '{{repository}}'
    labels:
      environment: '{{environment}}'
      team: '{{team}}'
      gitops-managed: 'true'
  spec:
    project: '{{project}}'
    source:
      repoURL: '{{url}}'
      targetRevision: '{{branch}}'
      helm:
        valueFiles:
        - 'values-{{environment}}.yaml'

Operational Workflows

Development Workflow

Incubator Development Process:

  1. Create feature branch: Develop new application in isolated branch
  2. Add Application manifest: Define application in incubator repository
  3. Test deployment: Verify application deploys correctly
  4. Iterate configuration: Refine settings based on testing
  5. Merge to main: Deploy to shared incubator environment
  6. Monitor and debug: Observe application behavior and logs

Production Promotion

Graduation from Incubator:

  1. Create dedicated repository: Move application to own repository
  2. Package as Helm chart: Standardize configuration management
  3. Add gitops-repo label: Enable ApplicationSet discovery
  4. Configure environments: Set up dev/staging/production values
  5. Test automation: Verify ApplicationSet creates Application
  6. Remove from incubator: Clean up experimental Application

Monitoring and Observability

Application Health Monitoring:

# Check application sync status
kubectl -n argocd get applications

# View application details
argocd app get incubator

# Monitor sync operations
argocd app sync incubator --dry-run

# Check for configuration drift
argocd app diff incubator

Common Issues and Troubleshooting:

  • Sync failures: Check resource validation and RBAC permissions
  • Resource conflicts: Verify namespace isolation and resource naming
  • Git access issues: Confirm repository permissions and authentication
  • Health check failures: Review application health status and events

Best Practices for GitOps

Repository Management

Separation of Concerns:

  • Application code: Business logic and container images
  • Configuration: Kubernetes manifests and Helm values
  • Infrastructure: Cluster setup and platform services
  • Policies: Security rules and governance configurations

Version Control Strategy:

main branch    → Production environment
staging branch → Staging environment  
dev branch     → Development environment
feature/*      → Feature testing

Security Considerations

Credential Management:

  • Use Argo CD's credential templates for repository access
  • Implement least-privilege access for Git repositories
  • Rotate credentials regularly and audit access
  • Consider using Git over SSH for enhanced security

Resource Isolation:

  • Separate AppProjects for different security domains
  • Use namespace-based isolation for applications
  • Implement RBAC policies aligned with organizational structure
  • Monitor cross-namespace resource access

This incubator approach provides a safe environment for experimenting with GitOps patterns while establishing the foundation for scalable, automated application management through ApplicationSets as the platform matures.

Prometheus Node Exporter

Sure, I could just jump straight into kube-prometheus, but where's the fun (and, more importantly, the learning) in that?

I'm going to try to build a system from the ground up, tweaking each component as I go.

Prometheus Node Exporter seems like a reasonable place to begin, as it will give me per-node statistics that I can look at immediately. Or almost immediately.

The first order of business is to modify our incubator repository to refer to the Prometheus Node Exporter Helm chart.

By adding the following in the incubator repo:

# templates/prometheus_node_exporter.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: prometheus-node-exporter
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus-node-exporter
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  destination:
    namespace: prometheus-node-exporter
    server: 'https://kubernetes.default.svc'
  project: incubator
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: prometheus-node-exporter
    targetRevision: 4.31.0
    helm:
      releaseName: prometheus-node-exporter

We'll soon see the resources created:

Prometheus Node Exporter project running in Argo CD

And we can curl a metric butt-ton of information:

$ curl localhost:9100/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 7
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.21.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 829976
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 829976
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.445756e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 704
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.909376e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 829976
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 1.458176e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.310144e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 8628
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.458176e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.76832e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 9332
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 1200
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 37968
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 48888
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 795876
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 425984
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 425984
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 9.4098e+06
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 6
# HELP node_boot_time_seconds Node boot time, in unixtime.
# TYPE node_boot_time_seconds gauge
node_boot_time_seconds 1.706835386e+09
# HELP node_context_switches_total Total number of context switches.
# TYPE node_context_switches_total counter
node_context_switches_total 1.8612307682e+10
# HELP node_cooling_device_cur_state Current throttle state of the cooling device
# TYPE node_cooling_device_cur_state gauge
node_cooling_device_cur_state{name="0",type="gpio-fan"} 1
# HELP node_cooling_device_max_state Maximum throttle state of the cooling device
# TYPE node_cooling_device_max_state gauge
node_cooling_device_max_state{name="0",type="gpio-fan"} 1
# HELP node_cpu_frequency_max_hertz Maximum CPU thread frequency in hertz.
# TYPE node_cpu_frequency_max_hertz gauge
node_cpu_frequency_max_hertz{cpu="0"} 2e+09
node_cpu_frequency_max_hertz{cpu="1"} 2e+09
node_cpu_frequency_max_hertz{cpu="2"} 2e+09
node_cpu_frequency_max_hertz{cpu="3"} 2e+09
# HELP node_cpu_frequency_min_hertz Minimum CPU thread frequency in hertz.
# TYPE node_cpu_frequency_min_hertz gauge
node_cpu_frequency_min_hertz{cpu="0"} 6e+08
node_cpu_frequency_min_hertz{cpu="1"} 6e+08
node_cpu_frequency_min_hertz{cpu="2"} 6e+08
node_cpu_frequency_min_hertz{cpu="3"} 6e+08
# HELP node_cpu_guest_seconds_total Seconds the CPUs spent in guests (VMs) for each mode.
# TYPE node_cpu_guest_seconds_total counter
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_guest_seconds_total{cpu="2",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="2",mode="user"} 0
node_cpu_guest_seconds_total{cpu="3",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="3",mode="user"} 0
# HELP node_cpu_scaling_frequency_hertz Current scaled CPU thread frequency in hertz.
# TYPE node_cpu_scaling_frequency_hertz gauge
node_cpu_scaling_frequency_hertz{cpu="0"} 2e+09
node_cpu_scaling_frequency_hertz{cpu="1"} 2e+09
node_cpu_scaling_frequency_hertz{cpu="2"} 2e+09
node_cpu_scaling_frequency_hertz{cpu="3"} 7e+08
# HELP node_cpu_scaling_frequency_max_hertz Maximum scaled CPU thread frequency in hertz.
# TYPE node_cpu_scaling_frequency_max_hertz gauge
node_cpu_scaling_frequency_max_hertz{cpu="0"} 2e+09
node_cpu_scaling_frequency_max_hertz{cpu="1"} 2e+09
node_cpu_scaling_frequency_max_hertz{cpu="2"} 2e+09
node_cpu_scaling_frequency_max_hertz{cpu="3"} 2e+09
# HELP node_cpu_scaling_frequency_min_hertz Minimum scaled CPU thread frequency in hertz.
# TYPE node_cpu_scaling_frequency_min_hertz gauge
node_cpu_scaling_frequency_min_hertz{cpu="0"} 6e+08
node_cpu_scaling_frequency_min_hertz{cpu="1"} 6e+08
node_cpu_scaling_frequency_min_hertz{cpu="2"} 6e+08
node_cpu_scaling_frequency_min_hertz{cpu="3"} 6e+08
# HELP node_cpu_scaling_governor Current enabled CPU frequency governor.
# TYPE node_cpu_scaling_governor gauge
node_cpu_scaling_governor{cpu="0",governor="conservative"} 0
node_cpu_scaling_governor{cpu="0",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="0",governor="performance"} 0
node_cpu_scaling_governor{cpu="0",governor="powersave"} 0
node_cpu_scaling_governor{cpu="0",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="0",governor="userspace"} 0
node_cpu_scaling_governor{cpu="1",governor="conservative"} 0
node_cpu_scaling_governor{cpu="1",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="1",governor="performance"} 0
node_cpu_scaling_governor{cpu="1",governor="powersave"} 0
node_cpu_scaling_governor{cpu="1",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="1",governor="userspace"} 0
node_cpu_scaling_governor{cpu="2",governor="conservative"} 0
node_cpu_scaling_governor{cpu="2",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="2",governor="performance"} 0
node_cpu_scaling_governor{cpu="2",governor="powersave"} 0
node_cpu_scaling_governor{cpu="2",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="2",governor="userspace"} 0
node_cpu_scaling_governor{cpu="3",governor="conservative"} 0
node_cpu_scaling_governor{cpu="3",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="3",governor="performance"} 0
node_cpu_scaling_governor{cpu="3",governor="powersave"} 0
node_cpu_scaling_governor{cpu="3",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="3",governor="userspace"} 0
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 2.68818165e+06
node_cpu_seconds_total{cpu="0",mode="iowait"} 8376.2
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 64.64
node_cpu_seconds_total{cpu="0",mode="softirq"} 17095.42
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 69354.3
node_cpu_seconds_total{cpu="0",mode="user"} 100985.22
node_cpu_seconds_total{cpu="1",mode="idle"} 2.70092994e+06
node_cpu_seconds_total{cpu="1",mode="iowait"} 10578.32
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 61.07
node_cpu_seconds_total{cpu="1",mode="softirq"} 3442.94
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 72718.57
node_cpu_seconds_total{cpu="1",mode="user"} 112849.28
node_cpu_seconds_total{cpu="2",mode="idle"} 2.70036651e+06
node_cpu_seconds_total{cpu="2",mode="iowait"} 10596.56
node_cpu_seconds_total{cpu="2",mode="irq"} 0
node_cpu_seconds_total{cpu="2",mode="nice"} 44.05
node_cpu_seconds_total{cpu="2",mode="softirq"} 3462.77
node_cpu_seconds_total{cpu="2",mode="steal"} 0
node_cpu_seconds_total{cpu="2",mode="system"} 73257.94
node_cpu_seconds_total{cpu="2",mode="user"} 112932.46
node_cpu_seconds_total{cpu="3",mode="idle"} 2.7039725e+06
node_cpu_seconds_total{cpu="3",mode="iowait"} 10525.98
node_cpu_seconds_total{cpu="3",mode="irq"} 0
node_cpu_seconds_total{cpu="3",mode="nice"} 56.42
node_cpu_seconds_total{cpu="3",mode="softirq"} 3434.8
node_cpu_seconds_total{cpu="3",mode="steal"} 0
node_cpu_seconds_total{cpu="3",mode="system"} 71924.93
node_cpu_seconds_total{cpu="3",mode="user"} 111615.13
# HELP node_disk_discard_time_seconds_total This is the total number of seconds spent by all discards.
# TYPE node_disk_discard_time_seconds_total counter
node_disk_discard_time_seconds_total{device="mmcblk0"} 6.008
node_disk_discard_time_seconds_total{device="mmcblk0p1"} 0.11800000000000001
node_disk_discard_time_seconds_total{device="mmcblk0p2"} 5.889
# HELP node_disk_discarded_sectors_total The total number of sectors discarded successfully.
# TYPE node_disk_discarded_sectors_total counter
node_disk_discarded_sectors_total{device="mmcblk0"} 2.7187894e+08
node_disk_discarded_sectors_total{device="mmcblk0p1"} 4.57802e+06
node_disk_discarded_sectors_total{device="mmcblk0p2"} 2.6730092e+08
# HELP node_disk_discards_completed_total The total number of discards completed successfully.
# TYPE node_disk_discards_completed_total counter
node_disk_discards_completed_total{device="mmcblk0"} 1330
node_disk_discards_completed_total{device="mmcblk0p1"} 20
node_disk_discards_completed_total{device="mmcblk0p2"} 1310
# HELP node_disk_discards_merged_total The total number of discards merged.
# TYPE node_disk_discards_merged_total counter
node_disk_discards_merged_total{device="mmcblk0"} 306
node_disk_discards_merged_total{device="mmcblk0p1"} 20
node_disk_discards_merged_total{device="mmcblk0p2"} 286
# HELP node_disk_filesystem_info Info about disk filesystem.
# TYPE node_disk_filesystem_info gauge
node_disk_filesystem_info{device="mmcblk0p1",type="vfat",usage="filesystem",uuid="5DF9-E225",version="FAT32"} 1
node_disk_filesystem_info{device="mmcblk0p2",type="ext4",usage="filesystem",uuid="3b614a3f-4a65-4480-876a-8a998e01ac9b",version="1.0"} 1
# HELP node_disk_flush_requests_time_seconds_total This is the total number of seconds spent by all flush requests.
# TYPE node_disk_flush_requests_time_seconds_total counter
node_disk_flush_requests_time_seconds_total{device="mmcblk0"} 4597.003
node_disk_flush_requests_time_seconds_total{device="mmcblk0p1"} 0
node_disk_flush_requests_time_seconds_total{device="mmcblk0p2"} 0
# HELP node_disk_flush_requests_total The total number of flush requests completed successfully
# TYPE node_disk_flush_requests_total counter
node_disk_flush_requests_total{device="mmcblk0"} 2.0808855e+07
node_disk_flush_requests_total{device="mmcblk0p1"} 0
node_disk_flush_requests_total{device="mmcblk0p2"} 0
# HELP node_disk_info Info of /sys/block/<block_device>.
# TYPE node_disk_info gauge
node_disk_info{device="mmcblk0",major="179",minor="0",model="",path="platform-fe340000.mmc",revision="",serial="",wwn=""} 1
node_disk_info{device="mmcblk0p1",major="179",minor="1",model="",path="platform-fe340000.mmc",revision="",serial="",wwn=""} 1
node_disk_info{device="mmcblk0p2",major="179",minor="2",model="",path="platform-fe340000.mmc",revision="",serial="",wwn=""} 1
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="mmcblk0"} 0
node_disk_io_now{device="mmcblk0p1"} 0
node_disk_io_now{device="mmcblk0p2"} 0
# HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
# TYPE node_disk_io_time_seconds_total counter
node_disk_io_time_seconds_total{device="mmcblk0"} 109481.804
node_disk_io_time_seconds_total{device="mmcblk0p1"} 4.172
node_disk_io_time_seconds_total{device="mmcblk0p2"} 109479.144
# HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
# TYPE node_disk_io_time_weighted_seconds_total counter
node_disk_io_time_weighted_seconds_total{device="mmcblk0"} 254357.374
node_disk_io_time_weighted_seconds_total{device="mmcblk0p1"} 168.897
node_disk_io_time_weighted_seconds_total{device="mmcblk0p2"} 249591.36000000002
# HELP node_disk_read_bytes_total The total number of bytes read successfully.
# TYPE node_disk_read_bytes_total counter
node_disk_read_bytes_total{device="mmcblk0"} 1.142326272e+09
node_disk_read_bytes_total{device="mmcblk0p1"} 8.704e+06
node_disk_read_bytes_total{device="mmcblk0p2"} 1.132397568e+09
# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.
# TYPE node_disk_read_time_seconds_total counter
node_disk_read_time_seconds_total{device="mmcblk0"} 72.763
node_disk_read_time_seconds_total{device="mmcblk0p1"} 0.8140000000000001
node_disk_read_time_seconds_total{device="mmcblk0p2"} 71.888
# HELP node_disk_reads_completed_total The total number of reads completed successfully.
# TYPE node_disk_reads_completed_total counter
node_disk_reads_completed_total{device="mmcblk0"} 26194
node_disk_reads_completed_total{device="mmcblk0p1"} 234
node_disk_reads_completed_total{device="mmcblk0p2"} 25885
# HELP node_disk_reads_merged_total The total number of reads merged.
# TYPE node_disk_reads_merged_total counter
node_disk_reads_merged_total{device="mmcblk0"} 4740
node_disk_reads_merged_total{device="mmcblk0p1"} 1119
node_disk_reads_merged_total{device="mmcblk0p2"} 3621
# HELP node_disk_write_time_seconds_total This is the total number of seconds spent by all writes.
# TYPE node_disk_write_time_seconds_total counter
node_disk_write_time_seconds_total{device="mmcblk0"} 249681.59900000002
node_disk_write_time_seconds_total{device="mmcblk0p1"} 167.964
node_disk_write_time_seconds_total{device="mmcblk0p2"} 249513.581
# HELP node_disk_writes_completed_total The total number of writes completed successfully.
# TYPE node_disk_writes_completed_total counter
node_disk_writes_completed_total{device="mmcblk0"} 6.356576e+07
node_disk_writes_completed_total{device="mmcblk0p1"} 749
node_disk_writes_completed_total{device="mmcblk0p2"} 6.3564908e+07
# HELP node_disk_writes_merged_total The number of writes merged.
# TYPE node_disk_writes_merged_total counter
node_disk_writes_merged_total{device="mmcblk0"} 9.074629e+06
node_disk_writes_merged_total{device="mmcblk0p1"} 1554
node_disk_writes_merged_total{device="mmcblk0p2"} 9.073075e+06
# HELP node_disk_written_bytes_total The total number of bytes written successfully.
# TYPE node_disk_written_bytes_total counter
node_disk_written_bytes_total{device="mmcblk0"} 2.61909222912e+11
node_disk_written_bytes_total{device="mmcblk0p1"} 8.3293696e+07
node_disk_written_bytes_total{device="mmcblk0p2"} 2.61825929216e+11
# HELP node_entropy_available_bits Bits of available entropy.
# TYPE node_entropy_available_bits gauge
node_entropy_available_bits 256
# HELP node_entropy_pool_size_bits Bits of entropy pool.
# TYPE node_entropy_pool_size_bits gauge
node_entropy_pool_size_bits 256
# HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which node_exporter was built, and the goos and goarch for the build.
# TYPE node_exporter_build_info gauge
node_exporter_build_info{branch="HEAD",goarch="arm64",goos="linux",goversion="go1.21.4",revision="7333465abf9efba81876303bb57e6fadb946041b",tags="netgo osusergo static_build",version="1.7.0"} 1
# HELP node_filefd_allocated File descriptor statistics: allocated.
# TYPE node_filefd_allocated gauge
node_filefd_allocated 2080
# HELP node_filefd_maximum File descriptor statistics: maximum.
# TYPE node_filefd_maximum gauge
node_filefd_maximum 9.223372036854776e+18
# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 4.6998528e+08
node_filesystem_avail_bytes{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 1.12564281344e+11
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 8.16193536e+08
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.226496e+06
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 8.19064832e+08
# HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device.
# TYPE node_filesystem_device_error gauge
node_filesystem_device_error{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_device_error{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/0e619376-d2fe-4f79-bc74-64fe5b3c8232/volumes/kubernetes.io~projected/kube-api-access-2f5p9"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/61bde493-1e9f-4d4d-b77f-1df095f775c4/volumes/kubernetes.io~projected/kube-api-access-rdrm2"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/878c2007-167d-4437-b654-43ef9cc0a5f0/volumes/kubernetes.io~projected/kube-api-access-j5fzh"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/9660f563-0f88-41aa-9d38-654911a04158/volumes/kubernetes.io~projected/kube-api-access-n494p"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/c008ef0e-9212-42ce-9a34-6ccaf6b087d1/volumes/kubernetes.io~projected/kube-api-access-9c8sx"} 1
# HELP node_filesystem_files Filesystem total file nodes.
# TYPE node_filesystem_files gauge
node_filesystem_files{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_files{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 7.500896e+06
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 999839
node_filesystem_files{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 999839
node_filesystem_files{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 999839
node_filesystem_files{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 199967
# HELP node_filesystem_files_free Filesystem total free file nodes.
# TYPE node_filesystem_files_free gauge
node_filesystem_files_free{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_files_free{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 7.421624e+06
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 999838
node_filesystem_files_free{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 998519
node_filesystem_files_free{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 999833
node_filesystem_files_free{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 199947
# HELP node_filesystem_free_bytes Filesystem free space in bytes.
# TYPE node_filesystem_free_bytes gauge
node_filesystem_free_bytes{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 4.6998528e+08
node_filesystem_free_bytes{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 1.18947086336e+11
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 8.16193536e+08
node_filesystem_free_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.226496e+06
node_filesystem_free_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 8.19064832e+08
# HELP node_filesystem_readonly Filesystem read-only status.
# TYPE node_filesystem_readonly gauge
node_filesystem_readonly{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_readonly{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/0e619376-d2fe-4f79-bc74-64fe5b3c8232/volumes/kubernetes.io~projected/kube-api-access-2f5p9"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/61bde493-1e9f-4d4d-b77f-1df095f775c4/volumes/kubernetes.io~projected/kube-api-access-rdrm2"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/878c2007-167d-4437-b654-43ef9cc0a5f0/volumes/kubernetes.io~projected/kube-api-access-j5fzh"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/9660f563-0f88-41aa-9d38-654911a04158/volumes/kubernetes.io~projected/kube-api-access-n494p"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/c008ef0e-9212-42ce-9a34-6ccaf6b087d1/volumes/kubernetes.io~projected/kube-api-access-9c8sx"} 0
# HELP node_filesystem_size_bytes Filesystem size in bytes.
# TYPE node_filesystem_size_bytes gauge
node_filesystem_size_bytes{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 5.34765568e+08
node_filesystem_size_bytes{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 1.25321166848e+11
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 8.19068928e+08
node_filesystem_size_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.24288e+06
node_filesystem_size_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 8.19064832e+08
# HELP node_forks_total Total number of forks.
# TYPE node_forks_total counter
node_forks_total 1.9002994e+07
# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="platform_gpio_fan_0",chip_name="gpio_fan"} 1
node_hwmon_chip_names{chip="soc:firmware_raspberrypi_hwmon",chip_name="rpi_volt"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1
# HELP node_hwmon_fan_max_rpm Hardware monitor for fan revolutions per minute (max)
# TYPE node_hwmon_fan_max_rpm gauge
node_hwmon_fan_max_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 5000
# HELP node_hwmon_fan_min_rpm Hardware monitor for fan revolutions per minute (min)
# TYPE node_hwmon_fan_min_rpm gauge
node_hwmon_fan_min_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 0
# HELP node_hwmon_fan_rpm Hardware monitor for fan revolutions per minute (input)
# TYPE node_hwmon_fan_rpm gauge
node_hwmon_fan_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 5000
# HELP node_hwmon_fan_target_rpm Hardware monitor for fan revolutions per minute (target)
# TYPE node_hwmon_fan_target_rpm gauge
node_hwmon_fan_target_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 5000
# HELP node_hwmon_in_lcrit_alarm_volts Hardware monitor for voltage (lcrit_alarm)
# TYPE node_hwmon_in_lcrit_alarm_volts gauge
node_hwmon_in_lcrit_alarm_volts{chip="soc:firmware_raspberrypi_hwmon",sensor="in0"} 0
# HELP node_hwmon_pwm Hardware monitor pwm element
# TYPE node_hwmon_pwm gauge
node_hwmon_pwm{chip="platform_gpio_fan_0",sensor="pwm1"} 255
# HELP node_hwmon_pwm_enable Hardware monitor pwm element enable
# TYPE node_hwmon_pwm_enable gauge
node_hwmon_pwm_enable{chip="platform_gpio_fan_0",sensor="pwm1"} 1
# HELP node_hwmon_pwm_mode Hardware monitor pwm element mode
# TYPE node_hwmon_pwm_mode gauge
node_hwmon_pwm_mode{chip="platform_gpio_fan_0",sensor="pwm1"} 0
# HELP node_hwmon_temp_celsius Hardware monitor for temperature (input)
# TYPE node_hwmon_temp_celsius gauge
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp0"} 27.745
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"} 27.745
# HELP node_hwmon_temp_crit_celsius Hardware monitor for temperature (crit)
# TYPE node_hwmon_temp_crit_celsius gauge
node_hwmon_temp_crit_celsius{chip="thermal_thermal_zone0",sensor="temp1"} 110
# HELP node_intr_total Total number of interrupts serviced.
# TYPE node_intr_total counter
node_intr_total 1.0312668562e+10
# HELP node_ipvs_connections_total The total number of connections made.
# TYPE node_ipvs_connections_total counter
node_ipvs_connections_total 2907
# HELP node_ipvs_incoming_bytes_total The total amount of incoming data.
# TYPE node_ipvs_incoming_bytes_total counter
node_ipvs_incoming_bytes_total 2.77474522e+08
# HELP node_ipvs_incoming_packets_total The total number of incoming packets.
# TYPE node_ipvs_incoming_packets_total counter
node_ipvs_incoming_packets_total 3.761541e+06
# HELP node_ipvs_outgoing_bytes_total The total amount of outgoing data.
# TYPE node_ipvs_outgoing_bytes_total counter
node_ipvs_outgoing_bytes_total 7.406631703e+09
# HELP node_ipvs_outgoing_packets_total The total number of outgoing packets.
# TYPE node_ipvs_outgoing_packets_total counter
node_ipvs_outgoing_packets_total 4.224817e+06
# HELP node_load1 1m load average.
# TYPE node_load1 gauge
node_load1 0.87
# HELP node_load15 15m load average.
# TYPE node_load15 gauge
node_load15 0.63
# HELP node_load5 5m load average.
# TYPE node_load5 gauge
node_load5 0.58
# HELP node_memory_Active_anon_bytes Memory information field Active_anon_bytes.
# TYPE node_memory_Active_anon_bytes gauge
node_memory_Active_anon_bytes 1.043009536e+09
# HELP node_memory_Active_bytes Memory information field Active_bytes.
# TYPE node_memory_Active_bytes gauge
node_memory_Active_bytes 1.62168832e+09
# HELP node_memory_Active_file_bytes Memory information field Active_file_bytes.
# TYPE node_memory_Active_file_bytes gauge
node_memory_Active_file_bytes 5.78678784e+08
# HELP node_memory_AnonPages_bytes Memory information field AnonPages_bytes.
# TYPE node_memory_AnonPages_bytes gauge
node_memory_AnonPages_bytes 1.043357696e+09
# HELP node_memory_Bounce_bytes Memory information field Bounce_bytes.
# TYPE node_memory_Bounce_bytes gauge
node_memory_Bounce_bytes 0
# HELP node_memory_Buffers_bytes Memory information field Buffers_bytes.
# TYPE node_memory_Buffers_bytes gauge
node_memory_Buffers_bytes 1.36790016e+08
# HELP node_memory_Cached_bytes Memory information field Cached_bytes.
# TYPE node_memory_Cached_bytes gauge
node_memory_Cached_bytes 4.609712128e+09
# HELP node_memory_CmaFree_bytes Memory information field CmaFree_bytes.
# TYPE node_memory_CmaFree_bytes gauge
node_memory_CmaFree_bytes 5.25586432e+08
# HELP node_memory_CmaTotal_bytes Memory information field CmaTotal_bytes.
# TYPE node_memory_CmaTotal_bytes gauge
node_memory_CmaTotal_bytes 5.36870912e+08
# HELP node_memory_CommitLimit_bytes Memory information field CommitLimit_bytes.
# TYPE node_memory_CommitLimit_bytes gauge
node_memory_CommitLimit_bytes 4.095340544e+09
# HELP node_memory_Committed_AS_bytes Memory information field Committed_AS_bytes.
# TYPE node_memory_Committed_AS_bytes gauge
node_memory_Committed_AS_bytes 3.449647104e+09
# HELP node_memory_Dirty_bytes Memory information field Dirty_bytes.
# TYPE node_memory_Dirty_bytes gauge
node_memory_Dirty_bytes 65536
# HELP node_memory_Inactive_anon_bytes Memory information field Inactive_anon_bytes.
# TYPE node_memory_Inactive_anon_bytes gauge
node_memory_Inactive_anon_bytes 3.25632e+06
# HELP node_memory_Inactive_bytes Memory information field Inactive_bytes.
# TYPE node_memory_Inactive_bytes gauge
node_memory_Inactive_bytes 4.168126464e+09
# HELP node_memory_Inactive_file_bytes Memory information field Inactive_file_bytes.
# TYPE node_memory_Inactive_file_bytes gauge
node_memory_Inactive_file_bytes 4.164870144e+09
# HELP node_memory_KReclaimable_bytes Memory information field KReclaimable_bytes.
# TYPE node_memory_KReclaimable_bytes gauge
node_memory_KReclaimable_bytes 4.01215488e+08
# HELP node_memory_KernelStack_bytes Memory information field KernelStack_bytes.
# TYPE node_memory_KernelStack_bytes gauge
node_memory_KernelStack_bytes 8.667136e+06
# HELP node_memory_Mapped_bytes Memory information field Mapped_bytes.
# TYPE node_memory_Mapped_bytes gauge
node_memory_Mapped_bytes 6.4243712e+08
# HELP node_memory_MemAvailable_bytes Memory information field MemAvailable_bytes.
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 6.829756416e+09
# HELP node_memory_MemFree_bytes Memory information field MemFree_bytes.
# TYPE node_memory_MemFree_bytes gauge
node_memory_MemFree_bytes 1.837809664e+09
# HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE node_memory_MemTotal_bytes gauge
node_memory_MemTotal_bytes 8.190685184e+09
# HELP node_memory_Mlocked_bytes Memory information field Mlocked_bytes.
# TYPE node_memory_Mlocked_bytes gauge
node_memory_Mlocked_bytes 0
# HELP node_memory_NFS_Unstable_bytes Memory information field NFS_Unstable_bytes.
# TYPE node_memory_NFS_Unstable_bytes gauge
node_memory_NFS_Unstable_bytes 0
# HELP node_memory_PageTables_bytes Memory information field PageTables_bytes.
# TYPE node_memory_PageTables_bytes gauge
node_memory_PageTables_bytes 1.128448e+07
# HELP node_memory_Percpu_bytes Memory information field Percpu_bytes.
# TYPE node_memory_Percpu_bytes gauge
node_memory_Percpu_bytes 3.52256e+06
# HELP node_memory_SReclaimable_bytes Memory information field SReclaimable_bytes.
# TYPE node_memory_SReclaimable_bytes gauge
node_memory_SReclaimable_bytes 4.01215488e+08
# HELP node_memory_SUnreclaim_bytes Memory information field SUnreclaim_bytes.
# TYPE node_memory_SUnreclaim_bytes gauge
node_memory_SUnreclaim_bytes 8.0576512e+07
# HELP node_memory_SecPageTables_bytes Memory information field SecPageTables_bytes.
# TYPE node_memory_SecPageTables_bytes gauge
node_memory_SecPageTables_bytes 0
# HELP node_memory_Shmem_bytes Memory information field Shmem_bytes.
# TYPE node_memory_Shmem_bytes gauge
node_memory_Shmem_bytes 2.953216e+06
# HELP node_memory_Slab_bytes Memory information field Slab_bytes.
# TYPE node_memory_Slab_bytes gauge
node_memory_Slab_bytes 4.81792e+08
# HELP node_memory_SwapCached_bytes Memory information field SwapCached_bytes.
# TYPE node_memory_SwapCached_bytes gauge
node_memory_SwapCached_bytes 0
# HELP node_memory_SwapFree_bytes Memory information field SwapFree_bytes.
# TYPE node_memory_SwapFree_bytes gauge
node_memory_SwapFree_bytes 0
# HELP node_memory_SwapTotal_bytes Memory information field SwapTotal_bytes.
# TYPE node_memory_SwapTotal_bytes gauge
node_memory_SwapTotal_bytes 0
# HELP node_memory_Unevictable_bytes Memory information field Unevictable_bytes.
# TYPE node_memory_Unevictable_bytes gauge
node_memory_Unevictable_bytes 0
# HELP node_memory_VmallocChunk_bytes Memory information field VmallocChunk_bytes.
# TYPE node_memory_VmallocChunk_bytes gauge
node_memory_VmallocChunk_bytes 0
# HELP node_memory_VmallocTotal_bytes Memory information field VmallocTotal_bytes.
# TYPE node_memory_VmallocTotal_bytes gauge
node_memory_VmallocTotal_bytes 2.65885319168e+11
# HELP node_memory_VmallocUsed_bytes Memory information field VmallocUsed_bytes.
# TYPE node_memory_VmallocUsed_bytes gauge
node_memory_VmallocUsed_bytes 2.3687168e+07
# HELP node_memory_WritebackTmp_bytes Memory information field WritebackTmp_bytes.
# TYPE node_memory_WritebackTmp_bytes gauge
node_memory_WritebackTmp_bytes 0
# HELP node_memory_Writeback_bytes Memory information field Writeback_bytes.
# TYPE node_memory_Writeback_bytes gauge
node_memory_Writeback_bytes 0
# HELP node_memory_Zswap_bytes Memory information field Zswap_bytes.
# TYPE node_memory_Zswap_bytes gauge
node_memory_Zswap_bytes 0
# HELP node_memory_Zswapped_bytes Memory information field Zswapped_bytes.
# TYPE node_memory_Zswapped_bytes gauge
node_memory_Zswapped_bytes 0
# HELP node_netstat_Icmp6_InErrors Statistic Icmp6InErrors.
# TYPE node_netstat_Icmp6_InErrors untyped
node_netstat_Icmp6_InErrors 0
# HELP node_netstat_Icmp6_InMsgs Statistic Icmp6InMsgs.
# TYPE node_netstat_Icmp6_InMsgs untyped
node_netstat_Icmp6_InMsgs 2
# HELP node_netstat_Icmp6_OutMsgs Statistic Icmp6OutMsgs.
# TYPE node_netstat_Icmp6_OutMsgs untyped
node_netstat_Icmp6_OutMsgs 1601
# HELP node_netstat_Icmp_InErrors Statistic IcmpInErrors.
# TYPE node_netstat_Icmp_InErrors untyped
node_netstat_Icmp_InErrors 1
# HELP node_netstat_Icmp_InMsgs Statistic IcmpInMsgs.
# TYPE node_netstat_Icmp_InMsgs untyped
node_netstat_Icmp_InMsgs 17
# HELP node_netstat_Icmp_OutMsgs Statistic IcmpOutMsgs.
# TYPE node_netstat_Icmp_OutMsgs untyped
node_netstat_Icmp_OutMsgs 14
# HELP node_netstat_Ip6_InOctets Statistic Ip6InOctets.
# TYPE node_netstat_Ip6_InOctets untyped
node_netstat_Ip6_InOctets 3.997070725e+09
# HELP node_netstat_Ip6_OutOctets Statistic Ip6OutOctets.
# TYPE node_netstat_Ip6_OutOctets untyped
node_netstat_Ip6_OutOctets 3.997073515e+09
# HELP node_netstat_IpExt_InOctets Statistic IpExtInOctets.
# TYPE node_netstat_IpExt_InOctets untyped
node_netstat_IpExt_InOctets 1.08144717251e+11
# HELP node_netstat_IpExt_OutOctets Statistic IpExtOutOctets.
# TYPE node_netstat_IpExt_OutOctets untyped
node_netstat_IpExt_OutOctets 1.56294035787e+11
# HELP node_netstat_Ip_Forwarding Statistic IpForwarding.
# TYPE node_netstat_Ip_Forwarding untyped
node_netstat_Ip_Forwarding 1
# HELP node_netstat_TcpExt_ListenDrops Statistic TcpExtListenDrops.
# TYPE node_netstat_TcpExt_ListenDrops untyped
node_netstat_TcpExt_ListenDrops 0
# HELP node_netstat_TcpExt_ListenOverflows Statistic TcpExtListenOverflows.
# TYPE node_netstat_TcpExt_ListenOverflows untyped
node_netstat_TcpExt_ListenOverflows 0
# HELP node_netstat_TcpExt_SyncookiesFailed Statistic TcpExtSyncookiesFailed.
# TYPE node_netstat_TcpExt_SyncookiesFailed untyped
node_netstat_TcpExt_SyncookiesFailed 0
# HELP node_netstat_TcpExt_SyncookiesRecv Statistic TcpExtSyncookiesRecv.
# TYPE node_netstat_TcpExt_SyncookiesRecv untyped
node_netstat_TcpExt_SyncookiesRecv 0
# HELP node_netstat_TcpExt_SyncookiesSent Statistic TcpExtSyncookiesSent.
# TYPE node_netstat_TcpExt_SyncookiesSent untyped
node_netstat_TcpExt_SyncookiesSent 0
# HELP node_netstat_TcpExt_TCPSynRetrans Statistic TcpExtTCPSynRetrans.
# TYPE node_netstat_TcpExt_TCPSynRetrans untyped
node_netstat_TcpExt_TCPSynRetrans 342
# HELP node_netstat_TcpExt_TCPTimeouts Statistic TcpExtTCPTimeouts.
# TYPE node_netstat_TcpExt_TCPTimeouts untyped
node_netstat_TcpExt_TCPTimeouts 513
# HELP node_netstat_Tcp_ActiveOpens Statistic TcpActiveOpens.
# TYPE node_netstat_Tcp_ActiveOpens untyped
node_netstat_Tcp_ActiveOpens 7.121624e+06
# HELP node_netstat_Tcp_CurrEstab Statistic TcpCurrEstab.
# TYPE node_netstat_Tcp_CurrEstab untyped
node_netstat_Tcp_CurrEstab 236
# HELP node_netstat_Tcp_InErrs Statistic TcpInErrs.
# TYPE node_netstat_Tcp_InErrs untyped
node_netstat_Tcp_InErrs 0
# HELP node_netstat_Tcp_InSegs Statistic TcpInSegs.
# TYPE node_netstat_Tcp_InSegs untyped
node_netstat_Tcp_InSegs 5.82648533e+08
# HELP node_netstat_Tcp_OutRsts Statistic TcpOutRsts.
# TYPE node_netstat_Tcp_OutRsts untyped
node_netstat_Tcp_OutRsts 5.798397e+06
# HELP node_netstat_Tcp_OutSegs Statistic TcpOutSegs.
# TYPE node_netstat_Tcp_OutSegs untyped
node_netstat_Tcp_OutSegs 6.13524809e+08
# HELP node_netstat_Tcp_PassiveOpens Statistic TcpPassiveOpens.
# TYPE node_netstat_Tcp_PassiveOpens untyped
node_netstat_Tcp_PassiveOpens 6.751246e+06
# HELP node_netstat_Tcp_RetransSegs Statistic TcpRetransSegs.
# TYPE node_netstat_Tcp_RetransSegs untyped
node_netstat_Tcp_RetransSegs 173853
# HELP node_netstat_Udp6_InDatagrams Statistic Udp6InDatagrams.
# TYPE node_netstat_Udp6_InDatagrams untyped
node_netstat_Udp6_InDatagrams 279
# HELP node_netstat_Udp6_InErrors Statistic Udp6InErrors.
# TYPE node_netstat_Udp6_InErrors untyped
node_netstat_Udp6_InErrors 0
# HELP node_netstat_Udp6_NoPorts Statistic Udp6NoPorts.
# TYPE node_netstat_Udp6_NoPorts untyped
node_netstat_Udp6_NoPorts 0
# HELP node_netstat_Udp6_OutDatagrams Statistic Udp6OutDatagrams.
# TYPE node_netstat_Udp6_OutDatagrams untyped
node_netstat_Udp6_OutDatagrams 236
# HELP node_netstat_Udp6_RcvbufErrors Statistic Udp6RcvbufErrors.
# TYPE node_netstat_Udp6_RcvbufErrors untyped
node_netstat_Udp6_RcvbufErrors 0
# HELP node_netstat_Udp6_SndbufErrors Statistic Udp6SndbufErrors.
# TYPE node_netstat_Udp6_SndbufErrors untyped
node_netstat_Udp6_SndbufErrors 0
# HELP node_netstat_UdpLite6_InErrors Statistic UdpLite6InErrors.
# TYPE node_netstat_UdpLite6_InErrors untyped
node_netstat_UdpLite6_InErrors 0
# HELP node_netstat_UdpLite_InErrors Statistic UdpLiteInErrors.
# TYPE node_netstat_UdpLite_InErrors untyped
node_netstat_UdpLite_InErrors 0
# HELP node_netstat_Udp_InDatagrams Statistic UdpInDatagrams.
# TYPE node_netstat_Udp_InDatagrams untyped
node_netstat_Udp_InDatagrams 6.547468e+06
# HELP node_netstat_Udp_InErrors Statistic UdpInErrors.
# TYPE node_netstat_Udp_InErrors untyped
node_netstat_Udp_InErrors 0
# HELP node_netstat_Udp_NoPorts Statistic UdpNoPorts.
# TYPE node_netstat_Udp_NoPorts untyped
node_netstat_Udp_NoPorts 9
# HELP node_netstat_Udp_OutDatagrams Statistic UdpOutDatagrams.
# TYPE node_netstat_Udp_OutDatagrams untyped
node_netstat_Udp_OutDatagrams 3.213419e+06
# HELP node_netstat_Udp_RcvbufErrors Statistic UdpRcvbufErrors.
# TYPE node_netstat_Udp_RcvbufErrors untyped
node_netstat_Udp_RcvbufErrors 0
# HELP node_netstat_Udp_SndbufErrors Statistic UdpSndbufErrors.
# TYPE node_netstat_Udp_SndbufErrors untyped
node_netstat_Udp_SndbufErrors 0
# HELP node_network_address_assign_type Network device property: address_assign_type
# TYPE node_network_address_assign_type gauge
node_network_address_assign_type{device="cali60e575ce8db"} 3
node_network_address_assign_type{device="cali85a56337055"} 3
node_network_address_assign_type{device="cali8c459f6702e"} 3
node_network_address_assign_type{device="eth0"} 0
node_network_address_assign_type{device="lo"} 0
node_network_address_assign_type{device="tunl0"} 0
node_network_address_assign_type{device="wlan0"} 0
# HELP node_network_carrier Network device property: carrier
# TYPE node_network_carrier gauge
node_network_carrier{device="cali60e575ce8db"} 1
node_network_carrier{device="cali85a56337055"} 1
node_network_carrier{device="cali8c459f6702e"} 1
node_network_carrier{device="eth0"} 1
node_network_carrier{device="lo"} 1
node_network_carrier{device="tunl0"} 1
node_network_carrier{device="wlan0"} 0
# HELP node_network_carrier_changes_total Network device property: carrier_changes_total
# TYPE node_network_carrier_changes_total counter
node_network_carrier_changes_total{device="cali60e575ce8db"} 4
node_network_carrier_changes_total{device="cali85a56337055"} 4
node_network_carrier_changes_total{device="cali8c459f6702e"} 4
node_network_carrier_changes_total{device="eth0"} 1
node_network_carrier_changes_total{device="lo"} 0
node_network_carrier_changes_total{device="tunl0"} 0
node_network_carrier_changes_total{device="wlan0"} 1
# HELP node_network_carrier_down_changes_total Network device property: carrier_down_changes_total
# TYPE node_network_carrier_down_changes_total counter
node_network_carrier_down_changes_total{device="cali60e575ce8db"} 2
node_network_carrier_down_changes_total{device="cali85a56337055"} 2
node_network_carrier_down_changes_total{device="cali8c459f6702e"} 2
node_network_carrier_down_changes_total{device="eth0"} 0
node_network_carrier_down_changes_total{device="lo"} 0
node_network_carrier_down_changes_total{device="tunl0"} 0
node_network_carrier_down_changes_total{device="wlan0"} 1
# HELP node_network_carrier_up_changes_total Network device property: carrier_up_changes_total
# TYPE node_network_carrier_up_changes_total counter
node_network_carrier_up_changes_total{device="cali60e575ce8db"} 2
node_network_carrier_up_changes_total{device="cali85a56337055"} 2
node_network_carrier_up_changes_total{device="cali8c459f6702e"} 2
node_network_carrier_up_changes_total{device="eth0"} 1
node_network_carrier_up_changes_total{device="lo"} 0
node_network_carrier_up_changes_total{device="tunl0"} 0
node_network_carrier_up_changes_total{device="wlan0"} 0
# HELP node_network_device_id Network device property: device_id
# TYPE node_network_device_id gauge
node_network_device_id{device="cali60e575ce8db"} 0
node_network_device_id{device="cali85a56337055"} 0
node_network_device_id{device="cali8c459f6702e"} 0
node_network_device_id{device="eth0"} 0
node_network_device_id{device="lo"} 0
node_network_device_id{device="tunl0"} 0
node_network_device_id{device="wlan0"} 0
# HELP node_network_dormant Network device property: dormant
# TYPE node_network_dormant gauge
node_network_dormant{device="cali60e575ce8db"} 0
node_network_dormant{device="cali85a56337055"} 0
node_network_dormant{device="cali8c459f6702e"} 0
node_network_dormant{device="eth0"} 0
node_network_dormant{device="lo"} 0
node_network_dormant{device="tunl0"} 0
node_network_dormant{device="wlan0"} 0
# HELP node_network_flags Network device property: flags
# TYPE node_network_flags gauge
node_network_flags{device="cali60e575ce8db"} 4099
node_network_flags{device="cali85a56337055"} 4099
node_network_flags{device="cali8c459f6702e"} 4099
node_network_flags{device="eth0"} 4099
node_network_flags{device="lo"} 9
node_network_flags{device="tunl0"} 129
node_network_flags{device="wlan0"} 4099
# HELP node_network_iface_id Network device property: iface_id
# TYPE node_network_iface_id gauge
node_network_iface_id{device="cali60e575ce8db"} 73
node_network_iface_id{device="cali85a56337055"} 74
node_network_iface_id{device="cali8c459f6702e"} 70
node_network_iface_id{device="eth0"} 2
node_network_iface_id{device="lo"} 1
node_network_iface_id{device="tunl0"} 18
node_network_iface_id{device="wlan0"} 3
# HELP node_network_iface_link Network device property: iface_link
# TYPE node_network_iface_link gauge
node_network_iface_link{device="cali60e575ce8db"} 4
node_network_iface_link{device="cali85a56337055"} 4
node_network_iface_link{device="cali8c459f6702e"} 4
node_network_iface_link{device="eth0"} 2
node_network_iface_link{device="lo"} 1
node_network_iface_link{device="tunl0"} 0
node_network_iface_link{device="wlan0"} 3
# HELP node_network_iface_link_mode Network device property: iface_link_mode
# TYPE node_network_iface_link_mode gauge
node_network_iface_link_mode{device="cali60e575ce8db"} 0
node_network_iface_link_mode{device="cali85a56337055"} 0
node_network_iface_link_mode{device="cali8c459f6702e"} 0
node_network_iface_link_mode{device="eth0"} 0
node_network_iface_link_mode{device="lo"} 0
node_network_iface_link_mode{device="tunl0"} 0
node_network_iface_link_mode{device="wlan0"} 1
# HELP node_network_info Non-numeric data from /sys/class/net/<iface>, value is always 1.
# TYPE node_network_info gauge
node_network_info{address="00:00:00:00",adminstate="up",broadcast="00:00:00:00",device="tunl0",duplex="",ifalias="",operstate="unknown"} 1
node_network_info{address="00:00:00:00:00:00",adminstate="up",broadcast="00:00:00:00:00:00",device="lo",duplex="",ifalias="",operstate="unknown"} 1
node_network_info{address="d8:3a:dd:89:c1:0b",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="eth0",duplex="full",ifalias="",operstate="up"} 1
node_network_info{address="d8:3a:dd:89:c1:0c",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="wlan0",duplex="",ifalias="",operstate="down"} 1
node_network_info{address="ee:ee:ee:ee:ee:ee",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="cali60e575ce8db",duplex="full",ifalias="",operstate="up"} 1
node_network_info{address="ee:ee:ee:ee:ee:ee",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="cali85a56337055",duplex="full",ifalias="",operstate="up"} 1
node_network_info{address="ee:ee:ee:ee:ee:ee",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="cali8c459f6702e",duplex="full",ifalias="",operstate="up"} 1
# HELP node_network_mtu_bytes Network device property: mtu_bytes
# TYPE node_network_mtu_bytes gauge
node_network_mtu_bytes{device="cali60e575ce8db"} 1480
node_network_mtu_bytes{device="cali85a56337055"} 1480
node_network_mtu_bytes{device="cali8c459f6702e"} 1480
node_network_mtu_bytes{device="eth0"} 1500
node_network_mtu_bytes{device="lo"} 65536
node_network_mtu_bytes{device="tunl0"} 1480
node_network_mtu_bytes{device="wlan0"} 1500
# HELP node_network_name_assign_type Network device property: name_assign_type
# TYPE node_network_name_assign_type gauge
node_network_name_assign_type{device="cali60e575ce8db"} 3
node_network_name_assign_type{device="cali85a56337055"} 3
node_network_name_assign_type{device="cali8c459f6702e"} 3
node_network_name_assign_type{device="eth0"} 1
node_network_name_assign_type{device="lo"} 2
# HELP node_network_net_dev_group Network device property: net_dev_group
# TYPE node_network_net_dev_group gauge
node_network_net_dev_group{device="cali60e575ce8db"} 0
node_network_net_dev_group{device="cali85a56337055"} 0
node_network_net_dev_group{device="cali8c459f6702e"} 0
node_network_net_dev_group{device="eth0"} 0
node_network_net_dev_group{device="lo"} 0
node_network_net_dev_group{device="tunl0"} 0
node_network_net_dev_group{device="wlan0"} 0
# HELP node_network_protocol_type Network device property: protocol_type
# TYPE node_network_protocol_type gauge
node_network_protocol_type{device="cali60e575ce8db"} 1
node_network_protocol_type{device="cali85a56337055"} 1
node_network_protocol_type{device="cali8c459f6702e"} 1
node_network_protocol_type{device="eth0"} 1
node_network_protocol_type{device="lo"} 772
node_network_protocol_type{device="tunl0"} 768
node_network_protocol_type{device="wlan0"} 1
# HELP node_network_receive_bytes_total Network device statistic receive_bytes.
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="cali60e575ce8db"} 6.800154e+07
node_network_receive_bytes_total{device="cali85a56337055"} 6.6751833e+07
node_network_receive_bytes_total{device="cali8c459f6702e"} 5.9727975e+07
node_network_receive_bytes_total{device="eth0"} 5.6372248596e+10
node_network_receive_bytes_total{device="lo"} 6.0342387372e+10
node_network_receive_bytes_total{device="tunl0"} 3.599596e+06
node_network_receive_bytes_total{device="wlan0"} 0
# HELP node_network_receive_compressed_total Network device statistic receive_compressed.
# TYPE node_network_receive_compressed_total counter
node_network_receive_compressed_total{device="cali60e575ce8db"} 0
node_network_receive_compressed_total{device="cali85a56337055"} 0
node_network_receive_compressed_total{device="cali8c459f6702e"} 0
node_network_receive_compressed_total{device="eth0"} 0
node_network_receive_compressed_total{device="lo"} 0
node_network_receive_compressed_total{device="tunl0"} 0
node_network_receive_compressed_total{device="wlan0"} 0
# HELP node_network_receive_drop_total Network device statistic receive_drop.
# TYPE node_network_receive_drop_total counter
node_network_receive_drop_total{device="cali60e575ce8db"} 1
node_network_receive_drop_total{device="cali85a56337055"} 1
node_network_receive_drop_total{device="cali8c459f6702e"} 1
node_network_receive_drop_total{device="eth0"} 0
node_network_receive_drop_total{device="lo"} 0
node_network_receive_drop_total{device="tunl0"} 0
node_network_receive_drop_total{device="wlan0"} 0
# HELP node_network_receive_errs_total Network device statistic receive_errs.
# TYPE node_network_receive_errs_total counter
node_network_receive_errs_total{device="cali60e575ce8db"} 0
node_network_receive_errs_total{device="cali85a56337055"} 0
node_network_receive_errs_total{device="cali8c459f6702e"} 0
node_network_receive_errs_total{device="eth0"} 0
node_network_receive_errs_total{device="lo"} 0
node_network_receive_errs_total{device="tunl0"} 0
node_network_receive_errs_total{device="wlan0"} 0
# HELP node_network_receive_fifo_total Network device statistic receive_fifo.
# TYPE node_network_receive_fifo_total counter
node_network_receive_fifo_total{device="cali60e575ce8db"} 0
node_network_receive_fifo_total{device="cali85a56337055"} 0
node_network_receive_fifo_total{device="cali8c459f6702e"} 0
node_network_receive_fifo_total{device="eth0"} 0
node_network_receive_fifo_total{device="lo"} 0
node_network_receive_fifo_total{device="tunl0"} 0
node_network_receive_fifo_total{device="wlan0"} 0
# HELP node_network_receive_frame_total Network device statistic receive_frame.
# TYPE node_network_receive_frame_total counter
node_network_receive_frame_total{device="cali60e575ce8db"} 0
node_network_receive_frame_total{device="cali85a56337055"} 0
node_network_receive_frame_total{device="cali8c459f6702e"} 0
node_network_receive_frame_total{device="eth0"} 0
node_network_receive_frame_total{device="lo"} 0
node_network_receive_frame_total{device="tunl0"} 0
node_network_receive_frame_total{device="wlan0"} 0
# HELP node_network_receive_multicast_total Network device statistic receive_multicast.
# TYPE node_network_receive_multicast_total counter
node_network_receive_multicast_total{device="cali60e575ce8db"} 0
node_network_receive_multicast_total{device="cali85a56337055"} 0
node_network_receive_multicast_total{device="cali8c459f6702e"} 0
node_network_receive_multicast_total{device="eth0"} 3.336362e+06
node_network_receive_multicast_total{device="lo"} 0
node_network_receive_multicast_total{device="tunl0"} 0
node_network_receive_multicast_total{device="wlan0"} 0
# HELP node_network_receive_nohandler_total Network device statistic receive_nohandler.
# TYPE node_network_receive_nohandler_total counter
node_network_receive_nohandler_total{device="cali60e575ce8db"} 0
node_network_receive_nohandler_total{device="cali85a56337055"} 0
node_network_receive_nohandler_total{device="cali8c459f6702e"} 0
node_network_receive_nohandler_total{device="eth0"} 0
node_network_receive_nohandler_total{device="lo"} 0
node_network_receive_nohandler_total{device="tunl0"} 0
node_network_receive_nohandler_total{device="wlan0"} 0
# HELP node_network_receive_packets_total Network device statistic receive_packets.
# TYPE node_network_receive_packets_total counter
node_network_receive_packets_total{device="cali60e575ce8db"} 800641
node_network_receive_packets_total{device="cali85a56337055"} 781891
node_network_receive_packets_total{device="cali8c459f6702e"} 680023
node_network_receive_packets_total{device="eth0"} 3.3310639e+08
node_network_receive_packets_total{device="lo"} 2.57029971e+08
node_network_receive_packets_total{device="tunl0"} 39699
node_network_receive_packets_total{device="wlan0"} 0
# HELP node_network_speed_bytes Network device property: speed_bytes
# TYPE node_network_speed_bytes gauge
node_network_speed_bytes{device="cali60e575ce8db"} 1.25e+09
node_network_speed_bytes{device="cali85a56337055"} 1.25e+09
node_network_speed_bytes{device="cali8c459f6702e"} 1.25e+09
node_network_speed_bytes{device="eth0"} 1.25e+08
# HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
# TYPE node_network_transmit_bytes_total counter
node_network_transmit_bytes_total{device="cali60e575ce8db"} 5.2804647e+07
node_network_transmit_bytes_total{device="cali85a56337055"} 5.4239763e+07
node_network_transmit_bytes_total{device="cali8c459f6702e"} 1.115901473e+09
node_network_transmit_bytes_total{device="eth0"} 1.02987658518e+11
node_network_transmit_bytes_total{device="lo"} 6.0342387372e+10
node_network_transmit_bytes_total{device="tunl0"} 8.407628e+06
node_network_transmit_bytes_total{device="wlan0"} 0
# HELP node_network_transmit_carrier_total Network device statistic transmit_carrier.
# TYPE node_network_transmit_carrier_total counter
node_network_transmit_carrier_total{device="cali60e575ce8db"} 0
node_network_transmit_carrier_total{device="cali85a56337055"} 0
node_network_transmit_carrier_total{device="cali8c459f6702e"} 0
node_network_transmit_carrier_total{device="eth0"} 0
node_network_transmit_carrier_total{device="lo"} 0
node_network_transmit_carrier_total{device="tunl0"} 0
node_network_transmit_carrier_total{device="wlan0"} 0
# HELP node_network_transmit_colls_total Network device statistic transmit_colls.
# TYPE node_network_transmit_colls_total counter
node_network_transmit_colls_total{device="cali60e575ce8db"} 0
node_network_transmit_colls_total{device="cali85a56337055"} 0
node_network_transmit_colls_total{device="cali8c459f6702e"} 0
node_network_transmit_colls_total{device="eth0"} 0
node_network_transmit_colls_total{device="lo"} 0
node_network_transmit_colls_total{device="tunl0"} 0
node_network_transmit_colls_total{device="wlan0"} 0
# HELP node_network_transmit_compressed_total Network device statistic transmit_compressed.
# TYPE node_network_transmit_compressed_total counter
node_network_transmit_compressed_total{device="cali60e575ce8db"} 0
node_network_transmit_compressed_total{device="cali85a56337055"} 0
node_network_transmit_compressed_total{device="cali8c459f6702e"} 0
node_network_transmit_compressed_total{device="eth0"} 0
node_network_transmit_compressed_total{device="lo"} 0
node_network_transmit_compressed_total{device="tunl0"} 0
node_network_transmit_compressed_total{device="wlan0"} 0
# HELP node_network_transmit_drop_total Network device statistic transmit_drop.
# TYPE node_network_transmit_drop_total counter
node_network_transmit_drop_total{device="cali60e575ce8db"} 0
node_network_transmit_drop_total{device="cali85a56337055"} 0
node_network_transmit_drop_total{device="cali8c459f6702e"} 0
node_network_transmit_drop_total{device="eth0"} 0
node_network_transmit_drop_total{device="lo"} 0
node_network_transmit_drop_total{device="tunl0"} 0
node_network_transmit_drop_total{device="wlan0"} 0
# HELP node_network_transmit_errs_total Network device statistic transmit_errs.
# TYPE node_network_transmit_errs_total counter
node_network_transmit_errs_total{device="cali60e575ce8db"} 0
node_network_transmit_errs_total{device="cali85a56337055"} 0
node_network_transmit_errs_total{device="cali8c459f6702e"} 0
node_network_transmit_errs_total{device="eth0"} 0
node_network_transmit_errs_total{device="lo"} 0
node_network_transmit_errs_total{device="tunl0"} 0
node_network_transmit_errs_total{device="wlan0"} 0
# HELP node_network_transmit_fifo_total Network device statistic transmit_fifo.
# TYPE node_network_transmit_fifo_total counter
node_network_transmit_fifo_total{device="cali60e575ce8db"} 0
node_network_transmit_fifo_total{device="cali85a56337055"} 0
node_network_transmit_fifo_total{device="cali8c459f6702e"} 0
node_network_transmit_fifo_total{device="eth0"} 0
node_network_transmit_fifo_total{device="lo"} 0
node_network_transmit_fifo_total{device="tunl0"} 0
node_network_transmit_fifo_total{device="wlan0"} 0
# HELP node_network_transmit_packets_total Network device statistic transmit_packets.
# TYPE node_network_transmit_packets_total counter
node_network_transmit_packets_total{device="cali60e575ce8db"} 560412
node_network_transmit_packets_total{device="cali85a56337055"} 582260
node_network_transmit_packets_total{device="cali8c459f6702e"} 733054
node_network_transmit_packets_total{device="eth0"} 3.54151866e+08
node_network_transmit_packets_total{device="lo"} 2.57029971e+08
node_network_transmit_packets_total{device="tunl0"} 39617
node_network_transmit_packets_total{device="wlan0"} 0
# HELP node_network_transmit_queue_length Network device property: transmit_queue_length
# TYPE node_network_transmit_queue_length gauge
node_network_transmit_queue_length{device="cali60e575ce8db"} 0
node_network_transmit_queue_length{device="cali85a56337055"} 0
node_network_transmit_queue_length{device="cali8c459f6702e"} 0
node_network_transmit_queue_length{device="eth0"} 1000
node_network_transmit_queue_length{device="lo"} 1000
node_network_transmit_queue_length{device="tunl0"} 1000
node_network_transmit_queue_length{device="wlan0"} 1000
# HELP node_network_up Value is 1 if operstate is 'up', 0 otherwise.
# TYPE node_network_up gauge
node_network_up{device="cali60e575ce8db"} 1
node_network_up{device="cali85a56337055"} 1
node_network_up{device="cali8c459f6702e"} 1
node_network_up{device="eth0"} 1
node_network_up{device="lo"} 0
node_network_up{device="tunl0"} 0
node_network_up{device="wlan0"} 0
# HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking.
# TYPE node_nf_conntrack_entries gauge
node_nf_conntrack_entries 474
# HELP node_nf_conntrack_entries_limit Maximum size of connection tracking table.
# TYPE node_nf_conntrack_entries_limit gauge
node_nf_conntrack_entries_limit 131072
# HELP node_nfs_connections_total Total number of NFSd TCP connections.
# TYPE node_nfs_connections_total counter
node_nfs_connections_total 0
# HELP node_nfs_packets_total Total NFSd network packets (sent+received) by protocol type.
# TYPE node_nfs_packets_total counter
node_nfs_packets_total{protocol="tcp"} 0
node_nfs_packets_total{protocol="udp"} 0
# HELP node_nfs_requests_total Number of NFS procedures invoked.
# TYPE node_nfs_requests_total counter
node_nfs_requests_total{method="Access",proto="3"} 0
node_nfs_requests_total{method="Access",proto="4"} 0
node_nfs_requests_total{method="Allocate",proto="4"} 0
node_nfs_requests_total{method="BindConnToSession",proto="4"} 0
node_nfs_requests_total{method="Clone",proto="4"} 0
node_nfs_requests_total{method="Close",proto="4"} 0
node_nfs_requests_total{method="Commit",proto="3"} 0
node_nfs_requests_total{method="Commit",proto="4"} 0
node_nfs_requests_total{method="Create",proto="2"} 0
node_nfs_requests_total{method="Create",proto="3"} 0
node_nfs_requests_total{method="Create",proto="4"} 0
node_nfs_requests_total{method="CreateSession",proto="4"} 0
node_nfs_requests_total{method="DeAllocate",proto="4"} 0
node_nfs_requests_total{method="DelegReturn",proto="4"} 0
node_nfs_requests_total{method="DestroyClientID",proto="4"} 0
node_nfs_requests_total{method="DestroySession",proto="4"} 0
node_nfs_requests_total{method="ExchangeID",proto="4"} 0
node_nfs_requests_total{method="FreeStateID",proto="4"} 0
node_nfs_requests_total{method="FsInfo",proto="3"} 0
node_nfs_requests_total{method="FsInfo",proto="4"} 0
node_nfs_requests_total{method="FsLocations",proto="4"} 0
node_nfs_requests_total{method="FsStat",proto="2"} 0
node_nfs_requests_total{method="FsStat",proto="3"} 0
node_nfs_requests_total{method="FsidPresent",proto="4"} 0
node_nfs_requests_total{method="GetACL",proto="4"} 0
node_nfs_requests_total{method="GetAttr",proto="2"} 0
node_nfs_requests_total{method="GetAttr",proto="3"} 0
node_nfs_requests_total{method="GetDeviceInfo",proto="4"} 0
node_nfs_requests_total{method="GetDeviceList",proto="4"} 0
node_nfs_requests_total{method="GetLeaseTime",proto="4"} 0
node_nfs_requests_total{method="Getattr",proto="4"} 0
node_nfs_requests_total{method="LayoutCommit",proto="4"} 0
node_nfs_requests_total{method="LayoutGet",proto="4"} 0
node_nfs_requests_total{method="LayoutReturn",proto="4"} 0
node_nfs_requests_total{method="LayoutStats",proto="4"} 0
node_nfs_requests_total{method="Link",proto="2"} 0
node_nfs_requests_total{method="Link",proto="3"} 0
node_nfs_requests_total{method="Link",proto="4"} 0
node_nfs_requests_total{method="Lock",proto="4"} 0
node_nfs_requests_total{method="Lockt",proto="4"} 0
node_nfs_requests_total{method="Locku",proto="4"} 0
node_nfs_requests_total{method="Lookup",proto="2"} 0
node_nfs_requests_total{method="Lookup",proto="3"} 0
node_nfs_requests_total{method="Lookup",proto="4"} 0
node_nfs_requests_total{method="LookupRoot",proto="4"} 0
node_nfs_requests_total{method="MkDir",proto="2"} 0
node_nfs_requests_total{method="MkDir",proto="3"} 0
node_nfs_requests_total{method="MkNod",proto="3"} 0
node_nfs_requests_total{method="Null",proto="2"} 0
node_nfs_requests_total{method="Null",proto="3"} 0
node_nfs_requests_total{method="Null",proto="4"} 0
node_nfs_requests_total{method="Open",proto="4"} 0
node_nfs_requests_total{method="OpenConfirm",proto="4"} 0
node_nfs_requests_total{method="OpenDowngrade",proto="4"} 0
node_nfs_requests_total{method="OpenNoattr",proto="4"} 0
node_nfs_requests_total{method="PathConf",proto="3"} 0
node_nfs_requests_total{method="Pathconf",proto="4"} 0
node_nfs_requests_total{method="Read",proto="2"} 0
node_nfs_requests_total{method="Read",proto="3"} 0
node_nfs_requests_total{method="Read",proto="4"} 0
node_nfs_requests_total{method="ReadDir",proto="2"} 0
node_nfs_requests_total{method="ReadDir",proto="3"} 0
node_nfs_requests_total{method="ReadDir",proto="4"} 0
node_nfs_requests_total{method="ReadDirPlus",proto="3"} 0
node_nfs_requests_total{method="ReadLink",proto="2"} 0
node_nfs_requests_total{method="ReadLink",proto="3"} 0
node_nfs_requests_total{method="ReadLink",proto="4"} 0
node_nfs_requests_total{method="ReclaimComplete",proto="4"} 0
node_nfs_requests_total{method="ReleaseLockowner",proto="4"} 0
node_nfs_requests_total{method="Remove",proto="2"} 0
node_nfs_requests_total{method="Remove",proto="3"} 0
node_nfs_requests_total{method="Remove",proto="4"} 0
node_nfs_requests_total{method="Rename",proto="2"} 0
node_nfs_requests_total{method="Rename",proto="3"} 0
node_nfs_requests_total{method="Rename",proto="4"} 0
node_nfs_requests_total{method="Renew",proto="4"} 0
node_nfs_requests_total{method="RmDir",proto="2"} 0
node_nfs_requests_total{method="RmDir",proto="3"} 0
node_nfs_requests_total{method="Root",proto="2"} 0
node_nfs_requests_total{method="Secinfo",proto="4"} 0
node_nfs_requests_total{method="SecinfoNoName",proto="4"} 0
node_nfs_requests_total{method="Seek",proto="4"} 0
node_nfs_requests_total{method="Sequence",proto="4"} 0
node_nfs_requests_total{method="ServerCaps",proto="4"} 0
node_nfs_requests_total{method="SetACL",proto="4"} 0
node_nfs_requests_total{method="SetAttr",proto="2"} 0
node_nfs_requests_total{method="SetAttr",proto="3"} 0
node_nfs_requests_total{method="SetClientID",proto="4"} 0
node_nfs_requests_total{method="SetClientIDConfirm",proto="4"} 0
node_nfs_requests_total{method="Setattr",proto="4"} 0
node_nfs_requests_total{method="StatFs",proto="4"} 0
node_nfs_requests_total{method="SymLink",proto="2"} 0
node_nfs_requests_total{method="SymLink",proto="3"} 0
node_nfs_requests_total{method="Symlink",proto="4"} 0
node_nfs_requests_total{method="TestStateID",proto="4"} 0
node_nfs_requests_total{method="WrCache",proto="2"} 0
node_nfs_requests_total{method="Write",proto="2"} 0
node_nfs_requests_total{method="Write",proto="3"} 0
node_nfs_requests_total{method="Write",proto="4"} 0
# HELP node_nfs_rpc_authentication_refreshes_total Number of RPC authentication refreshes performed.
# TYPE node_nfs_rpc_authentication_refreshes_total counter
node_nfs_rpc_authentication_refreshes_total 0
# HELP node_nfs_rpc_retransmissions_total Number of RPC transmissions performed.
# TYPE node_nfs_rpc_retransmissions_total counter
node_nfs_rpc_retransmissions_total 0
# HELP node_nfs_rpcs_total Total number of RPCs performed.
# TYPE node_nfs_rpcs_total counter
node_nfs_rpcs_total 0
# HELP node_os_info A metric with a constant '1' value labeled by build_id, id, id_like, image_id, image_version, name, pretty_name, variant, variant_id, version, version_codename, version_id.
# TYPE node_os_info gauge
node_os_info{build_id="",id="debian",id_like="",image_id="",image_version="",name="Debian GNU/Linux",pretty_name="Debian GNU/Linux 12 (bookworm)",variant="",variant_id="",version="12 (bookworm)",version_codename="bookworm",version_id="12"} 1
# HELP node_os_version Metric containing the major.minor part of the OS version.
# TYPE node_os_version gauge
node_os_version{id="debian",id_like="",name="Debian GNU/Linux"} 12
# HELP node_procs_blocked Number of processes blocked waiting for I/O to complete.
# TYPE node_procs_blocked gauge
node_procs_blocked 0
# HELP node_procs_running Number of processes in runnable state.
# TYPE node_procs_running gauge
node_procs_running 2
# HELP node_schedstat_running_seconds_total Number of seconds CPU spent running a process.
# TYPE node_schedstat_running_seconds_total counter
node_schedstat_running_seconds_total{cpu="0"} 193905.40964483
node_schedstat_running_seconds_total{cpu="1"} 201807.778053838
node_schedstat_running_seconds_total{cpu="2"} 202480.951626566
node_schedstat_running_seconds_total{cpu="3"} 199368.582085578
# HELP node_schedstat_timeslices_total Number of timeslices executed by CPU.
# TYPE node_schedstat_timeslices_total counter
node_schedstat_timeslices_total{cpu="0"} 2.671310666e+09
node_schedstat_timeslices_total{cpu="1"} 2.839935261e+09
node_schedstat_timeslices_total{cpu="2"} 2.840250945e+09
node_schedstat_timeslices_total{cpu="3"} 2.791566809e+09
# HELP node_schedstat_waiting_seconds_total Number of seconds spent by processing waiting for this CPU.
# TYPE node_schedstat_waiting_seconds_total counter
node_schedstat_waiting_seconds_total{cpu="0"} 146993.907550125
node_schedstat_waiting_seconds_total{cpu="1"} 148954.872956911
node_schedstat_waiting_seconds_total{cpu="2"} 149496.824640957
node_schedstat_waiting_seconds_total{cpu="3"} 148325.351612478
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="arp"} 0.000472051
node_scrape_collector_duration_seconds{collector="bcache"} 9.7776e-05
node_scrape_collector_duration_seconds{collector="bonding"} 0.00025022
node_scrape_collector_duration_seconds{collector="btrfs"} 0.018567631
node_scrape_collector_duration_seconds{collector="conntrack"} 0.014180114
node_scrape_collector_duration_seconds{collector="cpu"} 0.004748662
node_scrape_collector_duration_seconds{collector="cpufreq"} 0.049445245
node_scrape_collector_duration_seconds{collector="diskstats"} 0.001468727
node_scrape_collector_duration_seconds{collector="dmi"} 1.093e-06
node_scrape_collector_duration_seconds{collector="edac"} 7.6574e-05
node_scrape_collector_duration_seconds{collector="entropy"} 0.000781326
node_scrape_collector_duration_seconds{collector="fibrechannel"} 3.0574e-05
node_scrape_collector_duration_seconds{collector="filefd"} 0.000214998
node_scrape_collector_duration_seconds{collector="filesystem"} 0.041031802
node_scrape_collector_duration_seconds{collector="hwmon"} 0.007842633
node_scrape_collector_duration_seconds{collector="infiniband"} 4.1777e-05
node_scrape_collector_duration_seconds{collector="ipvs"} 0.000964547
node_scrape_collector_duration_seconds{collector="loadavg"} 0.000368979
node_scrape_collector_duration_seconds{collector="mdadm"} 7.6555e-05
node_scrape_collector_duration_seconds{collector="meminfo"} 0.001052527
node_scrape_collector_duration_seconds{collector="netclass"} 0.036469213
node_scrape_collector_duration_seconds{collector="netdev"} 0.002758901
node_scrape_collector_duration_seconds{collector="netstat"} 0.002033075
node_scrape_collector_duration_seconds{collector="nfs"} 0.000542699
node_scrape_collector_duration_seconds{collector="nfsd"} 0.000331331
node_scrape_collector_duration_seconds{collector="nvme"} 0.000140017
node_scrape_collector_duration_seconds{collector="os"} 0.000326923
node_scrape_collector_duration_seconds{collector="powersupplyclass"} 0.000183962
node_scrape_collector_duration_seconds{collector="pressure"} 6.4647e-05
node_scrape_collector_duration_seconds{collector="rapl"} 0.000149461
node_scrape_collector_duration_seconds{collector="schedstat"} 0.000511218
node_scrape_collector_duration_seconds{collector="selinux"} 0.000327182
node_scrape_collector_duration_seconds{collector="sockstat"} 0.001023898
node_scrape_collector_duration_seconds{collector="softnet"} 0.000578402
node_scrape_collector_duration_seconds{collector="stat"} 0.013851062
node_scrape_collector_duration_seconds{collector="tapestats"} 0.000176499
node_scrape_collector_duration_seconds{collector="textfile"} 5.7296e-05
node_scrape_collector_duration_seconds{collector="thermal_zone"} 0.017899137
node_scrape_collector_duration_seconds{collector="time"} 0.000422885
node_scrape_collector_duration_seconds{collector="timex"} 0.000182517
node_scrape_collector_duration_seconds{collector="udp_queues"} 0.001325488
node_scrape_collector_duration_seconds{collector="uname"} 7.0184e-05
node_scrape_collector_duration_seconds{collector="vmstat"} 0.000352664
node_scrape_collector_duration_seconds{collector="xfs"} 4.2481e-05
node_scrape_collector_duration_seconds{collector="zfs"} 0.00011237
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="arp"} 0
node_scrape_collector_success{collector="bcache"} 1
node_scrape_collector_success{collector="bonding"} 0
node_scrape_collector_success{collector="btrfs"} 1
node_scrape_collector_success{collector="conntrack"} 0
node_scrape_collector_success{collector="cpu"} 1
node_scrape_collector_success{collector="cpufreq"} 1
node_scrape_collector_success{collector="diskstats"} 1
node_scrape_collector_success{collector="dmi"} 0
node_scrape_collector_success{collector="edac"} 1
node_scrape_collector_success{collector="entropy"} 1
node_scrape_collector_success{collector="fibrechannel"} 0
node_scrape_collector_success{collector="filefd"} 1
node_scrape_collector_success{collector="filesystem"} 1
node_scrape_collector_success{collector="hwmon"} 1
node_scrape_collector_success{collector="infiniband"} 0
node_scrape_collector_success{collector="ipvs"} 1
node_scrape_collector_success{collector="loadavg"} 1
node_scrape_collector_success{collector="mdadm"} 0
node_scrape_collector_success{collector="meminfo"} 1
node_scrape_collector_success{collector="netclass"} 1
node_scrape_collector_success{collector="netdev"} 1
node_scrape_collector_success{collector="netstat"} 1
node_scrape_collector_success{collector="nfs"} 1
node_scrape_collector_success{collector="nfsd"} 0
node_scrape_collector_success{collector="nvme"} 1
node_scrape_collector_success{collector="os"} 1
node_scrape_collector_success{collector="powersupplyclass"} 1
node_scrape_collector_success{collector="pressure"} 0
node_scrape_collector_success{collector="rapl"} 0
node_scrape_collector_success{collector="schedstat"} 1
node_scrape_collector_success{collector="selinux"} 1
node_scrape_collector_success{collector="sockstat"} 1
node_scrape_collector_success{collector="softnet"} 1
node_scrape_collector_success{collector="stat"} 1
node_scrape_collector_success{collector="tapestats"} 0
node_scrape_collector_success{collector="textfile"} 1
node_scrape_collector_success{collector="thermal_zone"} 1
node_scrape_collector_success{collector="time"} 1
node_scrape_collector_success{collector="timex"} 1
node_scrape_collector_success{collector="udp_queues"} 1
node_scrape_collector_success{collector="uname"} 1
node_scrape_collector_success{collector="vmstat"} 1
node_scrape_collector_success{collector="xfs"} 1
node_scrape_collector_success{collector="zfs"} 0
# HELP node_selinux_enabled SELinux is enabled, 1 is true, 0 is false
# TYPE node_selinux_enabled gauge
node_selinux_enabled 0
# HELP node_sockstat_FRAG6_inuse Number of FRAG6 sockets in state inuse.
# TYPE node_sockstat_FRAG6_inuse gauge
node_sockstat_FRAG6_inuse 0
# HELP node_sockstat_FRAG6_memory Number of FRAG6 sockets in state memory.
# TYPE node_sockstat_FRAG6_memory gauge
node_sockstat_FRAG6_memory 0
# HELP node_sockstat_FRAG_inuse Number of FRAG sockets in state inuse.
# TYPE node_sockstat_FRAG_inuse gauge
node_sockstat_FRAG_inuse 0
# HELP node_sockstat_FRAG_memory Number of FRAG sockets in state memory.
# TYPE node_sockstat_FRAG_memory gauge
node_sockstat_FRAG_memory 0
# HELP node_sockstat_RAW6_inuse Number of RAW6 sockets in state inuse.
# TYPE node_sockstat_RAW6_inuse gauge
node_sockstat_RAW6_inuse 1
# HELP node_sockstat_RAW_inuse Number of RAW sockets in state inuse.
# TYPE node_sockstat_RAW_inuse gauge
node_sockstat_RAW_inuse 0
# HELP node_sockstat_TCP6_inuse Number of TCP6 sockets in state inuse.
# TYPE node_sockstat_TCP6_inuse gauge
node_sockstat_TCP6_inuse 44
# HELP node_sockstat_TCP_alloc Number of TCP sockets in state alloc.
# TYPE node_sockstat_TCP_alloc gauge
node_sockstat_TCP_alloc 272
# HELP node_sockstat_TCP_inuse Number of TCP sockets in state inuse.
# TYPE node_sockstat_TCP_inuse gauge
node_sockstat_TCP_inuse 211
# HELP node_sockstat_TCP_mem Number of TCP sockets in state mem.
# TYPE node_sockstat_TCP_mem gauge
node_sockstat_TCP_mem 665
# HELP node_sockstat_TCP_mem_bytes Number of TCP sockets in state mem_bytes.
# TYPE node_sockstat_TCP_mem_bytes gauge
node_sockstat_TCP_mem_bytes 2.72384e+06
# HELP node_sockstat_TCP_orphan Number of TCP sockets in state orphan.
# TYPE node_sockstat_TCP_orphan gauge
node_sockstat_TCP_orphan 0
# HELP node_sockstat_TCP_tw Number of TCP sockets in state tw.
# TYPE node_sockstat_TCP_tw gauge
node_sockstat_TCP_tw 55
# HELP node_sockstat_UDP6_inuse Number of UDP6 sockets in state inuse.
# TYPE node_sockstat_UDP6_inuse gauge
node_sockstat_UDP6_inuse 2
# HELP node_sockstat_UDPLITE6_inuse Number of UDPLITE6 sockets in state inuse.
# TYPE node_sockstat_UDPLITE6_inuse gauge
node_sockstat_UDPLITE6_inuse 0
# HELP node_sockstat_UDPLITE_inuse Number of UDPLITE sockets in state inuse.
# TYPE node_sockstat_UDPLITE_inuse gauge
node_sockstat_UDPLITE_inuse 0
# HELP node_sockstat_UDP_inuse Number of UDP sockets in state inuse.
# TYPE node_sockstat_UDP_inuse gauge
node_sockstat_UDP_inuse 3
# HELP node_sockstat_UDP_mem Number of UDP sockets in state mem.
# TYPE node_sockstat_UDP_mem gauge
node_sockstat_UDP_mem 249
# HELP node_sockstat_UDP_mem_bytes Number of UDP sockets in state mem_bytes.
# TYPE node_sockstat_UDP_mem_bytes gauge
node_sockstat_UDP_mem_bytes 1.019904e+06
# HELP node_sockstat_sockets_used Number of IPv4 sockets in use.
# TYPE node_sockstat_sockets_used gauge
node_sockstat_sockets_used 563
# HELP node_softnet_backlog_len Softnet backlog status
# TYPE node_softnet_backlog_len gauge
node_softnet_backlog_len{cpu="0"} 0
node_softnet_backlog_len{cpu="1"} 0
node_softnet_backlog_len{cpu="2"} 0
node_softnet_backlog_len{cpu="3"} 0
# HELP node_softnet_cpu_collision_total Number of collision occur while obtaining device lock while transmitting
# TYPE node_softnet_cpu_collision_total counter
node_softnet_cpu_collision_total{cpu="0"} 0
node_softnet_cpu_collision_total{cpu="1"} 0
node_softnet_cpu_collision_total{cpu="2"} 0
node_softnet_cpu_collision_total{cpu="3"} 0
# HELP node_softnet_dropped_total Number of dropped packets
# TYPE node_softnet_dropped_total counter
node_softnet_dropped_total{cpu="0"} 0
node_softnet_dropped_total{cpu="1"} 0
node_softnet_dropped_total{cpu="2"} 0
node_softnet_dropped_total{cpu="3"} 0
# HELP node_softnet_flow_limit_count_total Number of times flow limit has been reached
# TYPE node_softnet_flow_limit_count_total counter
node_softnet_flow_limit_count_total{cpu="0"} 0
node_softnet_flow_limit_count_total{cpu="1"} 0
node_softnet_flow_limit_count_total{cpu="2"} 0
node_softnet_flow_limit_count_total{cpu="3"} 0
# HELP node_softnet_processed_total Number of processed packets
# TYPE node_softnet_processed_total counter
node_softnet_processed_total{cpu="0"} 3.91430308e+08
node_softnet_processed_total{cpu="1"} 7.0427743e+07
node_softnet_processed_total{cpu="2"} 7.2377954e+07
node_softnet_processed_total{cpu="3"} 7.0743949e+07
# HELP node_softnet_received_rps_total Number of times cpu woken up received_rps
# TYPE node_softnet_received_rps_total counter
node_softnet_received_rps_total{cpu="0"} 0
node_softnet_received_rps_total{cpu="1"} 0
node_softnet_received_rps_total{cpu="2"} 0
node_softnet_received_rps_total{cpu="3"} 0
# HELP node_softnet_times_squeezed_total Number of times processing packets ran out of quota
# TYPE node_softnet_times_squeezed_total counter
node_softnet_times_squeezed_total{cpu="0"} 298183
node_softnet_times_squeezed_total{cpu="1"} 0
node_softnet_times_squeezed_total{cpu="2"} 0
node_softnet_times_squeezed_total{cpu="3"} 0
# HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise
# TYPE node_textfile_scrape_error gauge
node_textfile_scrape_error 0
# HELP node_thermal_zone_temp Zone temperature in Celsius
# TYPE node_thermal_zone_temp gauge
node_thermal_zone_temp{type="cpu-thermal",zone="0"} 28.232
# HELP node_time_clocksource_available_info Available clocksources read from '/sys/devices/system/clocksource'.
# TYPE node_time_clocksource_available_info gauge
node_time_clocksource_available_info{clocksource="arch_sys_counter",device="0"} 1
# HELP node_time_clocksource_current_info Current clocksource read from '/sys/devices/system/clocksource'.
# TYPE node_time_clocksource_current_info gauge
node_time_clocksource_current_info{clocksource="arch_sys_counter",device="0"} 1
# HELP node_time_seconds System time in seconds since epoch (1970).
# TYPE node_time_seconds gauge
node_time_seconds 1.7097658934862518e+09
# HELP node_time_zone_offset_seconds System time zone offset in seconds.
# TYPE node_time_zone_offset_seconds gauge
node_time_zone_offset_seconds{time_zone="UTC"} 0
# HELP node_timex_estimated_error_seconds Estimated error in seconds.
# TYPE node_timex_estimated_error_seconds gauge
node_timex_estimated_error_seconds 0
# HELP node_timex_frequency_adjustment_ratio Local clock frequency adjustment.
# TYPE node_timex_frequency_adjustment_ratio gauge
node_timex_frequency_adjustment_ratio 0.9999922578277588
# HELP node_timex_loop_time_constant Phase-locked loop time constant.
# TYPE node_timex_loop_time_constant gauge
node_timex_loop_time_constant 7
# HELP node_timex_maxerror_seconds Maximum error in seconds.
# TYPE node_timex_maxerror_seconds gauge
node_timex_maxerror_seconds 0.672
# HELP node_timex_offset_seconds Time offset in between local system and reference clock.
# TYPE node_timex_offset_seconds gauge
node_timex_offset_seconds -0.000593063
# HELP node_timex_pps_calibration_total Pulse per second count of calibration intervals.
# TYPE node_timex_pps_calibration_total counter
node_timex_pps_calibration_total 0
# HELP node_timex_pps_error_total Pulse per second count of calibration errors.
# TYPE node_timex_pps_error_total counter
node_timex_pps_error_total 0
# HELP node_timex_pps_frequency_hertz Pulse per second frequency.
# TYPE node_timex_pps_frequency_hertz gauge
node_timex_pps_frequency_hertz 0
# HELP node_timex_pps_jitter_seconds Pulse per second jitter.
# TYPE node_timex_pps_jitter_seconds gauge
node_timex_pps_jitter_seconds 0
# HELP node_timex_pps_jitter_total Pulse per second count of jitter limit exceeded events.
# TYPE node_timex_pps_jitter_total counter
node_timex_pps_jitter_total 0
# HELP node_timex_pps_shift_seconds Pulse per second interval duration.
# TYPE node_timex_pps_shift_seconds gauge
node_timex_pps_shift_seconds 0
# HELP node_timex_pps_stability_exceeded_total Pulse per second count of stability limit exceeded events.
# TYPE node_timex_pps_stability_exceeded_total counter
node_timex_pps_stability_exceeded_total 0
# HELP node_timex_pps_stability_hertz Pulse per second stability, average of recent frequency changes.
# TYPE node_timex_pps_stability_hertz gauge
node_timex_pps_stability_hertz 0
# HELP node_timex_status Value of the status array bits.
# TYPE node_timex_status gauge
node_timex_status 24577
# HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no).
# TYPE node_timex_sync_status gauge
node_timex_sync_status 1
# HELP node_timex_tai_offset_seconds International Atomic Time (TAI) offset.
# TYPE node_timex_tai_offset_seconds gauge
node_timex_tai_offset_seconds 0
# HELP node_timex_tick_seconds Seconds between clock ticks.
# TYPE node_timex_tick_seconds gauge
node_timex_tick_seconds 0.01
# HELP node_udp_queues Number of allocated memory in the kernel for UDP datagrams in bytes.
# TYPE node_udp_queues gauge
node_udp_queues{ip="v4",queue="rx"} 0
node_udp_queues{ip="v4",queue="tx"} 0
node_udp_queues{ip="v6",queue="rx"} 0
node_udp_queues{ip="v6",queue="tx"} 0
# HELP node_uname_info Labeled system information as provided by the uname system call.
# TYPE node_uname_info gauge
node_uname_info{domainname="(none)",machine="aarch64",nodename="bettley",release="6.1.0-rpi7-rpi-v8",sysname="Linux",version="#1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24)"} 1
# HELP node_vmstat_oom_kill /proc/vmstat information field oom_kill.
# TYPE node_vmstat_oom_kill untyped
node_vmstat_oom_kill 0
# HELP node_vmstat_pgfault /proc/vmstat information field pgfault.
# TYPE node_vmstat_pgfault untyped
node_vmstat_pgfault 3.706999478e+09
# HELP node_vmstat_pgmajfault /proc/vmstat information field pgmajfault.
# TYPE node_vmstat_pgmajfault untyped
node_vmstat_pgmajfault 5791
# HELP node_vmstat_pgpgin /proc/vmstat information field pgpgin.
# TYPE node_vmstat_pgpgin untyped
node_vmstat_pgpgin 1.115617e+06
# HELP node_vmstat_pgpgout /proc/vmstat information field pgpgout.
# TYPE node_vmstat_pgpgout untyped
node_vmstat_pgpgout 2.55770725e+08
# HELP node_vmstat_pswpin /proc/vmstat information field pswpin.
# TYPE node_vmstat_pswpin untyped
node_vmstat_pswpin 0
# HELP node_vmstat_pswpout /proc/vmstat information field pswpout.
# TYPE node_vmstat_pswpout untyped
node_vmstat_pswpout 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.05
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.2292096e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.7097658257e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.269604352e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

So... yay?

We could shift this to a separate repository, or we can just rip it back out of the incubator and create a separate Application resource for it in this task file. We could organize it a thousand different ways. A prometheus_node_exporter repository? A prometheus repository? A monitoring repository?

Because I'm not really sure which I'd like to do, I'll just defer the decision until a later date and move on to other things.

Router BGP Configuration

Before I go too much further, I want to get load balancer services working.

With major cloud vendors that support Kubernetes, creating a service of type LoadBalancer will create a load balancer within that platform that provides external access to that service. This spares us from having to use ClusterIP, etc, to access our services.

This functionality isn't automatically available in a homelab. Why would it be? How could it know what you want? Regardless of the complexities preventing this from Just Working™, this topic is often a source of irritation to the homelabber.

Fortunately, a gentleman and scholar named Dave Anderson spent (I assume) a significant amount of time and devised a system, MetalLB, to bring load balancer functionality to bare metal clusters.

With a reasonable amount of effort, we can configure a router supporting BGP and a Kubernetes cluster running MetalLB into a pretty clean network infrastructure.

Network Architecture Overview

The BGP configuration creates a sophisticated routing topology that enables dynamic load balancer allocation:

Network Segmentation

  • Infrastructure CIDR: 10.4.0.0/20 (main cluster network)
  • Service CIDR: 172.16.0.0/20 (Kubernetes internal services)
  • Pod CIDR: 192.168.0.0/16 (container networking)
  • MetalLB Pool: 10.4.11.0/24 (load balancer IP range: 10.4.11.0 - 10.4.15.254)

BGP Autonomous System Design

  • Router ASN: 64500 (OPNsense gateway acting as route reflector)
  • Cluster ASN: 64501 (all Kubernetes nodes share this AS number)
  • Peer relationship: eBGP (External BGP) between different AS numbers

This design follows RFC 7938 recommendations for private AS numbers in the range 64512-65534.

OPNsense Router Configuration

In my case, this starts with configuring my router/firewall (running OPNsense) to support BGP.

Step 1: FRR Plugin Installation

This means installing the os-frr (for "Free-Range Routing") plugin:

Installing the os-frr plugin

Free-Range Routing (FRR) is a routing software suite that provides:

  • BGP-4: Border Gateway Protocol implementation
  • OSPF: Open Shortest Path First for dynamic routing
  • ISIS/RIP: Additional routing protocol support
  • Route maps: Sophisticated traffic engineering capabilities

Step 2: Enable Global Routing

Then we enable routing:

Enabling routing

This configuration enables:

  • Kernel route injection: FRR can modify the system routing table
  • Route redistribution: Between different routing protocols
  • Multi-protocol support: IPv4 and IPv6 route advertisement

Step 3: BGP Configuration

Then we enable BGP. We give the router an AS number of 64500.

Enabling BGP

BGP Configuration Parameters:

  • Router ID: Typically set to the router's loopback or primary interface IP (10.4.0.1)
  • AS Number: 64500 (private ASN for the gateway)
  • Network advertisements: Routes to be advertised to BGP peers
  • Redistribution: Connected routes, static routes, and other protocols

Step 4: BGP Neighbor Configuration

Then we add each of the nodes that might run MetalLB "speakers" as neighbors. They all will share a single AS number, 64501.

Kubernetes Node BGP Peers:

# Control Plane Nodes (also run MetalLB speakers)
10.4.0.11 (bettley)  - ASN 64501
10.4.0.12 (cargyll)  - ASN 64501  
10.4.0.13 (dalt)     - ASN 64501

# Worker Nodes (potential MetalLB speakers)
10.4.0.14 (erenford) - ASN 64501
10.4.0.15 (fenn)     - ASN 64501
10.4.0.16 (gardener) - ASN 64501
10.4.0.17 (harlton)  - ASN 64501
10.4.0.18 (inchfield) - ASN 64501
10.4.0.19 (jast)     - ASN 64501
10.4.0.20 (karstark) - ASN 64501
10.4.0.21 (lipps)    - ASN 64501
10.4.1.10 (velaryon) - ASN 64501

Neighbor Configuration Details:

  • Peer Type: External BGP (eBGP) due to different AS numbers
  • Authentication: Can use MD5 authentication for security
  • Timers: Hold time (180s) and keepalive (60s) for session management
  • Route filters: Accept only specific route prefixes from cluster

BGP Route Advertisement Strategy

Router Advertisements

The OPNsense router advertises:

  • Default route (0.0.0.0/0) to provide internet access
  • Infrastructure networks (10.4.0.0/20) for internal cluster communication
  • External services that may be hosted outside the cluster

Cluster Advertisements

MetalLB speakers advertise:

  • LoadBalancer service IPs from the 10.4.11.0/24 pool
  • Individual /32 routes for each allocated load balancer IP
  • Equal-cost multi-path (ECMP) when multiple speakers announce the same service

Route Selection and Load Balancing

BGP Path Selection

When multiple MetalLB speakers advertise the same service IP:

  1. Prefer shortest AS path (all speakers have same path length)
  2. Prefer lowest origin code (IGP over EGP over incomplete)
  3. Prefer lowest MED (Multi-Exit Discriminator)
  4. Prefer eBGP over iBGP (not applicable here)
  5. Prefer lowest IGP cost to BGP next-hop
  6. Prefer oldest route (route stability)

Router Load Balancing

OPNsense can be configured for:

  • Per-packet load balancing: Maximum utilization but potential packet reordering
  • Per-flow load balancing: Maintains flow affinity while distributing across paths
  • Weighted load balancing: Different weights for different next-hops

Security Considerations

BGP Session Security

  • MD5 Authentication: Prevents unauthorized BGP session establishment
  • TTL Security: Ensures BGP packets come from directly connected neighbors
  • Prefix filters: Prevent route hijacking by filtering unexpected announcements

Route Filtering

# Example prefix filter configuration  
prefix-list METALLB-ROUTES permit 10.4.11.0/24 le 32
neighbor 10.4.0.11 prefix-list METALLB-ROUTES in

This ensures the router only accepts MetalLB routes within the designated pool.

Monitoring and Troubleshooting

BGP Session Monitoring

Key commands for BGP troubleshooting:

# View BGP summary
vtysh -c "show ip bgp summary"

# Check specific neighbor status  
vtysh -c "show ip bgp neighbor 10.4.0.11"

# View advertised routes
vtysh -c "show ip bgp advertised-routes"

# Check routing table
ip route show table main

Common BGP Issues

  • Session flapping: Often due to network connectivity or timer mismatches
  • Route installation failures: Check routing table limits and memory
  • Asymmetric routing: Verify return path routing and firewalls

Integration with MetalLB

The BGP configuration on the router side enables MetalLB to:

  1. Establish BGP sessions with the cluster gateway
  2. Advertise LoadBalancer service IPs dynamically as services are created
  3. Withdraw routes automatically when services are deleted
  4. Provide redundancy through multiple speaker nodes

This creates a fully dynamic load balancing solution where:

  • Services get real IP addresses from the external network
  • Traffic routes optimally through the cluster
  • Failover happens automatically via BGP reconvergence
  • No manual network configuration required for new services

In the next section, we'll configure MetalLB to establish these BGP sessions and begin advertising load balancer routes.

MetalLB

MetalLB requires that its namespace have some extra privileges:

  apiVersion: 'v1'
  kind: 'Namespace'
  metadata:
    name: 'metallb'
    labels:
      name: 'metallb'
      managed-by: 'argocd'
      pod-security.kubernetes.io/enforce: privileged
      pod-security.kubernetes.io/audit: privileged
      pod-security.kubernetes.io/warn: privileged

Its application is (perhaps surprisingly) rather simple to configure:

apiVersion: 'argoproj.io/v1alpha1'
kind: 'Application'
metadata:
  name: 'metallb'
  namespace: 'argocd'
  labels:
    name: 'metallb'
    managed-by: 'argocd'
spec:
  project: 'metallb'
  source:
    repoURL: 'https://metallb.github.io/metallb'
    chart: 'metallb'
    targetRevision: '0.14.3'
    helm:
      releaseName: 'metallb'
      valuesObject:
        rbac:
          create: true
        prometheus:
          scrapeAnnotations: true
          metricsPort: 7472
          rbacPrometheus: true
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: 'metallb'
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - Validate=true
      - CreateNamespace=false
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - RespectIgnoreDifferences=true
      - ApplyOutOfSyncOnly=true

It does require some extra resources, though. The first of these is an address pool from which to allocate IP addresses. It's important that this not overlap with a DHCP pool.

The full network is 10.4.0.0/20 and I've configured the DHCP server to only serve addresses in 10.4.0.100-254, so we have plenty of space to play with. Right now, I'll use 10.4.11.0-10.4.15.254, which gives ~1250 usable addresses. I don't think I'll use quite that many.

apiVersion: 'metallb.io/v1beta1'
kind: 'IPAddressPool'
metadata:
  name: 'primary'
  namespace: 'metallb'
spec:
  addresses:
  - 10.4.11.0 - 10.4.15.254

Then we need to configure MetalLB to act as a BGP peer:

apiVersion: 'metallb.io/v1beta2'
kind: 'BGPPeer'
metadata:
  name: 'marbrand'
  namespace: 'metallb'
spec:
  myASN: 64501
  peerASN: 64500
  peerAddress: 10.4.0.1

And advertise the IP address pool:

apiVersion: 'metallb.io/v1beta1'
kind: 'BGPAdvertisement'
metadata:
  name: 'primary'
  namespace: 'metallb'
spec:
  ipAddressPools:
    - 'primary'

That's that; we can deploy it, and soon we'll be up and running, although we can't yet test it.

MetalLB deployed in Argo CD

Testing MetalLB

The simplest way to test MetalLB is just to deploy an application with a LoadBalancer service and see if it works.

I'm a fan of httpbin and its Go port, httpbingo, so up it goes:

apiVersion: 'argoproj.io/v1alpha1'
kind: 'Application'
metadata:
  name: 'httpbin'
  namespace: 'argocd'
  labels:
    name: 'httpbin'
    managed-by: 'argocd'
spec:
  project: 'httpbin'
  source:
    repoURL: 'https://matheusfm.dev/charts'
    chart: 'httpbin'
    targetRevision: '0.1.1'
    helm:
      releaseName: 'httpbin'
      valuesObject:
        service:
          type: 'LoadBalancer'
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: 'httpbin'
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - Validate=true
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
      - RespectIgnoreDifferences=true
      - ApplyOutOfSyncOnly=true

Very quickly, it's synced:

httpbin deployed

We can get the IP address allocated for the load balancer with kubectl -n httpbin get svc:

httpbin service

And sure enough, it's allocated from the IP address pool we specified. That seems like an excellent sign!

Can we access it from a web browser running on a computer on a different network?

httpbin webpage

Yes, we can! Our load balancer system is working!

Comprehensive MetalLB Testing Suite

While the httpbin test demonstrates basic functionality, production MetalLB deployments require more thorough validation of various scenarios and failure modes.

Phase 1: Basic Functionality Tests

1.1 IP Address Allocation Verification

First, verify that MetalLB allocates IP addresses from the configured pool:

# Check the configured IP address pool
kubectl -n metallb get ipaddresspool primary -o yaml

# Deploy multiple LoadBalancer services and verify allocations
kubectl create deployment test-nginx --image=nginx
kubectl expose deployment test-nginx --type=LoadBalancer --port=80

# Verify sequential allocation from pool
kubectl get svc test-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Expected behavior:

  • IPs allocated from 10.4.11.0 - 10.4.15.254 range
  • Sequential allocation starting from pool beginning
  • No IP conflicts between services

1.2 Service Lifecycle Testing

Test the complete service lifecycle to ensure proper cleanup:

# Create service and note allocated IP
kubectl create deployment lifecycle-test --image=httpd
kubectl expose deployment lifecycle-test --type=LoadBalancer --port=80
ALLOCATED_IP=$(kubectl get svc lifecycle-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Verify service is accessible
curl -s http://$ALLOCATED_IP/ | grep "It works!"

# Delete service and verify IP is released
kubectl delete svc lifecycle-test
kubectl delete deployment lifecycle-test

# Verify IP is available for reallocation
kubectl create deployment reallocation-test --image=nginx
kubectl expose deployment reallocation-test --type=LoadBalancer --port=80
NEW_IP=$(kubectl get svc reallocation-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Should reuse the previously released IP
echo "Original IP: $ALLOCATED_IP, New IP: $NEW_IP"

Phase 2: BGP Advertisement Testing

2.1 BGP Session Health Verification

Monitor BGP session establishment and health:

# Check MetalLB speaker status
kubectl -n metallb get pods -l component=speaker

# Verify BGP sessions from router perspective
goldentooth command allyrion 'vtysh -c "show ip bgp summary"'

# Check BGP neighbor status for specific node
goldentooth command allyrion 'vtysh -c "show ip bgp neighbor 10.4.0.11"'

Expected BGP session states:

  • Established: BGP session is active and exchanging routes
  • Route count: Number of routes received from each speaker
  • Session uptime: Indicates session stability

2.2 Route Advertisement Verification

Verify that LoadBalancer IPs are properly advertised via BGP:

# Create test service
kubectl create deployment bgp-test --image=nginx
kubectl expose deployment bgp-test --type=LoadBalancer --port=80
TEST_IP=$(kubectl get svc bgp-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Check route advertisement on router
goldentooth command allyrion "vtysh -c 'show ip bgp | grep $TEST_IP'"

# Verify route in kernel routing table
goldentooth command allyrion "ip route show | grep $TEST_IP"

# Test route withdrawal
kubectl delete svc bgp-test
sleep 30

# Verify route is withdrawn
goldentooth command allyrion "vtysh -c 'show ip bgp | grep $TEST_IP' || echo 'Route withdrawn'"

Phase 3: High Availability Testing

3.1 Speaker Node Failure Simulation

Test MetalLB's behavior when speaker nodes fail:

# Identify which node is announcing a service
kubectl create deployment ha-test --image=nginx
kubectl expose deployment ha-test --type=LoadBalancer --port=80
HA_IP=$(kubectl get svc ha-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# Find announcing node from BGP table
goldentooth command allyrion "vtysh -c 'show ip bgp $HA_IP'"

# Simulate node failure by stopping kubelet on announcing node
ANNOUNCING_NODE=$(kubectl get svc ha-test -o jsonpath='{.metadata.annotations.metallb\.universe\.tf/announcing-node}' 2>/dev/null || echo "bettley")
goldentooth command_root $ANNOUNCING_NODE 'systemctl stop kubelet'

# Wait for BGP reconvergence (typically 30-180 seconds)
sleep 60

# Verify service is still accessible (new node should announce)
curl -s http://$HA_IP/ | grep "Welcome to nginx"

# Check new announcing node
goldentooth command allyrion "vtysh -c 'show ip bgp $HA_IP'"

# Restore failed node
goldentooth command_root $ANNOUNCING_NODE 'systemctl start kubelet'

3.2 Split-Brain Prevention Testing

Verify that MetalLB prevents split-brain scenarios where multiple nodes announce the same service:

# Deploy service with specific node selector to control placement
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: split-brain-test
  annotations:
    metallb.universe.tf/allow-shared-ip: "split-brain-test"
spec:
  type: LoadBalancer
  selector:
    app: split-brain-test
  ports:
  - port: 80
    targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: split-brain-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: split-brain-test
  template:
    metadata:
      labels:
        app: split-brain-test
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
EOF

# Monitor BGP announcements for the service IP
SPLIT_IP=$(kubectl get svc split-brain-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
goldentooth command allyrion "vtysh -c 'show ip bgp $SPLIT_IP detail'"

# Should see only one announcement path, not multiple conflicting paths

Phase 4: Performance and Scale Testing

4.1 IP Pool Exhaustion Testing

Test behavior when IP address pool is exhausted:

# Calculate available IPs in pool (10.4.11.0 - 10.4.15.254 = ~1250 IPs)
# Deploy services until pool exhaustion

for i in {1..10}; do
  kubectl create deployment scale-test-$i --image=nginx
  kubectl expose deployment scale-test-$i --type=LoadBalancer --port=80
  echo "Created service $i"
  sleep 5
done

# Check for services stuck in Pending state
kubectl get svc | grep Pending

# Verify MetalLB events for pool exhaustion
kubectl -n metallb get events --sort-by='.lastTimestamp'

4.2 BGP Convergence Time Measurement

Measure BGP convergence time under various scenarios:

# Create test service and measure initial advertisement time
start_time=$(date +%s)
kubectl create deployment convergence-test --image=nginx
kubectl expose deployment convergence-test --type=LoadBalancer --port=80

# Wait for IP allocation
while [ -z "$(kubectl get svc convergence-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null)" ]; do
  sleep 1
done

CONV_IP=$(kubectl get svc convergence-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "IP allocated: $CONV_IP"

# Wait for BGP advertisement
while ! goldentooth command allyrion "ip route show | grep $CONV_IP" >/dev/null 2>&1; do
  sleep 1
done

end_time=$(date +%s)
convergence_time=$((end_time - start_time))
echo "BGP convergence time: ${convergence_time} seconds"

Phase 5: Integration Testing

5.1 ExternalDNS Integration

Test automatic DNS record creation for LoadBalancer services:

# Deploy service with DNS annotation
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: dns-integration-test
  annotations:
    external-dns.alpha.kubernetes.io/hostname: test.goldentooth.net
spec:
  type: LoadBalancer
  selector:
    app: dns-test
  ports:
  - port: 80
    targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dns-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dns-test
  template:
    metadata:
      labels:
        app: dns-test
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
EOF

# Wait for DNS propagation
sleep 60

# Test DNS resolution
nslookup test.goldentooth.net

# Test HTTP access via DNS name
curl -s http://test.goldentooth.net/ | grep "Welcome to nginx"

5.2 TLS Certificate Integration

Test automatic TLS certificate provisioning for LoadBalancer services:

# Deploy service with cert-manager annotations
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: tls-integration-test
  annotations:
    external-dns.alpha.kubernetes.io/hostname: tls-test.goldentooth.net
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  type: LoadBalancer
  selector:
    app: tls-test
  ports:
  - port: 443
    targetPort: 443
    name: https
  - port: 80
    targetPort: 80
    name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: tls-test-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - tls-test.goldentooth.net
    secretName: tls-test-cert
  rules:
  - host: tls-test.goldentooth.net
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: tls-integration-test
            port:
              number: 80
EOF

# Wait for certificate provisioning
kubectl wait --for=condition=Ready certificate/tls-test-cert --timeout=300s

# Test HTTPS access
curl -s https://tls-test.goldentooth.net/ | grep "Welcome to nginx"

Phase 6: Troubleshooting and Monitoring

6.1 MetalLB Component Health

Monitor MetalLB component health and logs:

# Check MetalLB controller status
kubectl -n metallb get pods -l component=controller
kubectl -n metallb logs -l component=controller --tail=50

# Check MetalLB speaker status on each node
kubectl -n metallb get pods -l component=speaker -o wide
kubectl -n metallb logs -l component=speaker --tail=50

# Check MetalLB configuration
kubectl -n metallb get ipaddresspool,bgppeer,bgpadvertisement -o wide

6.2 BGP Session Troubleshooting

Debug BGP session issues:

# Check BGP session state
goldentooth command allyrion 'vtysh -c "show ip bgp summary"'

# Detailed neighbor analysis
for node_ip in 10.4.0.11 10.4.0.12 10.4.0.13; do
  echo "=== BGP Neighbor $node_ip ==="
  goldentooth command allyrion "vtysh -c 'show ip bgp neighbor $node_ip'"
done

# Check for BGP route-map and prefix-list configurations
goldentooth command allyrion 'vtysh -c "show ip bgp route-map"'
goldentooth command allyrion 'vtysh -c "show ip prefix-list"'

# Monitor BGP route changes in real-time
goldentooth command allyrion 'vtysh -c "debug bgp events"'

6.3 Network Connectivity Testing

Comprehensive network path testing:

# Test connectivity from external networks
for test_ip in $(kubectl get svc -A -o jsonpath='{.items[?(@.spec.type=="LoadBalancer")].status.loadBalancer.ingress[0].ip}'); do
  echo "Testing connectivity to $test_ip"
  
  # Test from router
  goldentooth command allyrion "ping -c 3 $test_ip"
  
  # Test HTTP connectivity
  goldentooth command allyrion "curl -s -o /dev/null -w '%{http_code}' http://$test_ip/ || echo 'Connection failed'"
  
  # Test from external network (if possible)
  # ping -c 3 $test_ip
done

# Test internal cluster connectivity
kubectl run network-test --image=busybox --rm -it --restart=Never -- /bin/sh
# From within the pod:
# wget -qO- http://test-service.default.svc.cluster.local/

This comprehensive testing suite ensures MetalLB is functioning correctly across all operational scenarios, from basic IP allocation to complex failure recovery and integration testing. Each test phase builds confidence in the load balancer implementation and helps identify potential issues before they impact production workloads.

Refactoring Argo CD

We're only a few projects in, and using Ansible to install our Argo CD applications seems a bit weak. It's not very GitOps-y to run a Bash command that runs an Ansible playbook that kubectls some manifests into our Kubernetes cluster.

In fact, the less we mess with Argo CD itself, the better. Eventually, we'll be able to create a repository on GitHub and see resources appear within our Kubernetes cluster without having to touch Argo CD at all!

We'll do this by using the power of ApplicationSet resources.

First, we'll create a secret to hold a GitHub token. This part is optional, but it'll allow us to use the API more.

Second, we'll create an AppProject to encompass these applications. It'll have pretty broad permissions at first, though I'll try and tighten them up a bit.

apiVersion: 'argoproj.io/v1alpha1'
kind: 'AppProject'
metadata:
  name: 'gitops-repo'
  namespace: 'argocd'
  finalizers:
    - 'resources-finalizer.argocd.argoproj.io'
spec:
  description: 'Goldentooth GitOps-Repo project'
  sourceRepos:
    - '*'
  destinations:
    - namespace: '!kube-system'
      server: '*'
    - namespace: '*'
      server: '*'
  clusterResourceWhitelist:
    - group: '*'
      kind: '*'

Then an ApplicationSet.

apiVersion: 'argoproj.io/v1alpha1'
kind: 'ApplicationSet'
metadata:
  name: 'gitops-repo'
  namespace: 'argocd'
spec:
  generators:
    - scmProvider:
        github:
          organization: 'goldentooth'
          tokenRef:
            secretName: 'github-token'
            key: 'token'
        filters:
          - labelMatch: 'gitops-repo'
  template:
    goTemplate: true
    goTemplateOptions: ["missingkey=error"]
    metadata:
      # Prefix name with `gitops-repo-`.
      # This allows us to define the `Application` manifest within the repo and
      # have significantly greater flexibility, at the cost of an additional
      # application in the Argo CD UI.
      name: 'gitops-repo-{{ .repository }}'
    spec:
      source:
        repoURL: '{{ .url }}'
        targetRevision: '{{ .branch }}'
        path: './'
      project: 'gitops-repo'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{ .repository }}'

The idea is that I'll create a repository and give it a topic of gitops-repo. This will be matched by the labelMatch filter, and then Argo CD will deploy whatever manifests it finds there.

MetalLB is the natural place to start.

We don't actually have to do that much to get this working:

  1. Create a new repository metallb.
  2. Add a Chart.yaml file with some boilerplate.
  3. Add the manifests to a templates/ directory.
  4. Add a values.yaml file with values to substitute into the manifests.
  5. As mentioned above, edit the repo to give it the gitops-repo topic.

Within a few minutes, Argo CD will notice the changes and deploy a gitops-repo-metallb application:

gitops-repo-metallb synced

If we click into it, we'll see the resources deployed by the manifests within the repository:

gitops-repo-metallb contents

So we see the resources we created previously for the BGPPeer, IPAddressPool, and BGPAdvertisement. We also see an Application, metallb, which we can also see in the general Applications overview in Argo CD:

metallb synced

Clicking into it, we'll see all of the resources deployed by the metallb Helm chart we referenced.

metallb contents

A quick test to verify that our httpbin application is still assigned a working load balancer, and we can declare victory!

While I'm here, I might as well shift httpbin and prometheus-node-exporter as well...

Giving Argo CD a Load Balancer

All this time, the Argo CD server has been operating with a ClusterIP service, and I've been manually port forwarding it via kubectl to be able to show all of these beautiful screenshots of the web UI.

That's annoying and we don't have to do it anymore. Fortunately, it's very easy to change this now; all we need to do is modify the Helm release values slightly; change server.service.type from 'ClusterIP' to 'LoadBalancer' and redeploy. A few minutes later, we can access Argo CD via http://10.4.11.1, no port forwarding required.

ExternalDNS

The workflow for accessing our LoadBalancer services ain't great.

If we deploy a new application, we need to run kubectl -n <namespace> get svc and read through a list to determine the IP address on which it's exposed. And that's not going to be stable; there's nothing at all guaranteeing that Argo CD will always be available at http://10.4.11.1.

Enter ExternalDNS. The idea is that we annotate our services with external-dns.alpha.kubernetes.io/hostname: "argocd.my-cluster.my-domain.com" and a DNS record will be created pointing to the actual IP address of the LoadBalancer service.

This is comparatively straightforward to configure if you host your DNS in one of the supported services. I host mine via AWS Route53, which is supported.

The complication is that we don't yet have a great way of managing secrets, so there's a manual step here that I find unpleasant, but we'll cross that bridge when we get to it.

Architecture Overview

ExternalDNS creates a bridge between Kubernetes services and external DNS providers, enabling automatic DNS record management:

DNS Infrastructure

  • Primary Domain: goldentooth.net managed in AWS Route53
  • Zone ID: Z0736727S7ZH91VKK44A (defined in Terraform)
  • Cluster Subdomain: Services automatically get <service>.goldentooth.net
  • TTL Configuration: Default 60 seconds for rapid updates during development

Integration Points

  • MetalLB: Provides LoadBalancer IPs from pool 10.4.11.0/24
  • Route53: AWS DNS service for public domain management
  • Argo CD: GitOps deployment and lifecycle management
  • Terraform: Infrastructure-as-code for Route53 zone and ACM certificates

Helm Chart Configuration

Because of work we've done previously with Argo CD, we can just create a new repository to deploy ExternalDNS within our cluster.

The ExternalDNS deployment is managed through a custom Helm chart with comprehensive configuration:

Chart Structure (Chart.yaml)

apiVersion: v2
name: external-dns
description: ExternalDNS for automatic DNS record management
type: application
version: 0.0.1
appVersion: "v0.14.2"

Values Configuration (values.yaml)

metadata:
  namespace: external-dns
  name: external-dns
  projectName: gitops-repo

spec:
  domainFilter: goldentooth.net
  version: v0.14.2

This configuration provides:

  • Namespace isolation: Dedicated external-dns namespace
  • GitOps integration: Part of the gitops-repo project for Argo CD
  • Domain scoping: Only manages records for goldentooth.net
  • Version pinning: Uses ExternalDNS v0.14.2 for stability

Deployment Architecture

Core Deployment Configuration

This has the following manifests:

Deployment: The deployment has several interesting features:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: external-dns
spec:
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      containers:
      - name: external-dns
        image: registry.k8s.io/external-dns/external-dns:v0.14.2
        args:
        - --source=service
        - --domain-filter=goldentooth.net
        - --provider=aws
        - --aws-zone-type=public
        - --registry=txt
        - --txt-owner-id=external-dns-external-dns
        - --log-level=debug
        - --aws-region=us-east-1
        env:
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /.aws/credentials
        volumeMounts:
        - name: aws-credentials
          mountPath: /.aws
          readOnly: true
      volumes:
      - name: aws-credentials
        secret:
          secretName: external-dns

Key Configuration Parameters:

  • Provider: aws for Route53 integration
  • Sources: service (monitors Kubernetes LoadBalancer services)
  • Domain Filter: goldentooth.net (restricts DNS management scope)
  • AWS Zone Type: public (only manages public DNS records)
  • Registry: txt (uses TXT records for ownership tracking)
  • Owner ID: external-dns-external-dns (namespace-app format)
  • Region: us-east-1 (AWS region for Route53 operations)

AWS Credentials Management

Secret Configuration:

apiVersion: v1
kind: Secret
metadata:
  name: external-dns
  namespace: external-dns
type: Opaque
data:
  credentials: |
    [default]
    aws_access_key_id = {{ secret_vault.aws.access_key_id | b64encode }}
    aws_secret_access_key = {{ secret_vault.aws.secret_access_key | b64encode }}

This setup:

  • Secure storage: AWS credentials stored in Ansible vault
  • Minimal permissions: IAM user with only Route53 zone modification rights
  • File-based auth: Uses AWS credentials file format for compatibility
  • Namespace isolation: Secret accessible only within external-dns namespace

RBAC Configuration

ServiceAccount: Just adds a service account for ExternalDNS.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: external-dns

ClusterRole: Describes an ability to observe changes in services.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services", "endpoints", "pods", "nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["extensions", "networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "watch", "list"]

ClusterRoleBinding: Binds the above cluster role and ExternalDNS.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: external-dns

Permission Scope:

  • Read-only access: ExternalDNS cannot modify Kubernetes resources
  • Cluster-wide monitoring: Can watch services across all namespaces
  • Resource types: Services, endpoints, pods, nodes, and ingresses
  • Security principle: Least privilege for DNS management operations

Service Annotation Patterns

Basic DNS Record Creation

Services use annotations to trigger DNS record creation:

apiVersion: v1
kind: Service
metadata:
  name: httpbin
  namespace: httpbin
  annotations:
    external-dns.alpha.kubernetes.io/hostname: httpbin.goldentooth.net
    external-dns.alpha.kubernetes.io/ttl: "60"
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: httpbin

Annotation Functions:

  • Hostname: external-dns.alpha.kubernetes.io/hostname specifies the FQDN
  • TTL: external-dns.alpha.kubernetes.io/ttl sets DNS record time-to-live
  • Automatic A record: Points to MetalLB-allocated LoadBalancer IP
  • Automatic TXT record: Ownership tracking with txt-owner-id

Advanced Annotation Options

annotations:
  # Multiple hostnames for the same service
  external-dns.alpha.kubernetes.io/hostname: "app.goldentooth.net,api.goldentooth.net"
  
  # Custom TTL for caching strategy
  external-dns.alpha.kubernetes.io/ttl: "300"
  
  # AWS-specific: Route53 weighted routing
  external-dns.alpha.kubernetes.io/aws-weight: "100"
  
  # AWS-specific: Health check configuration
  external-dns.alpha.kubernetes.io/aws-health-check-id: "12345678-1234-1234-1234-123456789012"

DNS Record Lifecycle Management

Record Creation Process

  1. Service Creation: LoadBalancer service deployed with ExternalDNS annotations
  2. IP Allocation: MetalLB assigns IP from configured pool (10.4.11.0/24)
  3. Service Discovery: ExternalDNS watches Kubernetes API for service changes
  4. DNS Creation: Creates A record pointing to LoadBalancer IP
  5. Ownership Tracking: Creates TXT record for ownership verification

Record Cleanup Process

  1. Service Deletion: LoadBalancer service removed from cluster
  2. Change Detection: ExternalDNS detects service removal event
  3. Ownership Verification: Checks TXT record ownership before deletion
  4. DNS Cleanup: Removes both A and TXT records from Route53
  5. IP Release: MetalLB returns IP to available pool

TXT Record Ownership System

ExternalDNS uses TXT records for safe multi-cluster DNS management:

# Example TXT record for ownership tracking
dig TXT httpbin.goldentooth.net

# Response includes:
# httpbin.goldentooth.net. 60 IN TXT "heritage=external-dns,external-dns/owner=external-dns-external-dns"

This prevents:

  • Record conflicts: Multiple ExternalDNS instances managing same domain
  • Accidental deletion: Only owner can modify/delete records
  • Split-brain scenarios: Clear ownership prevents conflicting updates

Integration with GitOps

Argo CD Application Configuration

ExternalDNS is deployed via GitOps using the ApplicationSet pattern:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: gitops-repo
  namespace: argocd
spec:
  generators:
  - scmProvider:
      github:
        organization: goldentooth
        allBranches: false
        labelSelector:
          matchLabels:
            gitops-repo: "true"
  template:
    metadata:
      name: '{{repository}}'
    spec:
      project: gitops-repo
      source:
        repoURL: '{{url}}'
        targetRevision: '{{branch}}'
        path: .
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{repository}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true

This provides:

  • Automatic deployment: Changes to external-dns repository trigger redeployment
  • Namespace creation: Automatically creates external-dns namespace
  • Self-healing: Argo CD corrects configuration drift
  • Pruning: Removes resources deleted from Git repository

Repository Structure

external-dns/
├── Chart.yaml          # Helm chart metadata
├── values.yaml         # Configuration values
└── templates/
    ├── Deployment.yaml  # ExternalDNS deployment
    ├── ServiceAccount.yaml
    ├── ClusterRole.yaml
    ├── ClusterRoleBinding.yaml
    └── Secret.yaml      # AWS credentials (Ansible-templated)

Monitoring and Troubleshooting

Health Verification

# Check ExternalDNS pod status
kubectl -n external-dns get pods

# Monitor ExternalDNS logs
kubectl -n external-dns logs -l app=external-dns --tail=50

# Verify AWS credentials
kubectl -n external-dns exec deployment/external-dns -- cat /.aws/credentials

# Check service discovery
kubectl -n external-dns logs deployment/external-dns | grep "Creating record"

DNS Record Validation

# Verify A record creation
dig A httpbin.goldentooth.net

# Check TXT record ownership
dig TXT httpbin.goldentooth.net

# Validate Route53 changes
aws route53 list-resource-record-sets --hosted-zone-id Z0736727S7ZH91VKK44A | jq '.ResourceRecordSets[] | select(.Name | contains("httpbin"))'

Common Issues and Solutions

Issue: DNS records not created

  • Check: Service has type: LoadBalancer and LoadBalancer IP is assigned
  • Verify: ExternalDNS has RBAC permissions to read services
  • Debug: Check ExternalDNS logs for AWS API errors

Issue: DNS records not cleaned up

  • Check: TXT record ownership matches ExternalDNS txt-owner-id
  • Verify: AWS credentials have Route53 delete permissions
  • Debug: Monitor ExternalDNS logs during service deletion

Issue: Multiple DNS entries for same service

  • Check: Only one ExternalDNS instance should manage each domain
  • Verify: txt-owner-id is unique across clusters
  • Fix: Use different owner IDs for different environments

Integration Examples

Argo CD Access

A few minutes after pushing changes to the repository, we can reach Argo CD via https://argocd.goldentooth.net/.

Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: argocd-server
  namespace: argocd
  annotations:
    external-dns.alpha.kubernetes.io/hostname: argocd.goldentooth.net
    external-dns.alpha.kubernetes.io/ttl: "60"
spec:
  type: LoadBalancer
  ports:
  - port: 443
    targetPort: 8080
    protocol: TCP
    name: https
  selector:
    app.kubernetes.io/component: server
    app.kubernetes.io/name: argocd-server

This automatically creates:

  • A record: argocd.goldentooth.net → 10.4.11.1 (MetalLB-assigned IP)
  • TXT record: Ownership tracking for safe management
  • 60-second TTL: Rapid DNS propagation for development workflows

The combination of MetalLB for LoadBalancer IP allocation and ExternalDNS for automatic DNS management creates a seamless experience where services become accessible via friendly DNS names without manual intervention, enabling true infrastructure-as-code patterns for both networking and DNS.

Killing the Incubator

At this point, given the ease of spinning up new applications with the gitops-repo ApplicationSet, there's really not much benefit to the Incubator app-of-apps repo.

I'd also added a way of easily spinning up generic projects, but I don't think that's necessary either. The ApplicationSet approach is really pretty powerful 🙂

Welcome Back

So, uh, it's been a while. Things got busy and I didn't really touch the cluster for a while, and now I'm interested in it again and of course have completely forgotten everything about it.

I also ditched my OPNsense firewall because I felt it was probably costing too much power and replaced with with a simpler Unifi device, which is great but I just realized that I now have to reconfigure MetalLB to use Layer 2 instead of BGP. I probably should've used Layer 2 from the beginning, but I thought BGP would make me look a little cooler. So no load balancer integration is working right now on the cluster, which means I can't easily check in on ArgoCD. But that's fine, that's not really my highest priority.

Also, I have some new interests; I've gotten into HPC and MLOps, and some of the people I'm interested in working with use Nomad, which I've used for a couple of throwaway play projects but never on an ongoing basis. So I'm going to set up Slurm and Nomad and probably some other things. Should be fun and teach me a good amount. Of course, that's moving away from Kubernetes, but I figure I'll keep the name of this blog the same because frankly I just don't have any interest in renaming it.

First, though, I need to make sure the cluster itself is up to snuff.

Now, even I remember that I have a little Bash tool to ease administering the cluster. And because I know me, it has online help:

$ goldentooth
Usage: goldentooth <subcommand> [arguments...]

Subcommands:
            autocomplete Enable bash autocompletion.
                 install Install Ansible dependencies.
                    lint Lint all roles.
                    ping Ping all hosts.
                  uptime Get uptime for all hosts.
                 command Run an arbitrary command on all hosts.
              edit_vault Edit the vault.
        ansible_playbook Run a specified Ansible playbook.
                   usage Display usage information.
           bootstrap_k8s Bootstrap Kubernetes cluster with kubeadm.
                 cleanup Perform various cleanup tasks.
       configure_cluster Configure the hosts in the cluster.
          install_argocd Install Argo CD on Kubernetes cluster.
     install_argocd_apps Install Argo CD applications.
            install_helm Install Helm on Kubernetes cluster.
    install_k8s_packages Install Kubernetes packages.
               reset_k8s Reset Kubernetes cluster with kubeadm.
     setup_load_balancer Setup the load balancer.
                shutdown Cleanly shut down the hosts in the cluster.
  uninstall_k8s_packages Uninstall Kubernetes packages.

so I can ping all of the nodes:

$ goldentooth ping
allyrion | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
gardener | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
inchfield | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
cargyll | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
erenford | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
dalt | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
bettley | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
jast | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
harlton | SUCCESS => {
    "changed": false,
    "ping": "pong"
}
fenn | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

and... yes, that's all of them. Okay, that's a good sign.

And then I can get their uptime:

$ goldentooth uptime
gardener | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:48,  0 user,  load average: 0.13, 0.17, 0.14
allyrion | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:49,  0 user,  load average: 0.10, 0.06, 0.01
inchfield | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:47,  0 user,  load average: 0.25, 0.59, 0.60
erenford | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:48,  0 user,  load average: 0.08, 0.15, 0.12
jast | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:47,  0 user,  load average: 0.11, 0.19, 0.27
dalt | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:49,  0 user,  load average: 0.84, 0.64, 0.59
fenn | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:48,  0 user,  load average: 0.27, 0.34, 0.23
harlton | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:48,  0 user,  load average: 0.27, 0.14, 0.20
bettley | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:49,  0 user,  load average: 0.41, 0.49, 0.81
cargyll | CHANGED | rc=0 >>
 19:26:23 up 17 days,  4:49,  0 user,  load average: 0.26, 0.42, 0.64

17 days, which is about when I set up the new router and had to reorganize a lot of my network. Seems legit. So it looks like the power supplies are still fine. When I first set up the cluster, I think there was a flaky USB cable on one of the Pis, so it would occasionally drop off. I'd prefer to control my chaos engineering, not have it arise spontaneously from my poor QC, thank you very much.

My first node just runs HAProxy (currently) and is the simplest, so I'm going to check and see what needs to be updated. Nobody cares about apt stuff so I'll skip the details.

TL;DR: It wasn't that much, really, though it does appear that I had some files in /etc/modprobe.d that should've been in /etc/modules-load.d. I blame... someone else.

So I'll update all of the nodes, hope they rejoin the cluster when they reboot, and in the next entry I'll try to update Kubernetes...

NFS Exports

Just kidding, I'm going to set up a USB thumb drive and NFS exports on Allyrion (my load balancer node).

The thumbdrive is just a Sandisk 64GB. Should be enough to do some fun stuff. fdisk it (hey, I remember the commands!), mkfs.ext4 it, get the UUID, add it to /etc/fstab (not "f-stab", "fs-tab"), and we have a bright shiny new volume.

NFS Server Implementation

NFS isn't hard to set up, but I'm going to use Jeff's ansible-role-nfs for consistency and maintainability.

The implementation consists of two main components:

Server Configuration

The NFS server setup is managed through the setup_nfs_exports.yaml playbook, which performs these operations:

  1. Install NFS utilities on all nodes:
- name: 'Install NFS utilities.'
  hosts: 'all'
  remote_user: 'root'
  tasks:
    - name: 'Ensure NFS utilities are installed.'
      ansible.builtin.apt:
        name:
          - nfs-common
        state: present
  1. Configure NFS server on allyrion:
- name: 'Setup NFS exports.'
  hosts: 'nfs'
  remote_user: 'root'
  roles:
    - { role: 'geerlingguy.nfs' }

Export Configuration

The NFS export is configured through host variables in inventory/host_vars/allyrion.yaml:

nfs_exports:
- "/mnt/usb1 *(rw,sync,no_root_squash,no_subtree_check)"

This export configuration provides:

  • Path: /mnt/usb1 - The USB thumb drive mount point
  • Access: * - Allow access from any host within the cluster network
  • Permissions: rw - Read-write access for all clients
  • Sync Policy: sync - Synchronous writes (safer but slower than async)
  • Root Mapping: no_root_squash - Allow root user from clients to maintain root privileges
  • Performance: no_subtree_check - Disable subtree checking for better performance

Network Integration

The NFS server integrates with the cluster's network architecture:

Server Information:

  • Host: allyrion (10.4.0.10)
  • Role: Dual-purpose load balancer and NFS server
  • Network: Infrastructure CIDR 10.4.0.0/20

Global NFS Configuration (in group_vars/all/vars.yaml):

nfs:
  server: "{{ groups['nfs_server'] | first}}"
  mounts:
    primary:
      share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
      mount: '/mnt/nfs'
      safe_name: 'mnt-nfs'
      type: 'nfs'
      options: {}

This configuration:

  • Dynamically determines the NFS server from the nfs_server inventory group
  • Uses the server's IP address for robust connectivity
  • Standardizes the client mount point as /mnt/nfs
  • Provides a safe filesystem name for systemd units

Security Considerations

Internal Network Trust Model: The NFS configuration uses a simplified security model appropriate for an internal cluster:

  • Open Access: The * wildcard allows any host to mount the share
  • No Kerberos: Uses IP-based authentication rather than user-based
  • Root Access: no_root_squash enables administrative operations from clients
  • Network Boundary: Security relies on the trusted internal network (10.4.0.0/20)

Storage Infrastructure

Physical Storage:

  • Device: SanDisk 64GB USB thumb drive
  • Filesystem: ext4 for reliability and broad compatibility
  • Mount Point: /mnt/usb1
  • Persistence: Configured in /etc/fstab using UUID for reliability

Performance Characteristics:

  • Capacity: 64GB available storage
  • Access Pattern: Shared read-write across 13 cluster nodes
  • Use Cases: Configuration files, shared data, cluster coordination

Verification and Testing

The NFS export can be verified using standard tools:

$ showmount -e allyrion
Exports list on allyrion:
/mnt/usb1                           *

This output confirms:

  • The export is properly configured and accessible
  • The path /mnt/usb1 is being served
  • Access is open to all hosts (*)

Command Line Integration

The NFS setup integrates with the goldentooth CLI for consistent cluster management:

# Configure NFS server
goldentooth setup_nfs_exports

# Configure client mounts (covered in Chapter 031)
goldentooth setup_nfs_mounts

# Verify exports
goldentooth command allyrion 'showmount -e allyrion'

Future Evolution

Note: This represents the initial NFS implementation. The cluster later evolves to include more sophisticated storage with ZFS pools and replication (documented in Chapter 050), while maintaining compatibility with this foundational NFS export.

We'll return to this later and find out if it actually works when we configure the client mounts in the next section.

Kubernetes Updates

Because I'm not a particularly smart man, I've allowed my cluster to fall behind. The current version, as of today, is 1.32.3, and my cluster is on 1.29.something.

So that means I need to upgrade 1.29 -> 1.30, 1.30 -> 1.31, and 1.31 -> 1.32.

1.29 -> 1.30

First, I update the repo URL in /etc/apt/sources.list.d/kubernetes.sources and run:

$ sudo apt update
Hit:1 http://deb.debian.org/debian bookworm InRelease
Hit:2 http://deb.debian.org/debian-security bookworm-security InRelease
Hit:3 http://deb.debian.org/debian bookworm-updates InRelease
Hit:4 https://download.docker.com/linux/debian bookworm InRelease
Hit:6 http://archive.raspberrypi.com/debian bookworm InRelease
Hit:7 https://baltocdn.com/helm/stable/debian all InRelease
Get:5 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.30/deb  InRelease [1,189 B]
Err:5 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.30/deb  InRelease
  The following signatures were invalid: EXPKEYSIG 234654DA9A296436 isv:kubernetes OBS Project <isv:kubernetes@build.opensuse.org>
Reading package lists... Done
W: https://download.docker.com/linux/debian/dists/bookworm/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: GPG error: https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.30/deb  InRelease: The following signatures were invalid: EXPKEYSIG 234654DA9A296436 isv:kubernetes OBS Project <isv:kubernetes@build.opensuse.org>
E: The repository 'https://pkgs.k8s.io/core:/stable:/v1.30/deb  InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

Well, shit. Looks like I need to do some surgery elsewhere.

Fortunately, I had some code for setting up the Kubernetes package repositories in install_k8s_packages. Of course, I don't want to install new versions of the packages – the upgrade process is a little more delicate than that – so I factored it out into a new role called setup_k8s_apt. Running that role against my cluster with goldentooth setup_k8s_apt made the necessary changes.

$ sudo apt-cache madison kubeadm
   kubeadm | 1.30.11-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.10-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.9-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.8-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.7-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.6-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.5-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.4-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.3-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.2-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.1-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages
   kubeadm | 1.30.0-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb  Packages

There we go. That wasn't that bad.

Now, the next steps are things I'm going to do repeatedly, and I don't want to type a bunch of commands, so I'm going to do it in Ansible. I need to do that advisedly, though.

I created a new role, goldentooth.upgrade_k8s. I'm working through the upgrade documentation, Ansible-izing it as I go.

So I added some tasks to update the Apt cache, unhold kubeadm, upgrade it, and then hold it again (via a handler). I tagged these with first_control_plane and invoke the role dynamically (because that is the only context in which you can limit execution of a role to the specified tags).

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"30", GitVersion:"v1.30.11", GitCommit:"6a074997c960757de911780f250ecd9931917366", GitTreeState:"clean", BuildDate:"2025-03-11T19:56:25Z", GoVersion:"go1.23.6", Compiler:"gc", Platform:"linux/arm64"}

It worked!

The plan operation similarly looks fine.

$ sudo kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: 1.29.6
[upgrade/versions] kubeadm version: v1.30.11
I0403 11:18:34.338987  564280 version.go:256] remote version is much newer: v1.32.3; falling back to: stable-1.30
[upgrade/versions] Target version: v1.30.11
[upgrade/versions] Latest version in the v1.29 series: v1.29.15

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   NODE        CURRENT   TARGET
kubelet     bettley     v1.29.2   v1.29.15
kubelet     cargyll     v1.29.2   v1.29.15
kubelet     dalt        v1.29.2   v1.29.15
kubelet     erenford    v1.29.2   v1.29.15
kubelet     fenn        v1.29.2   v1.29.15
kubelet     gardener    v1.29.2   v1.29.15
kubelet     harlton     v1.29.2   v1.29.15
kubelet     inchfield   v1.29.2   v1.29.15
kubelet     jast        v1.29.2   v1.29.15

Upgrade to the latest version in the v1.29 series:

COMPONENT                 NODE      CURRENT    TARGET
kube-apiserver            bettley   v1.29.6    v1.29.15
kube-apiserver            cargyll   v1.29.6    v1.29.15
kube-apiserver            dalt      v1.29.6    v1.29.15
kube-controller-manager   bettley   v1.29.6    v1.29.15
kube-controller-manager   cargyll   v1.29.6    v1.29.15
kube-controller-manager   dalt      v1.29.6    v1.29.15
kube-scheduler            bettley   v1.29.6    v1.29.15
kube-scheduler            cargyll   v1.29.6    v1.29.15
kube-scheduler            dalt      v1.29.6    v1.29.15
kube-proxy                          1.29.6     v1.29.15
CoreDNS                             v1.11.1    v1.11.3
etcd                      bettley   3.5.10-0   3.5.15-0
etcd                      cargyll   3.5.10-0   3.5.15-0
etcd                      dalt      3.5.10-0   3.5.15-0

You can now apply the upgrade by executing the following command:

	kubeadm upgrade apply v1.29.15

_____________________________________________________________________

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   NODE        CURRENT   TARGET
kubelet     bettley     v1.29.2   v1.30.11
kubelet     cargyll     v1.29.2   v1.30.11
kubelet     dalt        v1.29.2   v1.30.11
kubelet     erenford    v1.29.2   v1.30.11
kubelet     fenn        v1.29.2   v1.30.11
kubelet     gardener    v1.29.2   v1.30.11
kubelet     harlton     v1.29.2   v1.30.11
kubelet     inchfield   v1.29.2   v1.30.11
kubelet     jast        v1.29.2   v1.30.11

Upgrade to the latest stable version:

COMPONENT                 NODE      CURRENT    TARGET
kube-apiserver            bettley   v1.29.6    v1.30.11
kube-apiserver            cargyll   v1.29.6    v1.30.11
kube-apiserver            dalt      v1.29.6    v1.30.11
kube-controller-manager   bettley   v1.29.6    v1.30.11
kube-controller-manager   cargyll   v1.29.6    v1.30.11
kube-controller-manager   dalt      v1.29.6    v1.30.11
kube-scheduler            bettley   v1.29.6    v1.30.11
kube-scheduler            cargyll   v1.29.6    v1.30.11
kube-scheduler            dalt      v1.29.6    v1.30.11
kube-proxy                          1.29.6     v1.30.11
CoreDNS                             v1.11.1    v1.11.3
etcd                      bettley   3.5.10-0   3.5.15-0
etcd                      cargyll   3.5.10-0   3.5.15-0
etcd                      dalt      3.5.10-0   3.5.15-0

You can now apply the upgrade by executing the following command:

	kubeadm upgrade apply v1.30.11

_____________________________________________________________________


The table below shows the current state of component configs as understood by this version of kubeadm.
Configs that have a "yes" mark in the "MANUAL UPGRADE REQUIRED" column require manual config upgrade or
resetting to kubeadm defaults before a successful upgrade can be performed. The version to manually
upgrade to is denoted in the "PREFERRED VERSION" column.

API GROUP                 CURRENT VERSION   PREFERRED VERSION   MANUAL UPGRADE REQUIRED
kubeproxy.config.k8s.io   v1alpha1          v1alpha1            no
kubelet.config.k8s.io     v1beta1           v1beta1             no
_____________________________________________________________________

Of course, I won't automate the actual upgrade process; that seems unwise.

I'm skipping certificate renewal because I'd like to fight with one thing at a time.

$ sudo kubeadm upgrade apply v1.30.11 --certificate-renewal=false
[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to "v1.30.11"
[upgrade/versions] Cluster version: v1.29.6
[upgrade/versions] kubeadm version: v1.30.11
[upgrade] Are you sure you want to proceed? [y/N]: y
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
W0403 11:23:42.086815  566901 checks.go:844] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.9" as the CRI sandbox image.
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.30.11" (timeout: 5m0s)...
[upgrade/etcd] Upgrading to TLS for etcd
[upgrade/staticpods] Preparing for "etcd" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=etcd
[upgrade/staticpods] Component "etcd" upgraded successfully!
[upgrade/etcd] Waiting for etcd to become available
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests1796562509"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upgrade] Backing up kubelet config file to /etc/kubernetes/tmp/kubeadm-kubelet-config2173844632/config.yaml
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[upgrade/addons] skip upgrade addons because control plane instances [cargyll dalt] have not been upgraded

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.30.11". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.

The next steps for the other two control plane nodes are fairly straightforward. This mostly just consisted of duplicating the playbook block to add a new step for when the playbook is executed with the 'other_control_plane' tag and adding that tag to the steps already added in the setup_k8s role.

$ goldentooth command cargyll,dalt 'sudo kubeadm upgrade node'

And a few minutes later, both of the remaining control plane nodes have updated.

The next step is to upgrade the kubelet in each node.

Serially, for obvious reasons, we need to drain each node (from a control plane node), upgrade the kubelet (unhold, upgrade, hold), then uncordon the node (again, from a control plane node).

That's not too bad, and it's included in the latest changes to the upgrade_k8s role.

The final step is upgrading kubectl on each of the control plane nodes, which is a comparative cakewalk.

$ sudo kubectl version
Client Version: v1.30.11
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.11

Nice!

1.30 -> 1.31

Now that the Ansible playbook and role are fleshed out, the process moving forward is comparatively simple.

  1. Change the k8s_version_clean variable to 1.31.
  2. goldentooth setup_k8s_apt
  3. goldentooth upgrade_k8s --tags=kubeadm_first
  4. goldentooth command bettley 'kubeadm version'
  5. goldentooth command bettley 'sudo kubeadm upgrade plan'
  6. goldentooth command bettley 'sudo kubeadm upgrade apply v1.31.7 --certificate-renewal=false -y'
  7. goldentooth upgrade_k8s --tags=kubeadm_rest
  8. goldentooth command cargyll,dalt 'sudo kubeadm upgrade node'
  9. goldentooth upgrade_k8s --tags=kubelet
  10. goldentooth upgrade_k8s --tags=kubectl

1.31 -> 1.32

Hell, this is kinda fun now.

  1. Change the k8s_version_clean variable to 1.32.
  2. goldentooth setup_k8s_apt
  3. goldentooth upgrade_k8s --tags=kubeadm_first
  4. goldentooth command bettley 'kubeadm version'
  5. goldentooth command bettley 'sudo kubeadm upgrade plan'
  6. goldentooth command bettley 'sudo kubeadm upgrade apply v1.32.3 --certificate-renewal=false -y'
  7. goldentooth upgrade_k8s --tags=kubeadm_rest
  8. goldentooth command cargyll,dalt 'sudo kubeadm upgrade node'
  9. goldentooth upgrade_k8s --tags=kubelet
  10. goldentooth upgrade_k8s --tags=kubectl

And eventually, everything is fine:

$ sudo kubectl get nodes
NAME        STATUS   ROLES           AGE    VERSION
bettley     Ready    control-plane   286d   v1.32.3
cargyll     Ready    control-plane   286d   v1.32.3
dalt        Ready    control-plane   286d   v1.32.3
erenford    Ready    <none>          286d   v1.32.3
fenn        Ready    <none>          286d   v1.32.3
gardener    Ready    <none>          286d   v1.32.3
harlton     Ready    <none>          286d   v1.32.3
inchfield   Ready    <none>          286d   v1.32.3
jast        Ready    <none>          286d   v1.32.3

Fixing MetalLB

As mentioned here, I purchased a new router to replace a power-hungry Dell server running OPNsense, and that cost me BGP support. This kills my MetalLB configuration, so I need to switch it to use Layer 2.

This transition represents a fundamental change in how MetalLB operates and requires understanding the trade-offs between BGP and Layer 2 modes.

BGP vs Layer 2 Architecture Comparison

BGP Mode (Previous Configuration)

  • Dynamic routing: BGP speakers advertise LoadBalancer IPs to upstream routers
  • True load balancing: Multiple nodes can announce the same service IP with ECMP
  • Scalability: Router handles load distribution and failover automatically
  • Network integration: Works with enterprise routing infrastructure
  • Requirements: Router must support BGP (FRR, Quagga, hardware routers)

Layer 2 Mode (New Configuration)

  • ARP announcements: Nodes respond to ARP requests for LoadBalancer IPs
  • Active/passive failover: Only one node answers ARP for each service IP
  • Simpler setup: No routing protocol configuration required
  • Limited scalability: All traffic for a service goes through single node
  • Requirements: Nodes must be on same Layer 2 network segment

Hardware Infrastructure Change

The transition was necessitated by hardware changes:

Previous Setup:

  • Dell server: Power-hungry (likely PowerEdge) running OPNsense
  • BGP support: FRR (Free Range Routing) plugin provided full BGP implementation
  • Power consumption: High power draw from server-class hardware
  • Complexity: Full routing stack with BGP, OSPF, and other protocols

New Setup:

  • Consumer router: Lower power consumption
  • No BGP support: Consumer-grade firmware lacks routing protocol support
  • Simplified networking: Standard static routing and NAT
  • Cost efficiency: Reduced power costs and hardware complexity

Migration Process

The migration involved several coordinated steps to minimize service disruption:

Step 1: Remove BGP Configuration

That shouldn't be too bad.

I think it's just a matter of deleting the BGP advertisement:

$ sudo kubectl -n metallb delete BGPAdvertisement primary
bgpadvertisement.metallb.io "primary" deleted

This command removes the BGP advertisement configuration, which:

  • Stops route announcements: MetalLB speakers stop advertising LoadBalancer IPs via BGP
  • Maintains IP allocation: Existing LoadBalancer services keep their assigned IPs
  • Preserves connectivity: Services remain accessible until Layer 2 mode is configured

Step 2: Configure Layer 2 Advertisement

and creating an L2 advertisement:

$ cat tmp.yaml
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: primary
  namespace: metallb

$ sudo kubectl apply -f tmp.yaml
l2advertisement.metallb.io/primary created

L2Advertisement Configuration Details:

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: primary
  namespace: metallb
spec:
  ipAddressPools:
  - primary
  nodeSelectors:
  - matchLabels:
      kubernetes.io/hostname: "*"
  interfaces:
  - eth0

Key behaviors in Layer 2 mode:

  • ARP responder: Nodes respond to ARP requests for LoadBalancer IPs
  • Leader election: One node per service IP elected as ARP responder
  • Gratuitous ARP: Leader sends gratuitous ARP to announce IP ownership
  • Failover: New leader elected if current leader becomes unavailable

Step 3: Router Static Route Configuration

After adding the static route to my router, I can see the friendly go-httpbin response when I navigate to https://10.4.11.1/

Static Route Configuration:

# Router configuration (varies by model)
# Destination: 10.4.11.0/24 (MetalLB IP pool)
# Gateway: 10.4.0.X (any cluster node IP)
# Interface: LAN interface connected to cluster network

Why static routes are necessary:

  • IP pool isolation: MetalLB pool (10.4.11.0/24) is separate from cluster network (10.4.0.0/20)
  • Router awareness: Router needs to know how to reach LoadBalancer IPs
  • Return path: Ensures bidirectional connectivity for external clients

Network Topology Changes

Layer 2 Network Requirements

Physical topology:

[Internet] → [Router] → [Switch] → [Cluster Nodes]
                ↓
         Static Route:
         10.4.11.0/24 → cluster

ARP behavior:

  1. Client request: External client sends packet to LoadBalancer IP
  2. Router forwarding: Router forwards based on static route to cluster network
  3. ARP resolution: Router/switch broadcasts ARP request for LoadBalancer IP
  4. Node response: MetalLB leader node responds with its MAC address
  5. Traffic delivery: Subsequent packets sent directly to leader node

Failover Mechanism

Leader election process:

# Check current leader for a service
kubectl -n metallb logs -l app.kubernetes.io/component=speaker | grep "announcing"

# Example output:
# {"level":"info","ts":"2024-01-15T10:30:00Z","msg":"announcing","ip":"10.4.11.1","node":"bettley"}

Failover sequence:

  1. Leader failure: Current announcing node becomes unavailable
  2. Detection: MetalLB speakers detect leader absence (typically 10-30 seconds)
  3. Election: Remaining speakers elect new leader using deterministic algorithm
  4. Gratuitous ARP: New leader sends gratuitous ARP to update network caches
  5. Service restoration: Traffic resumes through new leader node

DNS Infrastructure Migration

I also lost some control over DNS, e.g. the router's DNS server will override all lookups for hellholt.net rather than forwarding requests to my DNS servers.

So I created a new domain, goldentooth.net, to handle this cluster. A couple of tweaks to ExternalDNS and some service definitions and I can verify that ExternalDNS is setting the DNS records correctly, although I don't seem to be able to resolve names just yet.

Domain Migration Impact

Previous Domain: hellholt.net

  • Router control: New router overrides DNS resolution
  • Local DNS interference: Router's DNS server intercepts queries
  • Limited delegation: Consumer router lacks sophisticated DNS forwarding

New Domain: goldentooth.net

  • External control: Managed entirely in AWS Route53
  • Clean delegation: No local DNS interference
  • ExternalDNS compatibility: Full automation support

ExternalDNS Configuration Updates

Domain filter change:

# Previous configuration
args:
- --domain-filter=hellholt.net

# New configuration
args:
- --domain-filter=goldentooth.net

Service annotation updates:

# httpbin service example
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: httpbin.goldentooth.net
    # Previously: httpbin.hellholt.net

DNS record verification:

# Check Route53 records
aws route53 list-resource-record-sets --hosted-zone-id Z0736727S7ZH91VKK44A

# Verify DNS propagation
dig A httpbin.goldentooth.net
dig TXT httpbin.goldentooth.net  # Ownership records

Performance and Operational Considerations

Layer 2 Mode Limitations

Single point of failure:

  • Only one node handles traffic for each LoadBalancer IP
  • Node failure causes service interruption until failover completes
  • No load distribution across multiple nodes

Network broadcast traffic:

  • ARP announcements increase broadcast traffic
  • Gratuitous ARP during failover events
  • Potential impact on large Layer 2 domains

Scalability constraints:

  • All service traffic passes through single node
  • Node bandwidth becomes bottleneck for high-traffic services
  • Limited horizontal scaling compared to BGP mode

Monitoring and Troubleshooting

MetalLB speaker logs:

# Monitor speaker activities
kubectl -n metallb logs -l component=speaker --tail=50

# Check for leader election events
kubectl -n metallb logs -l component=speaker | grep -E "(leader|announcing|failover)"

# Verify ARP announcements
kubectl -n metallb logs -l component=speaker | grep "gratuitous ARP"

Network connectivity testing:

# Test ARP resolution for LoadBalancer IPs
arping -c 3 10.4.11.1

# Check MAC address consistency
arp -a | grep "10.4.11"

# Verify static routes on router
ip route show | grep "10.4.11.0/24"

Future TLS Strategy

I think I still need to get TLS working too, but I've soured on the idea of maintaining a cert per domain name and per service. I think I'll just have a wildcard over goldentooth.net and share that out. Too much aggravation otherwise. That's a problem for another time, though.

Wildcard certificate benefits:

  • Simplified management: Single certificate for all subdomains
  • Reduced complexity: No per-service certificate automation
  • Cost efficiency: One certificate instead of multiple Let's Encrypt certs
  • Faster deployment: No certificate provisioning delays for new services

Implementation considerations:

# Wildcard certificate configuration
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: goldentooth-wildcard
  namespace: default
spec:
  secretName: goldentooth-wildcard-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - "*.goldentooth.net"
  - "goldentooth.net"

Configuration Persistence

The Layer 2 configuration is maintained in the gitops repository structure:

MetalLB Helm chart updates:

# values.yaml changes
spec:
  # BGP configuration removed
  # bgpPeers: []
  # bgpAdvertisements: []

  # Layer 2 configuration added
  l2Advertisements:
  - name: primary
    ipAddressPools:
    - primary

This transition demonstrates the flexibility of MetalLB to adapt to different network environments while maintaining service availability. While Layer 2 mode has limitations compared to BGP, it provides a viable solution for simpler network infrastructures and reduces operational complexity in exchange for some scalability constraints.

Post-Implementation Updates and Additional Fixes

After the initial MetalLB L2 migration, several additional issues were discovered and resolved to achieve full operational status.

Network Interface Selection Issues

During verification, a critical issue emerged with "super shaky" primary interface selection on cluster nodes. Some nodes (particularly newer ones like lipps and karstark) had both wired (eth0) and wireless (wlan0) interfaces active, causing:

  • Calico confusion: CNI plugin using wireless interfaces for pod networking
  • MetalLB routing failures: ARP announcements on wrong interfaces
  • Inconsistent connectivity: Services unreachable from certain nodes

Solution implemented:

  1. Enhanced networking role: Created robust interface detection logic preferring eth0
  2. Wireless interface management: Automatic detection and disabling of wlan0 on dual-homed nodes
  3. SystemD persistence: Network configurations and wireless disable service survive reboots
  4. Network debugging tools: Installed comprehensive toolset (arping, tcpdump, mtr, etc.)

Networking role improvements:

# /ansible/roles/goldentooth.setup_networking/tasks/main.yaml
- name: 'Set primary interface to eth0 if available'
  ansible.builtin.set_fact:
    metallb_interface: 'eth0'
  when:
    - 'network.metallb.interface == ""'
    - 'eth0_exists.rc == 0'

- name: 'Disable wireless interface if both eth0 and wireless exist'
  ansible.builtin.shell:
    cmd: "ip link set {{ wireless_interface_name.stdout }} down"
  when:
    - 'wireless_interface_count.stdout | int > 0'
    - 'eth0_exists.rc == 0'

DNS Architecture Migration

The L2 migration coincided with a broader DNS restructuring from hellholt.net to goldentooth.net with hierarchical service domains:

New domain structure:

  • Nodes: <node>.nodes.goldentooth.net
  • Kubernetes services: <service>.services.k8s.goldentooth.net
  • Nomad services: <service>.services.nomad.goldentooth.net
  • General services: <service>.services.goldentooth.net

ExternalDNS integration:

# Service annotations for automatic DNS management
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "argocd.services.k8s.goldentooth.net"
    external-dns.alpha.kubernetes.io/ttl: "60"

Current Operational Status (July 2025)

The MetalLB L2 configuration is now fully operational with the following verified services:

Active LoadBalancer services:

  • ArgoCD: argocd.services.k8s.goldentooth.net10.4.11.0
  • HTTPBin: httpbin.services.k8s.goldentooth.net10.4.11.1

Verification commands (updated):

# Check MetalLB speaker status
kubectl -n metallb logs -l app.kubernetes.io/component=speaker --tail=20

# Verify L2 announcements
kubectl -n metallb logs -l app.kubernetes.io/component=speaker | grep "announcing"

# Test connectivity to LoadBalancer IPs
curl -I http://10.4.11.1/  # HTTPBin
curl -I http://10.4.11.0/  # ArgoCD

# Verify DNS resolution
dig argocd.services.k8s.goldentooth.net
dig httpbin.services.k8s.goldentooth.net

# Check interface status on all nodes
goldentooth command all_nodes "ip link show | grep -E '(eth0|wlan)'"

MetalLB configuration summary:

  • Mode: Layer 2 (BGP disabled)
  • IP Pool: 10.4.11.0 - 10.4.15.254
  • Interface: eth0 (consistently across all nodes)
  • FRR: Disabled in Helm values for pure L2 operation

NFS Mounts

Now that Kubernetes is kinda squared away, I'm going to set up NFS mounts on the cluster nodes.

For the sake of simplicity, I'll just set up the mounts on every node, including the load balancer (which is currently exporting the share).

Implementation Architecture

Systemd-Based Mounting

Rather than using traditional /etc/fstab entries, I implemented NFS mounting using systemd mount and automount units. This approach provides several advantages:

  • Dynamic mounting: Automount units mount filesystems on-demand
  • Service management: Standard systemd service lifecycle management
  • Dependency handling: Proper ordering with network services
  • Logging: Integration with systemd journal for troubleshooting

Global Configuration

The NFS mount configuration is defined in group_vars/all/vars.yaml:

nfs:
  server: "{{ groups['nfs_server'] | first}}"
  mounts:
    primary:
      share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
      mount: '/mnt/nfs'
      safe_name: 'mnt-nfs'
      type: 'nfs'
      options: {}

This configuration:

  • Dynamically determines NFS server: Uses first host in nfs_server group (allyrion)
  • IP-based addressing: Uses 10.4.0.10:/mnt/usb1 for reliable connectivity
  • Standardized mount point: All nodes mount at /mnt/nfs
  • Safe naming: Provides mnt-nfs for systemd unit names

Systemd Template Implementation

Mount Unit Template

The mount service template (templates/mount.j2) creates individual systemd mount units:

[Unit]
Description=Mount {{ item.key }}

[Mount]
What={{ item.value.share }}
Where={{ item.value.mount }}
Type={{ item.value.type }}
Options={{ item.value.options | join(',') }}

[Install]
WantedBy=default.target

This generates a unit file at /etc/systemd/system/mnt-nfs.mount with:

  • What: 10.4.0.10:/mnt/usb1 (NFS export path)
  • Where: /mnt/nfs (local mount point)
  • Type: nfs (filesystem type)
  • Options: Default NFS mount options

Automount Unit Template

The automount template (templates/automount.j2) provides on-demand mounting:

[Unit]
Description=Automount {{ item.key }}
After=remote-fs-pre.target network-online.target network.target
Before=umount.target remote-fs.target

[Automount]
Where={{ item.value.mount }}

[Install]
WantedBy=default.target

Key features:

  • Network dependencies: Waits for network availability before attempting mounts
  • Lazy mounting: Only mounts when the path is accessed
  • Proper ordering: Correctly sequences with system startup and shutdown

Deployment Process

Ansible Role Implementation

The goldentooth.setup_nfs_mounts role handles the complete deployment:

- name: 'Generate mount unit for {{ item.key }}.'
  ansible.builtin.template:
    src: 'mount.j2'
    dest: "/etc/systemd/system/{{ item.value.safe_name }}.mount"
    mode: '0644'
  loop: "{{ nfs.mounts | dict2items }}"
  notify: 'reload systemd'

- name: 'Generate automount unit for {{ item.key }}.'
  ansible.builtin.template:
    src: 'automount.j2'
    dest: "/etc/systemd/system/{{ item.value.safe_name }}.automount"
    mode: '0644'
  loop: "{{ nfs.mounts | dict2items }}"
  notify: 'reload systemd'

Service Management

The role ensures proper service lifecycle:

- name: 'Enable and start automount services.'
  ansible.builtin.systemd:
    name: "{{ item.value.safe_name }}.automount"
    enabled: true
    state: started
    daemon_reload: true
  loop: "{{ nfs.mounts | dict2items }}"

Network Integration

Client Targeting

The NFS mounts are deployed across the entire cluster:

Target Hosts: All cluster nodes (hosts: 'all')

  • 12 Raspberry Pi nodes: allyrion, bettley, cargyll, dalt, erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps
  • 1 x86 GPU node: velaryon

Including NFS Server: Even allyrion (the NFS server) mounts its own export, providing:

  • Consistent access patterns: Same path (/mnt/nfs) on all nodes
  • Testing capability: Server can verify export functionality
  • Simplified administration: Uniform management across cluster

Network Configuration

Infrastructure Network: All communication occurs within the trusted 10.4.0.0/20 CIDR NFS Protocol: Standard NFSv3/v4 with default options Firewall: No additional firewall rules needed within cluster network

Directory Structure and Permissions

Mount Point Creation

- name: 'Ensure mount directories exist.'
  ansible.builtin.file:
    path: "{{ item.value.mount }}"
    state: directory
    mode: '0755'
  loop: "{{ nfs.mounts | dict2items }}"

Shared Directory Usage

The NFS mount serves multiple cluster functions:

Slurm Integration:

slurm_nfs_base_path: "{{ nfs.mounts.primary.mount }}/slurm"

Common Patterns:

  • /mnt/nfs/slurm/ - HPC job shared storage
  • /mnt/nfs/shared/ - General cluster shared data
  • /mnt/nfs/config/ - Configuration file distribution

Command Line Integration

goldentooth CLI Commands

# Configure NFS mounts on all nodes
goldentooth setup_nfs_mounts

# Verify mount status
goldentooth command all 'systemctl status mnt-nfs.automount'
goldentooth command all 'df -h /mnt/nfs'

# Test shared storage
goldentooth command allyrion 'echo "test" > /mnt/nfs/test.txt'
goldentooth command bettley 'cat /mnt/nfs/test.txt'

Troubleshooting and Verification

Service Status Verification

# Check automount service status
systemctl status mnt-nfs.automount

# Check mount service status (after access)
systemctl status mnt-nfs.mount

# View mount information
mount | grep nfs
df -h /mnt/nfs

Common Issues and Solutions

Network Dependencies: The automount units properly wait for network availability through After=network-online.target

Permission Issues: The NFS export uses no_root_squash, allowing proper root access from clients

Mount Persistence: Automount units ensure mounts survive reboots and network interruptions

Security Considerations

Trust Model

Internal Network Security: Security relies on the trusted cluster network boundary No User Authentication: Uses IP-based access control rather than user credentials Root Access: no_root_squash on server allows administrative operations

Future Enhancements

The current implementation could be enhanced with:

  • Kerberos authentication for user-based security
  • Network policies for additional access control
  • Encryption in transit for sensitive data protection

Integration with Storage Evolution

Note: This NFS mounting system provides the foundation for shared storage. As documented in Chapter 050, the cluster later evolves to include ZFS-based storage with replication, while maintaining compatibility with these NFS mount patterns.

This in itself wasn't too complicated, but I created two template files (one for a .mount service, another for a .automount service), fought with the variables for a bit, and it seems to work. The result is robust, cluster-wide shared storage accessible at /mnt/nfs on every node.

Slurm

Okay, finally, geez.

So this is about Slurm, an open-source, highly scalable, and fault-tolerant cluster management and job-scheduling system.

Before we get started: I want to express tremendous gratitude to Hossein Ghorbanfekr, for this Medium article and this second Medium article, which helped me set up Slurm and the modules and illustrated how to work with the system and verify its functionality. I'm a Slurm newbie and his articles were invaluable.

First, we're going to set up MUNGE, which is an authentication service designed for scalability within HPC environments. This is just a matter of installing the munge package, synchronizing the MUNGE key across the cluster (which isn't as ergonomic as I'd like, but oh well), and restarting the service.

Slurm itself isn't too complex to install, but we want to switch off slurmctld for the compute nodes and on for the controller nodes.

The next part is the configuration, which, uh, I'm not going to run through here. There are a ton of options and I'm figuring it out directive by directive by reading the documentation. Suffice to say that it's detailed, I had to hack some things in, and everything appears to work but I can't verify that just yet.

The control nodes write state to the NFS volume, the idea being that if one of them fails there'll be a short nonresponsive period and then another will take over. It recommends not using NFS, and I think it wants something like Ceph or GlusterFS or something, but I'm not going to bother; this is just an educational cluster, and these distributed filesystems really introduce a lot of complexity that I don't want to deal with right now.

Ultimately, I end up with this:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
general*     up   infinite      9   idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
debug        up   infinite      9   idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
$ scontrol show nodes
NodeName=bettley Arch=aarch64 CoresPerSocket=4
   CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.84
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=10.4.0.11 NodeHostName=bettley Version=22.05.8
   OS=Linux 6.12.20+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.12.20-1+rpt1~bpo12+1 (2025-03-19)
   RealMemory=4096 AllocMem=0 FreeMem=1086 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=general,debug
   BootTime=2025-04-02T20:28:31 SlurmdStartTime=2025-04-04T12:43:13
   LastBusyTime=2025-04-04T12:43:21
   CfgTRES=cpu=1,mem=4G,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

... etc ...

The next step is to set up Lua and Lmod for managing environments. Lua of course is a scripting language, and the Lmod system allows users of a Slurm cluster to flexibly modify their environment, use different versions of libraries and tools, etc by loading and unloading modules.

Setting this up isn't terribly fun or interesting. Lmod is on sourceforge, Lua is in Apt, we install some things, build Lmod from source, create some symlinks to ensure that Lmod is available in users' shell environments, and when we shell in and type a command, we can list our modules.

$ module av

------------------------------------------------------------------ /mnt/nfs/slurm/apps/modulefiles -------------------------------------------------------------------
   StdEnv

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

After the StdEnv, we can set up OpenMPI. OpenMPI is an implementation of Message Passing Interface (MPI), used to coordinate communication between processes running across different nodes in a cluster. It's built for speed and flexibility in environments where you need to split computation across many CPUs or machines, and allows us to quickly and easily execute processes on multiple Slurm nodes.

OpenMPI is comparatively straightforward to set up, mostly just installing a few system packages for libraries and headers and creating a module file.

The next step is setting up Golang, which is unfortunately a bit more aggravating than it should be, involving "manual" work (in Ansible terms, so executing commands and operating via trial-and-error rather than using predefined modules) because the latest version of Go in the Apt repos appears to be 1.19 but the latest version is 1.24 and I apparently need 1.23 at least to build Singularity (see next section).

Singularity is a method for running containers without the full Docker daemon and its complications. It's written in Go, which is why we had to install 1.23.0 and couldn't rest on our laurels with 1.19.0 in the Apt repository (or, indeed, 1.21.0 as I originally thought).

Building Singularity requires additional packages, and it takes quite a while. But when done:

$ module av

------------------------------------------------------------------ /mnt/nfs/slurm/apps/modulefiles -------------------------------------------------------------------
   Golang/1.21.0    Golang/1.23.0 (D)    OpenMPI    Singularity/4.3.0    StdEnv

  Where:
   D:  Default Module

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

Then we can use it:

$ module load Singularity
$ singularity pull docker://arm64v8/hello-world
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
INFO:    Fetching OCI image...
INFO:    Extracting OCI image...
INFO:    Inserting Singularity configuration...
INFO:    Creating SIF file...
$ srun singularity run hello-world_latest.sif
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

We can also build a Singularity definition file with

$ cat > ~/torch.def << EOF
Bootstrap: docker
From: ubuntu:20.04

%post
    apt-get -y update
    apt-get -y install python3-pip
    pip3 install numpy torch

%environment
    export LC_ALL=C
EOF
$ singularity build --fakeroot torch.sif torch.def
INFO:    Starting build...
INFO:    Fetching OCI image...
24.8MiB / 24.8MiB [===============================================================================================================================] 100 % 2.8 MiB/s 0s
INFO:    Extracting OCI image...
INFO:    Inserting Singularity configuration...
....
INFO:    Adding environment to container
INFO:    Creating SIF file...
INFO:    Build complete: torch.sif

and finally run it interactively:

$ salloc --tasks=1 --cpus-per-task=2 --mem=1gb
$ srun singularity run torch.sif \
    python3 -c "import torch; print(torch.tensor(range(5)))"
tensor([0, 1, 2, 3, 4])
$ exit

We can also submit it as a batch:

$ cat > ~/submit_torch.sh << EOF
#!/usr/bin/sh -l

#SBATCH --job-name=torch
#SBATCH --mem=1gb
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:05:00

module load Singularity

srun singularity run torch.sif \
    python3 -c "import torch; print(torch.tensor(range(5)))"
EOF
$ sbatch submit_torch.sh
Submitted batch job 398
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               398   general    torch   nathan  R       0:03      1 bettley
$ cat slurm-398.out
tensor([0, 1, 2, 3, 4])

The next part will be setting up Conda, which is similarly a bit more aggravating than it probably should.

Once that's done, though:

$ conda env list

# conda environments:
#
base                   /mnt/nfs/slurm/miniforge
default-env            /mnt/nfs/slurm/miniforge/envs/default-env
python3.10             /mnt/nfs/slurm/miniforge/user_envs/python3.10
python3.11             /mnt/nfs/slurm/miniforge/user_envs/python3.11
python3.12             /mnt/nfs/slurm/miniforge/user_envs/python3.12
python3.13             /mnt/nfs/slurm/miniforge/user_envs/python3.13

And we can easily activate an environment...

$ source activate python3.13
(python3.13) $

And we can schedule jobs to run across multiple nodes:

$ cat > ./submit_conda.sh << EOF
#!/usr/bin/env bash

#SBATCH --job-name=conda
#SBATCH --mem=1gb
#SBATCH --ntasks=6
#SBATCH --cpus-per-task=2
#SBATCH --time=00:05:00

# Load Conda and activate Python 3.13 environment.
module load Conda
source activate python3.13

srun python --version

sleep 5
EOF
$ sbatch submit_conda.sh
Submitted batch job 403
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               403   general    conda   nathan  R       0:01      3 bettley,cargyll,dalt
$ cat slurm-403.out
Python 3.13.2
Python 3.13.2
Python 3.13.2
Python 3.13.2
Python 3.13.2
Python 3.13.2

Super cool.

Terraform

I would like my Goldentooth services to be reachable from anywhere, but 1) I'd like trouble-free TLS encryption to the domain, and 2) I'd like to hide my home IP address. I can set up LetsEncrypt, but I've had issues in the past with local storage of certificates on what is essentially an ephemeral setup that might get blown up at any point. I'd like to prevent spurious strain on their systems, outages due to me spamming, etc etc etc. I'd prefer it if I could use AWS ACM. If I can use ACM and CloudFront and just not cache anything, or cache intelligently on a per-domain basis, that would be nice. I don't know if I can get that working - I know AWS intends ACM for AWS services – but I'll try.

So the first step of this is to set up Terraform; to create an S3 bucket to hold the state and a lock to support state locking.

We can bootstrap this by just creating the S3 bucket, then creating a Terraform configuration that only contains that S3 bucket and imports the existing bucket (mostly so I don't forget what the bucket is for or what it is using). I apply that - yup, works.

The next thing I add is configuration for an OIDC provider for GitHub. Fortunately, there's a provider for this, so it's easy to set up. I apply that and it creates an IAM role. I assign it Administrator access temporarily.

I create a GitHub Actions workflow to set up Terraform, plan, and apply the configuration. That works when I push to main. Pretty sweet.

Dynamic DNS

As previously stated: I would like my Goldentooth services to be reachable from anywhere, but 1) I'd like trouble-free TLS encryption to the domain, and 2) I'd like to hide my home IP address. I can set up LetsEncrypt, but I've had issues in the past with local storage of certificates on what is essentially an ephemeral setup that might get blown up at any point. I'd like to prevent spurious strain on their systems, outages due to me spamming, etc etc etc. I'd prefer it if I could use AWS ACM. If I can use ACM and CloudFront and just not cache anything, or cache intelligently on a per-domain basis, that would be nice. I don't know if I can get that working - I know AWS intends ACM for AWS services – but I'll try.

The next step of this is to get my router to update Route53 with my home IP address whenever it changes. That's going to require a Lambda function, API Gateway, an SSM Parameter for the credentials, an IAM role, etc. That's all going to be deployed and managed via Terraform.

dynamic-dns graph

Load Balancer Revisited

As previously stated: I would like my Goldentooth services to be reachable from anywhere, but 1) I'd like trouble-free TLS encryption to the domain, and 2) I'd like to hide my home IP address. I can set up LetsEncrypt, but I've had issues in the past with local storage of certificates on what is essentially an ephemeral setup that might get blown up at any point. I'd like to prevent spurious strain on their systems, outages due to me spamming, etc etc etc. I'd prefer it if I could use AWS ACM. If I can use ACM and CloudFront and just not cache anything, or cache intelligently on a per-domain basis, that would be nice. I don't know if I can get that working - I know AWS intends ACM for AWS services – but I'll try.

Now, one thing I want to be able to do for this is to have a single origin for the CloudFront distribution, e.g. *.my-home.goldentooth.net, which will resolve to my home IP address. But I want to be able to route based on domain name. I already have <service>.goldentooth.net working with ExternalDNS and MetalLB. So I want my reverse proxy to map an incoming request for <service>.my-home.goldentooth.net to a backend <service>.goldentooth.net with as little extra work as possible. Performance is less of an issue here than the fact that it works, that it's easy to maintain and repair if it breaks three year from now, and that I can complete this and move on to something else.

These factors combined mean that I should not use HAProxy for this. HAProxy is incredibly powerful and very performant, but it is not incredibly flexible for this sort of ad-hoc YOLO kind of work. Nginx, however, is.

So, alongside HAProxy, which I'm using for Kubernetes high-availability, I'll open a port on my router and forward it to Nginx, which will reverse-proxy that based on the domain name to the appropriate local load balancer service.

The resulting configuration is pretty simple:

server {
  listen 8080;
  resolver 8.8.8.8 valid=10s;
  server_name ~^(?<subdomain>[^.]+)\.{{ cluster.cloudfront_origin_domain }}$;
  location / {
    set $target_host "$subdomain.{{ cluster.domain }}";
    proxy_pass http://$target_host;
    proxy_set_header Host $target_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_ssl_verify off;
  }
}

And it just works; requesting http://httpbin.my-home.goldentooth.net:7463/ returns the appropriate service.

CloudFront and ACM

The next step will be to set up a CloudFront distribution that uses this address format as an origin, with no caching, and an ACM certificate. Assuming I can do that. If I can't, I might need to figure something else out. I could also use CloudFlare, and indeed if anyone ever reads this they're probably screaming at me, "just use CloudFlare, you idiot," but I'm trying to restrict the number of services and complications that I need to keep operational simultaneously.

Plus, I use Safari (and Brave) rather than Chrome, and one of the only systems with which I seem to encounter persistent issues using Safari is... CloudFlare. It might not for my use case, but I figure I would need to set it up just to test it.

So, yes, I'm totally aware this is a nasty hack, but... I'm gonna try it.

Spelling this out a little, here's the explicit idea:

  • Make a request to service.home-proxy.goldentooth.net
  • That does DNS lookup, which points to a CloudFront distribution
  • TLS certificate loads for CloudFront
  • CloudFront makes request to my home internet, preserving the Host header
  • That request gets port-forwarded to Nginx
  • Nginx matches host header service.home-proxy.goldentooth.net and sets $subdomain to service
  • Nginx sets upstream server name to service.goldentooth.net
  • Nginx does DNS lookup for upstream server and finds 10.4.11.43
  • Nginx proxies request back to 10.4.11.43

And this appears to work:

$ curl https://httpbin.home-proxy.goldentooth.net/ip
{
  "origin": "66.61.26.32"
}

The latency is nonzero but not noticeable to me. It's still an ugly hack, and there are some security implications I'll need to deal with. I ended up adding basic auth on the Nginx listener which, while not fantastic, is probably as much as I really need.

Prometheus

Way back in Chapter 19, I set up an Prometheus Node Exporter "app" for Argo CD, but I never actually set up Prometheus itself.

That's really fairly odd for me, since I'm normally super twitchy about metrics, logging, and observability. I guess I put it off because I was dealing with some kind of existential questions; where would Prometheus live, how would it communicate, etc, but then ended up kinda running out of steam before I answered the questions.

So, better late than never, I'm going to work on setting up Prometheus in a nice, decentralized kind of way.

Implementation Architecture

Installation Method

I'm using the official prometheus.prometheus.prometheus Ansible role from the Prometheus community. The depth to Prometheus is, after all, configuring and using it, not merely in installing it.

The installation is managed through:

  • Playbook: setup_prometheus.yaml
  • Custom role: goldentooth.setup_prometheus (wraps the community role)
  • CLI command: goldentooth setup_prometheus

Deployment Location

Prometheus runs on allyrion (10.4.0.10), which consolidates multiple infrastructure services:

  • HAProxy load balancer
  • NFS server
  • Prometheus monitoring server

This placement provides several advantages:

  • Central location for cluster-wide monitoring
  • Proximity to load balancer for HAProxy metrics
  • Reduced resource usage on Kubernetes worker nodes

Service Configuration

Core Settings

The Prometheus service is configured with production-ready settings:

# Storage and retention
prometheus_storage_retention_time: "15d"
prometheus_storage_retention_size: "5GB"
prometheus_storage_tsdb_path: "/var/lib/prometheus"

# Network and performance
prometheus_web_listen_address: "0.0.0.0:9090"
prometheus_config_global_scrape_interval: "60s"
prometheus_config_global_evaluation_interval: "15s"
prometheus_config_global_scrape_timeout: "15s"

Security Hardening

The service implements comprehensive security measures:

  • Dedicated user: Runs as prometheus user/group
  • Systemd hardening: NoNewPrivileges, PrivateDevices, ProtectSystem=strict
  • Capability restrictions: Limited to CAP_SET_UID only
  • Resource limits: GOMAXPROCS=4 to prevent CPU exhaustion

External Labels

Cluster identification through external labels:

external_labels:
  environment: goldentooth
  cluster: goldentooth
  domain: goldentooth.net

Service Discovery Implementation

File-Based Service Discovery

Rather than relying on complex auto-discovery, I implement file-based service discovery for reliability and explicit control:

Target Generation (/etc/prometheus/file_sd/node.yaml):

{% for host in groups['all'] %}
- targets:
    - "{{ hostvars[host]['ipv4_address'] }}:9100"
  labels:
    instance: "{{ host }}"
    job: 'node'
{% endfor %}

This approach:

  • Auto-generates targets from Ansible inventory
  • Covers all 13 cluster nodes (12 Raspberry Pis + 1 x86 GPU node)
  • Provides consistent labeling with instance and job labels
  • Updates automatically when nodes are added/removed

Scrape Configurations

Core Infrastructure Monitoring

Prometheus Self-Monitoring:

- job_name: 'prometheus'
  static_configs:
    - targets: ['allyrion:9090']

HAProxy Load Balancer:

- job_name: 'haproxy'
  static_configs:
    - targets: ['allyrion:8405']

HAProxy includes a built-in Prometheus exporter accessible at /metrics on port 8405, providing load balancer performance and health metrics.

Nginx Reverse Proxy:

- job_name: 'nginx'
  static_configs:
    - targets: ['allyrion:9113']

Node Monitoring

File Service Discovery for all cluster nodes:

- job_name: "unknown"
  file_sd_configs:
    - files:
      - "/etc/prometheus/file_sd/*.yaml"
      - "/etc/prometheus/file_sd/*.json"

This targets all Node Exporter instances across the cluster, providing comprehensive infrastructure metrics.

Advanced Integrations

Loki Log Aggregation:

- job_name: 'loki'
  static_configs:
    - targets: ['inchfield:3100']
  scheme: 'https'
  tls_config:
    ca_file: /etc/ssl/certs/goldentooth.pem

This integration uses the Step-CA certificate authority for secure communication with the Loki log aggregation service.

Exporter Ecosystem

Node Exporter Deployment

Kubernetes Nodes (via Argo CD):

  • Helm Chart: prometheus-node-exporter v4.46.1
  • Namespace: prometheus-node-exporter
  • Extra Collectors: --collector.systemd, --collector.processes
  • Management: Automated GitOps deployment with auto-sync

Infrastructure Node (allyrion):

  • Installation: Via prometheus.prometheus.node_exporter role
  • Enabled Collectors: systemd for service monitoring
  • Integration: Direct scraping by local Prometheus instance

Application Exporters

I also configured several application-specific exporters:

HAProxy Built-in Exporter: Provides load balancer metrics including backend health, response times, and traffic distribution

Nginx Exporter: Monitors reverse proxy performance and request patterns

Network Access and Security

Nginx Reverse Proxy

To provide secure external access to Prometheus, I configured an Nginx reverse proxy:

server {
    listen 8081;
    location / {
        proxy_pass http://127.0.0.1:9090;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Application prometheus;
    }
}

This provides:

  • Network isolation (Prometheus only accessible locally)
  • Header injection for request identification
  • Potential for future authentication layer

Certificate Integration

The cluster uses Step-CA for comprehensive certificate management. Prometheus leverages this infrastructure for:

  • Secure scraping of TLS-enabled services (Loki)
  • Potential future TLS termination
  • Integration with the broader security model

Alerting Configuration

Basic Alert Rules

The installation includes foundational alerting rules in /etc/prometheus/rules/ansible_managed.yml:

Watchdog Alert: Always-firing alert to verify the alerting pipeline is functional

Instance Down Alert: Critical alert when up == 0 for 5 minutes, indicating node or service failure

Future Expansion

The alert rule framework is prepared for expansion with application-specific alerts, SLA monitoring, and capacity planning alerts.

Integration with Monitoring Stack

Grafana Integration

Prometheus serves as the primary data source for Grafana dashboards:

datasources:
  - name: prometheus
    type: prometheus  
    url: http://allyrion:9090
    access: proxy

This enables rich visualization of cluster metrics through pre-configured and custom dashboards.

Storage and Performance

TSDB Configuration:

  • Retention: 15 days (time) and 5GB (size) for appropriate data lifecycle
  • Storage: Local disk at /var/lib/prometheus
  • Compaction: Automatic TSDB compaction for optimal query performance

The scrape configuration was fairly straightforward, and the result is a comprehensive monitoring foundation covering all infrastructure components and preparing for future application-specific monitoring expansion.

Consul

I wanted to install a service discovery system to manage, well, all of the other services that exist only to manage other services on this cluster.

I have the idea of installing Authelia, then Envoy, then Consul in a chain as a replacement for Nginx. Obviously it's far more complicated than Nginx, but by now that's the point; to increase the complexity of this homelab until it collapses under its own weight. Alas poor Goldentooth. I knew him, Gentle Reader, a cluster of infinite GPIO!

First order of business is to set up the Consul servers – leader and followers – which will occupy Bettley, Cargyll, and Dalt.

For most of this, I just followed the deployment guide. Then I followed the guide for creating client agent tokens.

Unfortunately, I encountered some issues when it came to setting up ACLs. For some reason, my server nodes worked precisely as expected, but my nodes would not join the cluster.

Apr 12 13:44:56 fenn consul[328873]: ==> Starting Consul agent...
Apr 12 13:44:56 fenn consul[328873]:                Version: '1.20.5'
Apr 12 13:44:56 fenn consul[328873]:             Build Date: '2025-03-11 10:16:18 +0000 UTC'
Apr 12 13:44:56 fenn consul[328873]:                Node ID: 'a5c6a1f2-8811-9de7-917f-acc1cd9fc8b7'
Apr 12 13:44:56 fenn consul[328873]:              Node name: 'fenn'
Apr 12 13:44:56 fenn consul[328873]:             Datacenter: 'dc1' (Segment: '')
Apr 12 13:44:56 fenn consul[328873]:                 Server: false (Bootstrap: false)
Apr 12 13:44:56 fenn consul[328873]:            Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: -1, DNS: 8600)
Apr 12 13:44:56 fenn consul[328873]:           Cluster Addr: 10.4.0.15 (LAN: 8301, WAN: 8302)
Apr 12 13:44:56 fenn consul[328873]:      Gossip Encryption: true
Apr 12 13:44:56 fenn consul[328873]:       Auto-Encrypt-TLS: true
Apr 12 13:44:56 fenn consul[328873]:            ACL Enabled: true
Apr 12 13:44:56 fenn consul[328873]:     ACL Default Policy: deny
Apr 12 13:44:56 fenn consul[328873]:              HTTPS TLS: Verify Incoming: true, Verify Outgoing: true, Min Version: TLSv1_2
Apr 12 13:44:56 fenn consul[328873]:               gRPC TLS: Verify Incoming: true, Min Version: TLSv1_2
Apr 12 13:44:56 fenn consul[328873]:       Internal RPC TLS: Verify Incoming: true, Verify Outgoing: true (Verify Hostname: true), Min Version: TLSv1_2
Apr 12 13:44:56 fenn consul[328873]: ==> Log data will now stream in as it occurs:
Apr 12 13:44:56 fenn consul[328873]: 2025-04-12T13:44:55.999-0400 [WARN]  agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config f
ormat must be set
Apr 12 13:44:56 fenn consul[328873]: 2025-04-12T13:44:56.021-0400 [WARN]  agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json,
 or config format must be set
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.216-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.11:8300 error="rpcinsecure: err
or making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly
when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.225-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.12:8300 error="rpcinsecure: err
or making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify
a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.240-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.240-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.240-0400 [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.255-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.11:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.263-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.12:8300 error="rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.277-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.277-0400 [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.508-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.11:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.515-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.12:8300 error="rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.538-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.538-0400 [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request

It seemed that the token would not be persisted on the client node after running consul acl set-agent-token agent <acl-token-secret-id>, even though I have enable_token_persistence set to true. As a result, I needed to go back and set it in the consul.hcl configuration file.

The fiddliness of the ACL bootstrapping also led me to split that out into a separate Ansible role.

Vault

As long as I'm setting up Consul, I figure I might as well set up Vault too.

This wasn't that bad, compared to the experience I had with ACLs in Consul. I set up a KMS key for unsealing, generated a certificate authority and regenerated TLS assets for my three server nodes, and the Consul storage backend works seamlessly.

Vault Cluster Architecture

Deployment Configuration

The Vault cluster runs on three Raspberry Pi nodes: bettley, cargyll, and dalt. This provides high availability with automatic leader election and fault tolerance.

Key Design Decisions:

  • Storage Backend: Consul (not Raft) - leverages existing Consul cluster for data persistence
  • Auto-Unsealing: AWS KMS integration eliminates manual unsealing after restarts
  • TLS Everywhere: Full mutual TLS with Step-CA integration
  • Service Integration: Deep integration with Consul service discovery

AWS KMS Auto-Unsealing

Rather than managing unseal keys manually, I implemented AWS KMS auto-unsealing through Terraform:

KMS Key Configuration (terraform/modules/vault_seal/kms.tf):

resource "aws_kms_key" "vault_seal" {
  description             = "KMS key for managing the Goldentooth vault seal"
  key_usage               = "ENCRYPT_DECRYPT"
  enable_key_rotation     = true
  deletion_window_in_days = 30
}

resource "aws_kms_alias" "vault_seal" {
  name          = "alias/goldentooth/vault-seal"
  target_key_id = aws_kms_key.vault_seal.key_id
}

This provides:

  • Automatic unsealing on service restart
  • Key rotation managed by AWS
  • Audit trail through CloudTrail
  • No manual intervention required for cluster recovery

Vault Server Configuration

Core Settings

The main Vault configuration demonstrates production-ready patterns:

ui                                  = true
cluster_addr                        = "https://{{ ipv4_address }}:8201"
api_addr                            = "https://{{ ipv4_address }}:8200"
disable_mlock                       = true
cluster_name                        = "goldentooth"
enable_response_header_raft_node_id = true
log_level                           = "debug"

Key Features:

  • Web UI enabled for administrative access
  • Per-node cluster addressing using individual IP addresses
  • Memory lock disabled (appropriate for container/Pi environments)
  • Debug logging for troubleshooting and development

Storage Backend: Consul Integration

storage "consul" {
  address           = "{{ ipv4_address }}:8500"
  check_timeout     = "5s"
  consistency_mode  = "strong"
  path              = "vault/"
  token             = "{{ vault_consul_token.token.SecretID }}"
}

The Consul storage backend provides:

  • Strong consistency for data integrity
  • Leveraged infrastructure - reuses existing Consul cluster
  • ACL integration with dedicated Consul tokens
  • Service discovery through Consul's native mechanisms

TLS Configuration

listener "tcp" {
  address                             = "{{ ipv4_address }}:8200"
  tls_cert_file                       = "/opt/vault/tls/tls.crt"
  tls_key_file                        = "/opt/vault/tls/tls.key"
  tls_require_and_verify_client_cert  = true
  telemetry {
    unauthenticated_metrics_access = true
  }
}

Security Features:

  • Mutual TLS required for all client connections
  • Step-CA certificates with multiple Subject Alternative Names
  • Automatic certificate renewal via systemd timers
  • Telemetry access for monitoring without authentication

Certificate Management Integration

Step-CA Integration

Vault certificates are issued by the cluster's Step-CA with comprehensive SAN coverage:

Certificate Attributes:

  • vault.service.consul - Service discovery name
  • localhost - Local access
  • Node hostname (e.g., bettley.nodes.goldentooth.net)
  • Node IP address (e.g., 10.4.0.11)

Renewal Automation:

[Service]
Environment=CERT_LOCATION=/opt/vault/tls/tls.crt \
            KEY_LOCATION=/opt/vault/tls/tls.key

# Restart Vault service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active vault.service || systemctl try-reload-or-restart vault.service"

Certificate Lifecycle

  • Validity: 24 hours (short-lived certificates)
  • Renewal: Automatic via cert-renewer@vault.timer
  • Service Integration: Automatic Vault restart after renewal
  • CLI Management: goldentooth rotate_vault_certs

Consul Backend Configuration

Dedicated ACL Policy

Vault nodes use dedicated Consul ACL tokens with specific permissions:

key_prefix "vault/" {
  policy = "write"
}

service "vault" {
  policy = "write"
}

agent_prefix "" {
  policy = "read"
}

session_prefix "" {
  policy = "write"
}

This provides:

  • Minimal permissions for Vault's operational needs
  • Isolated key space under vault/ prefix
  • Service registration capabilities
  • Session management for locking mechanisms

Security and Service Configuration

Systemd Hardening

[Service]
User=vault
Group=vault
SecureBits=keep-caps
AmbientCapabilities=CAP_IPC_LOCK
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK
NoNewPrivileges=yes

Security Measures:

  • Dedicated user/group isolation
  • Capability restrictions - only IPC_LOCK and SYSLOG
  • Memory locking capability for sensitive data
  • No privilege escalation permitted

Environment Security

AWS credentials for KMS access are managed through environment files:

AWS_ACCESS_KEY_ID={{ vault.aws.access_key_id }}
AWS_SECRET_ACCESS_KEY={{ vault.aws.secret_access_key }}
AWS_REGION={{ vault.aws.region }}
  • File permissions: 0600 (owner read/write only)
  • Encrypted at rest in Ansible vault
  • Least privilege IAM policies for KMS access only

Monitoring and Observability

Prometheus Integration

telemetry {
  prometheus_retention_time = "24h"
  usage_gauge_period = "10m"
  maximum_gauge_cardinality = 500
  enable_hostname_label = true
  lease_metrics_epsilon = "1h"
  num_lease_metrics_buckets = 168
  add_lease_metrics_namespace_labels = false
  filter_default = true
  disable_hostname = true
}

Metrics Features:

  • 24-hour retention for operational metrics
  • 10-minute usage gauges for capacity planning
  • Hostname labeling for per-node identification
  • Lease metrics with weekly granularity (168 buckets)
  • Unauthenticated metrics access for Prometheus scraping

Command Line Integration

goldentooth CLI Commands

# Deploy and configure Vault cluster
goldentooth setup_vault

# Rotate TLS certificates
goldentooth rotate_vault_certs

# Edit encrypted secrets
goldentooth edit_vault

Environment Configuration

For Vault CLI operations:

export VAULT_ADDR=https://{{ ipv4_address }}:8200
export VAULT_CLIENT_CERT=/opt/vault/tls/tls.crt
export VAULT_CLIENT_KEY=/opt/vault/tls/tls.key

External Secrets Integration

Kubernetes Integration

The cluster includes External Secrets Operator (v0.9.13) for Kubernetes secrets management:

  • Namespace: external-secrets
  • Management: Argo CD GitOps deployment
  • Integration: Direct Vault API access for secret retrieval
  • Use Cases: Database credentials, API keys, TLS certificates

Directory Structure

/opt/vault/                 # Base directory
├── tls/                   # TLS certificates
│   ├── tls.crt           # Server certificate (Step-CA issued)
│   └── tls.key           # Private key
├── data/                 # Data directory (unused with Consul backend)
└── raft/                 # Raft storage (unused with Consul backend)

/etc/vault.d/              # Configuration directory
├── vault.hcl             # Main configuration
└── vault.env             # Environment variables (AWS credentials)

High Availability and Operations

Cluster Behavior

  • Leader Election: Automatic through Consul backend
  • Split-Brain Protection: Consul quorum requirements
  • Rolling Updates: One node at a time with certificate renewal
  • Disaster Recovery: AWS KMS auto-unsealing enables rapid recovery

Operational Patterns

Health Checks: Consul health checks monitor Vault API availability Service Discovery: vault.service.consul provides load balancing Monitoring: Prometheus metrics for capacity and performance monitoring Logging: systemd journal integration with structured logging

That said, I haven't actually put anything into it yet, so the real test will come when I start using it for secrets management across the cluster infrastructure. The External Secrets integration provides the foundation for Kubernetes secrets management, while the Consul integration enables broader service authentication.

Envoy

I would like to replace Nginx with an edge routing solution of Envoy + Consul. Consul is setup, so let's get cracking on Envoy.

Unfortunately, it doesn't work out of the box:

$ envoy --version
external/com_github_google_tcmalloc/tcmalloc/system-alloc.cc:625] MmapAligned() failed - unable to allocate with tag (hint, size, alignment) - is something limiting address placement? 0x17f840000000 1073741824 1073741824 @ 0x5560994c54 0x5560990f40 0x5560990830 0x5560971b6c 0x556098de00 0x556098dbd0 0x5560966e60 0x55608964ec 0x556089314c 0x556095e340 0x7fa24a77fc
external/com_github_google_tcmalloc/tcmalloc/arena.cc:58] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size); is something preventing mmap from succeeding (sandbox, VSS limitations)? 131072 632 @ 0x5560994fb8 0x5560971bfc 0x556098de00 0x556098dbd0 0x5560966e60 0x55608964ec 0x556089314c 0x556095e340 0x7fa24a77fc
Aborted

That's because of this issue.

I don't really have the horsepower on these Pis to compile Envoy, and I don't want to recompile the kernel, so for the time being I think I'll need to run a special build of Envoy in Docker. Unfortunately, I can't find a version that both 1) runs on Raspberry Pis, and 2) is compatible with a current version of Consul, so I think I'm kinda screwed for the moment.

Cross-Compilation Investigation

To solve the tcmalloc issue, I attempted to cross-compile Envoy v1.32.0 for ARM64 with --define tcmalloc=disabled on Velaryon (the x86 node). This would theoretically produce a Raspberry Pi-compatible binary without the memory alignment problems.

Setup Completed

  • ✅ Created cross-compilation toolkit with ARM64 toolchain (aarch64-linux-gnu-gcc)
  • ✅ Built containerized build environment with Bazel 6.5.0 (required by Envoy)
  • ✅ Verified ARM64 cross-compilation works for simple C programs
  • ✅ Confirmed Envoy source has ARM64 configurations (//bazel:linux_aarch64)
  • ✅ Found Envoy's CI system officially supports ARM64 builds

Fundamental Blocker

All cross-compilation attempts failed with the same error:

cc_toolchain_suite '@local_config_cc//:toolchain' does not contain a toolchain for cpu 'aarch64'

The root cause is a version compatibility gap:

  • Envoy v1.32.0 requires Bazel 6.5.0 for compatibility
  • Bazel 6.5.0 predates built-in ARM64 toolchain support
  • Envoy's CI likely uses custom Docker images with pre-configured ARM64 toolchains

Attempts Made

  1. Custom cross-compilation setup - Blocked by missing Bazel ARM64 toolchain
  2. Platform-based approach - Wrong platform type (config_setting vs platform)
  3. CPU-based configuration - Same toolchain issue
  4. Official Envoy CI approach - Same fundamental Bazel limitation

Verdict

Cross-compiling Envoy for ARM64 would require either:

  • Creating custom Bazel ARM64 toolchain definitions (complex, undocumented)
  • Finding Envoy's exact CI Docker environment (may not be public)
  • Upgrading to newer Bazel (likely breaks Envoy v1.32.0 compatibility)

The juice isn't worth the squeeze. For edge routing on Raspberry Pi, simpler alternatives exist:

  • nginx (lightweight, excellent ARM64 support)
  • HAProxy (proven load balancer, ARM64 packages available)
  • Traefik (modern proxy, native ARM64 builds)
  • Caddy (simple reverse proxy, ARM64 support)

Step-CA

Apparently, another thing I did recently was to set up Nomad, but I didn't take any notes about it.

That's not really that big of a deal, though, because what I need to do is to get Nomad and Consul and Vault working together, and currently they aren't.

This is complicated by the fact that if I do want AutoEncrypt working between Nomad and Consul, the two have to have a certificate chain proceeding from either 1) the same root certificate, or 2) different root certificates that have cross-signed. Currently, Vault has its own root certificate that I generate from scratch with the Ansible x509 tools, and then Nomad and Consul generate their own certificates using the built-in tools.

This seems messy, so it's probably time to dive into some kind of meaningful, long-term TLS infrastructure.

The choice seemed fairly clear: step-ca. Although I hadn't used it before, I'd flirted with it a time or two and it seemed to be fairly straightforward.

I poked around a bit in other people's implementations and pilfered them ruthlessly (I've bought Max Hösel a couple coffees and I'm crediting him, never fear). I don't really need the full range of his features (and they are wonderful, it's really a lovely collection), so I cribbed the basic flow.

Once that's done, we have a few new Ansible playbooks:

  • apt_smallstep: Configure the Smallstep Apt repository.
  • install_step_ca: Install step-ca and step-cli on the CA node (which I've set to be Jast, the tenth node).
  • install_step_cli: Performed on all nodes.
  • init_cluster_ca: Initialize the certificate authority on the CA node.
  • bootstrap_cluster_ca: Install the root certificate in the trust store on every node.
  • zap_cluster_ca: To clean up, just nuke every file in the step-ca data directory.

The playbooks mentioned above get us most of the way there, but we need to revisit some of the places we've generated certificates (Vault, Consul, and Nomad) and integrate them into this system.

Refactoring HashiApp certificate management.

As it turned out, doing that involved refactor a good amount of my Ansible IaC. One thing I've learned about code quality:

Can you make the change easily? If so, make the change. If not, fix the most obvious obstacle, then reevaluate.

In this case, the change in question was to use the step CLI tool to generate certificates signed by the step-ca root certificate authority for services like Nomad, Vault, and Consul.

I knew immediately this would not be an easy change to make, just because of how I had written my Ansible roles. I had adopted conventional patterns for these roles, even though I knew they were not for general use and I didn't really have much intention of distributing them. Conventional patterns included naming variables expecting them to be reused across modules, etc. So I would declare variables in a general fashion within the defaults/main.yaml and then override them within my inventory's group_vars and host_vars.

I now consider this to be a mistake. In reality, the modules weren't really designed cleanly; there were a lot of assumptions based on my own use cases that I baked into the modules, and that affected which modules I declared, etc. So yeah, I had an Ansible role to set up Slurm, but it was by no means general enough to actually help most people set up Slurm. It just gathered together a lot of tasks that I found appropriate that had to do with setting up Slurm.

Nevertheless, I persisted for a while. Mostly, I think, out of a belief that I should at least pay lip service to community style guidelines.

This task, getting Nomad and Consul and Vault working with TLS courtesy of step-ca, was my breaking point. There was just too much crap that needed to be renamed, just to maintain the internal consistency of an increasingly clumsy architecture intended to please people who didn't notice and almost surely wouldn't care if they had.

So, TL;DR: there was a great reduction in redundancy and I shifted to specifying variables in dictionaries rather than distinctly-named snake-cased variables that reminded me a little too much of Java naming conventions.

Configuring HashiApps to use Step-CA

Once refactoring was done, configuring the apps to use Step-CA was mostly straightforward. A single step command was needed to generate the certificates, then another Ansible block to adjust the permissions and ownership of the generated files. For our labors, we're eventually greeted with Consul, Vault, and Nomad running exactly as they had before, but secured by a coherent certificate chain that can span all Goldentooth services.

Ray

Finally, we're getting back to something that's associated directly with machine learning: Ray.

It would be normal to opt for KubeRay here, since I am actually running Kubernetes on Goldentooth, but I'm not normal 🤷‍♂️ Instead, I'll be going with the on-prem approach, which... has some implications.

First of these is that I need to install Conda on every node. This is fine and probably something I should've already done anyway, just as a normal matter of course. Except I kind of did as part of setting up Slurm. Which, yeah, probably means a refactor is in order.

So let's install and configure Conda, then setup a Ray cluster!

24 Hours Later...

TL;DR: The attempt on my life has left me scarred and deformed.

So, that ended up being a major pain in the ass. The conda-forge channel didn't have builds of Ray for aarch64, so I needed to configure the defaults channel. Once the correct packages were installed, I encountered mysterious issues where the Ray dashboard wouldn't start up, causing the entire service to crash. It turned out, after prolonged debugging, that the Ray dashboard was apparently segfaulting because of issues with a grpcio wheel – not sure if it was built improperly, or what.

After figuring that out, I managed to get the cluster up, but still encountered issues. Well, the cluster was running Ray 2.46.0, and my MBP was running 2.7.0, so... that checks out. Unfortunately, I was attempting to follow MadeWithML based on a recommendation, and there were no Pi builds available for 2.7.0.

So I updated the MadeWithML project to use 2.46.0, brute-force-ishly, and that worked - for a time, but then incompatibilities started popping up. So I guess MadeWithML and my cluster weren't meant to be together.

Nevertheless, I do have a somewhat functioning Ray cluster, so I'm going to call this a victory (the only one I can) and move on.

Grafana

This, the next "article" (on Loki), and the successive one (on Vector), are occurring mostly in parallel so that I can validate these services as I go.

I (minimally) set up Vector first, then Loki, then Grafana, just to verify I could pass info around in some coherent fashion and see it in Grafana. However, that's not really sufficient.

The fact is that I'm not really experienced with Grafana. I've used it to debug things, I've installed and managed it, I've created and updated dashboards, etc. But I don't have a deep understanding of it or its featureset.

At work, we use Datadog. I love Datadog. Datadog has incredible features and a wonderful user interface. Datadog makes more money than I do, and costs more than I can afford. Also, they won't hire me, but I'm not bitter. The fact is that they don't really have a hobbyist tier, or at least not one that makes a ten-node cluster affordable.

At work, I prioritize observability. I rely heavily on logs, metrics, and traces to do my job. In my work on Goldentooth, I've been neglecting that. I've been using journalctl to review logs and debug services, and that's a pretty poor experience. It's recently become very, very clear that I need to have a better system here, and that means learning how to use Grafana and how to configure it best for my needs.

So, yeah, Grafana.

Grafana

My initial installation was bog-standard, basic Grafana. Not a thing changed. It worked! Okay, let's make it better.

The first thing I did was to throw that SQLite DB on a tmpfs. I'm not really concerned enough about the volume or load to consider moving to something like PostgreSQL, but 1) it also doesn't matter if I keep logs/metrics past a reboot, and 2) it's probably good to avoid any writes to the SD card that I can.

Next thing was to create a new repository, grafana-dashboards, to manage dashboards. I want a bunch of these dudes and it's better to manage them in a separate repository than in Ansible itself. I checked it out via Git, added a script to sync the repo every so often, added that to cron.

Of course, then I needed a dashboard to test it out, so I grabbed a nice one to incorporate data from Prometheus Node Exporter here. (Thanks, Ricardo F!)

Then I had to connect Grafana to Prometheus Node Exporter, then I realized I was missing a couple of command-line arguments in my Prometheus Node Exporter Helm chart that were nice to have, so I added those to the Argo CD Application, re-synced the app, etc, and finally things started showing up.

Grafana Dashboard for Prometheus Node Exporter

Pretty cool, I think.

Grafana Implementation Details

tmpfs Database Configuration

The first optimization I implemented was mounting the Grafana data directory on tmpfs to avoid SD card writes:

- name: 'Manage the mount for the Grafana data directory.'
  ansible.posix.mount:
    path: '/var/lib/grafana'
    src: 'tmpfs'
    fstype: 'tmpfs'
    opts: 'size=100M,mode=0755'
    state: 'present'

This configuration:

  • Avoids SD card wear: Eliminates database writes to flash storage
  • Improves performance: RAM-based storage for faster access
  • Ephemeral data: Acceptable for a lab environment where persistence across reboots isn't critical
  • Size limit: 100MB allocation prevents memory exhaustion

TLS Configuration

I finished up by adding comprehensive TLS support to Grafana using Step-CA integration:

Server Configuration (in grafana.ini):

[server]
protocol = https
http_addr = {{ ipv4_address }}
http_port = 3000
cert_file = {{ grafana.cert_path }}
cert_key = {{ grafana.key_path }}

[grpc_server]
use_tls = true
cert_file = {{ grafana.cert_path }}
key_file = {{ grafana.key_path }}

Certificate Management:

  • Source: Step-CA issued certificates with 24-hour validity
  • Renewal: Automatic via cert-renewer@grafana.timer
  • Service Integration: Automatic Grafana restart after certificate renewal
  • Paths: /opt/grafana/tls/tls.crt and /opt/grafana/tls/tls.key

Dashboard Repository Management

Next thing was to create a new repository, grafana-dashboards, to manage dashboards externally:

Repository Integration:

- name: 'Check out the Grafana dashboards repository.'
  ansible.builtin.git:
    repo: "https://github.com/{{ cluster.github.organization }}/{{ grafana.provisioners.dashboards.repository_name }}.git"
    dest: '/var/lib/grafana/dashboards'
  become_user: 'grafana'

Dashboard Provisioning (provisioners.dashboards.yaml):

apiVersion: 1
providers:
  - name: "grafana-dashboards"
    orgId: 1
    type: file
    folder: ''
    disableDeletion: false
    updateIntervalSeconds: 15
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Automatic Dashboard Updates

I added a script to sync the repository periodically via cron:

Update Script (/usr/local/bin/grafana-update-dashboards.sh):

#!/usr/bin/env bash
dashboard_path="/var/lib/grafana/dashboards"
cd "${dashboard_path}"
git fetch --all
git reset --hard origin/master
git pull

Cron Integration: Updates every 15 minutes to keep dashboards current with the repository

Data Source Provisioning

The Prometheus integration is configured through automatic data source provisioning:

datasources:
  - name: 'Prometheus'
    type: 'prometheus'
    access: 'proxy'
    url: http://{{ groups['prometheus'] | first }}:9090
    jsonData:
      httpMethod: POST
      manageAlerts: true
      prometheusType: Prometheus
      cacheLevel: 'High'
      disableRecordingRules: false
      incrementalQueryOverlapWindow: 10m

This configuration:

  • Automatic discovery: Uses Ansible inventory to find Prometheus server
  • High performance: POST method and high cache level for better performance
  • Alert management: Enables Grafana to manage Prometheus alerts
  • Query optimization: 10-minute overlap window for incremental queries

Advanced Monitoring Integration

Loki Integration for State History:

[unified_alerting.state_history]
backend = "multiple"
primary = "loki"
loki_remote_url = "https://{{ groups['loki'] | first }}:3100"

This enables:

  • Alert state history: Stored in Loki for long-term retention
  • Multi-backend support: Primary storage in Loki with annotations fallback
  • HTTPS integration: Secure communication with Loki using Step-CA certificates

Security and Authentication

Password Management:

- name: 'Reset Grafana admin password.'
  ansible.builtin.command:
    cmd: grafana-cli admin reset-admin-password "{{ grafana.admin_password }}"

Security Headers: The configuration includes comprehensive security settings:

  • TLS enforcement: HTTPS-only communication
  • Cookie security: Secure cookie settings for HTTPS
  • Content security: XSS protection and content type options enabled

Service Integration

Certificate Renewal Automation:

[Service]
Environment=CERT_LOCATION=/opt/grafana/tls/tls.crt \
            KEY_LOCATION=/opt/grafana/tls/tls.key

# Restart Grafana service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active grafana.service || systemctl try-reload-or-restart grafana.service"

Systemd Integration:

  • Service runs as dedicated grafana user
  • Automatic startup and dependency management
  • Integration with cluster-wide certificate renewal system

Dashboard Ecosystem

Of course, then I needed a dashboard to test it out, so I grabbed a nice one to incorporate data from Prometheus Node Exporter here. (Thanks, Ricardo F!)

The dashboard management system provides:

  • Version control: All dashboards tracked in Git
  • Automatic updates: Regular synchronization from repository
  • Folder organization: File system structure maps to Grafana folders
  • Community integration: Easy incorporation of community dashboards

Monitoring Stack Integration

Then I had to connect Grafana to Prometheus Node Exporter, then I realized I was missing a couple of command-line arguments in my Prometheus Node Exporter Helm chart that were nice to have, so I added those to the Argo CD Application, re-synced the app, etc, and finally things started showing up.

Node Exporter Enhancement:

  • Additional collectors: --collector.systemd, --collector.processes
  • GitOps deployment: Changes managed through Argo CD
  • Automatic synchronization: Dashboard updates reflect new metrics immediately

This comprehensive Grafana setup provides a production-ready observability platform that integrates seamlessly with the broader goldentooth monitoring ecosystem, combining security, automation, and extensibility.

Loki

This, the previous "article" (on Grafana), and the next one (on Vector), are occurring mostly in parallel so that I can validate these services as I go.

Loki is... there's a whole lot going on there.

Log Retention Configuration

I enabled a retention policy so that my logs wouldn't grow without bound until the end of time. This coincided with me noticing that my /var/log/journal directories had gotten up to about 4GB, which led me to perform a similar change in the journald configuration.

Retention Policy Configuration:

limits_config:
  retention_period: 168h  # 7 days

compactor:
  working_directory: /tmp/retention
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 5
  delete_request_store: filesystem

I reduced the retention_delete_worker_count from 150 to 5 🙂 This optimization:

  • Reduces resource usage: Less CPU overhead on Raspberry Pi nodes
  • Maintains efficiency: 5 workers sufficient for 7-day retention window
  • Prevents overload: Avoids overwhelming the Pi's limited resources

Consul Integration for Ring Management

I also configured Loki to use Consul as its ring kvstore, which involved sketching out an ACL policy and generating a token, but nothing too weird. (Assuming that it works.)

Ring Configuration:

common:
  ring:
    kvstore:
      store: consul
      consul:
        acl_token: {{ loki_consul_token }}
        host: {{ ipv4_address }}:8500

Consul ACL Policy (loki.policy.hcl):

key_prefix "collectors/" {
  policy = "write"
}

key_prefix "loki/" {
  policy = "write"
}

This integration provides:

  • Service discovery: Automatic discovery of Loki components
  • Consistent hashing: Proper ring distribution for ingester scaling
  • High availability: Shared state management across cluster nodes
  • Security: ACL-based access control to Consul KV store

Comprehensive TLS Configuration

The next several hours involved cleanup after I rashly configured Loki to use TLS. I didn't know that I'd then need to configure Loki to communicate with itself via TLS, and that I would have to do so in several different places and that those places would have different syntax for declaring the same core ideas (CA cert, TLS cert, TLS key).

Server TLS Configuration

GRPC and HTTP Server:

server:
  grpc_listen_address: {{ ipv4_address }}
  grpc_listen_port: 9096
  grpc_tls_config: &http_tls_config
    cert_file: "{{ loki.cert_path }}"
    key_file: "{{ loki.key_path }}"
    client_ca_file: "{{ step_ca.root_cert_path }}"
    client_auth_type: "VerifyClientCertIfGiven"
  http_listen_address: {{ ipv4_address }}
  http_listen_port: 3100
  http_tls_config: *http_tls_config

TLS Features:

  • Mutual TLS: Client certificate verification when provided
  • Step-CA Integration: Uses cluster certificate authority
  • YAML Anchors: Reuses TLS config across components to reduce duplication

Component-Level TLS Configuration

Frontend Configuration:

frontend:
  grpc_client_config: &grpc_client_config
    tls_enabled: true
    tls_cert_path: "{{ loki.cert_path }}"
    tls_key_path: "{{ loki.key_path }}"
    tls_ca_path: "{{ step_ca.root_cert_path }}"
  tail_tls_config:
    tls_cert_path: "{{ loki.cert_path }}"
    tls_key_path: "{{ loki.key_path }}"
    tls_ca_path: "{{ step_ca.root_cert_path }}"

Pattern Ingester TLS:

pattern_ingester:
  metric_aggregation:
    loki_address: {{ ipv4_address }}:3100
    use_tls: true
    http_client_config:
      tls_config:
        ca_file: "{{ step_ca.root_cert_path }}"
        cert_file: "{{ loki.cert_path }}"
        key_file: "{{ loki.key_path }}"

Internal Component Communication

The configuration ensures TLS across all internal communications:

  • Ingester Client: grpc_client_config: *grpc_client_config
  • Frontend Worker: grpc_client_config: *grpc_client_config
  • Query Scheduler: grpc_client_config: *grpc_client_config
  • Ruler: Uses separate alertmanager client TLS config

And holy crap, the Loki site is absolutely awful for finding and understanding where some configuration is needed.

Advanced Configuration Features

Pattern Recognition and Analytics

Pattern Ingester:

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: {{ ipv4_address }}:3100
    use_tls: true

This enables:

  • Log pattern detection: Automatic recognition of log patterns
  • Metric generation: Convert log patterns to Prometheus metrics
  • Performance insights: Understanding log volume and patterns

Schema and Storage Configuration

TSDB Schema (v13):

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

Storage Paths:

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules

Query Performance Optimization

Caching Configuration:

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 20

Performance Features:

  • Embedded cache: 20MB query result cache for faster repeated queries
  • Protobuf encoding: Efficient data serialization for frontend communication
  • Concurrent streams: 1000 max concurrent GRPC streams

Certificate Management Integration

Automatic Certificate Renewal:

[Service]
Environment=CERT_LOCATION={{ loki.cert_path }} \
            KEY_LOCATION={{ loki.key_path }}

# Restart Loki service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active loki.service || systemctl try-reload-or-restart loki.service"

Certificate Lifecycle:

  • 24-hour validity: Short-lived certificates for enhanced security
  • Automatic renewal: cert-renewer@loki.timer handles renewal
  • Service restart: Seamless certificate updates with service reload
  • Step-CA integration: Consistent with cluster-wide PKI infrastructure

Monitoring and Alerting Integration

Ruler Configuration:

ruler:
  alertmanager_url: http://{{ ipv4_address }}:9093
  alertmanager_client:
    tls_cert_path: "{{ loki.cert_path }}"
    tls_key_path: "{{ loki.key_path }}"
    tls_ca_path: "{{ step_ca.root_cert_path }}"

Observability Features:

  • Structured logging: JSON format for better parsing
  • Debug logging: Detailed logging for troubleshooting
  • Request logging: Log requests at info level for monitoring
  • Grafana integration: Primary storage for alert state history

Deployment Architecture

Single-Node Deployment: Currently deployed on inchfield node Replication Factor: 1 (appropriate for single-node setup) Resource Optimization: Configured for Raspberry Pi resource constraints Integration Points:

  • Vector: Log shipping from all cluster nodes
  • Grafana: Log visualization and alerting
  • Prometheus: Metrics scraping from Loki endpoints

This comprehensive Loki configuration provides a production-ready log aggregation platform with enterprise-grade security, retention management, and integration capabilities, despite the complexity of getting all the TLS configurations properly aligned across the numerous internal components.

Vector

This and the two previous "articles" (on Grafana and on Vector) are occurring mostly in parallel so that I can validate these services as I go.

The main thing I wanted to do immediately with Vector was hook up more sources. A couple were turnkey (journald, kubernetes_logs, internal_logs) but most were just log files. These latter are not currently parsed according to any specific format, so I'll need to revisit this and extract as much information as possible from each.

It would also be good for me to inject some more fields into this that are set on a per-node level. I already have hostname, but I should probably inject IP address, etc, and anything else I can think of.

Other than that, it doesn't really seem like there's a lot to discuss here. Vector's cool, though. And in the future, I should remember that adding a whole bunch of log files into Vector from ten nodes, all at once, is not a great idea, as it will flood the Loki sink...

New Server!

Today I saw Beyond NanoGPT: Go From LLM Beginner to AI Researcher! on HackerNews, and while I'm less interested than most in LLMs specifically, I'm still interested.

The notes included the following:

The codebase will generally work with either a CPU or GPU, but most implementations basically require a GPU as they will be untenably slow otherwise. I recommend either a consumer laptop with GPU, paying for Colab/Runpod, or simply asking a compute provider or local university for a compute grant if those are out of budget (this works surprisingly well, people are very generous).

If this was expected to be slow on a standard CPU, it'd probably be unbearable (or not run at all) on a Pi, so this gave me pause 🤔

Fortunately, I had a solution. I have an extra PC that's a few years old but still relatively beefy (a Ryzen 9 3900X (12 cores) with 32GB RAM and an RTX 2070 Super). I built it as a VR PC and my kid and I haven't played VR in quite a while, so... it's just kinda sitting there. But it occurred to me that it was probably sufficiently powerful to run most of Beyond NanoGPT, and if it struggled with anything I might be able to upgrade to an RTX 4XXX or 5XXX.

Of course, this single machine by itself dominates the rest of Goldentooth, so I'll need to take some steps to minimize its usefulness.

Setup

I installed Ubuntu 24.04, which I felt was probably a decent parity for the Raspberry Pi OS on Goldentooth. Perhaps I should've installed Ubuntu on the Pis as well, but hindsight is 20/20 and I don't have enough complaints about RPOS to switch now. At some point, SD cards are going to start dropping like flies and I'll probably make the switch at that time.

The installation itself was over in a flash, quickly enough that I thought something might've failed. Admittedly, it's been a while since I've installed Ubuntu Server Minimal on a modern-ish PC.

After that, I just needed to lug the damned thing down to the basement, wire it in, and start running Ansible playbooks on it to set it up. A few minutes later:

New Server!

Hello, Velaryon!

Oh, and install Nvidia's kernel modules and other tools. None of that was particularly difficult, although it was a tad more irritating than it should've been.

Once I had the GPU showing up, and the relevant tools and libraries installed, I wanted to verify that I could actually run things on the GPU, so I checked out NVIDIA's cuda-samples and built 'em.

With that done:

🐠nathan@velaryon:~/cuda-samples/build/Samples/1_Utilities/deviceQueryDrv
$ ./deviceQueryDrv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2070 SUPER"
  CUDA Driver Version:                           12.9
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 7786 MBytes (8164081664 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1770 MHz (1.77 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 45 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

Not the sexiest thing I've ever seen, but it's a step in the right direction.

Kubernetes

Again, I only want this machine to run in very limited circumstances. I figure it'll make a nice box for cross-compiling, and for running GPU-heavy workloads when necessary, but otherwise I want it to stay in the background.

After I added it to the Kubernetes cluster:

New Node

I tainted it to prevent standard pods from being scheduled on it:

kubectl taint nodes velaryon gpu=true:NoSchedule

and labeled it so that pods requiring a GPU would be scheduled on it:

kubectl label nodes velaryon gpu=true arch=amd64

Now, any pod I wish to run on this node should have the following:

tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
nodeSelector:
  gpu: "true"

Nomad

A similar tweak was needed for the nomad.hcl config:

{% if clean_hostname in groups['nomad_client'] -%}
client {
  enabled     = true
  node_class  = "{{ nomad.client.node_class }}"
  meta {
    arch  = "{{ ansible_architecture }}"
    gpu   = "{{ 'true' if 'gpu' == nomad.client.node_class else 'false' }}"
  }
}
{% endif %}

I think this will work for a constraint:

constraint {
  attribute = "${node.class}"
  operator  = "="
  value     = "gpu"
}

But I haven't tried it yet.

After applying, we see the class show up:

🐠root@velaryon:~
$ nomad node status
ID        Node Pool  DC   Name       Class    Drain  Eligibility  Status
76ff3ff3  default    dc1  velaryon   gpu      false  eligible     ready
30ffab50  default    dc1  inchfield  default  false  eligible     ready
db6ae26b  default    dc1  gardener   default  false  eligible     ready
02174920  default    dc1  jast       default  false  eligible     ready
9ffa31f3  default    dc1  fenn       default  false  eligible     ready
01f0cd94  default    dc1  harlton    default  false  eligible     ready
793b9a5a  default    dc1  erenford   default  false  eligible     ready

Other than that, it should get the standard complement of features - Vector, Consul, etc. I initially set up Slurm, then undid it; I felt it would just complicate matters.

New Rack!

I poked around a bit and realized that I had two extra Raspberry Pi 4B+'s, so I ended up spending an absolutely absurd amount of money to build a 10" rack and get all of the existing and new Pis into it, along with some fans, 5V and 12V power supplies, a 16-port switch, etc. It was absolutely ridiculous and I would not recommend this course of action to anyone, and I'll never financially recover from this.

The main goal of this was to take my existing Picocluster (which was screwed together and looked nice and... well, was already paid for) and have something where I could pull out an individual Pi and replace or repair it if I needed. Another issue was that I didn't really have any substantial external storage, e.g. SSDs.

New Rack

I've been playing with some other things recently, and have delayed updating this too much. I was intending my current focus to be the next article in this clog, but I think it's going to take quite a lot longer (and will likely be the subject of a great many articles), so I think in the meantime I need to return to the subject of the actual cluster and progress it along.

TLS Certificate Renewal

So some time back I configured step-ca to generate TLS certificates for various services, but I gave the certs very short lifetimes and didn't set up renewal, so... whenever I step away from the cluster for a few days, everything breaks 🙃

Today's goal is to fix that.

$ consul members
Error retrieving members: Get "http://127.0.0.1:8500/v1/agent/members?segment=_all": dial tcp 127.0.0.1:8500: connect: connection refused

Indeed, very little is working.

Fortunately, step-ca provides good instructions for dealing with this sort of situation. I created a cert-renewer@service file:

[Unit]
Description=Certificate renewer for %I
After=network-online.target
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
StartLimitIntervalSec=0
; PartOf=cert-renewer.target

[Service]
Type=oneshot
User=root

Environment=STEPPATH=/etc/step-ca \
            CERT_LOCATION=/etc/step/certs/%i.crt \
            KEY_LOCATION=/etc/step/certs/%i.key

; ExecCondition checks if the certificate is ready for renewal,
; based on the exit status of the command.
; (In systemd <242, you can use ExecStartPre= here.)
ExecCondition=/usr/bin/step certificate needs-renewal ${CERT_LOCATION}

; ExecStart renews the certificate, if ExecStartPre was successful.
ExecStart=/usr/bin/step ca renew --force ${CERT_LOCATION} ${KEY_LOCATION}

; Try to reload or restart the systemd service that relies on this cert-renewer
; If the relying service doesn't exist, forge ahead.
; (In systemd <229, use `reload-or-try-restart` instead of `try-reload-or-restart`)
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active %i.service || systemctl try-reload-or-restart %i"

[Install]
WantedBy=multi-user.target

and cert-renewer@.timer:

[Unit]
Description=Timer for certificate renewal of %I
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
; PartOf=cert-renewer.target

[Timer]
Persistent=true

; Run the timer unit every 5 minutes.
OnCalendar=*:1/5

; Always run the timer on time.
AccuracySec=1us

; Add jitter to prevent a "thundering hurd" of simultaneous certificate renewals.
RandomizedDelaySec=1m

[Install]
WantedBy=timers.target

and the necessary Ansible to throw it into place, and synced that over.

Then I created an overrides file for Consul:

[Service]
; `Environment=` overrides are applied per environment variable. This line does not
; affect any other variables set in the service template.
Environment=CERT_LOCATION={{ consul.cert_path }} \
            KEY_LOCATION={{ consul.key_path }}
WorkingDirectory={{ consul.key_path | dirname }}

; Restart Consul service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active consul.service || systemctl try-reload-or-restart consul.service"

Unfortunately, I couldn't build the update the Consul configuration because the TLS certs had expired:

TASK [goldentooth.setup_consul : Create a Consul agent policy for each node.] ****************************************************
Wednesday 16 July 2025  18:43:18 -0400 (0:00:57.623)       0:01:24.371 ********
skipping: [bettley]
skipping: [cargyll]
skipping: [dalt]
FAILED - RETRYING: [allyrion -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [harlton -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [erenford -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [inchfield -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [velaryon -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [jast -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [karstark -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [fenn -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [gardener -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [lipps -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [allyrion -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [harlton -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [erenford -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [jast -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [inchfield -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [velaryon -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [fenn -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [gardener -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [karstark -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [lipps -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [allyrion -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [harlton -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [erenford -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [fenn -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [jast -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [inchfield -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [velaryon -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [gardener -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [lipps -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [karstark -> bettley]: Create a Consul agent policy for each node. (1 retries left).
fatal: [allyrion -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [harlton -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [erenford -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [fenn -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [jast -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [inchfield -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [velaryon -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [gardener -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [karstark -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [lipps -> bettley]: FAILED! => changed=false
  attempts: 3
  msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>

And it was then that I noticed that the dates on all of the Raspberry Pis were off by about 8 days 😑. I'd never set up NTP. A quick Ansible playbook later, every Pi agrees on the same date and time, but now:

● consul.service - "HashiCorp Consul"
     Loaded: loaded (/etc/systemd/system/consul.service; enabled; preset: enabled)
     Active: active (running) since Wed 2025-07-16 18:51:09 EDT; 13s ago
       Docs: https://www.consul.io/
   Main PID: 733215 (consul)
      Tasks: 9 (limit: 8737)
     Memory: 19.4M
        CPU: 551ms
     CGroup: /system.slice/consul.service
             └─733215 /usr/bin/consul agent -config-dir=/etc/consul.d

Jul 16 18:51:09 bettley consul[733215]:               gRPC TLS: Verify Incoming: true, Min Version: TLSv1_2
Jul 16 18:51:09 bettley consul[733215]:       Internal RPC TLS: Verify Incoming: true, Verify Outgoing: true (Verify Hostname: true), Min Version: TLSv1_2
Jul 16 18:51:09 bettley consul[733215]: ==> Log data will now stream in as it occurs:
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.903-0400 [WARN]  agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.903-0400 [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.963-0400 [WARN]  agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.963-0400 [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.966-0400 [WARN]  agent:  keyring doesn't include key provided with -encrypt, using keyring: keyring=WAN
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.967-0400 [ERROR] agent: startup error: error="refusing to rejoin cluster because server has been offline for more than the configured server_rejoin_age_max (168h0m0s) - consider wiping your data dir"
Jul 16 18:51:19 bettley consul[733215]: 2025-07-16T18:51:19.968-0400 [ERROR] agent: startup error: error="refusing to rejoin cluster because server has been offline for more than the configured server_rejoin_age_max (168h0m0s) - consider wiping your data dir"

It won't rebuild the cluster because it's been offline too long 🙃 So I had to zap a file on the nodes:

$ goldentooth command bettley,cargyll,dalt 'sudo rm -rf /opt/consul/server_metadata.json*'
dalt | CHANGED | rc=0 >>

bettley | CHANGED | rc=0 >>

cargyll | CHANGED | rc=0 >>

and then I was able to restart the cluster.

As it turned out, I had to rotate the Consul certificates anyway, since they were invalid, but I think it's working now. I've shortened the cert lifetime to 24 hours, so I should find out pretty quickly 🙂

After that, it's the same procedure (rotate the certs, then re-setup the app and install the cert renewal timer) for Grafana, Loki, Nomad, Vault, and Vector.

SSH Certificates

So remember back in chapter 32 when I set up Step-CA as our internal certificate authority? Step-CA also handle SSH certificates, which allows a less peer-to-peer model for authenticating between nodes. I'd actually tried to set these up before and it was an enormous pain in the pass and didn't really work well, so when I saw Step-CA included it in its featureset, I was excited.

It's very easy to allow authorized_keys to grow without bound, and I'm fairly sure very few people actually read these messages:

The authenticity of host 'wtf.node.goldentooth.net (192.168.10.51)' can't be established.
ED25519 key fingerprint is SHA256:8xKJ5Fw6K+YFGxqR5EWsM4w3t5Y7MzO1p3G9kPvXHDo.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

So I wanted something that would allow seamless interconnection between the nodes while maintaining good security.

SSH certificates solve both of these problems elegantly. Instead of managing individual keys, you have a certificate authority that signs certificates. For user authentication, the SSH server trusts the CA's public key. For host authentication, your SSH client trusts the CA's public key.

It's basically the same model as TLS certificates, but for SSH. And since we already have Step-CA running, why not use it?

The Implementation

I created an Ansible role called goldentooth.setup_ssh_certificates to handle all of this. Let me walk through what it does.

Setting Up the CA Trust

First, we need to grab the SSH CA public keys from our Step-CA server. There are actually two different keys - one for signing user certificates and one for signing host certificates:

- name: 'Get SSH User CA public key'
  ansible.builtin.slurp:
    src: "{{ step_ca.ca.etc_path }}/certs/ssh_user_ca_key.pub"
  register: 'ssh_user_ca_key_b64'
  delegate_to: "{{ step_ca.server }}"
  run_once: true
  become: true

- name: 'Get SSH Host CA public key'
  ansible.builtin.slurp:
    src: "{{ step_ca.ca.etc_path }}/certs/ssh_host_ca_key.pub"
  register: 'ssh_host_ca_key_b64'
  delegate_to: "{{ step_ca.server }}"
  run_once: true
  become: true

Then we configure sshd to trust certificates signed by our User CA:

- name: 'Configure sshd to trust User CA'
  ansible.builtin.lineinfile:
    path: '/etc/ssh/sshd_config'
    regexp: '^#?TrustedUserCAKeys'
    line: 'TrustedUserCAKeys /etc/ssh/ssh_user_ca.pub'
    state: 'present'
    validate: '/usr/sbin/sshd -t -f %s'
  notify: 'reload sshd'

Host Certificates

For host certificates, we generate a certificate for each node that includes multiple principals (names the certificate is valid for):

- name: 'Generate SSH host certificate'
  ansible.builtin.shell:
    cmd: |
      step ssh certificate \
        --host \
        --sign \
        --force \
        --no-password \
        --insecure \
        --provisioner="{{ step_ca.default_provisioner.name }}" \
        --provisioner-password-file="{{ step_ca.default_provisioner.password_path }}" \
        --principal="{{ ansible_hostname }}" \
        --principal="{{ ansible_hostname }}.{{ cluster.node_domain }}" \
        --principal="{{ ansible_hostname }}.{{ cluster.domain }}" \
        --principal="{{ ansible_default_ipv4.address }}" \
        --ca-url="https://{{ hostvars[step_ca.server].ipv4_address }}:9443" \
        --root="{{ step_ca.root_cert_path }}" \
        --not-after=24h \
        {{ ansible_hostname }} \
        /etc/step/certs/ssh_host.key.pub

Automatic Certificate Renewal

Notice the --not-after=24h? Yeah, these certificates expire daily. Which means it's very important that the automatic renewal works 😀

Enter systemd timers:

[Unit]
Description=Timer for SSH host certificate renewal
Documentation=https://smallstep.com/docs/step-cli/reference/ssh/certificate

[Timer]
OnBootSec=5min
OnUnitActiveSec=15min
RandomizedDelaySec=5min

[Install]
WantedBy=timers.target

This runs every 15 minutes (with some randomization to avoid thundering herd problems). The service itself checks if the certificate needs renewal before actually doing anything:

# Check if certificate needs renewal
ExecCondition=/usr/bin/step certificate needs-renewal /etc/step/certs/ssh_host.key-cert.pub

User Certificates

For user certificates, I set up both root and my regular user account. The process is similar - generate a certificate with appropriate principals:

- name: 'Generate root user SSH certificate'
  ansible.builtin.shell:
    cmd: |
      step ssh certificate \
        --sign \
        --force \
        --no-password \
        --insecure \
        --provisioner="{{ step_ca.default_provisioner.name }}" \
        --provisioner-password-file="{{ step_ca.default_provisioner.password_path }}" \
        --principal="root" \
        --principal="{{ ansible_hostname }}-root" \
        --ca-url="https://{{ hostvars[step_ca.server].ipv4_address }}:9443" \
        --root="{{ step_ca.root_cert_path }}" \
        --not-after=24h \
        root@{{ ansible_hostname }} \
        /etc/step/certs/root_ssh_key.pub

Then configure SSH to actually use the certificate:

- name: 'Configure root SSH to use certificate'
  ansible.builtin.blockinfile:
    path: '/root/.ssh/config'
    create: true
    owner: 'root'
    group: 'root'
    mode: '0600'
    block: |
      Host *
          CertificateFile /etc/step/certs/root_ssh_key-cert.pub
          IdentityFile /etc/step/certs/root_ssh_key
    marker: '# {mark} ANSIBLE MANAGED BLOCK - SSH CERTIFICATE'

The Trust Configuration

For the client side, we need to tell SSH to trust host certificates signed by our CA:

- name: 'Configure SSH client to trust Host CA'
  ansible.builtin.lineinfile:
    path: '/etc/ssh/ssh_known_hosts'
    line: "@cert-authority * {{ ssh_host_ca_key }}"
    create: true
    owner: 'root'
    group: 'root'
    mode: '0644'

And since we're all friends here in the cluster, I disabled strict host key checking for cluster nodes:

- name: 'Disable StrictHostKeyChecking for cluster nodes'
  ansible.builtin.blockinfile:
    path: '/etc/ssh/ssh_config'
    block: |
      Host *.{{ cluster.node_domain }} *.{{ cluster.domain }}
          StrictHostKeyChecking no
          UserKnownHostsFile /dev/null
    marker: '# {mark} ANSIBLE MANAGED BLOCK - CLUSTER SSH CONFIG'

Is this less secure? Technically yes. Do I care? Not really. These are all nodes in my internal cluster that I control. The certificates provide the actual authentication.

The Results

After running the playbook, I can now SSH between any nodes in the cluster without passwords or key management:

root@bramble-ca:~# ssh bramble-01
Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-1017-raspi aarch64)
...
Last login: Sat Jul 19 00:15:23 2025 from 192.168.10.50
root@bramble-01:~#

No host key verification prompts. No password prompts. Just instant access.

And the best part? I can verify that certificates are being used:

root@bramble-01:~# ssh-keygen -L -f /etc/step/certs/ssh_host.key-cert.pub
/etc/step/certs/ssh_host.key-cert.pub:
        Type: ssh-ed25519-cert-v01@openssh.com host certificate
        Public key: ED25519-CERT SHA256:M5PQn6zVH7xJL+OFQzH4yVwR5EHrF2xQPm9QR5xKXBc
        Signing CA: ED25519 SHA256:gNPpOqPsZW6YZDmhWQWqJ4l+L8E5Xgg8FQyAAbPi7Ss (using ssh-ed25519)
        Key ID: "bramble-01"
        Serial: 8485811653946933657
        Valid: from 2025-07-18T20:13:42 to 2025-07-19T20:14:42
        Principals:
                bramble-01
                bramble-01.node.goldentooth.net
                bramble-01.goldentooth.net
                192.168.10.51
        Critical Options: (none)
        Extensions: (none)

Look at that! The certificate is valid for exactly 24 hours and includes all the names I might use to connect to this host.

ZFS and Replication

So remember back in chapters 28 and 31 when I set up NFS exports using a USB thumbdrive? Obviously my crowning achievement as an infrastructure engineer.

After living with that setup for a bit, I finally got my hands on some SSDs. Not new ones, mind you – these are various drives I've accumulated over the years. Eight of them, to be precise:

  • 3x 120GB SSDs
  • 3x ~450GB SSDs
  • 2x 1TB SSDs

Time to do something more serious with storage.

The Storage Strategy

I spent way too much time researching distributed storage options. GlusterFS? Apparently dead. Lustre? Way overkill for a Pi cluster, and the complexity-to-benefit ratio is terrible. BeeGFS? Same story.

So I decided to split the drives across three different storage systems:

  • ZFS for the 3x 120GB drives – rock solid, great snapshot support, and I already know it
  • Ceph for the 3x 450GB drives – the gold standard for distributed block storage in Kubernetes
  • SeaweedFS for the 2x 1TB drives – interesting distributed object storage that's simpler than MinIO

Today we're tackling ZFS, because I actually have experience with it and it seemed like the easiest place to start.

The ZFS Setup

I created a role called goldentooth.setup_zfs to handle all of this. The basic idea is to set up ZFS on nodes that have SSDs attached, create datasets for shared storage, and then use Sanoid for snapshot management and Syncoid for replication between nodes.

First, let's install ZFS and configure it for the Pi's limited RAM:

- name: 'Install ZFS.'
  ansible.builtin.apt:
    name:
      - 'zfsutils-linux'
      - 'zfs-dkms'
      - 'zfs-zed'
      - 'sanoid'
    state: 'present'
    update_cache: true

- name: 'Configure ZFS Event Daemon.'
  ansible.builtin.lineinfile:
    path: '/etc/zfs/zed.d/zed.rc'
    regexp: '^#?ZED_EMAIL_ADDR='
    line: 'ZED_EMAIL_ADDR="{{ my.email }}"'
  notify: 'Restart ZFS-zed service.'

- name: 'Limit ZFS ARC to 128MB of RAM.'
  ansible.builtin.lineinfile:
    path: '/etc/modprobe.d/zfs.conf'
    line: 'options zfs zfs_arc_max=1073741824'
    create: true
  notify: 'Update initramfs.'

That ARC limit is important – by default ZFS will happily eat half your RAM for caching, which is not great when you only have 8GB to start with.

Creating the Pool

The pool creation is straightforward. I'm not doing anything fancy like RAID-Z because I only have one SSD per node:

- name: 'Create ZFS pool.'
  ansible.builtin.command: |
    zpool create {{ zfs.pool.name }} {{ zfs.pool.device }}
  args:
    creates: "/{{ zfs.pool.name }}"
  when: ansible_hostname == 'allyrion'

Wait, why when: ansible_hostname == 'allyrion'? Well, it turns out I'm only creating the pool on the primary node. The other nodes will receive the data via replication. This is a bit different from a typical ZFS setup where each node would have its own pool, but it makes sense for my use case.

Sanoid for Snapshots

Sanoid is a fantastic tool for managing ZFS snapshots. It handles creating snapshots on a schedule and pruning old ones according to a retention policy. The configuration is pretty simple:

# Primary dataset for source snapshots
[{{ zfs.pool.name }}/{{ zfs.datasets[0].name }}]
	use_template = production
	recursive = yes
	autosnap = yes
	autoprune = yes

[template_production]
	frequently = 0
	hourly = 36
	daily = 30
	monthly = 3
	yearly = 0
	autosnap = yes
	autoprune = yes

This keeps 36 hourly snapshots, 30 daily snapshots, and 3 monthly snapshots. No yearly snapshots because, let's be honest, this cluster probably won't last that long without me completely rebuilding it.

Syncoid for Replication

Here's where it gets interesting. Syncoid is Sanoid's companion tool that handles ZFS replication. It's basically a smart wrapper around zfs send and zfs receive that handles all the complexity of incremental replication.

I set up systemd services and timers to handle the replication:

[Unit]
Description=Syncoid ZFS replication to %i
After=zfs-import.target
Requires=zfs-import.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/syncoid --no-privilege-elevation {{ zfs.pool.name }}/{{ zfs.datasets[0].name }} root@%i:{{ zfs.pool.name }}/{{ zfs.datasets[0].name }}
StandardOutput=journal
StandardError=journal

The %i is systemd template magic – it gets replaced with whatever comes after the @ in the service name. So syncoid@bramble-01.service would replicate to bramble-01.

The timer runs every 15 minutes:

[Unit]
Description=Syncoid ZFS replication to %i timer
Requires=syncoid@%i.service

[Timer]
OnCalendar=*:0/15
RandomizedDelaySec=60
Persistent=true

SSH Configuration for Replication

Of course, Syncoid needs to SSH between nodes to do the replication. Initially, I tried to set this up with a separate SSH key for ZFS replication. That turned into such a mess that it actually motivated me to finally implement SSH certificates properly (see the previous chapter).

After setting up SSH certificates, I could simplify the configuration to just reference the certificates:

- name: 'Configure SSH config for ZFS replication using certificates.'
  ansible.builtin.blockinfile:
    path: '/root/.ssh/config'
    create: true
    mode: '0600'
    block: |
      # ZFS replication configuration using SSH certificates
      {% for host in groups['zfs'] %}
      {% if host != inventory_hostname %}
      Host {{ host }}
        HostName {{ hostvars[host]['ipv4_address'] }}
        User root
        CertificateFile /etc/step/certs/root_ssh_key-cert.pub
        IdentityFile /etc/step/certs/root_ssh_key
        StrictHostKeyChecking no
        UserKnownHostsFile /dev/null
      {% endif %}
      {% endfor %}

Much cleaner! No more key management, just point to the certificates that are already being automatically renewed. Sometimes a little pain is exactly what you need to motivate doing things the right way.

The Topology

The way I set this up, only the first node in the zfs group (allyrion) actually creates datasets and takes snapshots. The other nodes just receive replicated data:

- name: 'Enable and start Syncoid timers for replication targets.'
  ansible.builtin.systemd:
    name: "syncoid@{{ item }}.timer"
    enabled: true
    state: 'started'
  loop: "{{ groups['zfs'] | reject('eq', inventory_hostname) | list }}"
  when:
    - groups['zfs'] | length > 1
    - inventory_hostname == groups['zfs'][0]  # Only run on first ZFS node (allyrion)

This creates a hub-and-spoke topology where allyrion is the primary and replicates to all other ZFS nodes. It's not the most resilient topology (if allyrion dies, no new snapshots), but it's simple and works for my needs.

Does It Work?

Let's check using the goldentooth CLI:

$ goldentooth command allyrion 'zfs list'
allyrion | CHANGED | rc=0 >>
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool        546K   108G    24K  /rpool
rpool/data    53K   108G    25K  /data

Nice! The pool is there. Now let's look at snapshots:

$ goldentooth command allyrion 'zfs list -t snapshot'
allyrion | CHANGED | rc=0 >>
NAME                                                        USED  AVAIL  REFER  MOUNTPOINT
rpool/data@autosnap_2025-07-18_18:13:17_monthly               0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_daily                 0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_hourly                0B      -    24K  -
rpool/data@autosnap_2025-07-18_19:00:03_hourly                0B      -    24K  -
rpool/data@autosnap_2025-07-18_20:00:10_hourly                0B      -    24K  -
...
rpool/data@autosnap_2025-07-19_14:00:15_hourly                0B      -    24K  -
rpool/data@syncoid_allyrion_2025-07-19:10:45:32-GMT-04:00     0B      -    25K  -

Excellent! Sanoid is creating snapshots hourly, daily, and monthly. That last snapshot with the "syncoid" prefix shows that replication is happening too.

And on the replica nodes? Let me check which nodes have ZFS:

$ goldentooth command gardener 'zfs list'
gardener | CHANGED | rc=0 >>
NAME         USED  AVAIL  REFER  MOUNTPOINT
rpool        600K   108G    25K  /rpool
rpool/data    53K   108G    25K  /rpool/data

The replica has the same dataset structure. And the snapshots?

$ goldentooth command gardener 'zfs list -t snapshot | head -5'
gardener | CHANGED | rc=0 >>
NAME                                                        USED  AVAIL  REFER  MOUNTPOINT
rpool/data@autosnap_2025-07-18_18:13:17_monthly               0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_daily                 0B      -    24K  -
rpool/data@autosnap_2025-07-18_18:13:17_hourly                0B      -    24K  -
rpool/data@autosnap_2025-07-18_19:00:03_hourly                0B      -    24K  -

Perfect! The snapshots are being replicated from allyrion to gardener. The replication is working.

Performance

How's the performance? Well... it's ZFS on a single SSD connected to a Raspberry Pi. It's not going to win any benchmarks:

$ goldentooth command_root allyrion 'dd if=/dev/zero of=/data/test bs=1M count=100'
allyrion | CHANGED | rc=0 >>
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.205277 s, 511 MB/s

511 MB/s writes! That's... actually surprisingly good for a Pi with a SATA SSD over USB3. Clearly the ZFS caching is helping here, but even so, that's plenty fast for shared configuration files, build artifacts, and other cluster data.

Expanding the Kubernetes Cluster

With the Goldentooth cluster continuing to evolve, it was time to bring two more nodes into the Kubernetes fold... Karstark and Lipps, two Raspberry Pi 4Bs (4GB) that were just kinda sitting around.

The Current State

Before the expansion, our Kubernetes cluster consisted of:

  • Control Plane (3 nodes): bettley, cargyll, dalt
  • Workers (7 nodes): erenford, fenn, gardener, harlton, inchfield, jast, velaryon

Karstark and Lipps were already fully integrated into the cluster infrastructure:

  • Both were part of the Consul service mesh as clients
  • Both were configured as Nomad clients for workload scheduling
  • Both were included in other cluster services like Ray and Slurm

However, they weren't yet part of the Kubernetes cluster, which meant we were missing out on their compute capacity for containerized workloads.

Installing Kubernetes Packages

The first step was to ensure both nodes had the necessary Kubernetes packages installed. Using the goldentooth CLI:

ansible-playbook -i inventory/hosts playbooks/install_k8s_packages.yaml --limit karstark,lipps

This playbook handled:

  • Installing and configuring containerd as the container runtime
  • Installing kubeadm, kubectl, and kubelet packages
  • Setting up proper systemd cgroup configuration
  • Enabling and starting the kubelet service

Both nodes successfully installed Kubernetes v1.32.7, which was slightly newer than the existing cluster nodes running v1.32.3.

The Challenge: Certificate Issues

When attempting to use the standard goldentooth bootstrap_k8s command, we ran into certificate verification issues. The bootstrap process was timing out when trying to communicate with the Kubernetes API server.

The error manifested as:

tls: failed to verify certificate: x509: certificate signed by unknown authority

This is a common issue in clusters that have been running for a while (393 days in our case) and have undergone certificate rotations or updates.

The Solution: Manual Join Process

Instead of relying on the automated bootstrap, I took a more direct approach:

  1. Generate a join token from the control plane:

    goldentooth command_root bettley "kubeadm token create --print-join-command"
    
  2. Execute the join command on both nodes:

    goldentooth command_root karstark,lipps "kubeadm join 10.4.0.10:6443 --token yi3zz8.qf0ziy9ce7nhnkjv --discovery-token-ca-cert-hash sha256:0d6c8981d10e407429e135db4350e6bb21382af57addd798daf6c3c5663ac964 --skip-phases=preflight"
    

The --skip-phases=preflight flag was key here, as it bypassed the problematic preflight checks while still allowing the nodes to join successfully.

Verification

After the join process completed, both nodes appeared in the cluster:

goldentooth command_root bettley "kubectl get nodes"
NAME        STATUS   ROLES           AGE    VERSION
bettley     Ready    control-plane   393d   v1.32.3
cargyll     Ready    control-plane   393d   v1.32.3
dalt        Ready    control-plane   393d   v1.32.3
erenford    Ready    <none>          393d   v1.32.3
fenn        Ready    <none>          393d   v1.32.3
gardener    Ready    <none>          393d   v1.32.3
harlton     Ready    <none>          393d   v1.32.3
inchfield   Ready    <none>          393d   v1.32.3
jast        Ready    <none>          393d   v1.32.3
karstark    Ready    <none>          53s    v1.32.7
lipps       Ready    <none>          54s    v1.32.7
velaryon    Ready    <none>          52d    v1.32.5

Perfect! Both nodes transitioned from "NotReady" to "Ready" status within about a minute, indicating that the Calico CNI networking had successfully configured them.

The New Topology

Our Kubernetes cluster now consists of:

  • Control Plane (3 nodes): bettley, cargyll, dalt
  • Workers (9 nodes): erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps, velaryon (GPU)

This brings us to a total of 12 nodes in the Kubernetes cluster, matching the full complement of our Raspberry Pi bramble plus the x86 GPU node.

GPU Node Configuration

Velaryon, my x86 GPU node, required special configuration to ensure GPU workloads are only scheduled intentionally:

Hardware Specifications

  • GPU: NVIDIA GeForce RTX 2070 (8GB VRAM)
  • CPU: 24 cores (x86_64)
  • Memory: 32GB RAM
  • Architecture: amd64

Kubernetes Configuration

The node is configured with:

  • Label: gpu=true for workload targeting
  • Taint: gpu=true:NoSchedule to prevent accidental scheduling
  • Architecture: arch=amd64 for x86-specific workloads

Scheduling Requirements

To schedule workloads on Velaryon, pods must include:

tolerations:
- key: gpu
  operator: Equal
  value: "true"
  effect: NoSchedule
nodeSelector:
  gpu: "true"

This ensures that only workloads explicitly designed for GPU execution can access the expensive GPU resources, following the same intentional scheduling pattern used with Nomad.

GPU Resource Detection Challenge

While the taint-based scheduling was working correctly, getting Kubernetes to actually detect and expose the GPU resources proved more challenging. The NVIDIA device plugin is responsible for discovering GPUs and advertising them as nvidia.com/gpu resources that pods can request.

Initial Problem

The device plugin was failing with the error:

E0719 16:20:41.050191       1 factory.go:115] Incompatible platform detected
E0719 16:20:41.050193       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?

Despite having installed the NVIDIA Container Toolkit and configuring containerd, the device plugin couldn't detect the NVML library from within its container environment.

The Root Cause

The issue was that the device plugin container couldn't access:

  1. NVIDIA Management Library: libnvidia-ml.so.1 needed for GPU discovery
  2. Device files: /dev/nvidia* required for direct GPU communication
  3. Proper privileges: Needed to interact with kernel-level GPU drivers

The Solution

After several iterations, the working configuration required:

Library Access:

volumeMounts:
- name: nvidia-ml-lib
  mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
  readOnly: true
- name: nvidia-ml-actual
  mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
  readOnly: true

Device Access:

volumeMounts:
- name: dev
  mountPath: /dev
volumes:
- name: dev
  hostPath:
    path: /dev

Container Privileges:

securityContext:
  privileged: true

Verification

Once properly configured, the device plugin successfully reported:

I0719 16:56:06.462937       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0719 16:56:06.463631       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0719 16:56:06.465420       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

The GPU resource then appeared in the node's capacity:

kubectl get nodes -o json | jq '.items[] | select(.metadata.name=="velaryon") | .status.capacity'
{
  "cpu": "24",
  "ephemeral-storage": "102626232Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "32803048Ki",
  "nvidia.com/gpu": "1",
  "pods": "110"
}

Testing GPU Resource Allocation

To verify the system was working end-to-end, I created a test pod that:

  • Requests GPU resources: nvidia.com/gpu: 1
  • Includes proper tolerations: To bypass the gpu=true:NoSchedule taint
  • Targets the GPU node: Using gpu: "true" node selector
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-workload
spec:
  tolerations:
  - key: gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  nodeSelector:
    gpu: "true"
  containers:
  - name: gpu-test
    image: busybox
    command: ["sleep", "60"]
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1

The pod successfully scheduled and the node showed:

nvidia.com/gpu     1          1

This confirmed that GPU resource allocation tracking was working correctly.

Final NVIDIA Device Plugin Configuration

For reference, here's the complete working NVIDIA device plugin DaemonSet configuration:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: gpu
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        gpu: "true"
      priorityClassName: system-node-critical
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
        - name: nvidia-ml-lib
          mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
          readOnly: true
        - name: nvidia-ml-actual
          mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
          readOnly: true
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
      - name: nvidia-ml-lib
        hostPath:
          path: /lib/x86_64-linux-gnu/libnvidia-ml.so.1
      - name: nvidia-ml-actual
        hostPath:
          path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03

Key aspects of this configuration:

  • Targeted deployment: Only runs on nodes with gpu: "true" label
  • Taint tolerance: Can schedule on nodes with gpu=true:NoSchedule taint
  • Privileged access: Required for kernel-level GPU driver interaction
  • Library binding: Specific mounts for NVIDIA ML library files
  • Device access: Full /dev mount for GPU device communication

GPU Storage NFS Export

With the cluster expanded to include the Velaryon GPU node, a natural question emerged: how can the Raspberry Pi cluster nodes efficiently exchange data with the GPU node for machine learning workloads and other compute-intensive tasks?

The solution was to leverage Velaryon's secondary 1TB NVMe SSD and expose it to the entire cluster via NFS, creating a high-speed shared storage pool specifically for Pi-GPU data exchange.

The Challenge

Velaryon came with two storage devices:

  • Primary NVMe (nvme1n1): Linux system drive
  • Secondary NVMe (nvme0n1): 1TB drive with old NTFS partitions from previous Windows installation

The goal was to repurpose this secondary drive as shared storage while maintaining architectural separation - the GPU node should provide storage services without becoming a structural component of the Pi cluster.

Storage Architecture Decision

Rather than integrating Velaryon into the existing storage ecosystem (ZFS replication, Ceph distributed storage), I opted for a simpler approach:

  • Pure ext4: Single partition consuming the entire 1TB drive
  • NFS export: Simple, performant network filesystem
  • Subnet-wide access: Available to all 10.4.x.x nodes

This keeps the GPU node loosely coupled while providing the needed functionality.

Implementation

Drive Preparation

First, I cleared the old NTFS partitions and created a fresh GPT layout:

# Clear existing partition table
sudo wipefs -af /dev/nvme0n1

# Create new GPT partition table and single partition
sudo parted /dev/nvme0n1 mklabel gpt
sudo parted /dev/nvme0n1 mkpart primary ext4 0% 100%

# Format as ext4
sudo mkfs.ext4 -L gpu-storage /dev/nvme0n1p1

The resulting filesystem has UUID 5bc38d5b-a7a4-426e-acdb-e5caf0a809d9 and is mounted persistently at /mnt/gpu-storage.

NFS Server Configuration

Velaryon was configured as an NFS server with a single export:

# /etc/exports
/mnt/gpu-storage 10.4.0.0/20(rw,sync,no_root_squash,no_subtree_check)

This grants read-write access to the entire infrastructure subnet with synchronous writes for data integrity.

Ansible Integration

Rather than manually configuring each node, I integrated the GPU storage into the existing Ansible automation:

Inventory Updates (inventory/hosts):

nfs_server:
  hosts:
    allyrion:    # Existing NFS server
    velaryon:    # New GPU storage server

Host Variables (inventory/host_vars/velaryon.yaml):

nfs_exports:
  - "/mnt/gpu-storage 10.4.0.0/20(rw,sync,no_root_squash,no_subtree_check)"

Global Configuration (group_vars/all/vars.yaml):

nfs:
  mounts:
    primary:      # Existing allyrion NFS share
      share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
      mount: '/mnt/nfs'
      safe_name: 'mnt-nfs'
      type: 'nfs'
      options: {}
    gpu_storage:  # New GPU storage share
      share: "{{ hostvars['velaryon'].ipv4_address }}:/mnt/gpu-storage"
      mount: '/mnt/gpu-storage'
      safe_name: 'mnt-gpu\x2dstorage'  # Systemd unit name escaping
      type: 'nfs'
      options: {}

Systemd Automount Configuration

The trickiest part was configuring systemd automount units. Systemd requires special character escaping for mount paths - the mount point /mnt/gpu-storage must use the unit name mnt-gpu\x2dstorage (where \x2d is the escaped dash).

Mount Unit Template (templates/mount.j2):

[Unit]
Description=Mount {{ item.key }}

[Mount]
What={{ item.value.share }}
Where={{ item.value.mount }}
Type={{ item.value.type }}
{% if item.value.options -%}
Options={{ item.value.options | join(',') }}
{% else -%}
Options=defaults
{% endif %}

[Install]
WantedBy=default.target

Automount Unit Template (templates/automount.j2):

[Unit]
Description=Automount {{ item.key }}
After=remote-fs-pre.target network-online.target network.target
Before=umount.target remote-fs.target

[Automount]
Where={{ item.value.mount }}
TimeoutIdleSec=60

[Install]
WantedBy=default.target

Deployment Playbook

A new playbook setup_gpu_storage.yaml orchestrates the entire deployment:

---
# Setup GPU storage on Velaryon with NFS export
- name: 'Setup Velaryon GPU storage and NFS export'
  hosts: 'velaryon'
  become: true
  tasks:
    - name: 'Ensure GPU storage mount point exists'
      ansible.builtin.file:
        path: '/mnt/gpu-storage'
        state: 'directory'
        owner: 'root'
        group: 'root'
        mode: '0755'

    - name: 'Check if GPU storage is mounted'
      ansible.builtin.command:
        cmd: 'mountpoint -q /mnt/gpu-storage'
      register: gpu_storage_mounted
      failed_when: false
      changed_when: false

    - name: 'Mount GPU storage if not already mounted'
      ansible.builtin.mount:
        src: 'UUID=5bc38d5b-a7a4-426e-acdb-e5caf0a809d9'
        path: '/mnt/gpu-storage'
        fstype: 'ext4'
        opts: 'defaults'
        state: 'mounted'
      when: gpu_storage_mounted.rc != 0

- name: 'Configure NFS exports on Velaryon'
  hosts: 'velaryon'
  become: true
  roles:
    - 'geerlingguy.nfs'

- name: 'Setup NFS mounts on all nodes'
  hosts: 'all'
  become: true
  roles:
    - 'goldentooth.setup_nfs_mounts'

Usage

The GPU storage is now seamlessly integrated into the goldentooth CLI:

# Deploy/update GPU storage configuration
goldentooth setup_gpu_storage

# Enable automount on specific nodes
goldentooth command allyrion 'sudo systemctl enable --now mnt-gpu\x2dstorage.automount'

# Verify access (automounts on first access)
goldentooth command cargyll,bettley 'ls /mnt/gpu-storage/'

Results

The implementation provides:

  • 1TB shared storage available cluster-wide at /mnt/gpu-storage
  • Automatic mounting via systemd automount on directory access
  • Full Ansible automation via the goldentooth CLI
  • Clean separation between Pi cluster and GPU node architectures

Data written from any node is immediately visible across the cluster, enabling seamless Pi-GPU workflows for machine learning datasets, model artifacts, and computational results.

Prometheus Blackbox Exporter

The Observability Gap

Our Goldentooth cluster has comprehensive infrastructure monitoring through Prometheus, node exporters, and application metrics. But we've been missing a crucial piece: synthetic monitoring. We can see if our servers are running, but can we actually reach our services? Are our web UIs accessible? Can we connect to our APIs?

Enter the Prometheus Blackbox Exporter - our eyes and ears for service availability across the entire cluster.

What is Blackbox Monitoring?

Blackbox monitoring tests services from the outside, just like your users would. Instead of looking at internal metrics, it:

  • Probes HTTP/HTTPS endpoints - "Is the Consul web UI actually working?"
  • Tests TCP connectivity - "Can I connect to the Vault API port?"
  • Validates DNS resolution - "Do our cluster domains resolve correctly?"
  • Checks ICMP reachability - "Are all nodes responding to ping?"

It's called "blackbox" because we don't peek inside the service - we just test if it works from the outside.

Planning the Implementation

I needed to design a comprehensive monitoring strategy that would cover:

Service Categories

  • HashiCorp Stack: Consul, Vault, Nomad web UIs and APIs
  • Kubernetes Services: API server health, Argo CD, LoadBalancer services
  • Observability Stack: Prometheus, Grafana, Loki endpoints
  • Infrastructure: All 13 node homepages, HAProxy stats
  • External Services: CloudFront distributions
  • Network Health: DNS resolution for all cluster domains

Intelligent Probe Types

  • Internal HTTPS: Uses Step-CA certificates for cluster services
  • External HTTPS: Uses public CAs for external services
  • HTTP: Plain HTTP for internal services
  • TCP: Port connectivity for APIs and cluster communication
  • DNS: Domain resolution for cluster services
  • ICMP: Basic network connectivity for all nodes

The Ansible Implementation

I created a comprehensive Ansible role goldentooth.setup_blackbox_exporter that handles:

Core Deployment

# Install blackbox exporter v0.25.0
# Deploy on allyrion (same node as Prometheus)
# Configure systemd service with security hardening
# Set up TLS certificates via Step-CA

Security Integration

The blackbox exporter integrates seamlessly with our Step-CA infrastructure:

  • Client certificates for secure communication
  • CA validation for internal services
  • Automatic renewal via systemd timers
  • Proper certificate ownership for the service user

Service Discovery Magic

Instead of hardcoding targets, I implemented dynamic service discovery:

# Generate targets from Ansible inventory variables
blackbox_https_internal_targets:
  - "https://consul.goldentooth.net:8501"
  - "https://vault.goldentooth.net:8200"
  - "https://nomad.goldentooth.net:4646"
  # ... and many more

# Auto-generate ICMP targets for all cluster nodes
{% for host in groups['all'] %}
- targets:
    - "{{ hostvars[host]['ipv4_address'] }}"
  labels:
    job: 'blackbox-icmp'
    node: "{{ host }}"
{% endfor %}

Prometheus Integration

The trickiest part was configuring Prometheus to properly scrape blackbox targets. Blackbox exporter works differently than normal exporters:

# Instead of scraping the target directly...
# Prometheus scrapes the blackbox exporter with target as parameter
- job_name: 'blackbox-https-internal'
  metrics_path: '/probe'
  params:
    module: ['https_2xx_internal']
  relabel_configs:
    # Redirect to blackbox exporter
    - target_label: __address__
      replacement: "allyrion:9115"
    # Pass original target as parameter
    - source_labels: [__param_target]
      target_label: __param_target

Deployment Day

The deployment was mostly smooth with a few interesting challenges:

Certificate Duration Drama

# First attempt failed:
# "requested duration of 8760h is more than authorized maximum of 168h"

# Solution: Match Step-CA policy
--not-after=168h  # 1 week instead of 1 year

DNS Resolution Reality Check

Many of our internal domains (*.goldentooth.net) don't actually resolve yet, so probes show up=0. This is expected and actually valuable - it shows us what infrastructure we still need to set up!

Relabel Configuration Complexity

Getting the Prometheus relabel configs right for blackbox took several iterations. The key insight: blackbox exporter targets need to be "redirected" through the exporter itself.

What We're Monitoring Now

The blackbox exporter is now actively monitoring 40+ endpoints across our cluster:

Web UIs and APIs

  • Consul Web UI (https://consul.goldentooth.net:8501)
  • Vault Web UI (https://vault.goldentooth.net:8200)
  • Nomad Web UI (https://nomad.goldentooth.net:4646)
  • Grafana Dashboard (https://grafana.goldentooth.net:3000)
  • Argo CD Interface (https://argocd.goldentooth.net)

Infrastructure Endpoints

  • All 13 node homepages (http://[node].nodes.goldentooth.net)
  • HAProxy statistics page (with basic auth)
  • Prometheus web interface
  • Loki API endpoints

Network Connectivity

  • TCP connectivity to all critical service ports
  • DNS resolution for all cluster domains
  • ICMP ping for every cluster node
  • External CloudFront distributions

The Power of Synthetic Monitoring

Now when something breaks, we'll know immediately:

  • probe_success tells us if the service is reachable
  • probe_duration_seconds shows response times
  • probe_http_status_code reveals HTTP errors
  • probe_ssl_earliest_cert_expiry warns about certificate expiration

This complements our existing infrastructure monitoring perfectly. We can see both "the server is running" (node exporter) and "the service actually works" (blackbox exporter).

Comprehensive Metrics Collection

After establishing the foundation of our observability stack with Prometheus, Grafana, and the blackbox exporter, it's time to ensure we're collecting metrics from every critical component in our cluster. This chapter covers the addition of Nomad telemetry and Kubernetes object metrics to our monitoring infrastructure.

The Metrics Audit

A comprehensive audit of our cluster revealed which services were already exposing metrics:

Already Configured:

  • ✅ Kubernetes API server, controller manager, scheduler (via control plane endpoints)
  • ✅ HAProxy (custom exporter on port 8405)
  • ✅ Prometheus (self-monitoring)
  • ✅ Grafana (internal metrics)
  • ✅ Loki (log aggregation metrics)
  • ✅ Consul (built-in Prometheus endpoint)
  • ✅ Vault (telemetry endpoint)

Missing:

  • ❌ Nomad (no telemetry configuration)
  • ❌ Kubernetes object state (deployments, pods, services)

Enabling Nomad Telemetry

Nomad has built-in Prometheus support but requires explicit configuration. We added the telemetry block to our Nomad configuration template:

telemetry {
  collection_interval        = "1s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

This configuration:

  • Enables Prometheus-compatible metrics on /v1/metrics?format=prometheus
  • Publishes detailed allocation and node metrics
  • Disables hostname labels (we add our own)
  • Sets a 1-second collection interval for fine-grained data

Certificate-Based Authentication

Unlike some services that expose metrics without authentication, Nomad requires mutual TLS for metrics access. We leveraged our Step-CA infrastructure to generate proper client certificates:

- name: 'Generate Prometheus client certificate for Nomad metrics.'
  ansible.builtin.shell:
    cmd: |
      {{ step_ca.executable }} ca certificate \
        "prometheus.client.nomad" \
        "/etc/prometheus/certs/nomad-client.crt" \
        "/etc/prometheus/certs/nomad-client.key" \
        --provisioner="{{ step_ca.default_provisioner.name }}" \
        --password-file="{{ step_ca.default_provisioner.password_path }}" \
        --san="prometheus.client.nomad" \
        --san="prometheus" \
        --san="{{ clean_hostname }}" \
        --san="{{ ipv4_address }}" \
        --not-after='24h' \
        --console \
        --force

This approach ensures:

  • Certificates are properly signed by our cluster CA
  • Client identity is clearly established
  • Automatic renewal via systemd timers
  • Consistent with our security model

Prometheus Scrape Configuration

With certificates in place, we configured Prometheus to scrape all Nomad nodes:

- job_name: 'nomad'
  metrics_path: '/v1/metrics'
  params:
    format: ['prometheus']
  static_configs:
    - targets:
        - "10.4.0.11:4646"  # bettley (server)
        - "10.4.0.12:4646"  # cargyll (server)
        - "10.4.0.13:4646"  # dalt (server)
        # ... all client nodes
  scheme: 'https'
  tls_config:
    ca_file: "{{ step_ca.root_cert_path }}"
    cert_file: "/etc/prometheus/certs/nomad-client.crt"
    key_file: "/etc/prometheus/certs/nomad-client.key"

Kubernetes Object Metrics with kube-state-metrics

While node-level metrics tell us about resource usage, we also need visibility into Kubernetes objects themselves. Enter kube-state-metrics, which exposes metrics about:

  • Deployment replica counts and rollout status
  • Pod phases and container states
  • Service endpoints and readiness
  • PersistentVolume claims and capacity
  • Job completion status
  • And much more

GitOps Deployment Pattern

Following our established patterns, we created a dedicated GitOps repository for kube-state-metrics:

# Create the repository
gh repo create goldentooth/kube-state-metrics --public

# Clone into our organization structure
cd ~/Projects/goldentooth
git clone https://github.com/goldentooth/kube-state-metrics.git

# Add the required label for Argo CD discovery
gh repo edit goldentooth/kube-state-metrics --add-topic gitops-repo

The key insight here is that our Argo CD ApplicationSet automatically discovers repositories with the gitops-repo label, eliminating manual application creation.

kube-state-metrics Configuration

The deployment includes comprehensive RBAC permissions to read all Kubernetes objects:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
# ... additional resources

We discovered that some resources like resourcequotas, replicationcontrollers, and limitranges were missing from the initial configuration, causing permission errors. A quick update to the ClusterRole resolved these issues.

Security Hardening

The kube-state-metrics deployment follows security best practices:

securityContext:
  fsGroup: 65534
  runAsGroup: 65534
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

Container-level security adds additional restrictions:

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true

Prometheus Auto-Discovery

The service includes annotations for automatic Prometheus discovery:

annotations:
  prometheus.io/scrape: 'true'
  prometheus.io/port: '8080'
  prometheus.io/path: '/metrics'

This eliminates the need for manual Prometheus configuration - the metrics are automatically discovered and scraped.

Verifying the Deployment

After deployment, we can verify metrics are being exposed:

# Port-forward to test locally
kubectl port-forward -n kube-state-metrics service/kube-state-metrics 8080:8080

# Check deployment metrics
curl -s http://localhost:8080/metrics | grep kube_deployment_status_replicas

Example output:

kube_deployment_status_replicas{namespace="argocd",deployment="argocd-redis-ha-haproxy"} 3
kube_deployment_status_replicas{namespace="kube-system",deployment="coredns"} 2

Blocking Docker Installation

The Problem

I don't know why, and I'm too lazy to dig much into it, but if I install docker on any node in the Kubernetes cluster, this conflicts with containerd (containerd.io), which causes Kubernetes to shit blood and stop working on that node. Great.

To prevent this, I implemented a clusterwide ban on Docker. I'm recording the details here in case I need to do it again.

Implementation

First, we removed Docker from nodes where it was already installed (like Allyrion):

# Stop and remove containers
goldentooth command_root allyrion "docker stop envoy && docker rm envoy"

# Remove all images
goldentooth command_root allyrion "docker images -q | xargs -r docker rmi -f"

# Stop and disable Docker
goldentooth command_root allyrion "systemctl stop docker && systemctl disable docker"
goldentooth command_root allyrion "systemctl stop docker.socket && systemctl disable docker.socket"

# Purge Docker packages
goldentooth command_root allyrion "apt-get purge -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin"
goldentooth command_root allyrion "apt-get autoremove -y"

# Clean up Docker directories
goldentooth command_root allyrion "rm -rf /var/lib/docker /etc/docker /var/run/docker.sock"
goldentooth command_root allyrion "rm -f /etc/apt/sources.list.d/docker.list /etc/apt/keyrings/docker.gpg"

APT Preferences Configuration

Next, we added an APT preferences file to the goldentooth.setup_security role that blocks Docker packages from being installed:

- name: 'Block Docker installation to prevent conflicts with Kubernetes containerd'
  ansible.builtin.copy:
    dest: '/etc/apt/preferences.d/block-docker'
    mode: '0644'
    owner: 'root'
    group: 'root'
    content: |
      # Block Docker installation to prevent conflicts with Kubernetes containerd
      # Docker packages can break the containerd installation used by Kubernetes
      # This preference file prevents accidental installation of Docker

      Package: docker-ce
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-ce-cli
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-ce-rootless-extras
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-buildx-plugin
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-compose-plugin
      Pin: origin ""
      Pin-Priority: -1

      Package: docker.io
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-compose
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-registry
      Pin: origin ""
      Pin-Priority: -1

      Package: docker-doc
      Pin: origin ""
      Pin-Priority: -1

      # Also block the older containerd.io package that comes with Docker
      # Kubernetes should use the standard containerd package instead
      Package: containerd.io
      Pin: origin ""
      Pin-Priority: -1

Deployment

The configuration was deployed to all nodes using:

goldentooth configure_cluster

Verification

We can verify that Docker is now blocked:

# Check Docker package policy
goldentooth command allyrion "apt-cache policy docker-ce"
# Output shows: Candidate: (none)

# Verify the preferences file exists
goldentooth command all "ls -la /etc/apt/preferences.d/block-docker"

How APT Preferences Work

APT preferences allow you to control which versions of packages are installed. By setting a Pin-Priority of -1, we effectively tell APT to never install these packages, regardless of their availability in the configured repositories.

This is more robust than simply removing Docker repositories because:

  1. It prevents installation from any source (including manual addition of repositories)
  2. It provides clear documentation of why these packages are blocked
  3. It's easily reversible if needed (just remove the preferences file)

Infrastructure Test Framework Improvements

After running comprehensive tests across the cluster, we discovered several critical issues with our test framework that were masking real infrastructure problems. This chapter documents the systematic fixes we implemented to ensure our automated testing provides accurate health monitoring.

The Initial Problem

When running goldentooth test all, multiple test failures appeared across different nodes:

PLAY RECAP *********************************************************************
bettley                    : ok=47   changed=0    unreachable=0    failed=1    skipped=3    rescued=0    ignored=2
cargyll                    : ok=47   changed=0    unreachable=0    failed=1    skipped=3    rescued=0    ignored=1
dalt                       : ok=47   changed=0    unreachable=0    failed=1    skipped=3    rescued=0    ignored=1

The challenge was determining whether these failures indicated real infrastructure issues or problems with the test framework itself.

Root Cause Analysis

1. Kubernetes API Server Connectivity Issues

The most critical failure was the Kubernetes API server health check consistently failing on the bettley control plane node:

Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
url: https://10.4.0.11:6443/healthz

Initial investigation revealed that while kubelet was running, both etcd and kube-apiserver pods were in CrashLoopBackOff state. This led us to discover that Kubernetes certificates had expired on June 20, 2025, but we were running tests in July 2025.

2. Test Framework Configuration Issues

Several test framework bugs were identified:

  • Vault decryption errors: Tests couldn't access encrypted vault secrets
  • Wrong certificate paths: Tests checking CA certificates instead of service certificates
  • Undefined variables: JMESPath dependencies and variable reference errors
  • Localhost binding assumptions: Services bound to specific IPs, not localhost

Infrastructure Fixes

Kubernetes Certificate Renewal

The most significant infrastructure issue was expired Kubernetes certificates. We resolved this using kubeadm:

# Backup existing certificates
ansible -i inventory/hosts bettley -m shell -a "cp -r /etc/kubernetes/pki /etc/kubernetes/pki.backup.$(date +%Y%m%d_%H%M%S)" --become

# Renew all certificates
ansible -i inventory/hosts bettley -m shell -a "kubeadm certs renew all" --become

# Restart control plane components by moving manifests temporarily
cd /etc/kubernetes/manifests
mv kube-apiserver.yaml kube-apiserver.yaml.tmp
mv etcd.yaml etcd.yaml.tmp
mv kube-controller-manager.yaml kube-controller-manager.yaml.tmp
mv kube-scheduler.yaml kube-scheduler.yaml.tmp

# Wait 10 seconds, then restore manifests
sleep 10
mv kube-apiserver.yaml.tmp kube-apiserver.yaml
mv etcd.yaml.tmp etcd.yaml
mv kube-controller-manager.yaml.tmp kube-controller-manager.yaml
mv kube-scheduler.yaml.tmp kube-scheduler.yaml

After renewal, certificates were valid until July 2026:

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
apiserver                  Jul 23, 2026 00:01 UTC   364d            ca                      no
etcd-peer                  Jul 23, 2026 00:01 UTC   364d            etcd-ca                 no
etcd-server                Jul 23, 2026 00:01 UTC   364d            etcd-ca                 no

Test Framework Fixes

1. Vault Authentication

Fixed missing vault password configuration in test environment:

# /Users/nathan/Projects/goldentooth/ansible/tests/ansible.cfg
[defaults]
vault_password_file = ~/.goldentooth_vault_password

2. Certificate Path Corrections

Updated tests to check actual service certificates instead of CA certificates:

# Before: Checking CA certificates (5-year lifespan)
path: /etc/consul.d/tls/consul-agent-ca.pem

# After: Checking service certificates (24-hour rotation)
path: /etc/consul.d/certs/tls.crt

3. API Connectivity Fixes

Fixed hardcoded localhost assumptions to use actual node IP addresses:

# Before: Assuming localhost binding
url: "https://127.0.0.1:8501/v1/status/leader"

# After: Using actual node IP
url: "http://{{ ansible_default_ipv4.address }}:8500/v1/status/leader"

4. Consul Members Command

Enhanced Consul connectivity testing with proper address specification:

- name: Check if consul command exists
  stat:
    path: /usr/bin/consul
  register: consul_command_stat

- name: Check Consul members
  command: consul members -status=alive -http-addr={{ ansible_default_ipv4.address }}:8500
  when:
    - consul_service.status.ActiveState == "active"
    - consul_command_stat.stat.exists

5. Kubernetes Test Improvements

Simplified Kubernetes tests to avoid JMESPath dependencies and fixed variable scoping:

# Simplified node readiness test
- name: Record node readiness test (simplified)
  set_fact:
    k8s_tests: "{{ k8s_tests + [{'name': 'k8s_cluster_accessible', 'category': 'kubernetes', 'success': (k8s_nodes_raw is defined and k8s_nodes_raw is succeeded) | bool, 'duration': 0.5}] }}"

# Fixed API health test scoping
- name: Record API health test
  set_fact:
    k8s_tests: "{{ k8s_tests + [{'name': 'k8s_api_healthy', 'category': 'kubernetes', 'success': (k8s_api.status == 200 and k8s_api.content | default('') == 'ok') | bool, 'duration': 0.2}] }}"
  when:
    - k8s_api is defined
    - inventory_hostname in groups['k8s_control_plane']

6. Step-CA Variable References

Fixed undefined variable references in Step-CA connectivity tests:

# Fixed IP address lookup
elif step ca health --ca-url https://{{ hostvars[groups['step_ca'] | first]['ipv4_address'] }}:9443 --root /etc/ssl/certs/goldentooth.pem; then

7. Localhost Aggregation Task

Fixed the test summary task that was failing due to missing facts:

- name: Aggregate test results
  hosts: localhost
  gather_facts: true  # Changed from false

Test Design Philosophy

We adopted a principle of separating certificate presence from validity testing:

# Test 1: Certificate exists
- name: Check Consul certificate exists
  stat:
    path: /etc/consul.d/certs/tls.crt
  register: consul_cert

- name: Record certificate presence test
  set_fact:
    consul_tests: "{{ consul_tests + [{'name': 'consul_certificate_present', 'category': 'consul', 'success': consul_cert.stat.exists | bool, 'duration': 0.1}] }}"

# Test 2: Certificate is valid (separate test)
- name: Check if certificate needs renewal
  command: step certificate needs-renewal /etc/consul.d/certs/tls.crt
  register: cert_needs_renewal
  when: consul_cert.stat.exists

- name: Record certificate validity test
  set_fact:
    consul_tests: "{{ consul_tests + [{'name': 'consul_certificate_valid', 'category': 'consul', 'success': (cert_needs_renewal.rc != 0) | bool, 'duration': 0.1}] }}"

This approach provides better debugging information and clearer failure isolation.

Slurm Refactoring and Improvements

Overview

After the initial Slurm deployment (documented in chapter 032), the cluster faced performance and reliability challenges that required significant refactoring. The monolithic setup role was taking 10+ minutes to execute and had idempotency issues, while memory configuration mismatches caused node validation failures.

It's my fault - it's because of my laziness. So this chapter is essentially me saying "yeah, I did a shitty thing, and so now I have to fix it."

Problems Identified

Performance Issues

  • Setup Duration: The original goldentooth.setup_slurm role took over 10 minutes
  • Non-idempotent: Re-running the role would repeat expensive operations
  • Monolithic Design: Single role handled everything from basic Slurm to complex HPC software stacks

Node Validation Failures

  • Memory Mismatch: karstark and lipps nodes (4GB Pi 4s) were configured with 4096MB but only had ~3797MB available
  • Invalid State: These nodes showed as "inval" in sinfo output
  • Authentication Issues: MUNGE key synchronization problems across nodes

Configuration Management

  • Static Memory Values: All nodes hardcoded to 4096MB regardless of actual capacity
  • Limited Flexibility: Single configuration approach didn't account for hardware variations

Refactoring Solution

Modular Role Architecture

Split the monolithic role into focused components:

Core Components (goldentooth.setup_slurm_core)

  • Purpose: Essential Slurm and MUNGE setup only
  • Duration: Reduced from 10+ minutes to ~50 seconds
  • Scope: Package installation, basic configuration, service management
  • Features: MUNGE key synchronization, systemd PID file fixes

Specialized Modules

  • goldentooth.setup_lmod: Environment module system
  • goldentooth.setup_hpc_software: HPC software stack (OpenMPI, Singularity, Conda)
  • goldentooth.setup_slurm_modules: Module files for installed software

Dynamic Memory Detection

Replaced static memory configuration with dynamic detection:

# Before: Static configuration
NodeName=DEFAULT CPUs=4 RealMemory=4096 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN

# After: Dynamic per-node configuration
NodeName=DEFAULT CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
{% for slurm_compute_name in groups['slurm_compute'] %}
NodeName={{ slurm_compute_name }} NodeAddr={{ hostvars[slurm_compute_name].ipv4_address }} RealMemory={{ hostvars[slurm_compute_name].ansible_memtotal_mb }}
{% endfor %}

Node Exclusion Strategy

For nodes with insufficient memory (karstark, lipps):

  • Inventory Update: Removed from slurm_compute group
  • Service Cleanup: Stopped and disabled slurmd/munge services
  • Package Removal: Uninstalled Slurm packages to prevent conflicts

Implementation Details

MUNGE Key Synchronization

Added permanent solution to MUNGE authentication issues:

- name: 'Synchronize MUNGE keys across cluster'
  block:
    - name: 'Retrieve MUNGE key from first controller'
      ansible.builtin.slurp:
        src: '/etc/munge/munge.key'
      register: 'controller_munge_key'
      run_once: true
      delegate_to: "{{ groups['slurm_controller'] | first }}"

    - name: 'Distribute MUNGE key to all nodes'
      ansible.builtin.copy:
        content: "{{ controller_munge_key.content | b64decode }}"
        dest: '/etc/munge/munge.key'
        owner: 'munge'
        group: 'munge'
        mode: '0400'
        backup: yes
      when: inventory_hostname != groups['slurm_controller'] | first
      notify: 'Restart MUNGE'

SystemD Integration Fixes

Resolved PID file path mismatches:

- name: 'Fix slurmctld pidfile path mismatch'
  ansible.builtin.copy:
    content: |
      [Service]
      PIDFile=/var/run/slurm/slurmctld.pid
    dest: '/etc/systemd/system/slurmctld.service.d/override.conf'
    mode: '0644'
  when: inventory_hostname in groups['slurm_controller']
  notify: 'Reload systemd and restart slurmctld'

NFS Permission Resolution

Fixed directory permissions that prevented slurm user access:

# Fixed root directory permissions on NFS server
chmod 755 /mnt/usb1  # Was 700, preventing slurm user access

Results

Performance Improvements

  • Setup Time: Reduced from 10+ minutes to ~50 seconds for core functionality
  • Idempotency: Role can be safely re-run without expensive operations
  • Modularity: Users can choose which components to install

Cluster Health

  • Node Status: 9 nodes operational and idle
  • Authentication: MUNGE working consistently across all nodes
  • Resource Detection: Accurate memory reporting per node

Final Cluster State

general*     up   infinite      9   idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
debug        up   infinite      9   idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast

Prometheus Slurm Exporter

Overview

Following the Slurm refactoring work, the next logical step was to add comprehensive monitoring for the HPC workload manager. This chapter documents the implementation of prometheus-slurm-exporter to provide real-time visibility into cluster utilization, job queues, and resource allocation.

The Challenge

While Slurm was operational with 9 nodes in idle state, there was no integration with the existing Prometheus/Grafana observability stack. Key missing capabilities:

  • No Cluster Metrics: Unable to monitor CPU/memory utilization across nodes
  • No Job Visibility: No insight into job queues, completion rates, or resource consumption
  • No Historical Data: No way to track cluster usage patterns over time
  • Limited Alerting: No proactive monitoring of cluster health or resource exhaustion

Implementation Approach

Exporter Selection

Initially attempted the original vpenso/prometheus-slurm-exporter but discovered it was unmaintained and lacked modern features. Switched to the rivosinc/prometheus-slurm-exporter fork which provided:

  • Active Maintenance: 87 commits, regular releases through v1.6.10
  • Pre-built Binaries: ARM64 support via GitHub releases
  • Enhanced Features: Job tracing, CLI fallback modes, throttling support
  • Better Performance: Optimized for multiple Prometheus instances

Architecture Design

Deployed the exporter following goldentooth cluster patterns:

# Deployment Strategy
Target Nodes: slurm_controller (bettley, cargyll, dalt)
Service Port: 9092 (HTTP)
Protocol: HTTP with Prometheus file-based service discovery
Integration: Full Step-CA certificate management ready
User Management: Dedicated slurm-exporter service user

Role Structure

Created goldentooth.setup_slurm_exporter following established conventions:

roles/goldentooth.setup_slurm_exporter/
├── CLAUDE.md              # Comprehensive documentation
├── tasks/main.yaml         # Main deployment tasks
├── templates/
│   ├── slurm-exporter.service.j2         # Systemd service
│   ├── slurm_targets.yaml.j2             # Prometheus targets
│   └── cert-renewer@slurm-exporter.conf.j2  # Certificate renewal
└── handlers/main.yaml      # Service management handlers

Technical Implementation

Binary Installation

- name: 'Download prometheus-slurm-exporter from rivosinc fork'
  ansible.builtin.get_url:
    url: 'https://github.com/rivosinc/prometheus-slurm-exporter/releases/download/v{{ prometheus_slurm_exporter.version }}/prometheus-slurm-exporter_linux_{{ host.architecture }}.tar.gz'
    dest: '/tmp/prometheus-slurm-exporter-{{ prometheus_slurm_exporter.version }}.tar.gz'
    mode: '0644'

Service Configuration

[Service]
Type=simple
User=slurm-exporter
Group=slurm-exporter
ExecStart=/usr/local/bin/prometheus-slurm-exporter \
  -web.listen-address={{ ansible_default_ipv4.address }}:{{ prometheus_slurm_exporter.port }} \
  -web.log-level=info

Prometheus Integration

Added to the existing scrape configuration:

prometheus_scrape_configs:
  - job_name: 'slurm'
    file_sd_configs:
      - files:
          - "/etc/prometheus/file_sd/slurm_targets.yaml"
    relabel_configs:
      - source_labels: [instance]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

Service Discovery

Dynamic target generation for all controller nodes:

- targets:
  - "{{ hostvars[slurm_controller].ansible_default_ipv4.address }}:{{ prometheus_slurm_exporter.port }}"
  labels:
    job: 'slurm'
    instance: '{{ slurm_controller }}'
    cluster: '{{ cluster_name }}'
    role: 'slurm-controller'

Metrics Exposed

The rivosinc exporter provides comprehensive cluster visibility:

Core Cluster Metrics

slurm_cpus_total 36          # Total CPU cores (9 nodes × 4 cores)
slurm_cpus_idle 36           # Available CPU cores
slurm_cpus_per_state{state="idle"} 36
slurm_node_count_per_state{state="idle"} 9

Memory Utilization

slurm_mem_real 7.0281e+10    # Total cluster memory (MB)
slurm_mem_alloc 6.0797e+10   # Allocated memory
slurm_mem_free 9.484e+09     # Available memory

Job Queue Metrics

slurm_jobs_pending 0         # Jobs waiting in queue
slurm_jobs_running 0         # Currently executing jobs
slurm_job_scrape_duration 29 # Metric collection performance

Performance Monitoring

slurm_cpu_load 5.83          # Current CPU load average
slurm_node_scrape_duration 35 # Node data collection time

Deployment Results

Service Health

All three controller nodes running successfully:

● slurm-exporter.service - Prometheus Slurm Exporter
     Loaded: loaded (/etc/systemd/system/slurm-exporter.service; enabled)
     Active: active (running)
   Main PID: 3692156 (prometheus-slur)
      Tasks: 5 (limit: 8737)
     Memory: 1.5M (max: 128.0M available)

Metrics Validation

curl http://10.4.0.11:9092/metrics | grep '^slurm_'
slurm_cpu_load 5.83
slurm_cpus_idle 36
slurm_cpus_per_state{state="idle"} 36
slurm_cpus_total 36
slurm_node_count_per_state{state="idle"} 9

Prometheus Integration

Targets automatically discovered and scraped:

  • bettley:9092 - Controller node metrics
  • cargyll:9092 - Controller node metrics
  • dalt:9092 - Controller node metrics

Configuration Management

Variables Structure

# Prometheus Slurm Exporter configuration (rivosinc fork)
prometheus_slurm_exporter:
  version: "1.6.10"
  port: 9092
  user: "slurm-exporter"
  group: "slurm-exporter"

Command Interface

# Deploy exporter
goldentooth setup_slurm_exporter

# Verify deployment
goldentooth command slurm_controller "systemctl status slurm-exporter"

# Check metrics
goldentooth command bettley "curl -s http://localhost:9092/metrics | head -10"

Troubleshooting Lessons

Initial Issues Encountered

  1. Wrong Repository: Started with unmaintained vpenso fork

    • Solution: Switched to actively maintained rivosinc fork
  2. TLS Configuration: Attempted HTTPS but exporter doesn't support TLS flags

    • Solution: Used HTTP with plans for future TLS proxy if needed
  3. Binary Availability: No pre-built ARM64 binaries in original version

    • Solution: rivosinc fork provides comprehensive release assets
  4. Port Conflicts: Initially used port 8080

    • Solution: Used exporter default 9092 to avoid conflicts

Debugging Process

Service logs were key to identifying configuration issues:

journalctl -u slurm-exporter --no-pager -l

Metrics endpoint testing confirmed functionality:

curl -s http://localhost:9092/metrics | grep -E '^slurm_'

Integration with Existing Stack

The exporter seamlessly integrates with goldentooth monitoring infrastructure:

Prometheus Configuration

  • File-based Service Discovery: Automatic target management
  • Label Strategy: Consistent with existing exporters
  • Scrape Intervals: Standard 60-second collection

Certificate Management

  • Step-CA Ready: Templates prepared for future TLS implementation
  • Automatic Renewal: Systemd timer configuration included
  • Service User: Dedicated account with minimal permissions

Observability Pipeline

  • Prometheus: Metrics collection and storage
  • Grafana: Dashboard visualization (ready for implementation)
  • Alerting: Rule definition for cluster health monitoring

Performance Impact

Resource Usage

  • Memory: ~1.5MB RSS per exporter instance
  • CPU: Minimal impact during scraping
  • Network: Standard HTTP metrics collection
  • Slurm Load: Read-only operations with built-in throttling

Scalability Considerations

  • Multiple Controllers: Distributed across all controller nodes
  • High Availability: No single point of failure
  • Data Consistency: Each exporter provides complete cluster view

Certificate Renewal Debugging Odyssey

Some time after setting up the certificate renewal system, the cluster was humming along nicely with 24-hour certificate lifetimes and automatic renewal every 5 minutes. Or so I thought.

One morning, I discovered that Vault certificates had mysteriously expired overnight, despite the renewal system supposedly working. This kicked off a multi-day investigation that would lead to significant improvements in our certificate management and monitoring infrastructure.

The Mystery: Why Didn't Vault Certificates Renew?

The first clue was puzzling - some services had renewed their certificates successfully (Consul, Nomad), while others (Vault) had failed silently. The cert-renewer systemd service showed no errors, and the timers were running on schedule.

$ goldentooth command_root jast 'systemctl status cert-renewer@vault.timer'
● cert-renewer@vault.timer - Timer for certificate renewal of vault
     Loaded: loaded (/etc/systemd/system/cert-renewer@.timer; enabled)
     Active: active (waiting) since Wed 2025-07-23 14:05:12 EDT; 3h ago

The timer was active, but the certificates were still expired. Something was fundamentally wrong with our renewal logic.

Building a Certificate Renewal Canary

Rather than guessing at the problem, I decided to build proper test infrastructure. The solution was a "canary" service - a minimal certificate renewal setup with extremely short lifetimes that would fail fast and give us rapid feedback.

Creating the Canary Service

I created a new Ansible role goldentooth.setup_cert_renewer_canary that:

  1. Creates a dedicated user and service: cert-canary user with its own systemd service
  2. Uses 15-minute certificate lifetimes: Fast enough to debug quickly
  3. Runs on a 5-minute timer: Same schedule as production services
  4. Provides comprehensive logging: Detailed output for debugging
# roles/goldentooth.setup_cert_renewer_canary/defaults/main.yaml
cert_canary:
  username: cert-canary
  group: cert-canary
  cert_lifetime: 15m
  cert_path: /opt/cert-canary/certs/tls.crt
  key_path: /opt/cert-canary/certs/tls.key

The canary service template includes detailed logging:

[Unit]
Description=Certificate Canary Service
After=network-online.target

[Service]
Type=oneshot
User=cert-canary
WorkingDirectory=/opt/cert-canary
ExecStart=/bin/echo "Certificate canary service executed successfully"

Discovering the Root Cause

With the canary in place, I could observe the renewal process in real-time. The breakthrough came when I examined the step certificate needs-renewal command more carefully.

The 66% Threshold Problem

The default cert-renewer configuration uses a 66% threshold for renewal - certificates renew when they have less than 66% of their lifetime remaining. For 24-hour certificates, this means renewal occurs when there are about 8 hours left.

But here's the critical issue: with a 5-minute timer interval, there's only a narrow window for successful renewal. If the renewal fails during that window (due to network issues, service restarts, etc.), the next attempt won't occur until the timer fires again.

The math was unforgiving:

  • 24-hour certificate: 66% threshold = ~8 hour renewal window
  • 5-minute timer: 12 attempts per hour
  • Network/service instability: Occasional renewal failures
  • Result: Certificates could expire if multiple renewal attempts failed in succession

The Solution: Environment Variable Configuration

The fix involved making the cert-renewer system more configurable and robust. I updated the base cert-renewer@.service template to support environment variable overrides:

[Unit]
Description=Certificate renewer for %I
After=network-online.target
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
StartLimitIntervalSec=0

[Service]
Type=oneshot
User=root
Environment=STEPPATH=/etc/step-ca
Environment=CERT_LOCATION=/etc/step/certs/%i.crt
Environment=KEY_LOCATION=/etc/step/certs/%i.key
Environment=EXPIRES_IN_THRESHOLD=66%

ExecCondition=/usr/bin/step certificate needs-renewal ${CERT_LOCATION} --expires-in ${EXPIRES_IN_THRESHOLD}
ExecStart=/usr/bin/step ca renew --force ${CERT_LOCATION} ${KEY_LOCATION}
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active %i.service || systemctl try-reload-or-restart %i.service"

[Install]
WantedBy=multi-user.target

Service-Specific Overrides

Each service now gets its own override configuration that specifies the exact certificate paths and renewal behavior:

# /etc/systemd/system/cert-renewer@vault.service.d/override.conf
[Service]
Environment=CERT_LOCATION=/opt/vault/tls/tls.crt
Environment=KEY_LOCATION=/opt/vault/tls/tls.key
WorkingDirectory=/opt/vault/tls

ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active vault.service || systemctl try-reload-or-restart vault.service"

The beauty of this approach is that we can now tune renewal behavior per service without modifying the base template.

Comprehensive Monitoring Infrastructure

While debugging the certificate issue, I also built comprehensive monitoring dashboards and alerting to prevent future incidents.

New Grafana Dashboards

I created three major monitoring dashboards:

  1. Slurm Cluster Overview: Job queue metrics, resource utilization, historical trends
  2. HashiCorp Services Overview: Consul health, Vault status, Nomad allocation monitoring
  3. Infrastructure Health Overview: Node uptime, storage capacity, network metrics

Enhanced Metrics Collection

The monitoring improvements included:

  • Vector Internal Metrics: Enabled Vector's internal metrics and Prometheus exporter
  • Certificate Expiration Tracking: Automated monitoring of certificate days-remaining
  • Service Health Indicators: Real-time status for all critical cluster services
  • Alert Rules: Proactive notifications for certificate expiration and service failures

Testing Infrastructure Improvements

The certificate renewal investigation led to significant improvements in our testing infrastructure.

Certificate-Aware Test Suite

I created a comprehensive test_certificate_renewal role that:

  1. Node-Specific Testing: Only tests certificates for services actually deployed on each node
  2. Multi-Layered Validation: Certificate presence, validity, timer status, renewal capability
  3. Chain Validation: Verifies certificates against the cluster CA
  4. Canary Health Monitoring: Tracks the certificate canary's renewal cycles

Smart Service Filtering

The test improvements included "intelligent" service filtering:

# Filter services to only those deployed on this node
- name: Filter services for current node
  set_fact:
    node_certificate_services: |-
      {%- set filtered_services = [] -%}
      {%- for service in certificate_services -%}
        {%- set should_include = false -%}
        {%- if service.get('specific_hosts') -%}
          {%- if inventory_hostname in service.specific_hosts -%}
            {%- set should_include = true -%}
          {%- endif -%}
        {%- elif service.host_groups -%}
          {%- for group in service.host_groups -%}
            {%- if inventory_hostname in groups.get(group, []) -%}
              {%- set should_include = true -%}
            {%- endif -%}
          {%- endfor -%}
        {%- endif -%}
        {%- if should_include -%}
          {%- set _ = filtered_services.append(service) -%}
        {%- endif -%}
      {%- endfor -%}
      {{ filtered_services }}

This eliminated false positives where tests were failing for missing certificates on nodes where services weren't supposed to be running.

Nextflow Workflow Management System

Overview

After successfully establishing a robust Slurm HPC cluster with comprehensive monitoring and observability, the next logical step was to add a modern workflow management system. Nextflow provides a powerful solution for data-intensive computational pipelines, enabling scalable and reproducible scientific workflows using software containers.

This chapter documents the installation and integration of Nextflow 24.10.0 with the existing Slurm cluster, complete with Singularity container support, shared storage integration, and module system configuration.

The Challenge

While our Slurm cluster was fully functional for individual job submission, we lacked a sophisticated workflow management system that could:

  • Orchestrate Complex Pipelines: Chain multiple computational steps with dependency management
  • Provide Reproducibility: Ensure consistent results across different execution environments
  • Support Containers: Leverage containerized software for portable and consistent environments
  • Integrate with Slurm: Seamlessly submit jobs to our existing cluster scheduler
  • Enable Scalability: Automatically parallelize workflows across cluster nodes

Modern bioinformatics and data science workflows often require hundreds of interconnected tasks, making manual job submission impractical and error-prone.

Implementation Approach

The solution involved creating a comprehensive Nextflow installation that integrates deeply with our existing infrastructure:

1. Architecture Design

  • Shared Storage Integration: Install Nextflow on NFS to ensure cluster-wide accessibility
  • Slurm Executor: Configure native Slurm executor for distributed job execution
  • Container Runtime: Leverage existing Singularity installation for reproducible environments
  • Module System: Integrate with Lmod for consistent environment management

2. Installation Strategy

  • Java Runtime: Install OpenJDK 17 as a prerequisite across all compute nodes
  • Centralized Installation: Single installation on shared storage accessible by all nodes
  • Configuration Templates: Create reusable configuration for common workflow patterns
  • Example Workflows: Provide ready-to-run examples for testing and learning

Technical Implementation

New Ansible Role Creation

Created goldentooth.setup_nextflow role with comprehensive installation logic:

# Install Java OpenJDK (required for Nextflow)
- name: 'Install Java OpenJDK (required for Nextflow)'
  ansible.builtin.apt:
    name:
      - 'openjdk-17-jdk'
      - 'openjdk-17-jre'
    state: 'present'

# Download and install Nextflow
- name: 'Download Nextflow binary'
  ansible.builtin.get_url:
    url: "https://github.com/nextflow-io/nextflow/releases/download/v{{ slurm.nextflow_version }}/nextflow"
    dest: "{{ slurm.nfs_base_path }}/nextflow/{{ slurm.nextflow_version }}/nextflow"
    owner: 'slurm'
    group: 'slurm'
    mode: '0755'

Slurm Executor Configuration

Created comprehensive Nextflow configuration optimized for our cluster:

// Nextflow Configuration for Goldentooth Cluster
process {
    executor = 'slurm'
    queue = 'general'

    // Default resource requirements
    cpus = 1
    memory = '1GB'
    time = '1h'

    // Enable Singularity containers
    container = 'ubuntu:20.04'

    // Process-specific configurations
    withLabel: 'small' {
        cpus = 1
        memory = '2GB'
        time = '30m'
    }

    withLabel: 'large' {
        cpus = 4
        memory = '8GB'
        time = '6h'
    }
}

// Slurm executor configuration
executor {
    name = 'slurm'
    queueSize = 100
    submitRateLimit = '10/1min'

    clusterOptions = {
        "--account=default " +
        "--partition=\${task.queue} " +
        "--job-name=nf-\${task.hash}"
    }
}

Container Integration

Configured Singularity integration for reproducible workflows:

singularity {
    enabled = true
    autoMounts = true
    envWhitelist = 'SLURM_*'

    // Cache directory on shared storage
    cacheDir = '/mnt/nfs/slurm/singularity/cache'

    // Mount shared directories
    runOptions = '--bind /mnt/nfs/slurm:/mnt/nfs/slurm'
}

Module System Integration

Extended the existing Lmod system with a Nextflow module:

-- Nextflow Workflow Management System
whatis("Nextflow workflow management system 24.10.0")

-- Load required Java module (dependency)
depends_on("java/17")

-- Add Nextflow to PATH
prepend_path("PATH", "/mnt/nfs/slurm/nextflow/24.10.0")

-- Set Nextflow environment variables
setenv("NXF_HOME", "/mnt/nfs/slurm/nextflow/24.10.0")
setenv("NXF_WORK", "/mnt/nfs/slurm/nextflow/workspace")

-- Enable Singularity by default
setenv("NXF_SINGULARITY_CACHEDIR", "/mnt/nfs/slurm/singularity/cache")

Example Pipeline

Created a comprehensive hello-world pipeline demonstrating cluster integration:

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

// Pipeline parameters
params {
    greeting = 'Hello'
    names = ['World', 'Goldentooth', 'Slurm', 'Nextflow']
    output_dir = './results'
}

process sayHello {
    tag "$name"
    label 'small'
    publishDir params.output_dir, mode: 'copy'

    container 'ubuntu:20.04'

    input:
    val name

    output:
    path "${name}_greeting.txt"

    script:
    """
    echo "${params.greeting} ${name}!" > ${name}_greeting.txt
    echo "Running on node: \$(hostname)" >> ${name}_greeting.txt
    echo "Slurm Job ID: \${SLURM_JOB_ID:-'Not running under Slurm'}" >> ${name}_greeting.txt
    """
}

workflow {
    names_ch = Channel.fromList(params.names)
    greetings_ch = sayHello(names_ch)

    workflow.onComplete {
        log.info "Pipeline completed successfully!"
        log.info "Results saved to: ${params.output_dir}"
    }
}

Deployment Results

Installation Success

The deployment was executed successfully across all Slurm compute nodes:

cd /Users/nathan/Projects/goldentooth/ansible
ansible-playbook -i inventory/hosts playbooks/setup_nextflow.yaml --limit slurm_compute

Installation Summary:

  • Java OpenJDK 17 installed on 9 compute nodes
  • Nextflow 24.10.0 downloaded and configured
  • Slurm executor configured with resource profiles
  • Singularity integration enabled with shared cache
  • Module file created and integrated with Lmod
  • Example pipeline deployed and tested

Verification Output

Nextflow Installation Test:
      N E X T F L O W
      version 24.10.0 build 5928
      created 27-10-2024 18:36 UTC (14:36 GMT-04:00)
      cite doi:10.1038/nbt.3820
      http://nextflow.io

Installation paths:
- Nextflow: /mnt/nfs/slurm/nextflow/24.10.0
- Config: /mnt/nfs/slurm/nextflow/24.10.0/nextflow.config
- Examples: /mnt/nfs/slurm/nextflow/24.10.0/examples
- Workspace: /mnt/nfs/slurm/nextflow/workspace

Configuration Management

Usage Workflow

Users can now access Nextflow through the module system:

# Load the Nextflow environment
module load Nextflow/24.10.0

# Run the example pipeline
nextflow run /mnt/nfs/slurm/nextflow/24.10.0/examples/hello-world.nf

# Run with development profile (reduced resources)
nextflow run pipeline.nf -profile dev

# Run with custom configuration
nextflow run pipeline.nf -c custom.config

Prometheus Node Exporter Migration: From Kubernetes to Native

The Problem

While working on Grafana dashboard configuration, I discovered that the node exporter dashboard was completely empty - no metrics, no data, just a sad empty dashboard that looked like it had given up on life.

The issue? Our Prometheus Node Exporter was deployed via Kubernetes and Argo CD, but Prometheus itself was running as a systemd service on allyrion. The Kubernetes deployment created a ClusterIP service at 172.16.12.161:9100, but Prometheus (running outside the cluster) couldn't reach this internal Kubernetes service.

Meanwhile, Prometheus was configured to scrape node exporters directly at each node's IP on port 9100 (e.g., 10.4.0.11:9100), but nothing was listening there because the actual exporters were only accessible through the Kubernetes service mesh.

The Solution: Raw-dogging Node Exporter

Time to embrace the chaos and deploy node exporter directly on the nodes as systemd services. Sometimes the simplest solution is the best solution.

Step 1: Create the Ansible Playbook

First, I created a new playbook to deploy node exporter cluster-wide using the same prometheus.prometheus.node_exporter role that HAProxy was already using:

# ansible/playbooks/setup_node_exporter.yaml
# Description: Setup Prometheus Node Exporter on all cluster nodes.

- name: 'Setup Prometheus Node Exporter.'
  hosts: 'all'
  remote_user: 'root'
  roles:
    - { role: 'prometheus.prometheus.node_exporter' }
  handlers:
    - name: 'Restart Node Exporter.'
      ansible.builtin.service:
        name: 'node_exporter'
        state: 'restarted'
        enabled: true

Step 2: Deploy via Goldentooth CLI

Thanks to the goldentooth CLI's fallback behavior (it automatically runs Ansible playbooks with matching names), deployment was as simple as:

goldentooth setup_node_exporter

This installed node exporter on all 13 cluster nodes, creating:

  • node-exp system user and group
  • /usr/local/bin/node_exporter binary
  • /etc/systemd/system/node_exporter.service systemd service
  • /var/lib/node_exporter textfile collector directory

Step 3: Handle Port Conflicts

The deployment initially failed on most nodes with "address already in use" errors. The Kubernetes node exporter pods were still running and had claimed port 9100.

Investigation revealed the conflict:

goldentooth command bettley "journalctl -u node_exporter --no-pager -n 10"
# Error: listen tcp 0.0.0.0:9100: bind: address already in use

Step 4: Clean Up Kubernetes Deployment

I removed the Kubernetes deployment entirely:

# Delete the daemonset and namespace
kubectl delete daemonset prometheus-node-exporter -n prometheus-node-exporter
kubectl delete namespace prometheus-node-exporter

# Delete the Argo CD applications managing this
kubectl delete application prometheus-node-exporter gitops-repo-prometheus-node-exporter -n argocd

# Delete the GitHub repository (to prevent ApplicationSet from recreating it)
gh repo delete goldentooth/prometheus-node-exporter --yes

Step 5: Restart Failed Services

With the port conflicts resolved, I restarted the systemd services:

goldentooth command bettley,dalt "systemctl restart node_exporter"

All nodes now showed healthy node exporter services:

● node_exporter.service - Prometheus Node Exporter
     Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled)
     Active: active (running) since Wed 2025-07-23 19:36:30 EDT; 7s ago

Step 6: Reload Prometheus

With native node exporters now listening on port 9100 on all nodes, I reloaded Prometheus to pick up the new targets:

goldentooth command allyrion "systemctl reload prometheus"

Verified metrics were accessible:

goldentooth command allyrion "curl -s http://10.4.0.11:9100/metrics | grep node_cpu_seconds_total | head -3"
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 1.42238869e+06

The Result

Within minutes, the Grafana node exporter dashboard came alive with beautiful metrics from all cluster nodes. CPU usage, memory consumption, disk I/O, network statistics - everything was flowing perfectly.

Authelia Authentication Infrastructure

In our quest to provide secure access to the Goldentooth cluster for AI assistants, we needed a robust authentication and authorization solution. This chapter chronicles the implementation of Authelia, a comprehensive authentication server that provides OAuth 2.0 and OpenID Connect capabilities for our cluster services.

The Authentication Challenge

As we began developing the MCP (Model Context Protocol) server to enable AI assistants like Claude Code to interact with our cluster, we faced a critical security requirement: how to provide secure, standards-based authentication without compromising cluster security or creating a poor user experience.

Traditional authentication approaches like API keys or basic authentication felt inadequate for this use case. We needed:

  • Standards-based OAuth 2.0 and OpenID Connect support
  • Multi-factor authentication capabilities
  • Fine-grained authorization policies
  • Integration with our existing Step-CA certificate infrastructure
  • Single Sign-On (SSO) for multiple cluster services

Why Authelia?

After evaluating various authentication solutions, Authelia emerged as the ideal choice for our cluster:

  • Comprehensive Feature Set: OAuth 2.0, OpenID Connect, LDAP, 2FA/MFA support
  • Self-Hosted: No dependency on external authentication providers
  • Lightweight: Perfect for deployment on Raspberry Pi infrastructure
  • Flexible Storage: Supports SQLite for simplicity or PostgreSQL for scale
  • Policy Engine: Fine-grained access control based on users, groups, and resources

Architecture Overview

Authelia fits into our cluster architecture as the central authentication authority:

Claude Code (OAuth Client)
    ↓ OAuth 2.0 Authorization Code Flow
Authelia (auth.services.goldentooth.net)
    ↓ JWT/Token Validation
MCP Server (mcp.services.goldentooth.net)
    ↓ Authenticated API Calls
Goldentooth Cluster Services

The authentication flow follows industry-standard OAuth 2.0 patterns:

  1. Discovery: Client discovers OAuth endpoints via well-known URLs
  2. Authorization: User authenticates with Authelia and grants permissions
  3. Token Exchange: Authorization code exchanged for access/ID tokens
  4. API Access: Bearer tokens used for authenticated MCP requests

Ansible Implementation

Role Structure

The goldentooth.setup_authelia role provides comprehensive deployment automation:

ansible/roles/goldentooth.setup_authelia/
├── defaults/main.yml      # Default configuration variables
├── tasks/main.yml         # Primary deployment tasks
├── templates/             # Configuration templates
│   ├── configuration.yml.j2           # Main Authelia config
│   ├── users_database.yml.j2          # User definitions
│   ├── authelia.service.j2             # Systemd service
│   ├── authelia-consul-service.json.j2 # Consul registration
│   └── cert-renewer@authelia.conf.j2   # Certificate renewal
├── handlers/main.yml      # Service restart handlers
└── CLAUDE.md             # Role documentation

Key Configuration Elements

OIDC Provider Configuration: Authelia acts as a full OpenID Connect provider with pre-configured clients for the MCP server:

identity_providers:
  oidc:
    hmac_secret: {{ authelia_oidc_hmac_secret }}
    clients:
      - client_id: goldentooth-mcp
        client_name: Goldentooth MCP Server
        client_secret: "$argon2id$v=19$m=65536,t=3,p=4$..."
        authorization_policy: one_factor
        redirect_uris:
          - https://mcp.services.{{ authelia_domain }}/callback
        scopes:
          - openid
          - profile
          - email
          - groups
          - offline_access
        grant_types:
          - authorization_code
          - refresh_token

Security Hardening: Multiple layers of security protection:

authentication_backend:
  file:
    password:
      algorithm: argon2id
      iterations: 3
      memory: 65536
      parallelism: 4
      key_length: 32
      salt_length: 16

regulation:
  max_retries: 3
  find_time: 2m
  ban_time: 5m

session:
  secret: {{ authelia_session_secret }}
  expiration: 12h
  inactivity: 45m

Certificate Integration

Authelia integrates seamlessly with our Step-CA infrastructure:

# Generate TLS certificate for Authelia server
step ca certificate \
  "authelia.{{ authelia_domain }}" \
  /etc/authelia/tls.crt \
  /etc/authelia/tls.key \
  --provisioner="default" \
  --san="authelia.{{ authelia_domain }}" \
  --san="auth.services.{{ authelia_domain }}" \
  --not-after='24h' \
  --force

The role also configures automatic certificate renewal through our cert-renewer@authelia.timer service, ensuring continuous operation without manual intervention.

Consul Integration

Authelia registers itself as a service in our Consul service mesh, enabling service discovery and health monitoring:

{
  "service": {
    "name": "authelia",
    "port": 9091,
    "address": "{{ ansible_hostname }}.{{ cluster.node_domain }}",
    "tags": ["authentication", "oauth", "oidc"],
    "check": {
      "http": "https://{{ ansible_hostname }}.{{ cluster.node_domain }}:9091/api/health",
      "interval": "30s",
      "timeout": "10s",
      "tls_skip_verify": false
    }
  }
}

This integration provides:

  • Service Discovery: Other services can locate Authelia via Consul DNS
  • Health Monitoring: Consul tracks Authelia's health status
  • Load Balancing: Support for multiple Authelia instances if needed

User Management and Policies

Default User Configuration

The deployment creates essential user accounts:

users:
  admin:
    displayname: "Administrator"
    password: "$argon2id$v=19$m=65536,t=3,p=4$..."
    email: admin@goldentooth.net
    groups:
      - admins
      - users

  mcp-service:
    displayname: "MCP Service Account"
    password: "$argon2id$v=19$m=65536,t=3,p=4$..."
    email: mcp-service@goldentooth.net
    groups:
      - services

Access Control Policies

Authelia implements fine-grained access control:

access_control:
  default_policy: one_factor
  rules:
    # Public access to health checks
    - domain: "*.{{ authelia_domain }}"
      policy: bypass
      resources:
        - "^/api/health$"

    # Admin resources require two-factor
    - domain: "*.{{ authelia_domain }}"
      policy: two_factor
      subject:
        - "group:admins"

    # Regular user access
    - domain: "*.{{ authelia_domain }}"
      policy: one_factor

Multi-Factor Authentication

Authelia supports multiple 2FA methods out of the box:

TOTP (Time-based One-Time Password):

  • Compatible with Google Authenticator, Authy, 1Password
  • 6-digit codes with 30-second rotation
  • QR code enrollment process

WebAuthn/FIDO2:

  • Hardware security keys (YubiKey, SoloKey)
  • Platform authenticators (TouchID, Windows Hello)
  • Phishing-resistant authentication

Push Notifications (planned):

  • Integration with Duo Security for push-based 2FA
  • SMS fallback for environments without smartphone access

Deployment and Management

Installation Command

Deploy Authelia across the cluster with a single command:

# Deploy to default Authelia nodes
goldentooth setup_authelia

# Deploy to specific node
goldentooth setup_authelia --limit jast

Service Management

Monitor and manage Authelia using familiar systemd commands:

# Check service status
goldentooth command authelia "systemctl status authelia"

# View logs
goldentooth command authelia "journalctl -u authelia -f"

# Restart service
goldentooth command_root authelia "systemctl restart authelia"

# Validate configuration
goldentooth command authelia "/usr/local/bin/authelia validate-config --config /etc/authelia/configuration.yml"

Health Monitoring

Authelia exposes health and metrics endpoints:

  • Health Check: https://auth.goldentooth.net/api/health
  • Metrics: http://auth.goldentooth.net:9959/metrics (Prometheus format)

These endpoints integrate with our monitoring stack (Prometheus, Grafana) for observability.

Security Considerations

Threat Mitigation

Authelia addresses multiple attack vectors:

Session Security:

  • Secure, HTTP-only cookies
  • CSRF protection via state parameters
  • Session timeout and inactivity limits

Rate Limiting:

  • Failed login attempt throttling
  • IP-based temporary bans
  • Progressive delays for repeated failures

Password Security:

  • Argon2id hashing (memory-hard, side-channel resistant)
  • Configurable complexity requirements
  • Protection against timing attacks

Network Security

All Authelia communication is secured:

  • TLS 1.3: All external communications encrypted
  • Certificate Validation: Mutual TLS with cluster CA
  • HSTS: HTTP Strict Transport Security headers
  • Secure Headers: Complete security header suite

Integration with MCP Server

The MCP server integrates with Authelia through standard OAuth 2.0 flows:

OAuth Discovery

The MCP server exposes OAuth discovery endpoints that delegate to Authelia:

#![allow(unused)]
fn main() {
// In http_server.rs
async fn handle_oauth_metadata() -> Result<Response<Full<Bytes>>, Infallible> {
    let discovery = auth.discover_oidc_config().await?;
    let metadata = serde_json::json!({
        "issuer": discovery.issuer,
        "authorization_endpoint": discovery.authorization_endpoint,
        "token_endpoint": discovery.token_endpoint,
        "jwks_uri": discovery.jwks_uri,
        // ... additional OAuth metadata
    });
    Ok(Response::builder()
        .status(StatusCode::OK)
        .header("Content-Type", "application/json")
        .body(Full::new(Bytes::from(metadata.to_string())))
        .unwrap())
}
}

Token Validation

The MCP server validates tokens using both JWT verification and OAuth token introspection:

#![allow(unused)]
fn main() {
async fn validate_token(&self, token: &str) -> AuthResult<Claims> {
    if self.is_jwt_token(token) {
        // JWT validation for ID tokens
        self.validate_jwt_token(token).await
    } else {
        // Token introspection for opaque access tokens
        self.introspect_access_token(token).await
    }
}
}

This dual approach supports both JWT ID tokens and opaque access tokens that Authelia issues.

Performance and Scalability

Resource Utilization

Authelia runs efficiently on Raspberry Pi hardware:

  • Memory: ~50MB RSS under normal load
  • CPU: <1% utilization during authentication flows
  • Storage: SQLite database grows slowly (~10MB for hundreds of users)
  • Network: Minimal bandwidth requirements

Scaling Strategies

For high-availability deployments:

  1. Multiple Instances: Deploy Authelia on multiple nodes with shared database
  2. PostgreSQL Backend: Replace SQLite with PostgreSQL for concurrent access
  3. Redis Sessions: Use Redis for distributed session storage
  4. Load Balancing: HAProxy or similar for request distribution

SeaweedFS Distributed Storage Implementation

With Ceph providing robust block storage for Kubernetes, Goldentooth needed an object storage solution optimized for file-based workloads. SeaweedFS emerged as the perfect complement: a simple, fast distributed file system that excels at handling large numbers of files with minimal operational overhead.

The Architecture Decision

SeaweedFS follows a different philosophy from traditional distributed storage systems. Instead of complex replication schemes, it uses a simple master-volume architecture inspired by Google's Colossus and Facebook's Haystack:

  • Master servers: Coordinate volume assignments with HashiCorp Raft consensus
  • Volume servers: Store actual file data in append-only volumes
  • HA consensus: Raft-based leadership election with automatic failover

Target Deployment

I implemented a high availability cluster using fenn and karstark with true HA clustering:

  • Storage capacity: ~1TB total (491GB + 515GB across dedicated SSDs)
  • Fault tolerance: Automatic failover with zero-downtime leadership transitions
  • Consensus protocol: HashiCorp Raft for distributed coordination
  • Architecture support: Native ARM64 and x86_64 binaries
  • Version: SeaweedFS 3.66 with HA clustering capabilities

Storage Foundation

The SeaweedFS deployment builds on the existing goldentooth.bootstrap_seaweedfs infrastructure:

SSD Preparation

Each storage node gets a dedicated SSD mounted at /mnt/seaweedfs-ssd/:

- name: Format SSD with ext4 filesystem
  ansible.builtin.filesystem:
    fstype: "{{ seaweedfs.filesystem_type }}"
    dev: "{{ seaweedfs.device }}"
    force: true

- name: Set proper ownership on SSD mount
  ansible.builtin.file:
    path: "{{ seaweedfs.mount_path }}"
    owner: "{{ seaweedfs.uid }}"
    group: "{{ seaweedfs.gid }}"
    mode: '0755'
    recurse: true

Directory Structure

The bootstrap creates organized storage directories:

  • /mnt/seaweedfs-ssd/data/ - Volume server storage
  • /mnt/seaweedfs-ssd/master/ - Master server metadata
  • /mnt/seaweedfs-ssd/index/ - Volume indexing
  • /mnt/seaweedfs-ssd/filer/ - Future filer service data

Service Implementation

The goldentooth.setup_seaweedfs role handles the complete service deployment:

Binary Management

Cross-architecture support with automatic download:

- name: Download SeaweedFS binary
  ansible.builtin.get_url:
    url: "https://github.com/seaweedfs/seaweedfs/releases/download/{{ seaweedfs.version }}/linux_arm64.tar.gz"
    dest: "/tmp/seaweedfs-{{ seaweedfs.version }}.tar.gz"
  when: ansible_architecture == "aarch64"

- name: Download SeaweedFS binary (x86_64)
  ansible.builtin.get_url:
    url: "https://github.com/seaweedfs/seaweedfs/releases/download/{{ seaweedfs.version }}/linux_amd64.tar.gz"
    dest: "/tmp/seaweedfs-{{ seaweedfs.version }}.tar.gz"
  when: ansible_architecture == "x86_64"

High Availability Master Configuration

Each node runs a master server with HashiCorp Raft consensus for true HA clustering:

[Unit]
Description=SeaweedFS Master Server
After=network.target
Wants=network.target

[Service]
Type=simple
User=seaweedfs
Group=seaweedfs
ExecStart=/usr/local/bin/weed master \
    -port=9333 \
    -mdir=/mnt/seaweedfs-ssd/master \
    -ip=10.4.x.x \
    -peers=fenn:9333,karstark:9333 \
    -raftHashicorp=true \
    -defaultReplication=001 \
    -volumeSizeLimitMB=1024
Restart=always
RestartSec=5s

# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/mnt/seaweedfs-ssd

Volume Server Configuration

Volume servers automatically track the current cluster leader:

[Unit]
Description=SeaweedFS Volume Server
After=network.target seaweedfs-master.service
Wants=network.target

[Service]
Type=simple
User=seaweedfs
Group=seaweedfs
ExecStart=/usr/local/bin/weed volume \
    -port=8080 \
    -dir=/mnt/seaweedfs-ssd/data \
    -max=64 \
    -mserver=fenn:9333,karstark:9333 \
    -ip=10.4.x.x
Restart=always
RestartSec=5s

# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/mnt/seaweedfs-ssd

Security Hardening

SeaweedFS services run with comprehensive systemd security constraints:

  • User isolation: Dedicated seaweedfs user (UID/GID 985)
  • Filesystem protection: ProtectSystem=strict with explicit write paths
  • Privilege containment: NoNewPrivileges=yes
  • Process isolation: PrivateTmp=yes and ProtectHome=yes

Deployment Process

The deployment uses serial execution to ensure proper cluster formation:

- name: Enable and start SeaweedFS services
  ansible.builtin.systemd:
    name: "{{ item }}"
    enabled: true
    state: started
    daemon_reload: true
  loop:
    - seaweedfs-master
    - seaweedfs-volume

- name: Wait for SeaweedFS master to be ready
  ansible.builtin.uri:
    url: "http://{{ ansible_default_ipv4.address }}:9333/cluster/status"
    method: GET
  until: master_health_check.status == 200
  retries: 10
  delay: 5

Service Verification

Post-deployment health checks confirm proper operation:

HA Cluster Status

curl http://fenn:9333/cluster/status

Returns cluster topology, current leader, and peer status.

Leadership Monitoring

# Watch leadership changes (healthy flapping every 3 seconds)
watch -n 1 'curl -s http://fenn:9333/cluster/status | jq .Leader'

Volume Server Status

curl http://fenn:8080/status

Shows volume allocation and current master server connections.

Volume Assignment Testing

curl -X POST http://fenn:9333/dir/assign

Demonstrates automatic request routing to the current cluster leader.

High Availability Cluster Status

The SeaweedFS cluster now operates as a true HA system:

  • Raft consensus: HashiCorp Raft manages leadership election and state replication
  • Automatic failover: Zero-downtime master transitions when nodes fail
  • Leadership rotation: Healthy 3-second leadership cycling for load balancing
  • Cluster awareness: Volume servers automatically follow leadership changes
  • Fault tolerance: Cluster recovers gracefully from network partitions
  • Storage capacity: Nearly 1TB with redundancy and automatic replication

Command Integration

SeaweedFS operations integrate with the goldentooth CLI:

# Deploy SeaweedFS cluster
goldentooth setup_seaweedfs

# Check HA cluster status
goldentooth command fenn,karstark "systemctl status seaweedfs-master seaweedfs-volume"

# View cluster leadership and peers
goldentooth command fenn "curl -s http://localhost:9333/cluster/status | jq"

# Monitor leadership changes
goldentooth command fenn "watch -n 1 'curl -s http://localhost:9333/cluster/status | jq .Leader'"

# Monitor storage utilization
goldentooth command fenn,karstark "df -h /mnt/seaweedfs-ssd"

Step-CA Certificate Monitoring Implementation

With the goldentooth cluster now heavily dependent on Step-CA for certificate management across Consul, Vault, Nomad, Grafana, Loki, Vector, HAProxy, Blackbox Exporter, and the newly deployed SeaweedFS distributed storage, we needed comprehensive certificate monitoring to prevent service outages from expired certificates.

The existing certificate monitoring was basic - we had file-based certificate expiry alerts, but lacked the visibility and proactive alerting necessary for enterprise-grade PKI management.

The Monitoring Challenge

Our cluster runs multiple services with Step-CA certificates:

  • Consul: Service mesh certificates for all nodes
  • Vault: Secrets management with HA cluster
  • Nomad: Workload orchestration across the cluster
  • Grafana: Observability dashboard access
  • Loki: Log aggregation infrastructure
  • Vector: Log shipping to Loki
  • HAProxy: Load balancer with TLS termintion
  • Blackbox Exporter: Synthetic monitoring service
  • SeaweedFS: Distributed storage with master/volume servers

Each service has automated certificate renewal via cert-renewer@.service systemd timers, but we needed comprehensive monitoring to ensure the renewal system itself was healthy and catch any failures before they caused outages.

Enhanced Blackbox Monitoring

The first enhancement expanded our synthetic monitoring to include comprehensive TLS validation for all Step-CA services.

SeaweedFS Integration

With SeaweedFS newly deployed as a high-availability distributed storage system, I added its endpoints to blackbox monitoring:

# SeaweedFS Master servers (HA cluster)
- targets:
  - "https://fenn:9333"
  - "https://karstark:9333" 
  labels:
    service: "seaweedfs-master"

# SeaweedFS Volume servers  
- targets:
  - "https://fenn:8080"
  - "https://karstark:8080"
  labels:
    service: "seaweedfs-volume"

Comprehensive TLS Endpoint Monitoring

Every Step-CA managed service now has synthetic TLS validation:

blackbox_https_internal_targets:
  - "https://consul.goldentooth.net:8501"
  - "https://vault.goldentooth.net:8200" 
  - "https://nomad.goldentooth.net:4646"
  - "https://grafana.goldentooth.net:3000"
  - "https://loki.goldentooth.net:3100"
  - "https://vector.goldentooth.net:8686"
  - "https://fenn:9115"  # blackbox exporter itself
  - "https://fenn:9333"  # seaweedfs master
  - "https://karstark:9333"
  - "https://fenn:8080"  # seaweedfs volume
  - "https://karstark:8080"

The blackbox exporter validates not just connectivity, but certificate chain validity, expiry dates, and proper TLS negotiation for each endpoint.

Advanced Prometheus Alert System

The core enhancement was implementing a comprehensive multi-tier alerting system for certificate management.

Certificate Expiry Alerts

I implemented three tiers of certificate expiry warnings:

# 30-day advance warning
- alert: CertificateExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Certificate expiring in 30 days"
    description: "Certificate for {{ $labels.instance }} expires in 30 days. Plan renewal."

# 7-day critical alert  
- alert: CertificateExpiringCritical
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Certificate expiring in 7 days"
    description: "Certificate for {{ $labels.instance }} expires in 7 days. Immediate attention required."

# 2-day emergency alert
- alert: CertificateExpiringEmergency  
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 2
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Certificate expiring in 2 days"
    description: "Certificate for {{ $labels.instance }} expires in 2 days. Emergency action required."

Certificate Renewal System Monitoring

Beyond expiry monitoring, I added alerts for certificate renewal system health:

# File-based certificate monitoring
- alert: CertificateFileExpiring
  expr: (file_certificate_expiry_seconds - time()) / 86400 < 7
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Certificate file expiring soon"
    description: "Certificate file {{ $labels.path }} expires in less than 7 days"

# Certificate renewal timer failure
- alert: CertificateRenewalTimerFailed
  expr: systemd_timer_last_trigger_seconds{name=~"cert-renewer@.*"} < time() - 86400 * 8
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "Certificate renewal timer failed"
    description: "Certificate renewal timer {{ $labels.name }} hasn't run in over 8 days"

Step-CA Server Health

Critical infrastructure monitoring for the Step-CA service itself:

# Step-CA service availability
- alert: StepCADown
  expr: up{job="step-ca"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Step-CA server is down"
    description: "Step-CA certificate authority is unreachable"

# TLS endpoint failures
- alert: TLSEndpointDown
  expr: probe_success{job=~"blackbox-https.*"} == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "TLS endpoint unreachable"
    description: "TLS endpoint {{ $labels.instance }} is unreachable via HTTPS"

Comprehensive Certificate Dashboard

The monitoring enhancement includes a dedicated Grafana dashboard providing complete PKI visibility.

Dashboard Features

The Step-CA Certificate Dashboard displays:

  • Certificate Expiry Timeline: Color-coded visualization showing all certificates with expiry thresholds (green > 30 days, yellow 7-30 days, red < 7 days)
  • TLS Endpoint Status: Real-time status of all HTTPS endpoints monitored via blackbox probes
  • Certificate Renewal Health: Status of systemd renewal timers across all services
  • Step-CA Server Status: Availability and responsiveness of the certificate authority
  • Certificate Inventory: Table showing all managed certificates with expiry dates and renewal status

Dashboard Implementation

- name: Deploy Step-CA certificate monitoring dashboard
  ansible.builtin.copy:
    src: "{{ playbook_dir }}/../grafana-dashboards/step-ca-certificate-dashboard.json"
    dest: "/var/lib/grafana/dashboards/"
    owner: grafana
    group: grafana
    mode: '0644'
  notify: restart grafana

The dashboard provides at-a-glance visibility into the health of the entire PKI infrastructure, with drill-down capabilities to investigate specific certificate issues.

Infrastructure Integration

Enhanced Grafana Role

The Grafana setup role now includes automated dashboard deployment:

- name: Create dashboards directory
  ansible.builtin.file:
    path: "/var/lib/grafana/dashboards"
    state: present
    owner: grafana
    group: grafana
    mode: '0755'

- name: Deploy certificate monitoring dashboard
  ansible.builtin.copy:
    src: "step-ca-certificate-dashboard.json"
    dest: "/var/lib/grafana/dashboards/"
    owner: grafana
    group: grafana
    mode: '0644'
  notify: restart grafana

Prometheus Configuration Updates

The Prometheus alerting rules required careful template escaping for proper alert message formatting:

# Proper Prometheus alert template escaping
annotations:
  summary: "Certificate for {{ "{{ $labels.instance }}" }} expires in 30 days"
  description: "Certificate renewal required for {{ "{{ $labels.instance }}" }}"

Service Targets Configuration

All Step-CA certificate endpoints are now systematically monitored:

blackbox_targets:
  https_internal:
    # Core HashiCorp services
    - "https://consul.goldentooth.net:8501"
    - "https://vault.goldentooth.net:8200"
    - "https://nomad.goldentooth.net:4646"
    
    # Observability stack
    - "https://grafana.goldentooth.net:3000"
    - "https://loki.goldentooth.net:3100"
    - "https://vector.goldentooth.net:8686"
    
    # Infrastructure services
    - "https://fenn:9115"  # blackbox exporter
    
    # SeaweedFS distributed storage
    - "https://fenn:9333"   # seaweedfs master
    - "https://karstark:9333"
    - "https://fenn:8080"   # seaweedfs volume  
    - "https://karstark:8080"

Deployment Results

Monitoring Coverage

The enhanced certificate monitoring now provides:

  • Complete PKI visibility: All 20+ Step-CA certificates monitored
  • Proactive alerting: 30/7/2 day advance warnings prevent surprises
  • System health monitoring: Renewal timer and Step-CA service health tracking
  • Synthetic validation: Real TLS endpoint testing via blackbox probes
  • Centralized dashboard: Single pane of glass for certificate infrastructure

Alert Integration

The alert system provides:

  • Early warning system: 30-day alerts allow planned certificate maintenance
  • Escalating severity: 7-day critical and 2-day emergency alerts ensure attention
  • Renewal system monitoring: Catches failures in automated renewal timers
  • Infrastructure monitoring: Step-CA server availability tracking

Operational Impact

Before this enhancement:

  • Basic file-based certificate expiry alerts
  • Limited visibility into certificate health
  • Potential for service outages from unnoticed certificate expiry
  • Manual certificate status checking required

After implementation:

  • Enterprise-grade certificate lifecycle monitoring
  • Proactive alerting preventing service disruptions
  • Complete synthetic validation of certificate-dependent services
  • Real-time visibility into PKI infrastructure health
  • Automated dashboard providing immediate certificate status overview

Repository Integration

Multi-Repository Changes

The implementation spans two repositories:

goldentooth/ansible: Core infrastructure implementation

  • Enhanced blackbox exporter role with SeaweedFS targets
  • Comprehensive Prometheus alerting rules
  • Improved Grafana role with dashboard deployment
  • Certificate monitoring integration across all Step-CA services

goldentooth/grafana-dashboards: Dashboard repository

  • New Step-CA Certificate Dashboard with complete PKI visibility
  • Dashboard committed for reuse across environments
  • JSON format compatible with Grafana provisioning

Command Integration

Certificate monitoring integrates with goldentooth CLI:

# Deploy enhanced certificate monitoring
goldentooth setup_blackbox_exporter
goldentooth setup_grafana  
goldentooth setup_prometheus

# Check certificate monitoring status
goldentooth command allyrion "systemctl status blackbox-exporter"

# View certificate expiry alerts
goldentooth command allyrion "curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname | contains(\"Certificate\"))'"

# Monitor renewal timers
goldentooth command_all "systemctl list-timers 'cert-renewer@*'"

This comprehensive Step-CA certificate monitoring implementation transforms goldentooth from basic certificate management to enterprise-grade PKI infrastructure with complete lifecycle visibility, proactive alerting, and automated health monitoring. The system now prevents certificate-related service outages through early warning and comprehensive synthetic validation of all certificate-dependent services.

HAProxy Dataplane API Integration

With the cluster's load balancing infrastructure established through our initial HAProxy setup and subsequent revisiting, the next evolution was to enable dynamic configuration management. HAProxy's traditional configuration model requires service restarts for changes, which creates service disruption and doesn't align well with modern infrastructure automation needs.

The HAProxy Dataplane API provides a RESTful interface for runtime configuration management, allowing backend server manipulation, health check configuration, and statistics collection without HAProxy restarts. This capability is essential for automated deployment pipelines and dynamic infrastructure management.

Implementation Strategy

The implementation focused on integrating HAProxy Dataplane API v3.2.1 into the existing goldentooth.setup_haproxy Ansible role while maintaining the cluster's security and operational standards.

Configuration Architecture

The API requires a specific YAML v2 configuration format with a nested structure significantly different from HAProxy's traditional flat configuration:

config_version: 2
haproxy:
  config_file: /etc/haproxy/haproxy.cfg
  userlist: controller
  reload:
    reload_cmd: systemctl reload haproxy
    reload_delay: 5
    restart_cmd: systemctl restart haproxy
name: dataplaneapi
mode: single
resources:
  maps_dir: /etc/haproxy/maps
  ssl_certs_dir: /etc/haproxy/ssl
  general_storage_dir: /etc/haproxy/general
  spoe_dir: /etc/haproxy/spoe
  spoe_transaction_dir: /tmp/spoe-haproxy
  backups_dir: /etc/haproxy/backups
  config_snippets_dir: /etc/haproxy/config_snippets
  acl_dir: /etc/haproxy/acl
  transactions_dir: /etc/haproxy/transactions
user:
  insecure: false
  username: "{{ vault.cluster_credentials.username }}"
  password: "{{ vault.cluster_credentials.password }}"
advertised:
  api_address: 0.0.0.0
  api_port: 5555

This configuration structure enables the API to manage HAProxy through systemd reload commands rather than requiring full restarts, maintaining service availability during configuration changes.

Directory Structure Implementation

The API requires an extensive directory hierarchy for storing various configuration components:

# Primary API configuration
/etc/haproxy-dataplane/

# HAProxy configuration storage
/etc/haproxy/dataplane/
/etc/haproxy/maps/
/etc/haproxy/ssl/
/etc/haproxy/general/
/etc/haproxy/spoe/
/etc/haproxy/acl/
/etc/haproxy/transactions/
/etc/haproxy/config_snippets/
/etc/haproxy/backups/

# Temporary processing
/tmp/spoe-haproxy/

All directories are created with proper ownership (haproxy:haproxy) and permissions to ensure the API service can read and write configuration data while maintaining security boundaries.

HAProxy Configuration Integration

The implementation required specific HAProxy configuration changes to enable API communication:

Master-Worker Mode

global
    master-worker
    
    # Admin socket with proper group permissions
    stats socket /run/haproxy/admin.sock mode 660 level admin group haproxy
    
    # User authentication for API access
    userlist controller
        user {{ vault.cluster_credentials.username }} password {{ vault.cluster_credentials.password }}

The master-worker mode enables the API to communicate with HAProxy's runtime through the admin socket, while the userlist provides authentication for API requests.

Backend Configuration

backend haproxy-dataplane-api
    server dataplane 127.0.0.1:5555 check

This backend configuration allows external access to the API through the existing reverse proxy infrastructure, integrating seamlessly with the cluster's routing patterns.

Systemd Service Implementation

The service configuration prioritizes security while providing necessary filesystem access:

[Unit]
Description=HAProxy Dataplane API
After=network.target haproxy.service
Requires=haproxy.service

[Service]
Type=exec
User=haproxy
Group=haproxy
ExecStart=/usr/local/bin/dataplaneapi --config-file=/etc/haproxy-dataplane/dataplaneapi.yaml
Restart=always
RestartSec=5

# Security settings
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true

# Required filesystem access
ReadWritePaths=/etc/haproxy
ReadWritePaths=/etc/haproxy-dataplane
ReadWritePaths=/var/lib/haproxy
ReadWritePaths=/run/haproxy
ReadWritePaths=/tmp/spoe-haproxy

[Install]
WantedBy=multi-user.target

The security-focused configuration uses ProtectSystem=strict with explicit ReadWritePaths declarations, ensuring the service has access only to required directories while maintaining system protection.

Problem Resolution Process

The implementation encountered several configuration challenges that required systematic debugging:

YAML Configuration Format Issues

Problem: Initial configuration used HAProxy's flat format rather than the required nested YAML v2 structure.

Solution: Implemented proper config_version: 2 with nested haproxy: sections and structured resource directories.

Socket Permission Problems

Problem: HAProxy admin socket was inaccessible to the dataplane API service.

ERRO[0000] error fetching configuration: dial unix /run/haproxy/admin.sock: connect: permission denied

Solution: Added group haproxy to the HAProxy socket configuration, allowing the dataplane API service running as the haproxy user to access the socket.

Directory Permission Resolution

Problem: Multiple permission denied errors for various storage directories.

ERRO[0000] Cannot create dir /etc/haproxy/maps: mkdir /etc/haproxy/maps: permission denied

Solution: Systematically created all required directories with proper ownership:

- name: Create HAProxy dataplane directories
  file:
    path: "{{ item }}"
    state: directory
    owner: haproxy
    group: haproxy
    mode: '0755'
  loop:
    - /etc/haproxy/dataplane
    - /etc/haproxy/maps
    - /etc/haproxy/ssl
    - /etc/haproxy/general
    - /etc/haproxy/spoe
    - /etc/haproxy/acl
    - /etc/haproxy/transactions
    - /etc/haproxy/config_snippets
    - /etc/haproxy/backups
    - /tmp/spoe-haproxy

Filesystem Write Access

Problem: The /etc/haproxy directory was read-only for the haproxy user, preventing configuration updates.

Solution: Modified directory ownership and permissions to allow write access while maintaining security:

chgrp haproxy /etc/haproxy
chmod g+w /etc/haproxy

Service Integration and Accessibility

The API integrates with the cluster's existing infrastructure patterns:

  • Service Discovery: Available at https://haproxy-api.services.goldentooth.net
  • Authentication: Uses cluster credentials for API access
  • Monitoring: Integrated with existing health check patterns
  • Security: TLS termination through existing certificate management

Operational Capabilities

The successful implementation enables several advanced load balancer management capabilities:

Dynamic Backend Management

# Add backend servers without HAProxy restart
curl -X POST https://haproxy-api.services.goldentooth.net/v3/services/haproxy/configuration/servers \
  -d '{"name": "new-server", "address": "10.4.1.10", "port": 8080}'

# Modify server weights for traffic distribution
curl -X PUT https://haproxy-api.services.goldentooth.net/v3/services/haproxy/configuration/servers/web1 \
  -d '{"weight": 150}'

Health Check Configuration

# Configure health checks dynamically
curl -X PUT https://haproxy-api.services.goldentooth.net/v3/services/haproxy/configuration/backends/web \
  -d '{"health_check": {"uri": "/health", "interval": "5s"}}'

Runtime Statistics and Monitoring

The API provides comprehensive runtime statistics and configuration state information, enabling advanced monitoring and automated decision-making for infrastructure management.

Current Status and Integration

The HAProxy Dataplane API is now:

  • Active and stable on the allyrion load balancer node
  • Listening on port 5555 with proper systemd management
  • Responding to HTTP API requests with full functionality
  • Integrated with HAProxy through the admin socket interface
  • Accessible externally via the configured domain endpoint
  • Authenticated using cluster credential standards

This implementation represents a significant enhancement to the cluster's load balancing capabilities, moving from static configuration management to dynamic, API-driven infrastructure control. The systematic approach to troubleshooting configuration issues demonstrates the methodical problem-solving required for complex infrastructure integration while maintaining operational security and reliability standards.

Dynamic Service Discovery with Consul + HAProxy Dataplane API

Building upon our HAProxy Dataplane API integration, the next architectural evolution was implementing dynamic service discovery. This transformation moved the cluster away from static backend configurations toward a fully dynamic, Consul-driven service mesh architecture where services can relocate between nodes without manual load balancer reconfiguration.

The Static Configuration Problem

Traditional HAProxy configurations require explicit backend server definitions:

backend grafana-backend
    server grafana1 10.4.1.15:3000 check ssl verify none
    server grafana2 10.4.1.16:3000 check ssl verify none backup

This approach creates several operational challenges:

  • Manual Updates: Adding or removing services requires HAProxy configuration changes
  • Node Dependencies: Services tied to specific IP addresses can't migrate freely
  • Health Check Duplication: Both HAProxy and service discovery systems monitor health
  • Configuration Drift: Static configurations become outdated as infrastructure evolves

Dynamic Service Discovery Architecture

The new implementation leverages Consul's service registry with HAProxy Dataplane API's dynamic backend creation:

Service Registration → Consul Service Registry → HAProxy Dataplane API → Dynamic Backends

Core Components

  1. Consul Service Registry: Central service discovery database
  2. Service Registration Template: Reusable Ansible template for consistent service registration
  3. HAProxy Dataplane API: Dynamic backend management interface
  4. Service-to-Backend Mappings: Configuration linking Consul services to HAProxy backends

Implementation: Reusable Service Registration Template

The foundation of dynamic service discovery is the consul-service-registration.json.j2 template in the goldentooth.setup_consul role:

{
  "name": "{{ consul_service_name }}",
  "id": "{{ consul_service_name }}-{{ ansible_hostname }}",
  "address": "{{ consul_service_address | default(ipv4_address) }}",
  "port": {{ consul_service_port }},
  "tags": {{ consul_service_tags | default(['goldentooth']) | to_json }},
  "meta": {
    "version": "{{ consul_service_version | default('unknown') }}",
    "environment": "{{ consul_service_environment | default('production') }}",
    "service_type": "{{ consul_service_type | default('application') }}",
    "cluster": "goldentooth",
    "hostname": "{{ ansible_hostname }}",
    "protocol": "{{ consul_service_protocol | default('http') }}",
    "path": "{{ consul_service_health_path | default('/') }}"
  },
  "checks": [
    {
      "id": "{{ consul_service_name }}-http-health",
      "name": "{{ consul_service_name | title }} HTTP Health Check",
      "http": "{{ consul_service_health_http }}",
      "method": "{{ consul_service_health_method | default('GET') }}",
      "interval": "{{ consul_service_health_interval | default('30s') }}",
      "timeout": "{{ consul_service_health_timeout | default('10s') }}",
      "status": "passing"
    }
  ]
}

This template provides:

  • Standardized Service Registration: Consistent metadata and health check patterns
  • Flexible Health Checks: HTTP and TCP checks with configurable endpoints
  • Rich Metadata: Protocol, version, and environment information for routing decisions
  • Health Check Integration: Native Consul health monitoring replacing static HAProxy checks

Service Integration Patterns

Grafana Service Registration

The goldentooth.setup_grafana role demonstrates the integration pattern:

- name: Register Grafana with Consul
  include_role:
    name: goldentooth.setup_consul
    tasks_from: register_service
  vars:
    consul_service_name: grafana
    consul_service_port: 3000
    consul_service_tags:
      - monitoring
      - dashboard
      - goldentooth
      - https
    consul_service_type: monitoring
    consul_service_protocol: https
    consul_service_health_path: /api/health
    consul_service_health_http: "https://{{ ipv4_address }}:3000/api/health"
    consul_service_health_tls_skip_verify: true

This registration creates a Grafana service entry in Consul with:

  • HTTPS Health Checks: Direct validation of Grafana's API endpoint
  • Service Metadata: Rich tagging for service discovery and routing
  • TLS Configuration: Proper SSL handling for encrypted services

Service-Specific Health Check Endpoints

Each service uses appropriate health check endpoints:

  • Grafana: /api/health - Grafana's built-in health endpoint
  • Prometheus: /-/healthy - Standard Prometheus health check
  • Loki: /ready - Loki readiness endpoint
  • MCP Server: /health - Custom health endpoint

HAProxy Dataplane API Configuration

The dataplaneapi.yaml.j2 template defines service-to-backend mappings:

service_discovery:
  consuls:
    - address: 127.0.0.1:8500
      enabled: true
      services:
        - name: grafana
          backend_name: consul-grafana
          mode: http
          balance: roundrobin
          check: enabled
          check_ssl: enabled
          check_path: /api/health
          ssl: enabled
          ssl_verify: none
          
        - name: prometheus  
          backend_name: consul-prometheus
          mode: http
          balance: roundrobin
          check: enabled
          check_path: /-/healthy
          
        - name: loki
          backend_name: consul-loki
          mode: http
          balance: roundrobin
          check: enabled
          check_ssl: enabled
          check_path: /ready
          ssl: enabled
          ssl_verify: none

This configuration:

  • Maps Consul Services: Links service registry entries to HAProxy backends
  • Configures SSL Settings: Handles HTTPS services with appropriate SSL verification
  • Defines Load Balancing: Sets algorithm and health check behavior per service
  • Creates Dynamic Backends: Automatically generates consul-* backend names

Frontend Routing Transformation

HAProxy frontend configuration transitioned from static to dynamic backends:

Before: Static Backend References

frontend goldentooth-services
  use_backend grafana-backend if { hdr(host) -i grafana.services.goldentooth.net }
  use_backend prometheus-backend if { hdr(host) -i prometheus.services.goldentooth.net }

After: Dynamic Backend References

frontend goldentooth-services
  use_backend consul-grafana if { hdr(host) -i grafana.services.goldentooth.net }
  use_backend consul-prometheus if { hdr(host) -i prometheus.services.goldentooth.net }
  use_backend consul-loki if { hdr(host) -i loki.services.goldentooth.net }
  use_backend consul-mcp-server if { hdr(host) -i mcp.services.goldentooth.net }

The consul-* naming convention distinguishes dynamically managed backends from static ones.

Multi-Service Role Implementation

Each service role now includes Consul registration:

Prometheus Registration

- name: Register Prometheus with Consul
  include_role:
    name: goldentooth.setup_consul
    tasks_from: register_service
  vars:
    consul_service_name: prometheus
    consul_service_port: 9090
    consul_service_health_http: "http://{{ ipv4_address }}:9090/-/healthy"

Loki Registration

- name: Register Loki with Consul
  include_role:
    name: goldentooth.setup_consul
    tasks_from: register_service
  vars:
    consul_service_name: loki
    consul_service_port: 3100
    consul_service_health_http: "https://{{ ipv4_address }}:3100/ready"
    consul_service_health_tls_skip_verify: true

MCP Server Registration

- name: Register MCP Server with Consul
  include_role:
    name: goldentooth.setup_consul
    tasks_from: register_service
  vars:
    consul_service_name: mcp-server
    consul_service_port: 3001
    consul_service_health_http: "http://{{ ipv4_address }}:3001/health"

Technical Benefits

Service Mobility

Services can now migrate between nodes without load balancer reconfiguration. When a service starts on a different node, it registers with Consul, and HAProxy automatically updates backend server lists.

Health Check Integration

Consul's health checking replaces static HAProxy health checks, providing:

  • Centralized Health Monitoring: Single source of truth for service health
  • Rich Health Check Types: HTTP, TCP, script-based, and TTL checks
  • Health Check Inheritance: HAProxy backends inherit health status from Consul

Configuration Simplification

Static backend definitions are eliminated, reducing HAProxy configuration complexity and maintenance overhead.

Service Discovery Foundation

The implementation establishes patterns for:

  • Service Registration: Standardized across all cluster services
  • Health Check Consistency: Uniform health monitoring approaches
  • Metadata Management: Rich service information for advanced routing
  • Dynamic Backend Naming: Clear separation between static and dynamic backends

Operational Impact

Deployment Flexibility

Services can be deployed to any cluster node without infrastructure configuration changes. The service registers itself with Consul, and HAProxy automatically includes it in load balancing.

Zero-Downtime Updates

Service updates can leverage Consul's health checking for gradual rollouts. Unhealthy instances are automatically removed from load balancing until they pass health checks.

Monitoring Integration

Consul's web UI provides real-time service health visualization, complementing existing Prometheus/Grafana monitoring infrastructure.

Future Service Mesh Evolution

This implementation represents the foundation for comprehensive service mesh architecture:

  • Additional Service Registration: Extending dynamic discovery to all cluster services
  • Advanced Routing: Consul metadata-based traffic routing and service versioning
  • Security Integration: Service-to-service authentication and authorization
  • Circuit Breaking: Automated failure handling and traffic management

The transformation from static to dynamic service discovery fundamentally changes how the Goldentooth cluster manages service routing, establishing patterns that will support continued infrastructure evolution and automation.

SeaweedFS Pi 5 Migration and CSI Integration

After the successful initial SeaweedFS deployment on the Pi 4B nodes (fenn and karstark), a significant hardware upgrade opportunity arose. Four new Raspberry Pi 5 nodes with 1TB NVMe SSDs had joined the cluster: Manderly, Norcross, Oakheart, and Payne. This chapter chronicles the complete migration of the SeaweedFS distributed storage system to these more powerful nodes and the resolution of critical clustering issues that enabled full Kubernetes CSI integration.

The New Hardware Foundation

Meet the Storage Powerhouses

The four new Pi 5 nodes represent a massive upgrade in storage capacity and performance:

  • Manderly (10.4.0.22) - 1TB NVMe SSD via PCIe
  • Norcross (10.4.0.23) - 1TB NVMe SSD via PCIe
  • Oakheart (10.4.0.24) - 1TB NVMe SSD via PCIe
  • Payne (10.4.0.25) - 1TB NVMe SSD via PCIe

Total Raw Capacity: 4TB across four nodes (vs. ~1TB across two Pi 4B nodes)

Performance Characteristics

The Pi 5 + NVMe combination delivers substantial improvements:

  • Storage Interface: PCIe NVMe vs. USB 3.0 SSD
  • Sequential Read/Write: ~400MB/s vs. ~100MB/s
  • Random IOPS: 10x improvement for small file operations
  • CPU Performance: Cortex-A76 vs. Cortex-A72 cores
  • Memory: 8GB LPDDR4X vs. 4GB on old nodes

Migration Strategy

Cluster Topology Decision

Rather than attempt in-place migration, the decision was made to completely rebuild the SeaweedFS cluster on the new hardware. This approach provided:

  1. Clean Architecture: No legacy configuration artifacts
  2. Improved Topology: Optimize for 4-node distributed storage
  3. Zero Downtime: Keep old cluster running during migration
  4. Rollback Safety: Ability to revert if issues arose

Node Role Assignment

The four Pi 5 nodes were configured with hybrid roles to maximize both performance and fault tolerance:

  • Masters: Manderly, Norcross, Oakheart (3-node Raft consensus)
  • Volume Servers: All four nodes (maximizing storage capacity)

This design provides proper Raft consensus with an odd number of masters while utilizing all available storage capacity.

The Critical Discovery: Raft Consensus Requirements

The Leadership Election Problem

The initial migration attempt using all four nodes as masters immediately revealed a fundamental issue:

F0804 21:16:33.246267 master.go:285 Only odd number of masters are supported:
[10.4.0.22:9333 10.4.0.23:9333 10.4.0.24:9333 10.4.0.25:9333]

SeaweedFS requires an odd number of masters for Raft consensus. This is a fundamental requirement of distributed consensus algorithms to avoid split-brain scenarios where no majority can be established.

The Mathematics of Consensus

With 4 masters:

  • Split scenarios: 2-2 splits prevent majority formation
  • Leadership impossible: No node can achieve >50% votes
  • Cluster paralysis: "Leader not selected yet" errors continuously

With 3 masters:

  • Majority possible: 2 out of 3 can form majority
  • Fault tolerance: 1 node failure still allows operation
  • Clear leadership: Proper Raft election cycles

Infrastructure Template Updates

Fixing Hardcoded Configurations

The migration revealed template issues that needed correction:

Dynamic Peer Discovery

# Before (hardcoded)
-peers=fenn:9333,karstark:9333

# After (dynamic)
-peers={% for host in groups['seaweedfs'] %}{{ host }}:9333{% if not loop.last %},{% endif %}{% endfor %}

Consul Service Template Fix

{
  "peer_addresses": "{% for host in groups['seaweedfs'] %}{{ host }}:9333{% if not loop.last %},{% endif %}{% endfor %}"
}

Removing Problematic Parameters

The -ip= parameter in master service templates was causing duplicate peer entries:

# Problematic configuration
ExecStart=/usr/local/bin/weed master \
    -port=9333 \
    -mdir=/mnt/seaweedfs-nvme/master \
    -peers=manderly:9333,norcross:9333,oakheart:9333 \
    -ip=10.4.0.22 \                    # <-- This caused duplicates
    -raftHashicorp=true

# Clean configuration
ExecStart=/usr/local/bin/weed master \
    -port=9333 \
    -mdir=/mnt/seaweedfs-nvme/master \
    -peers=manderly:9333,norcross:9333,oakheart:9333 \
    -raftHashicorp=true

Kubernetes CSI Integration Challenge

The DNS Resolution Problem

With the SeaweedFS cluster running on bare metal and Kubernetes CSI components running in pods, a networking challenge emerged:

Problem: Kubernetes pods couldn't resolve SeaweedFS node hostnames because they exist outside the cluster DNS.

Solution: Kubernetes Services with explicit Endpoints to bridge the DNS gap.

Service-Based DNS Resolution

# Headless service for each SeaweedFS node
apiVersion: v1
kind: Service
metadata:
  name: manderly
  namespace: default
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: master
    port: 9333
  - name: volume
    port: 8080
---
# Explicit endpoint mapping
apiVersion: v1
kind: Endpoints
metadata:
  name: manderly
  namespace: default
subsets:
- addresses:
  - ip: 10.4.0.22
  ports:
  - name: master
    port: 9333
  - name: volume
    port: 8080

This approach allows the SeaweedFS filer (running in Kubernetes) to connect to the bare metal cluster using service names like manderly:9333.

Migration Execution

Phase 1: Infrastructure Preparation

# Update inventory to reflect new nodes
goldentooth edit_vault
# Configure new SeaweedFS group with Pi 5 nodes

# Clean deployment of storage infrastructure
goldentooth cleanup_old_storage
goldentooth setup_seaweedfs

Phase 2: Cluster Formation with Proper Topology

# Deploy 3-master configuration
goldentooth command_root manderly,norcross,oakheart "systemctl start seaweedfs-master"

# Verify leadership election
curl http://10.4.0.22:9333/dir/status

# Start volume servers on all nodes
goldentooth command_root manderly,norcross,oakheart,payne "systemctl start seaweedfs-volume"

Phase 3: Kubernetes Integration

# Deploy DNS bridge services
kubectl apply -f seaweedfs-services-endpoints.yaml

# Deploy and verify filer
kubectl get pods -l app=seaweedfs-filer
kubectl logs seaweedfs-filer-xxx | grep "Start Seaweed Filer"

Verification and Testing

Cluster Health Verification

# Leadership confirmation
curl http://10.4.0.22:9333/cluster/status
# Returns proper topology with elected leader

# Service status across all nodes
goldentooth command manderly,norcross,oakheart,payne "systemctl status seaweedfs-master seaweedfs-volume"

CSI Integration Testing

# Test PVC creation
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: seaweedfs-test-pvc
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 1Gi
  storageClassName: seaweedfs-storage

Result: Successful dynamic volume provisioning with NFS-style mounting via seaweedfs-filer:8888:/buckets/pvc-xxx.

End-to-End Functionality

# Pod with mounted SeaweedFS volume
kubectl exec test-pod -- df -h /data
# Filesystem: seaweedfs-filer:8888:/buckets/pvc-xxx Size: 512M

# File I/O verification
kubectl exec test-pod -- touch /data/test-file
kubectl exec test-pod -- ls -la /data/
# Files persist across pod restarts via distributed storage

Final Architecture

Cluster Topology

  • Masters: 3 nodes (Manderly, Norcross, Oakheart) with Raft consensus
  • Volume Servers: 4 nodes (all Pi 5s) with 1TB NVMe each
  • Total Capacity: ~3.6TB usable distributed storage
  • Fault Tolerance: Can survive 1 master failure + multiple volume server failures
  • Performance: NVMe speeds with distributed redundancy

Integration Status

  • Kubernetes CSI: Dynamic volume provisioning working
  • DNS Resolution: Service-based hostname resolution
  • Leadership Election: Stable Raft consensus
  • Filer Services: HTTP/gRPC endpoints operational
  • Volume Mounting: NFS-style filesystem access
  • High Availability: Multi-node fault tolerance

Monitoring Integration

SeaweedFS metrics integrate with the existing Goldentooth observability stack:

  • Prometheus: Master and volume server metrics collection
  • Grafana: Storage capacity and performance dashboards
  • Consul: Service discovery and health monitoring
  • Step-CA: TLS certificate management for secure communications

Performance Impact

Storage Capacity Comparison

MetricOld Cluster (Pi 4B)New Cluster (Pi 5)Improvement
Total Capacity~1TB~3.6TB3.6x
Node Count242x
Per-Node Storage500GB1TB2x
Storage InterfaceUSB 3.0 SSDPCIe NVMePCIe speed
Fault ToleranceSingle failureMulti-failureHigher

Architectural Benefits

  • Proper Consensus: 3-master Raft eliminates split-brain scenarios
  • Expanded Capacity: 3.6TB enables larger workloads and datasets
  • Performance Scaling: NVMe storage handles high-IOPS workloads
  • Kubernetes Native: CSI integration enables GitOps storage workflows
  • Future Ready: Foundation for S3 gateway and advanced SeaweedFS features

P5.js Creative Coding Platform

Goldentooth's journey into creative computing required a platform for hosting and showcasing interactive p5.js sketches. The p5js-sketches project emerged as a Kubernetes-native solution that combines modern DevOps practices with artistic expression, providing a robust foundation for creative coding experiments and demonstrations.

Project Overview

Vision and Purpose

The p5js-sketches platform serves multiple purposes within the Goldentooth ecosystem:

  • Creative Expression: A canvas for computational art and interactive visualizations
  • Educational Demos: Showcase machine learning algorithms and mathematical concepts
  • Technical Exhibition: Demonstrate Kubernetes deployment patterns for static content
  • Community Sharing: Provide a gallery format for browsing and discovering sketches

Architecture Philosophy

The platform embraces cloud-native principles while optimizing for the unique constraints of a Raspberry Pi cluster:

  • Container-Native: Docker-based deployments with multi-architecture support
  • GitOps Workflow: Code-to-deployment automation via Argo CD
  • Edge-Optimized: Resource limits tailored for ARM64 Pi hardware
  • Automated Content: CI/CD pipeline for preview generation and deployment

Technical Architecture

Core Components

The platform consists of several integrated components:

Static File Server

  • Base: nginx optimized for ARM64 Raspberry Pi hardware
  • Content: p5.js sketches with HTML, JavaScript, and assets
  • Security: Non-root container with read-only filesystem
  • Performance: Tuned for low-memory Pi environments

Storage Foundation

  • Backend: local-path storage provisioner
  • Capacity: 10Gi persistent volume for sketch content
  • Limitation: Single-replica deployment (ReadWriteOnce constraint)
  • Future: Ready for migration to SeaweedFS distributed storage

Networking Integration

  • Load Balancer: MetalLB for external access
  • DNS: external-dns automatic hostname management
  • SSL: Future integration with cert-manager and Step-CA

Container Configuration

The deployment leverages advanced Kubernetes security features:

# Security hardening
security:
  runAsNonRoot: true
  runAsUser: 101              # nginx user
  runAsGroup: 101
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true

# Resource optimization for Pi hardware
resources:
  requests:
    memory: "32Mi"
    cpu: "50m"
  limits:
    memory: "64Mi"
    cpu: "100m"

Deployment Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   GitHub Repo   │───▶│   Argo CD       │───▶│  Kubernetes     │
│   p5js-sketches │    │   GitOps        │    │  Deployment     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                                              │
         ▼                                              ▼
┌─────────────────┐                            ┌─────────────────┐
│ GitHub Actions  │                            │    nginx Pod    │
│ Preview Gen     │                            │  serving static │
└─────────────────┘                            │     content     │
                                               └─────────────────┘

Automated Preview Generation System

The Challenge

p5.js sketches are interactive, dynamic content that can't be represented by static screenshots. The platform needed a way to automatically generate compelling preview images that capture the essence of each sketch's visual output.

Solution: Headless Browser Automation

The preview generation system uses Puppeteer for sophisticated browser automation:

Technology Stack

  • Puppeteer v21.5.0: Headless Chrome automation
  • GitHub Actions: CI/CD execution environment
  • Node.js: Runtime environment for capture scripts
  • Canvas Capture: Direct p5.js canvas element extraction

Capture Process

const CONFIG = {
  sketches_dir: './sketches',
  capture_delay: 10000,           // Wait for sketch initialization
  animation_duration: 3000,       // Record animation period
  viewport: { width: 600, height: 600 },
  screenshot_options: {
    type: 'png',
    clip: { x: 0, y: 0, width: 400, height: 400 }  // Crop to canvas
  }
};

Advanced Capture Features

Sketch Lifecycle Awareness

  • Initialization Delay: Configurable per-sketch startup time
  • Animation Sampling: Capture representative frames from animations
  • Canvas Detection: Automatic identification of p5.js canvas elements
  • Error Handling: Graceful fallback for problematic sketches

GitHub Actions Integration

on:
  push:
    paths:
      - 'sketches/**'      # Trigger on sketch modifications
  workflow_dispatch:       # Manual execution capability
    inputs:
      force_regenerate:    # Regenerate all previews
      capture_delay:       # Configurable timing

Automated Workflow

  1. Trigger Detection: Sketch files modified or manual dispatch
  2. Environment Setup: Node.js, Puppeteer browser installation
  3. Dependency Caching: Optimize build times with browser cache
  4. Preview Generation: Execute capture script across all sketches
  5. Change Detection: Identify new or modified preview images
  6. Auto-Commit: Commit generated images back to repository
  7. Artifact Upload: Preserve previews for debugging and archives

Sketch Organization and Metadata

Directory Structure

Each sketch follows a standardized organization pattern:

sketches/
├── linear-regression/
│   ├── index.html          # Entry point with p5.js setup
│   ├── sketch.js          # Main p5.js code
│   ├── style.css          # Styling and layout
│   ├── metadata.json      # Sketch configuration
│   ├── preview.png        # Auto-generated preview (400x400)
│   └── libraries/         # p5.js and extensions
│       ├── p5.min.js
│       └── p5.sound.min.js
└── robbie-the-robot/
    ├── index.html
    ├── main.js            # Entry point
    ├── robot.js           # Agent implementation
    ├── simulation.js      # GA evolution logic
    ├── world.js           # Environment simulation
    ├── ga-worker.js       # Web Worker for GA
    ├── metadata.json
    ├── preview.png
    └── libraries/

Metadata Configuration

Each sketch includes rich metadata for gallery display and capture configuration:

{
  "title": "Robby GA with Worker",
  "description": "Genetic algorithm simulation where robots learn to collect cans in a grid world using neural network evolution",
  "isAnimated": true,
  "captureDelay": 30000,
  "lastUpdated": "2025-08-04T19:06:01.506Z"
}

Metadata Fields

  • title: Display name for gallery
  • description: Detailed explanation of the sketch concept
  • isAnimated: Indicates dynamic content requiring longer capture
  • captureDelay: Custom initialization time in milliseconds
  • lastUpdated: Automatic timestamp for version tracking

Example Sketches

Linear Regression Visualization

A educational demonstration of machine learning fundamentals:

Purpose: Interactive visualization of gradient descent optimization Features:

  • Real-time data point plotting
  • Animated regression line fitting
  • Loss function visualization
  • Parameter adjustment controls

Technical Implementation:

  • Single-file sketch with mathematical calculations
  • Real-time chart updates using p5.js drawing primitives
  • Interactive mouse controls for data manipulation

Robbie the Robot - Genetic Algorithm

A sophisticated multi-agent simulation demonstrating evolutionary computation:

Purpose: Showcase genetic algorithms learning optimal can-collection strategies Features:

  • Multi-generational population evolution
  • Neural network-based agent decision making
  • Web Worker-based GA computation for performance
  • Real-time fitness and generation statistics

Technical Architecture:

  • Main Thread: p5.js rendering and user interaction
  • Web Worker: Genetic algorithm computation (ga-worker.js)
  • Modular Design: Separate files for robot, simulation, and world logic
  • Performance Optimization: Efficient canvas rendering for multiple agents

Deployment Integration

Helm Chart Configuration

The platform uses Helm for templated Kubernetes deployments:

# Chart.yaml
apiVersion: 'v2'
name: 'p5js-sketches'
description: 'P5.js Sketch Server - Static file server for hosting p5.js sketches'
type: 'application'
version: '0.0.1'

Key Templates:

  • Deployment: nginx container with security hardening
  • Service: LoadBalancer with MetalLB integration
  • ConfigMap: nginx configuration optimization
  • Namespace: Isolated environment for sketch server
  • ServiceAccount: RBAC configuration for security

Argo CD GitOps Integration

The platform deploys automatically via Argo CD:

Repository Structure:

  • Source: github.com/goldentooth/p5js-sketches
  • Target: p5js-sketches namespace
  • Sync Policy: Automatic deployment on git changes
  • Health Checks: Kubernetes-native readiness and liveness probes

Deployment URL: https://p5js-sketches.services.k8s.goldentooth.net/

The platform includes sophisticated gallery generation:

Features:

  • Responsive Grid: CSS Grid layout optimized for various screen sizes
  • Preview Integration: Auto-generated preview images with fallbacks
  • Metadata Display: Title, description, and technical details
  • Interactive Navigation: Direct links to individual sketches
  • Search and Filter: Future enhancement for large sketch collections

Template System:

<!-- Gallery template with dynamic sketch injection -->
<div class="gallery-grid">
  {{#each sketches}}
  <div class="sketch-card">
    <img src="{{preview}}" alt="{{title}}" loading="lazy">
    <h3>{{title}}</h3>
    <p>{{description}}</p>
    <a href="{{url}}" class="sketch-link">View Sketch</a>
  </div>
  {{/each}}
</div>

CLI Ergonomics

The Goldentooth CLI underwent a fundamental transformation, evolving from a verbose, Ansible-heavy interface into a sleek, ergonomic command suite optimized for both human operators and programmatic consumption. This architectural revolution introduced direct SSH operations, intelligent MOTD systems, distributed computing integration, and performance improvements that deliver 3x faster execution times.

The Transformation

From Ansible-Heavy to SSH-Native Operations

The original CLI relied exclusively on Ansible playbooks for every operation, creating unnecessary overhead for simple tasks. The new architecture introduces direct SSH operations that bypass Ansible entirely for appropriate use cases:

Before: Every command required Ansible overhead

# Old approach - always through Ansible
goldentooth command all "systemctl status consul"  # ~10-15 seconds

After: Direct SSH with intelligent routing

# New approach - direct SSH operations
goldentooth shell bettley                    # Instant interactive session
goldentooth command all "systemctl status consul"  # ~3-5 seconds with parallel

Revolutionary SSH-Based Command Suite

Interactive Shell Sessions

The shell command provides seamless access to cluster nodes with intelligent behavior:

# Single node - direct SSH session with beautiful MOTD
goldentooth shell bettley

# Multiple nodes - broadcast mode with synchronized output
goldentooth shell all

Smart Behavior:

  • Single node: Interactive SSH session with full MOTD display
  • Multiple nodes: Broadcast mode with synchronized command execution
  • Automatic host resolution from Ansible inventory groups

Stream Processing with Pipe

The pipe command transforms stdin into distributed execution:

# Stream commands to multiple nodes
echo "df -h" | goldentooth pipe storage_nodes
echo "systemctl status consul" | goldentooth pipe consul_server

Advanced Features:

  • Comment filtering (lines starting with # are ignored)
  • Empty line skipping for clean script processing
  • Parallel execution across multiple hosts
  • Clean error handling and output formatting

File Transfer with CP

Node-aware file transfer using intuitive syntax:

# Copy from cluster to local
goldentooth cp bettley:/var/log/consul.log ./logs/

# Copy from local to cluster
goldentooth cp ./config.yaml allyrion:/etc/myapp/

# Inter-node transfers
goldentooth cp allyrion:/tmp/data.json bettley:/opt/processing/

Batch Script Execution

Execute shell scripts across the cluster:

# Run maintenance script on storage nodes
goldentooth batch maintenance.sh storage_nodes

# Execute deployment script on all nodes
goldentooth batch deploy.sh all

Multi-line Command Execution

The heredoc command enables complex multi-line operations:

goldentooth heredoc consul_server <<'EOF'
consul kv put config/database/host "db.goldentooth.net"
consul kv put config/database/port "5432"
systemctl reload myapp
EOF

Performance Architecture

GNU Parallel Integration

The CLI intelligently detects and leverages GNU parallel for concurrent operations:

Automatic Parallelization:

  • Single host: Direct SSH connection
  • Multiple hosts: GNU parallel with job control (-j0 for optimal concurrency)
  • Fallback: Sequential execution if parallel unavailable

Performance Improvements:

  • 3x faster execution for multi-host operations
  • Optimal resource utilization across cluster nodes
  • Tagged output for clear host identification

Intelligent SSH Configuration

Optimized SSH behavior for different use cases:

Clean Command Output:

ssh_opts="-T -o StrictHostKeyChecking=no -o LogLevel=ERROR -q"

Features:

  • -T flag disables pseudo-terminal allocation (suppresses MOTD for commands)
  • Error suppression for clean programmatic consumption
  • Connection optimization for repeated operations

MOTD System Overhaul

Visual Node Identification

Each cluster node features unique ASCII art MOTD for instant visual recognition:

Implementation:

  • Node-specific colorized ASCII artwork stored in /etc/motd
  • Beautiful visual identification during interactive SSH sessions
  • SSH PrintMotd yes configuration for proper display

Examples:

  • bettley: Distinctive golden-colored ASCII art design
  • allyrion: Unique visual signature for immediate recognition
  • Each node: Custom artwork matching cluster theme and node personality

Smart MOTD Behavior

The system provides context-appropriate MOTD display:

Interactive Sessions: Full MOTD display with ASCII art Command Execution: Suppressed MOTD for clean output Programmatic Access: No visual interference with data processing

Technical Implementation:

  • Removed complex PAM-based conditional MOTD system
  • Leveraged SSH's built-in PrintMotd behavior
  • Clean separation between interactive and programmatic access

Inventory Integration System

Ansible Group Compatibility

The CLI seamlessly integrates Ansible inventory definitions with SSH operations:

Inventory Parsing:

# parse-inventory.py converts YAML inventory to bash functions
def generate_bash_variables(groups):
    # Creates goldentooth:resolve_hosts() function
    # Generates case statements for each group
    # Maintains compatibility with existing Ansible workflows

Generated Functions:

function goldentooth:resolve_hosts() {
  case "$expression" in
    "consul_server")
      echo "allyrion bettley cargyll"
      ;;
    "storage_nodes")
      echo "jast karstark lipps"
      ;;
    # ... all inventory groups
  esac
}

Installation Integration:

  • Inventory parsing during CLI installation (make install)
  • Automatic generation of /usr/local/bin/goldentooth-inventory.sh
  • Dynamic loading of inventory groups into CLI

Distributed LLaMA Integration

Cross-Platform Compilation

Advanced cross-compilation support for ARM64 distributed computing:

Architecture:

  • x86_64 Velaryon node: Cross-compilation host
  • ARM64 Pi nodes: Deployment targets
  • Automated binary distribution and service management

Commands:

# Model management
goldentooth dllama_download_model meta-llama/Llama-3.2-1B

# Service lifecycle
goldentooth dllama_start_workers
goldentooth dllama_stop

# Cluster status
goldentooth dllama_status

# Distributed inference
goldentooth dllama_inference "Explain quantum computing"

Technical Features:

  • Automatic model download and conversion
  • Distributed worker node management
  • Cross-architecture binary deployment
  • Performance monitoring and status reporting

Command Line Interface Enhancements

Bash Completion System

Comprehensive tab completion for all operations:

Features:

  • Command completion for all CLI functions
  • Host and group name completion
  • Context-aware parameter suggestions
  • Integration with existing shell environments

Error Handling and Output Management

Professional error management with proper stream handling:

Implementation:

  • Error messages directed to stderr
  • Clean stdout for programmatic consumption
  • Consistent exit codes for automation integration
  • Detailed error reporting with actionable suggestions

Help and Documentation

Built-in documentation system:

# List available commands
goldentooth help

# Command-specific help
goldentooth help shell
goldentooth help dllama_inference

# Show available inventory groups
goldentooth list_groups

Integration with Existing Infrastructure

Ansible Compatibility

The new CLI maintains full compatibility with existing Ansible workflows:

Hybrid Approach:

  • SSH operations for simple, fast tasks
  • Ansible playbooks for complex configuration management
  • Seamless switching between approaches based on task requirements

Examples:

# Quick status check - SSH
goldentooth command all "uptime"

# Complex configuration - Ansible
goldentooth setup_consul

Monitoring and Observability

CLI operations integrate with existing monitoring systems:

Features:

  • Command execution logging
  • Performance metrics collection
  • Integration with Prometheus/Grafana monitoring
  • Audit trail for security compliance

User Experience Improvements

Intuitive Command Syntax

Natural, memorable command patterns:

# Intuitive file operations
goldentooth cp source destination

# Clear service management
goldentooth dllama_start_workers

# Obvious interactive access
goldentooth shell hostname