Introduction
Who am I?
A portrait of the author in the form he will assume over the course of this project, having returned to our time to warn his present self against pursuing this course of action.
My name is Nathan Douglas. The best source of information about my electronic life is probably my GitHub profile. It almost certainly would not be my LinkedIn profile. I also have a blog about non-computer-related stuff here.
What Do I Do?
The author in his eventual form advising the author in his present form not to do the thing, and why.
I've been trying to get computers to do what I want, with mixed success, since the early mid-nineties. I earned my Bachelor's in Computer Science from the University of Nevada at Las Vegas in 2011, and I've been working as a Software/DevOps engineer ever since, depending on gig.
I consider DevOps a methodology and a role, in that I try to work in whatever capacity I can to improve the product delivery lifecycle and shorten delivery lead time. I generally do the work that is referred to as "DevOps" or "platform engineering" or "site reliability engineering", but I try to emphasize the theoretical aspects, e.g. Lean Management, sytems thinking, etc. That's not to say that I'm an expert, but just that I try to keep the technical details grounded in the philosophical justifications, the big picture.
Update (2025-04-05): At present I consider myself more of a platform engineer. I'm trying to move into an MLOps space, though, and from there into High-Performance Computing. I also would like to eventually shift into deep tech research and possibly get my PhD in mathematics or computer science.
Background
"What would you do if you had an AMD K6-2 333MHz and 96MB RAM?" "I'd run two copies of Windows 98, my dude."
At some point in the very early 00's, I believe, I first encountered VMWare and the idea that I could run a computer inside of another computer. That wasn't the first time I'd encountered a virtual machine -- I'd played with Java in the '90's, and played Zork and other Infocom and Inform games -- but it might've been the first time that I really understood the idea.
And I made use of it. For a long time – most of my twenties – I was occupied by a writing project. I maintained a virtual machine that ran a LAMP server and hosted various content management systems and related technologies: raw HTML pages, MediaWiki, DokuWiki, Drupal, etc, all to organize my thoughts on this and other projects. Along the way, I learned a whole lot about this sort of deployment: namely, that it was a pain in the ass.
I finally abandoned that writing project around the time Docker came out. I immediately understood what it was: a less tedious VM. (Admittedly, my understanding was not that sophisticated.) I built a decent set of skills with Docker and used it wherever I could. I thought Docker was about as good as it got.
At some point around 2016 or 2017, I became aware of Kubernetes. I immediately built a 4-node cluster with old PCs, doing a version of Kubernetes the Hard Way on bare metal, and then shifted to a custom system with four VMWare VMs that PXE booted, setup a CoreOS configuration with Ignition and what was then called Matchbox, and formed into a self-healing cluster with some neat toys like GlusterFS, etc. Eventually, though, I started neglecting the cluster and tore it down.
Around 2021, my teammates and I started considering a Kubernetes-based infrastructure for our applications, so I got back into it. I set up a rather complicated infrastructure on a three-node Proxmox VE cluster that would create three three-node Kubernetes clusters using LXC containers. From there I explored ArgoCD and GitOps and Helm and some other things that I hadn't really played with before. But again, my interest waned and the cluster didn't actually get much action.
A large part of this, I think, is that I didn't trust it to run high-FAF (Family Acceptance Factor) apps, like Plex, etc. After all, this was supposed to be a cluster I could tinker with, and tear down and destroy and rebuild at any time with a moment's notice. So in practice, this ended up being a toy cluster.
And while I'd gone through Kubernetes the Hard Way (twice!), I got the irritating feeling that I hadn't really learned all that much. I'd done Linux From Scratch, and had run Gentoo for several years, so I was no stranger to the idea of following a painfully manual process filled with shell commands and waiting for days for my computer to be useful again. And I did learn a lot from all three projects, but, for whatever reason, it didn't stick all that well.
Motivation
In late 2023, my team's contract concluded, and there was a possibility I might be laid off. My employer quickly offered me a position on another team, which I happily and gratefully accepted, but I had already applied to several other positions. I had some promising paths forward, but... not as many as I would like. It was an unnerving experience.
Not everyone is using Kubernetes, of course, but it's an increasingly essential skill in my field. There are other skills I have – Ansible, Terraform, Linux system administration, etc – but I'm not entirely comfortable with my knowledge of Kubernetes, so I'd like to deepen and broaden that as effectively as possible.
Goals
I want to get really good at Kubernetes. Not just administering it, but having a good understanding of what is going on under the hood at any point, and how best to inspect and troubleshoot and repair a cluster.
I want to have a fertile playground for experimenting; something that is not used for other purposes, not expected to be stable, ideally not even accessed by anyone else. Something I can do the DevOps equivalent of destroy with an axe, without consequences.
I want to document everything I've learned exhaustively. I don't want to take a command for granted, or copy and paste, or even copying and pasting after nodding thoughtfully at a wall of text. I want to embed things deeply into my thiccc skull.
Generally, I want to be beyond prepared for my CKA, CKAD, and CKS certification exams. I hate test anxiety. I hate feeling like there are gaps in my knowledge. I want to go in confident, and I want my employers and teammates to be confident of my abilities.
Approach
This is largely going to consist of me reading documentation and banging my head against the wall. I'll provide links to the relevant information, and type out the commands, but I also want to persist this in Infrastructure-as-Code. Consequently, I'll link to Ansible tasks/roles/playbooks for each task as well.
Cluster Hardware
I went with a PicoCluster 10H. I'm well aware that I could've cobbled something together and spent much less money; I have indeed done the thing with a bunch of Raspberry Pis screwed to a board and plugged into an Anker USB charger and a TP-Link switch.
I didn't want to do that again, though. For one, I've experienced problems with USB chargers seeming to lose power over time, and some small switches getting flaky when powered from USB. I liked the power supply of the PicoCluster and its cooling configuration. I liked that it did pretty much exactly what I wanted, and if I had problems I could yell at someone else about it rather than getting derailed by hardware rabbit holes.
I also purchased ten large heatsinks with fans, specifically these. There were others I liked a bit more, and these interfered with the standoffs that were used to build each stack of five Raspberry Pis, but these seemed as though they would likely be the most reliable in the long run.
I purchased SanDisk 128GB Extreme microSDXC cards for local storage. I've been using SanDisk cards for years with no significant issues or complaints.
The individual nodes are Raspberry Pi 4B/8GB. As of the time I'm writing this, Raspberry Pi 5s are out, and they offer very substantial benefits over the 4B. That said, they also have higher energy consumption, lower availability, and so forth. I'm opting for a lower likelihood of surprises because, again, I just don't want to spend much time dealing with hardware and I don't expect performance to hinder me.
Technical Specifications
Complete Node Inventory
The cluster consists of 13 nodes with specific roles and configurations:
Raspberry Pi Nodes (12 total):
- allyrion (10.4.0.10) - NFS server, HAProxy load balancer, Docker host
- bettley (10.4.0.11) - Kubernetes control plane, Consul server, Vault server
- cargyll (10.4.0.12) - Kubernetes control plane, Consul server, Vault server
- dalt (10.4.0.13) - Kubernetes control plane, Consul server, Vault server
- erenford (10.4.0.14) - Kubernetes worker, Ray head node, ZFS storage
- fenn (10.4.0.15) - Kubernetes worker, Ceph storage node
- gardener (10.4.0.16) - Kubernetes worker, Grafana host, ZFS storage
- harlton (10.4.0.17) - Kubernetes worker
- inchfield (10.4.0.18) - Kubernetes worker, Loki log aggregation
- jast (10.4.0.19) - Kubernetes worker, Step-CA certificate authority
- karstark (10.4.0.20) - Kubernetes worker, Ceph storage node
- lipps (10.4.0.21) - Kubernetes worker, Ceph storage node
x86 GPU Node:
- velaryon (10.4.0.30) - AMD Ryzen 9 3900X, 32GB RAM, NVIDIA RTX 2070 Super
Hardware Architecture
Raspberry Pi 4B Specifications:
- CPU: ARM Cortex-A72 quad-core @ 2.0GHz (overclocked from 1.5GHz)
- RAM: 8GB LPDDR4
- Storage: SanDisk 128GB Extreme microSDXC (UHS-I Class 10)
- Network: Gigabit Ethernet (onboard)
- GPIO: Used for fan control (pin 14) and hardware monitoring
Performance Optimizations:
arm_freq=2000
over_voltage=6
These overclocking settings provide approximately 33% performance increase while maintaining thermal stability with active cooling.
Network Infrastructure
Network Segmentation:
- Infrastructure CIDR:
10.4.0.0/20
- Physical network backbone - Service CIDR:
172.16.0.0/20
- Kubernetes virtual services - Pod CIDR:
192.168.0.0/16
- Container networking - MetalLB Range:
10.4.11.0/24
- Load balancer IP allocation
MAC Address Registry: Each node has documented MAC addresses for network boot and management:
- Raspberry Pi nodes:
d8:3a:dd:*
anddc:a6:32:*
prefixes - x86 node:
2c:f0:5d:0f:ff:39
(velaryon)
Storage Architecture
Distributed Storage Strategy:
NFS Shared Storage:
- Server: allyrion exports
/mnt/usb1
- Clients: All 13 nodes mount at
/mnt/nfs
- Use Cases: Configuration files, shared datasets, cluster coordination
ZFS Storage Pool:
- Nodes: allyrion, erenford, gardener
- Pool:
rpool
withrpool/data
dataset - Features: Snapshots, replication, compression
- Optimization: 128MB ARC limit for Raspberry Pi RAM constraints
Ceph Distributed Storage:
- Nodes: fenn, karstark, lipps
- Purpose: Highly available distributed block and object storage
- Integration: Kubernetes persistent volumes
Thermal Management
Cooling Configuration:
- Heatsinks: Large aluminum heatsinks with 40mm fans per node
- Fan Control: GPIO-based temperature control at 60°C threshold
- Airflow: PicoCluster chassis provides directed airflow path
- Monitoring: Temperature sensors exposed via Prometheus metrics
Thermal Performance:
- Idle: ~45-50°C ambient
- Load: ~60-65°C under sustained workload
- Throttling: No thermal throttling observed during normal operations
Power Architecture
Power Supply:
- Input: Single AC connection to PicoCluster power distribution
- Per Node: 5V/3A regulated power (avoiding USB charger degradation)
- Efficiency: ~90% efficiency at typical load
- Redundancy: Single point of failure by design (acceptable for lab environment)
Power Consumption:
- Raspberry Pi: ~8W idle, ~15W peak per node
- Total Pi Load: ~96W idle, ~180W peak (12 nodes)
- x86 Node: ~150W idle, ~300W peak
- Cluster Total: ~250W idle, ~480W peak
Hardware Monitoring
Metrics Collection:
- Node Exporter: Hardware sensors, thermal data, power metrics
- Prometheus: Centralized metrics aggregation
- Grafana: Real-time dashboards with thermal and performance alerts
Monitored Parameters:
- CPU temperature and frequency
- Memory usage and availability
- Storage I/O and capacity
- Network interface statistics
- Fan speed and cooling device status
Reliability Considerations
Hardware Resilience:
- No RAID: Individual node failure acceptable (distributed applications)
- Network Redundancy: Single switch (acceptable for lab)
- Power Redundancy: Single PSU (lab environment limitation)
- Cooling Redundancy: Individual fan failure affects single node only
Failure Recovery:
- Kubernetes: Automatic pod rescheduling on node failure
- Consul/Vault: Multi-node quorum survives single node loss
- Storage: ZFS replication and Ceph redundancy provide data protection
Future Expansion
Planned Upgrades:
- SSD Storage: USB 3.0 SSD expansion for high-IOPS workloads
- Network Upgrades: Potential 10GbE expansion via USB adapters
- Additional GPU: PCIe expansion for ML workloads
Frequently Asked Questions
So, how do you like the PicoCluster so far?
I have no complaints. Putting it together was straightforward; the documentation was great, everything was labeled correctly, etc. Cooling seems very adequate and performance and appearance are perfect.
The integrated power supply has been particularly reliable compared to previous experiences with USB charger-based setups. The structured cabling and chassis design make maintenance and monitoring much easier than ad-hoc Raspberry Pi clusters.
Have you considered adding SSDs for mass storage?
Yes, and I have some cables and spare SSDs for doing so. I'm not sure if I actually will. We'll see.
The current storage architecture with ZFS pools on USB-attached SSDs and distributed Ceph storage has proven adequate for most workloads. The microSD cards handle the OS and container storage well, while shared storage needs are met through the NFS and distributed storage layers.
Meet the Nodes
It's generally frowned upon nowadays to treat servers like "pets" as opposed to "cattle". And, indeed, I'm trying not to personify these little guys too much, but... you can have my custom MOTD, hostnames, and prompts when you pry them from my cold, dead fingers.
The nodes are identified with a letter A-J and labeled accordingly on the ethernet port so that if one needs to be replaced or repaired, that can be done with a minimum of confusion. Then, I gave each the name of a noble house from A Song of Ice and Fire and gave it a MOTD (based on the coat of arms) and a themed Bash prompt.
In my experience, when I'm working in multiple servers simultaneously, it's good for me to have a bright warning sign letting me know, as unambiguously as possible, what server I'm actually logged in on. (I've never blown up prod thinking it was staging, but if I'm shelled into prod I'm deeply concerned about that possibility)
This is just me being a bit over-the-top, I guess.
✋ Allyrion
🐞 Bettley
🦢 Cargyll
🍋 Dalt
🦩 Erenford
🌺 Fenn
🧤 Gardener
🌳 Harlton
🏁 Inchfield
🦁 Jast
Node Configuration
After physically installing and setting up the nodes, the next step is to perform basic configuration. You can see the Ansible playbook I use for this, which currently runs the following roles:
goldentooth.configure
:- Set timezone; last thing I need to do when working with computers is having to perform arithmetic on times and dates.
- Set keybord layout; this should be set already, but I want to be sure.
- Enable overclocking; I've installed an adequate cooling system to support the Pis running full-throttle at their full spec clock.
- Enable fan control; the heatsinks I've installed include fans to prevent CPU throttling under heavy load.
- Enable and configure certain cgroups; this allows Kubernetes to manage and limit resources on the system.
cpuset
: This is used to manage the assignment of individual CPUs (both physical and logical) and memory nodes to tasks running in a cgroup. It allows for pinning processes to specific CPUs and memory nodes, which can be very useful in a containerized environment for performance tuning and ensuring that certain processes have dedicated CPU time. Kubernetes can use cpuset to ensure that workloads (containers/pods) have dedicated processing resources. This is particularly important in multi-tenant environments or when running workloads that require guaranteed CPU cycles. By controlling CPU affinity and ensuring that processes are not competing for CPU time, Kubernetes can improve the predictability and efficiency of applications.memory
: This is used to limit the amount of memory that tasks in a cgroup can use. This includes both RAM and swap space. It provides mechanisms to monitor memory usage and enforce hard or soft limits on the memory available to processes. When a limit is reached, the cgroup can trigger OOM (Out of Memory) killer to select and kill processes exceeding their allocation. Kubernetes uses the memory cgroup to enforce memory limits specified for pods and containers, preventing a single workload from consuming all available memory, which could lead to system instability or affect other workloads. It allows for better resource isolation, efficient use of system resources, and ensures that applications adhere to their specified resource limits, promoting fairness and reliability.hugetlb
: This is used to manage huge pages, a feature of modern operating systems that allows the allocation of memory in larger blocks (huge pages) compared to standard page sizes. This can significantly improve performance for certain workloads by reducing the overhead of page translation and increasing TLB (Translation Lookaside Buffer) hits. Some applications, particularly those dealing with large datasets or high-performance computing tasks, can benefit significantly from using huge pages. Kubernetes can use it to allocate huge pages to these workloads, improving performance and efficiency. This is not going to be a concern for my use, but I'm enabling it anyway simply because it's recommended.
- Disable swap. Kubernetes doesn't like swap by default, and although this can be worked around, I'd prefer to avoid swapping on SD cards. I don't really expect a high memory pressure condition anyway.
- Set preferred editor; I like
nano
, although I can (after years of practice) safely and reliably exitvi
. - Set certain kernel modules to load at boot:
overlay
: This supports OverlayFS, a type of union filesystem. It allows one filesystem to be overlaid on top of another, combining their contents. In the context of containers, OverlayFS can be used to create a layered filesystem that combines multiple layers into a single view, making it efficient to manage container images and writable container layers.br_netfilter
: This allows bridged network traffic to be filtered by iptables and ip6tables. This is essential for implementing network policies, including those related to Network Address Translation (NAT), port forwarding, and traffic filtering. Kubernetes uses it to enforce network policies that control ingress and egress traffic to pods and between pods. This is crucial for maintaining the security and isolation of containerized applications. It also enables the necessary manipulation of traffic for services to direct traffic to pods, and for pods to communicate with each other and the outside world. This includes the implementation of services, load balancing, and NAT for pod networking. And by allowing iptables to filter bridged traffic, br_netfilter helps Kubernetes manage network traffic more efficiently, ensuring consistent network performance and reliability across the cluster.
- Load above kernel modules on every boot.
- Set some kernel parameters:
net.bridge.bridge-nf-call-iptables
: This allows iptables to inspect and manipulate the traffic that passes through a Linux bridge. A bridge is a way to connect two network segments, acting somewhat like a virtual network switch. When enabled, it allows iptables rules to be applied to traffic coming in or going out of a bridge, effectively enabling network policies, NAT, and other iptables-based functionalities for bridged traffic. This is essential in Kubernetes for implementing network policies that control access to and from pods running on the same node, ensuring the necessary level of network isolation and security.net.bridge.bridge-nf-call-ip6tables
: As above, but for IPv6 traffic.net.ipv4.ip_forward
: This controls the ability of the Linux kernel to forward IP packets from one network interface to another, a fundamental capability for any router or gateway. Enabling IP forwarding is crucial for a node to route traffic between pods, across different nodes, or between pods and the external network. It allows the node to act as a forwarder or router, which is essential for the connectivity of pods across the cluster, service exposure, and for pods to access the internet or external resources when necessary.
- Add SSH public key to
root
's authorized keys; this is already performed for my normal user by Raspberry Pi Imager.
goldentooth.set_hostname
: Set the hostname of the node (including a line in/etc/hosts
). This doesn't need to be a separate role, obviously. I just like the structure as I have it.goldentooth.set_motd
: Set the MotD, as described in the previous chapter.goldentooth.set_bash_prompt
: Set the Bash prompt, as described in the previous chapter.goldentooth.setup_security
: Some basic security configuration. Currently, this just uses Jeff Geerling'sansible-role-security
to perform some basic tasks, like setting up unattended upgrades, etc, but I might expand this in the future.
Raspberry Pi Imager doesn't allow you to specify an SSH key for the root
user, so I do this in goldentooth.configure
. However, I also have Kubespray installed (for when I want things to Just Work™), and Kubespray expects the remote user to be root
. As a result, I specify that the remote user is my normal user account in the configure_cluster
playbook. This means a lot of become: true
in the roles, but I would prefer eventually to ditch Kubespray and disallow root login via SSH.
Anyway, we need to rerun goldentooth.set_bash_prompt
, but as the root
user. This almost never matters, since I prefer to SSH as a normal user and use sudo
, but I like my prompts and you can't take them away from me.
With the nodes configured, we can start talking about the different roles they will serve.
Cluster Roles and Responsibilities
Observations:
- The cluster has a single power supply but two power distribution units (PDUs) and two network switches, so it seems reasonable to segment the cluster into left and right halves.
- I want high availability, which requires a control plane capable of a quorum, so a minimum of three nodes in the control plane.
- I want to use a dedicated external load balancer for the control plane rather than configure my existing Opnsense firewall/router. (I'll have to do that to enable MetalLB via BGP, sadly.)
- So that would yield one load balancer, three control plane nodes, and six worker nodes.
- With the left-right segmentation, I can locate one load balancer and one control plane node on the left side, two control plane nodes on the right side, and three worker nodes on each side.
This isn't really high-availability; the cluster has multiple single points of failure:
- the load balancer node
- whichever network switch is connected to the upstream
- the power supply
- the PDU powering the LB
- the PDU powering the upstream switch
- etc.
That said, I find those acceptable given the nature of this project.
Load Balancer
Allyrion, the first node alphabetically and the top node on the left side, will run a load balancer. I had a number of options here, but I ended up going with HAProxy. HAProxy was my introduction to load balancing, reverse proxying, and so forth, and I have kind of a soft spot for it.
I'd also considered Traefik, which I use elsewhere in my homelab, but I believe I'll use it as an ingress controller. Similarly, I think I prefer to use Nginx on a per-application level. I'm pursuing this project first and foremost to learn and to document my learning, and I'd prefer to cover as much ground as possible, and as clearly as possible, and I believe I can do this best if I don't have to worry about having to specify which installation of $proxy
I'm referring to at any given time.
So:
- HAProxy: Load balancer
- Traefik: Ingress controller
- Nginx: Miscellaneous
Control Plane
Bettley (the second node on the left side), Gardener, and Harlton (the first and second nodes on the right side) will be the control plane nodes.
It's common, in small home Kubernetes clusters, to remove the control plane taint (node-role.kubernetes.io/control-plane
) to allow miscellaneous pods to be scheduled on the control plane nodes. I won't be doing that here; six worker nodes should be sufficient for my purposes, and I'll try (where possible and practical) to follow best practices. That said, I might find some random fun things to run on my control plane nodes, and I'll adjust their tolerations accordingly.
Workers
The remaining nodes (Cargyll, Dalt, and Erenford on the left, and Harlton, Inchfield, and Jast on the right) are dedicated workers. What sort of workloads will they run?
Well, probably nothing interesting. Not Plex, not torrent clients or *darrs. Mostly logging, metrics, and similar. I'll probably end up gathering a lot of data about data. And that's fine – these Raspberry Pis are running off SD cards; I don't really want them to be doing anything interesting anyway.
Network Topology
In case you don't quite have a picture of the infrastructure so far, it should look like this:
Frequently Asked Questions
Why didn't you make Etcd high-availability?
It seems like I'd need that cluster to have a quorum too, so we're talking about three nodes for the control plane, three nodes for Etcd, one for the load balancer, and, uh, three worker nodes. That's a bit more than I'd like to invest, and I'd like to avoid doubling up anywhere (although I'll probably add additional functionality to the load balancer). I'm interested in the etcd side of things, but not really enough to compromise elsewhere. I could be missing something obvious, though; if so, please let me know.
Why didn't you just do A=load balancer, B-D=control plane, and E-J=workers?
I could've and should've and still might. But because I'm a bit of a fool and wasn't really paying attention, I put A-E on the left and F-J on the right, rather than A,C,E,G,I on the left and B,D,F,H,J on the right, which would've been a bit cleaner. As it is, I need to think a second about which nodes are control nodes, since they aren't in a strict alphabetical order.
I might adjust this in the future; it should be easy to do so, after all, I just don't particularly want to take the cluster apart and rebuild it, especially since the standoffs were kind of messy as a consequence of the heatsinks.
Load Balancer
This cluster should have a high-availability control plane, and we can start laying the groundwork for that immediately.
This might sound complex, but all we're doing is:
- creating a load balancer
- configuring the load balancer to use all of the control plane nodes as a list of backends
- telling anything that sends requests to a control plane node to send them to the load balancer instead
As mentioned before, we're using HAProxy as a load balancer. First, though, I'll install rsyslog
, a log processing system. It will gather logs from HAProxy and deposit them in a more ergonomic location.
$ sudo apt install -y rsyslog
At least at the time of writing (February 2024), rsyslog
on Raspberry Pi OS includes a bit of configuration that relocates HAProxy logs:
# /etc/rsyslog.d/49-haproxy.conf
# Create an additional socket in haproxy's chroot in order to allow logging via
# /dev/log to chroot'ed HAProxy processes
$AddUnixListenSocket /var/lib/haproxy/dev/log
# Send HAProxy messages to a dedicated logfile
:programname, startswith, "haproxy" {
/var/log/haproxy.log
stop
}
In Raspberry Pi OS, installing and configuring HAProxy is a simple matter.
$ sudo apt install -y haproxy
Here is the configuration I'm working with for HAProxy at the time of writing (February 2024); I've done my best to comment it thoroughly. You can also see the Jinja2 template and the role that deploys the template to configure HAProxy.
# /etc/haproxy/haproxy.cfg
# This is the HAProxy configuration file for the load balancer in my Kubernetes
# cluster. It is used to load balance the API server traffic between the
# control plane nodes.
# Global parameters
global
# Sets uid for haproxy process.
user haproxy
# Sets gid for haproxy process.
group haproxy
# Sets the maximum per-process number of concurrent connections.
maxconn 4096
# Configure logging.
log /dev/log local0
log /dev/log local1 notice
# Default parameters
defaults
# Use global log configuration.
log global
# Frontend configuration for the HAProxy stats page.
frontend stats-frontend
# Listen on all IPv4 addresses on port 8404.
bind *:8404
# Use HTTP mode.
mode http
# Enable the stats page.
stats enable
# Set the URI to access the stats page.
stats uri /stats
# Set the refresh rate of the stats page.
stats refresh 10s
# Set the realm to access the stats page.
stats realm HAProxy\ Statistics
# Set the username and password to access the stats page.
stats auth nathan:<redacted>
# Hide HAProxy version to improve security.
stats hide-version
# Kubernetes API server frontend configuration.
frontend k8s-api-server
# Listen on the IPv4 address of the load balancer on port 6443.
bind 10.4.0.10:6443
# Use TCP mode, which means that the connection will be passed to the server
# without TLS termination, etc.
mode tcp
# Enable logging of the client's IP address and port.
option tcplog
# Use the Kubernetes API server backend.
default_backend k8s-api-server
# Kubernetes API server backend configuration.
backend k8s-api-server
# Use TCP mode, not HTTPS.
mode tcp
# Sets the maximum time to wait for a connection attempt to a server to
# succeed.
timeout connect 10s
# Sets the maximum inactivity time on the client side. I might reduce this at
# some point.
timeout client 86400s
# Sets the maximum inactivity time on the server side. I might reduce this at
# some point.
timeout server 86400s
# Sets the load balancing algorithm.
# `roundrobin` means that each server is used in turns, according to their
# weights.
balance roundrobin
# Enable health checks.
option tcp-check
# For each control plane node, add a server line with the node's hostname and
# IP address.
# The `check` parameter enables health checks.
# The `fall` parameter sets the number of consecutive health check failures
# after which the server is considered to be down.
# The `rise` parameter sets the number of consecutive health check successes
# after which the server is considered to be up.
server bettley 10.4.0.11:6443 check fall 3 rise 2
server fenn 10.4.0.15:6443 check fall 3 rise 2
server gardener 10.4.0.16:6443 check fall 3 rise 2
This enables the HAProxy stats frontend, which allows us to gain some insight into the operation of the frontend in something like realtime.
We see that our backends are unavailable, which is of course expected at this time. We can also read the logs, in /var/log/haproxy.log
:
$ cat /var/log/haproxy.log
2024-02-21T07:03:16.603651-05:00 allyrion haproxy[1305383]: [NOTICE] (1305383) : haproxy version is 2.6.12-1+deb12u1
2024-02-21T07:03:16.603906-05:00 allyrion haproxy[1305383]: [NOTICE] (1305383) : path to executable is /usr/sbin/haproxy
2024-02-21T07:03:16.604085-05:00 allyrion haproxy[1305383]: [WARNING] (1305383) : Exiting Master process...
2024-02-21T07:03:16.607180-05:00 allyrion haproxy[1305383]: [ALERT] (1305383) : Current worker (1305385) exited with code 143 (Terminated)
2024-02-21T07:03:16.607558-05:00 allyrion haproxy[1305383]: [WARNING] (1305383) : All workers exited. Exiting... (0)
2024-02-21T07:03:16.771133-05:00 allyrion haproxy[1305569]: [NOTICE] (1305569) : New worker (1305572) forked
2024-02-21T07:03:16.772082-05:00 allyrion haproxy[1305569]: [NOTICE] (1305569) : Loading success.
2024-02-21T07:03:16.775819-05:00 allyrion haproxy[1305572]: [WARNING] (1305572) : Server k8s-api-server/bettley is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:16.776309-05:00 allyrion haproxy[1305572]: Server k8s-api-server/bettley is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:16.776584-05:00 allyrion haproxy[1305572]: Server k8s-api-server/bettley is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.423831-05:00 allyrion haproxy[1305572]: [WARNING] (1305572) : Server k8s-api-server/fenn is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.424229-05:00 allyrion haproxy[1305572]: Server k8s-api-server/fenn is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.424446-05:00 allyrion haproxy[1305572]: Server k8s-api-server/fenn is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:17.653803-05:00 allyrion haproxy[1305572]: Connect from 10.0.2.162:53155 to 10.4.0.10:8404 (stats-frontend/HTTP)
2024-02-21T07:03:17.677482-05:00 allyrion haproxy[1305572]: Connect from 10.0.2.162:53156 to 10.4.0.10:8404 (stats-frontend/HTTP)
2024-02-21T07:03:18.114561-05:00 allyrion haproxy[1305572]: [WARNING] (1305572) : Server k8s-api-server/gardener is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:18.115141-05:00 allyrion haproxy[1305572]: [ALERT] (1305572) : backend 'k8s-api-server' has no server available!
2024-02-21T07:03:18.115560-05:00 allyrion haproxy[1305572]: Server k8s-api-server/gardener is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:18.116133-05:00 allyrion haproxy[1305572]: Server k8s-api-server/gardener is DOWN, reason: Layer4 connection problem, info: "Connection refused at initial connection step of tcp-check", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-02-21T07:03:18.117560-05:00 allyrion haproxy[1305572]: backend k8s-api-server has no server available!
2024-02-21T07:03:18.118458-05:00 allyrion haproxy[1305572]: backend k8s-api-server has no server available!
This is fine and dandy, and will be addressed in future chapters.
Container Runtime
Kubernetes is a container orchestration platform and therefore requires some container runtime to be installed.
This is a simple step; containerd
is well-supported, well-regarded, and I don't have any reason not to use it.
I used Jeff Geerling's Ansible role to install and configure containerd
on my cluster; this is really the point at which some kind of IaC/configuration management system becomes something more than a polite suggestion 🙂
Configuration Details
The containerd installation and configuration is managed through several key components:
Ansible Role Configuration
The geerlingguy.containerd
role is specified in my requirements.yml
and configured with these critical variables in group_vars/all/vars.yaml
:
# geerlingguy.containerd configuration
containerd_package: 'containerd.io'
containerd_package_state: 'present'
containerd_service_state: 'started'
containerd_service_enabled: true
containerd_config_cgroup_driver_systemd: true # Critical for Kubernetes integration
Runtime Integration with Kubernetes
The most important aspect of the containerd configuration is its integration with Kubernetes. The cluster explicitly configures the CRI socket path:
kubernetes:
cri_socket_path: 'unix:///var/run/containerd/containerd.sock'
This socket path is used throughout the kubeadm initialization and join processes, ensuring Kubernetes can communicate with the container runtime.
Systemd Cgroup Driver
The configuration sets SystemdCgroup = true
in the containerd configuration file (/etc/containerd/config.toml
), which is essential because:
- Kubernetes 1.22+ requires systemd cgroup driver for kubelet
- Consistency: Both kubelet and containerd must use the same cgroup driver
- Resource Management: Enables proper CPU/memory limits enforcement
Generated Configuration
The Ansible role generates a complete containerd configuration with these key settings:
# Runtime configuration
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true # Critical for Kubernetes cgroup management
# Socket configuration
[grpc]
address = "/run/containerd/containerd.sock"
Installation Process
The Ansible role performs these steps:
- Repository Setup: Adds Docker CE repository (containerd.io package source)
- Package Installation: Installs
containerd.io
package - Default Config Generation: Runs
containerd config default
to generate base config - Systemd Cgroup Modification: Patches config to set
SystemdCgroup = true
- Service Management: Enables and starts containerd service
Architecture Support
The configuration automatically handles ARM64 architecture for the Raspberry Pi nodes through architecture detection in the Ansible variables, ensuring proper package selection for both ARM64 (Pi nodes) and AMD64 (x86 nodes).
Troubleshooting Tools
The installation also provides crictl
(Container Runtime Interface CLI) for debugging and inspecting containers directly at the runtime level, which proves invaluable when troubleshooting Kubernetes pod issues.
The container runtime installation is handled in my install_k8s_packages.yaml
playbook, which is where we'll be spending some time in subsequent sections.
Networking
Kubernetes uses three different networks:
- Infrastructure: The physical or virtual backbone connecting the machines hosting the nodes. The infrastructure network enables connectivity between the nodes; this is essential for the Kubernetes control plane components (like the kube-apiserver, etcd, scheduler, and controller-manager) and the worker nodes to communicate with each other. Although pods communicate with each other via the pod network (overlay network), the underlying infrastructure network supports this by facilitating the physical or virtual network paths between nodes.
- Service: This is a purely virtual and internal network. It allows services to communicate with each other and with Pods seamlessly. This network layer abstracts the actual network details from the services, providing a consistent and simplified interface for inter-service communication. When a Service is created, it is automatically assigned a unique IP address from the service network's address space. This IP address is stable for the lifetime of the Service, even if the Pods that make up the Service change. This stable IP address makes it easier to configure DNS or other service discovery mechanisms.
- Pod: This is a crucial component that allows for seamless communication between pods across the cluster, regardless of which node they are running on. This networking model is designed to ensure that each pod gets its own unique IP address, making it appear as though each pod is on a flat network where every pod can communicate with every other pod directly without NAT.
My infrastructure network is already up and running at 10.4.0.0/20
. I'll configure my service network at 172.16.0.0/20
and my pod network at 192.168.0.0/16
.
Network Architecture Implementation
CIDR Block Allocations
The goldentooth cluster uses a carefully planned network segmentation strategy:
- Infrastructure Network:
10.4.0.0/20
- Physical network backbone - Service Network:
172.16.0.0/20
- Kubernetes virtual services - Pod Network:
192.168.0.0/16
- Container-to-container communication - MetalLB Range:
10.4.11.0/24
- Load balancer service IPs
Physical Network Topology
The cluster consists of:
Control Plane Nodes (High Availability):
- bettley (10.4.0.11), cargyll (10.4.0.12), dalt (10.4.0.13)
Load Balancer and Services:
- allyrion (10.4.0.10) - HAProxy load balancer, NFS server
Worker Nodes:
- 8 Raspberry Pi ARM64 workers: erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps
- 1 x86 GPU worker: velaryon (10.4.0.30)
CNI Implementation: Calico
The cluster uses Calico as the Container Network Interface (CNI) plugin. Calico is configured during the kubeadm initialization:
kubeadm init \
--control-plane-endpoint="10.4.0.10:6443" \
--service-cidr="172.16.0.0/20" \
--pod-network-cidr="192.168.0.0/16" \
--kubernetes-version="stable-1.32"
Calico provides:
- Layer 3 networking with BGP routing
- Network policies for microsegmentation
- Cross-node pod communication without overlay networks
- Integration with the existing BGP infrastructure
Load Balancer Architecture
HAProxy Configuration: The cluster uses HAProxy running on allyrion (10.4.0.10) to provide high availability for the Kubernetes API server:
- Frontend: Listens on port 6443
- Backend: Round-robin load balancing across all three control plane nodes
- Health Checks: TCP-based health checks with fall=3, rise=2 configuration
- Monitoring: Prometheus metrics endpoint on port 8405
This ensures the cluster remains accessible even if individual control plane nodes fail.
BGP Integration with MetalLB
The cluster implements BGP-based load balancing using MetalLB:
Router Configuration (OPNsense with FRR):
- Router AS Number: 64500
- Cluster AS Number: 64501
- BGP Peer: Router at 10.4.0.1
MetalLB Configuration:
spec:
myASN: 64501
peerASN: 64500
peerAddress: 10.4.0.1
addressPool: '10.4.11.0 - 10.4.15.254'
This allows Kubernetes LoadBalancer services to receive real IP addresses that are automatically routed through the network infrastructure.
Static Route Management
The networking Ansible role automatically:
- Detects the primary network interface using
ip route show 10.4.0.0/20
- Adds static routes for the MetalLB range:
ip route add 10.4.11.0/24 dev <interface>
- Persists routes in
/etc/network/interfaces.d/<interface>.cfg
for boot persistence
Service Discovery and DNS
The cluster implements comprehensive service discovery:
- Cluster Domain:
goldentooth.net
- Node Domain:
nodes.goldentooth.net
- Services Domain:
services.goldentooth.net
- External DNS: Automated DNS record management via external-dns operator
Network Security
Certificate-Based Security:
- Step-CA: Provides automated certificate management for all services
- TLS Everywhere: All inter-service communication is encrypted
- SSH Certificates: Automated SSH certificate provisioning
Service Mesh Integration:
- Consul: Provides service discovery and health checking across both Kubernetes and Nomad
- Network Policies: Configured but not strictly enforced by default
Multi-Orchestrator Networking
The cluster supports both Kubernetes and HashiCorp Nomad workloads on the same physical network:
- Kubernetes: Calico CNI with BGP routing
- Nomad: Bridge networking with Consul Connect service mesh
- Vault: Network-based authentication and secrets distribution
Monitoring Network Integration
Observability Stack:
- Prometheus: Scrapes metrics across all network endpoints
- Grafana: Centralized dashboards accessible via MetalLB LoadBalancer
- Loki: Log aggregation with Vector log shipping across nodes
- Node Exporter: Per-node metrics collection
With this network architecture decided and implemented, we can move forward to the next phase of cluster construction.
Configuring Packages
Rather than YOLOing binaries onto our nodes like heathens, we'll use Apt and Ansible.
I wrote the above line before a few hours or so of fighting with Apt, Ansible, the repository signing key, documentation on the greater internet, my emotions, etc.
The long and short of it is that apt-key add
is deprecated in Debian and Ubuntu, and consequently ansible.builtin.apt_key
should be deprecated, but cannot be at this time for backward compatibility with older versions of Debian and Ubuntu and other derivative distributions.
The reason for this deprecation, as I understand it, is that apt-key add
adds a key to /etc/apt/trusted.gpg.d
. Keys here can be used to sign any package, including a package downloaded from an official distro package repository. This weakens our defenses against supply-chain attacks.
The new recommendation is to add the key to /etc/apt/keyrings
, where it will be used when appropriate but not, apparently, to sign for official distro package repositories.
A further complication is that the Kubernetes project has moved its package repositories a time or two and completely rewrote the repository structure.
As a result, if you Google™, you will find a number of ways of using Ansible or a shell command to configure the Kubernetes apt repository on Debian/Ubuntu/Raspberry Pi OS, but none of them are optimal.
The Desired End-State
Here are my expectations:
- use the new deb822 format, not the old sources.list format
- preserve idempotence
- don't point to deprecated package repositories
- actually work
Existing solutions failed at one or all of these.
For the record, what we're trying to create is:
- a file located at
/etc/apt/keyrings/kubernetes.asc
containing the Kubernetes package repository signing key - a file located at
/etc/apt/sources.list.d/kubernetes.sources
containing information about the Kubernetes package repository.
The latter should look something like the following:
X-Repolib-Name: kubernetes
Types: deb
URIs: https://pkgs.k8s.io/core:/stable:/v1.29/deb/
Suites: /
Architectures: arm64
Signed-By: /etc/apt/keyrings/kubernetes.asc
The Solution
After quite some time and effort and suffering, I arrived at a solution.
You can review the original task file for changes, but I'm embedding it here because it was weirdly a nightmare to arrive at a working solution.
I've edited this only to substitute strings for the variables that point to them, so it should be a working solution more-or-less out-of-the-box.
---
- name: 'Install packages needed to use the Kubernetes Apt repository.'
ansible.builtin.apt:
name:
- 'apt-transport-https'
- 'ca-certificates'
- 'curl'
- 'gnupg'
- 'python3-debian'
state: 'present'
- name: 'Add Kubernetes repository.'
ansible.builtin.deb822_repository:
name: 'kubernetes'
types:
- 'deb'
uris:
- "https://pkgs.k8s.io/core:/stable:/v1.29/deb/"
suites:
- '/'
architectures:
- 'arm64'
signed_by: "https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key"
After this, you will of course need to update your Apt cache and install the three Kubernetes tools we'll use shortly: kubeadm
, kubectl
, and kubelet
.
Installing Packages
Now that we have functional access to the Kubernetes Apt package repository, we can install some important Kubernetes tools:
kubeadm
provides a straightforward way to setup and configure a Kubernetes cluster (API server, Controller Manager, DNS, etc). Kubernetes the Hard Way basically does whatkubeadm
does. I usekubeadm
because my goal is to go not necessarily deeper, but farther.kubectl
is a CLI tool for administering a Kubernetes cluster; you can deploy applications, inspect resources, view logs, etc. As I'm studying for my CKA, I want to usekubectl
for as much as possible.kubelet
runs on each and every node in the cluster and ensures that pods are functioning as desired and takes steps to correct their behavior when it does not match the desired state.
Package Installation Implementation
Kubernetes Package Configuration
The package installation is managed through Ansible variables in group_vars/all/vars.yaml
:
kubernetes_version: '1.32'
kubernetes:
apt_packages:
- 'kubeadm'
- 'kubectl'
- 'kubelet'
apt_repo_url: "https://pkgs.k8s.io/core:/stable:/v{{ kubernetes_version }}/deb/"
This configuration:
- Version management: Uses Kubernetes 1.32 (latest stable at time of writing)
- Repository pinning: Uses version-specific repository for consistency
- Package selection: Core Kubernetes tools required for cluster operation
Installation Process
The installation is handled by the install_k8s_packages.yaml
playbook, which performs these steps:
1. Container Runtime Setup:
- name: 'Setup `containerd`.'
hosts: 'k8s_cluster'
remote_user: 'root'
roles:
- { role: 'geerlingguy.containerd' }
This ensures containerd is installed and configured before Kubernetes packages.
2. Package Installation:
- name: 'Install Kubernetes packages.'
ansible.builtin.apt:
name: "{{ kubernetes.apt_packages }}"
state: 'present'
notify:
- 'Hold Kubernetes packages.'
- 'Enable and restart kubelet service.'
3. Package Hold Management:
- name: 'Hold Kubernetes packages.'
ansible.builtin.dpkg_selections:
name: "{{ package }}"
selection: 'hold'
loop: "{{ kubernetes.apt_packages }}"
This prevents accidental upgrades during regular system updates, ensuring cluster stability.
Service Configuration
kubelet Service Activation:
- name: 'Enable and restart kubelet service.'
ansible.builtin.systemd_service:
name: 'kubelet'
state: 'restarted'
enabled: true
daemon_reload: true
Key features:
- Auto-start: Enables kubelet to start automatically on boot
- Service restart: Ensures kubelet starts with new configuration
- Daemon reload: Refreshes systemd to recognize any unit file changes
Target Nodes
The installation targets the k8s_cluster
inventory group, which includes:
- Control plane nodes: bettley, cargyll, dalt (3 nodes)
- Worker nodes: All remaining Raspberry Pi nodes + velaryon GPU node (10 nodes)
This ensures all cluster nodes have consistent Kubernetes tooling.
Version Management Strategy
Repository Strategy:
- Version-pinned repositories: Uses
v1.32
specific repository - Package holds: Prevents accidental upgrades via
dpkg --set-selections
- Coordinated updates: Cluster-wide version management through Ansible
Upgrade Process:
- Update
kubernetes_version
variable - Run
install_k8s_packages.yaml
playbook - Coordinate cluster upgrade using
kubeadm upgrade
- Update containerd and other runtime components as needed
Integration with Container Runtime
The playbook ensures proper integration between Kubernetes and containerd:
Runtime Configuration:
- CRI socket:
/var/run/containerd/containerd.sock
- Cgroup driver: systemd (required for Kubernetes 1.22+)
- Image service: containerd handles container image management
Service Dependencies:
- containerd must be running before kubelet starts
- kubelet configured to use containerd as container runtime
- Proper systemd service ordering ensures reliable startup
Command Line Integration
The installation integrates with the goldentooth CLI:
# Install Kubernetes packages across cluster
goldentooth install_k8s_packages
# Uninstall if needed (cleanup)
goldentooth uninstall_k8s_packages
Post-Installation Verification
After installation, you can verify the tools are properly installed:
# Check versions
goldentooth command k8s_cluster 'kubeadm version'
goldentooth command k8s_cluster 'kubectl version --client'
goldentooth command k8s_cluster 'kubelet --version'
# Verify package holds
goldentooth command k8s_cluster 'apt-mark showhold | grep kube'
Installing these tools is comparatively simple with the automated approach, just sudo apt-get install -y kubeadm kubectl kubelet
, but the Ansible implementation adds important production considerations like version pinning, service management, and cluster-wide coordination that manual installation would miss.
kubeadm init
kubeadm
does a wonderful job of simplifying Kubernetes cluster bootstrapping (if you don't believe me, just read Kubernetes the Hard Way), but there's still a decent amount of work involved. Since we're creating a high-availability cluster, we need to do some magic to convey secrets between the control plane nodes, generate join tokens for the worker nodes, etc.
So, we will:
- run
kubeadm
on the first control plane node - copy some data around
- run a different
kubeadm
command to join the rest of the control plane nodes to the cluster - copy some more data around
- run a different
kubeadm
command to join the worker nodes to the cluster
and then we're done!
kubeadm init
takes a number of command-line arguments.
You can look at the actual Ansible tasks bootstrapping my cluster, but this is what my command evaluates out to:
kubeadm init \
--control-plane-endpoint="10.4.0.10:6443" \
--kubernetes-version="stable-1.29" \
--service-cidr="172.16.0.0/20" \
--pod-network-cidr="192.168.0.0/16" \
--cert-dir="/etc/kubernetes/pki" \
--cri-socket="unix:///var/run/containerd/containerd.sock" \
--upload-certs
I'll break that down line by line:
# Run through all of the phases of initializing a Kubernetes control plane.
kubeadm init \
# Requests should target the load balancer, not this particular node.
--control-plane-endpoint="10.4.0.10:6443" \
# We don't need any more instability than we already have.
# At time of writing, 1.29 is the current release.
--kubernetes-version="stable-1.29" \
# As described in the chapter on Networking, this is the CIDR from which
# service IP addresses will be allocated.
# This gives us 4,094 IP addresses to work with.
--service-cidr="172.16.0.0/20" \
# As described in the chapter on Networking, this is the CIDR from which
# pod IP addresses will be allocated.
# This gives us 65,534 IP addresses to work with.
--pod-network-cidr="192.168.0.0/16"
# This is the directory that will host TLS certificates, keys, etc for
# cluster communication.
--cert-dir="/etc/kubernetes/pki"
# This is the URI of the container runtime interface socket, which allows
# direct interaction with the container runtime.
--cri-socket="unix:///var/run/containerd/containerd.sock"
# Upload certificates into the appropriate secrets, rather than making us
# do that manually.
--upload-certs
Oh, you thought I was just going to blow right by this, didncha? No, this ain't Kubernetes the Hard Way, but I do want to make an effort to understand what's going on here. So here, courtesy of kubeadm init --help
, is the list of phases that kubeadm
runs through by default.
preflight Run pre-flight checks
certs Certificate generation
/ca Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
/apiserver Generate the certificate for serving the Kubernetes API
/apiserver-kubelet-client Generate the certificate for the API server to connect to kubelet
/front-proxy-ca Generate the self-signed CA to provision identities for front proxy
/front-proxy-client Generate the certificate for the front proxy client
/etcd-ca Generate the self-signed CA to provision identities for etcd
/etcd-server Generate the certificate for serving etcd
/etcd-peer Generate the certificate for etcd nodes to communicate with each other
/etcd-healthcheck-client Generate the certificate for liveness probes to healthcheck etcd
/apiserver-etcd-client Generate the certificate the apiserver uses to access etcd
/sa Generate a private key for signing service account tokens along with its public key
kubeconfig Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
/admin Generate a kubeconfig file for the admin to use and for kubeadm itself
/super-admin Generate a kubeconfig file for the super-admin
/kubelet Generate a kubeconfig file for the kubelet to use *only* for cluster bootstrapping purposes
/controller-manager Generate a kubeconfig file for the controller manager to use
/scheduler Generate a kubeconfig file for the scheduler to use
etcd Generate static Pod manifest file for local etcd
/local Generate the static Pod manifest file for a local, single-node local etcd instance
control-plane Generate all static Pod manifest files necessary to establish the control plane
/apiserver Generates the kube-apiserver static Pod manifest
/controller-manager Generates the kube-controller-manager static Pod manifest
/scheduler Generates the kube-scheduler static Pod manifest
kubelet-start Write kubelet settings and (re)start the kubelet
upload-config Upload the kubeadm and kubelet configuration to a ConfigMap
/kubeadm Upload the kubeadm ClusterConfiguration to a ConfigMap
/kubelet Upload the kubelet component config to a ConfigMap
upload-certs Upload certificates to kubeadm-certs
mark-control-plane Mark a node as a control-plane
bootstrap-token Generates bootstrap tokens used to join a node to a cluster
kubelet-finalize Updates settings relevant to the kubelet after TLS bootstrap
/experimental-cert-rotation Enable kubelet client certificate rotation
addon Install required addons for passing conformance tests
/coredns Install the CoreDNS addon to a Kubernetes cluster
/kube-proxy Install the kube-proxy addon to a Kubernetes cluster
show-join-command Show the join command for control-plane and worker node
So now I will go through each of these in turn to explain how the cluster is created.
kubeadm
init phases
preflight
The preflight phase performs a number of checks of the environment to ensure it is suitable. These aren't, as far as I can tell, documented anywhere -- perhaps because documentation would inevitably drift out of sync with the code rather quickly. And, besides, we're engineers and this is an open-source project; if we care that much, we can just read the source code!
But I'll go through and mention a few of these checks, just for the sake of discussion and because there are some important concepts.
- Networking: It checks that certain ports are available and firewall settings do not prevent communication.
- Container Runtime: It requires a container runtime, since... Kubernetes is a container orchestration platform.
- Swap: Historically, Kubernetes has balked at running on a system with swap enabled, for performance and stability reasons, but this has been lifted recently.
- Uniqueness: It checks that each hostname is different in order to prevent networking conflicts.
- Kernel Parameters: It checks for certain cgroups (see the Node configuration chapter for more information). It used to check for some networking parameters as well, to ensure traffic can flow properly, but it appears this might not be a thing anymore in 1.30.
certs
This phase generates important certificates for communication between cluster components.
/ca
This generates a self-signed certificate authority that will be used to provision identities for all of the other Kubernetes components, and lays the groundwork for the security and reliability of their communication by ensuring that all components are able to trust one another.
By generating its own root CA, a Kubernetes cluster can be self-sufficient in managing the lifecycle of the certificates it uses for TLS. This includes generating, distributing, rotating, and revoking certificates as needed. This autonomy simplifies the setup and ongoing management of the cluster, especially in environments where integrating with an external CA might be challenging.
It's worth mentioning that this includes client certificates as well as server certificates, since client certificates aren't currently as well-known and ubiquitous as server certificates. So just as the API server has a server certificate that allows clients making requests to verify its identity, so clients will have a client certificate that allows the server to verify their identity.
So these certificate relationships maintain CIA (Confidentiality, Integrity, and Authentication) by:
- encrypting the data transmitted between the client and the server (Confidentiality)
- preventing tampering with the data transmitted between the client and the server (Integrity)
- verifying the identity of the server and the client (Authentication)
/apiserver
The Kubernetes API server is the central management entity of the cluster. The Kubernetes API allows users and internal and external processes and components to communicate and report and manage the state of the cluster. The API server accepts, validates, and executes REST operations, and is the only cluster component that interacts with etcd
directly. etcd
is the source of truth within the cluster, so it is essential that communication with the API server be secure.
/apiserver-kubelet-client
This is a client certificate for the API server, ensuring that it can authenticate itself to each kubelet and prove that it is a legitimate source of commands and requests.
/front-proxy-ca
and front-proxy-client
The Front Proxy certificates seem to only be used in situations where kube-proxy
is supporting an extension API server, and the API server/aggregator needs to connect to an extension API server respectively. This is beyond the scope of this project.
/etcd-ca
etcd
can be configured to run "stacked" (deployed onto the control plane) or as an external cluster. For various reasons (security via isolation, access control, simplified rotation and management, etc), etcd
is provided its own certificate authority.
/etcd-server
This is a server certificate for each etcd
node, assuring the Kubernetes API server and etcd
peers of its identity.
/etcd-peer
This is a client and server certificate, distributed to each etcd
node, that enables them to communicate securely with one another.
/etcd-healthcheck-client
This is a client certificate that enables the caller to probe etcd
. It permits broader access, in that multiple clients can use it, but the degree of that access is very restricted.
/apiserver-etcd-client
This is a client certificate permitting the API server to communicate with etcd
.
/sa
This is a public and private key pair that is used for signing service account tokens.
Service accounts are used to provide an identity for processes that run in a Pod, permitting them to interact securely with the API server.
Service account tokens are JWTs (JSON Web Tokens). When a Pod accesses the Kubernetes API, it can present a service account token as a bearer token in the HTTP Authorization header. The API server then uses the public key to verify the signature on the token, and can then evaluate whether the claims are valid, etc.
kubeconfig
These phases write the necessary configuration files to secure and facilitate communication within the cluster and between administrator tools (like kubectl
) and the cluster.
/admin
This is the kubeconfig file for the cluster administrator. It provides the admin user with full access to the cluster.
Now, per a change in 1.29, as Rory McCune explains, this admin
credential is no longer a member of system:masters
and instead has access granted via RBAC. This means that access can be revoked without having to manually rotate all of the cluster certificates.
/super-admin
This new credential also provides full access to the cluster, but via the system:masters
group mechanism (read: irrevocable without rotating certificates). This also explains why, when watching my cluster spin up while using the admin.conf
credentials, a time or two I saw access denied errors!
/kubelet
This credential is for use with the kubelet during cluster bootstrapping. It provides a baseline cluster-wide configuration for all kubelets in the cluster. It points to the client certificates that allow the kubelet to communicate with the API server so we can propagate cluster-level configuration to each kubelet.
/controller-manager
This credential is used by the Controller Manager. The Controller Manager is responsible for running controller processes, which watch the state of the cluster through the API server and make changes attempting to move the current state towards the desired state. This file contains credentials that allow the Controller Manager to communicate securely with the API server.
/scheduler
This credential is used by the Kubernetes Scheduler. The Scheduler is responsible for assigning work, in the form of Pods, to different nodes in the cluster. It makes these decisions based on resource availability, workload requirements, and other policies. This file contains the credentials needed for the Scheduler to interact with the API server.
etcd
This phase generates the static pod manifest file for local etcd
.
Static pod manifests are files kept in (in our case) /etc/kubernetes/manifests
; the kubelet observes this directory and will start/replace/delete pods accordingly. In the case of a "stacked" cluster, where we have critical control plane components like etcd
and the API server running within pods, we need some method of creating and managing pods without those components. Static pod manifests provide this capability.
/local
This phase configures a local etcd
instance to run on the same node as the other control plane components. This is what we'll be doing; later, when we join additional nodes to the control plane, the etcd
cluster will expand.
For instance, the static pod manifest file for etcd
on bettley
, my first control plane node, has a spec.containers[0].command
that looks like this:
....
- command:
- etcd
- --advertise-client-urls=https://10.4.0.11:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://10.4.0.11:2380
- --initial-cluster=bettley=https://10.4.0.11:2380
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://10.4.0.11:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://10.4.0.11:2380
- --name=bettley
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
....
whereas on fenn
, the second control plane node, the corresponding static pod manifest file looks like this:
- command:
- etcd
- --advertise-client-urls=https://10.4.0.15:2379
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --client-cert-auth=true
- --data-dir=/var/lib/etcd
- --experimental-initial-corrupt-check=true
- --experimental-watch-progress-notify-interval=5s
- --initial-advertise-peer-urls=https://10.4.0.15:2380
- --initial-cluster=fenn=https://10.4.0.15:2380,gardener=https://10.4.0.16:2380,bettley=https://10.4.0.11:2380
- --initial-cluster-state=existing
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --listen-client-urls=https://127.0.0.1:2379,https://10.4.0.15:2379
- --listen-metrics-urls=http://127.0.0.1:2381
- --listen-peer-urls=https://10.4.0.15:2380
- --name=fenn
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-client-cert-auth=true
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --snapshot-count=10000
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
and correspondingly, we can see three pods:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
etcd-bettley 1/1 Running 19 3h23m
etcd-fenn 1/1 Running 0 3h22m
etcd-gardener 1/1 Running 0 3h23m
control-plane
This phase generates the static pod manifest files for the other (non-etcd
) control plane components.
/apiserver
This generates the static pod manifest file for the API server, which we've already discussed quite a bit.
/controller-manager
This generates the static pod manifest file for the controller manager. The controller manager embeds the core control loops shipped with Kubernetes. A controller is a loop that watches the shared state of the cluster through the API server and makes changes attempting to move the current state towards the desired state. Examples of controllers that are part of the Controller Manager include the Replication Controller, Endpoints Controller, Namespace Controller, and ServiceAccounts Controller.
/scheduler
This phase generates the static pod manifest file for the scheduler. The scheduler is responsible for allocating pods to nodes in the cluster based on various scheduling principles, including resource availability, constraints, affinities, etc.
kubelet-start
Throughout this process, the kubelet has been in a crash loop because it hasn't had a valid configuration.
This phase generates a config which (at least on my system) is stored at /var/lib/kubelet/config.yaml
, as well as a "bootstrap" configuration that allows the kubelet to connect to the control plane (and retrieve credentials for longterm use).
Then the kubelet is restarted and will bootstrap with the control plane.
upload-certs
This phase enables the secure distribution of the certificates we created above, in the certs
phases.
Some certificates need to be shared across the cluster (or at least across the control plane) for secure communication. This includes the certificates for the API server, etcd
, the front proxy, etc.
kubeadm
generates an encryption key that is used to encrypt the certificates, so they're not actually exposed in plain text at any point. Then the encrypted certificates are uploaded to etcd
, a distributed key-value store that Kubernetes uses for persisting cluster state. To facilitate future joins of control plane nodes without having to manually distribute certificates, these encrypted certificates are stored in a specific kubeadm-certs
secret.
The encryption key is required to decrypt the certificates for use by joining nodes. This key is not uploaded to the cluster for security reasons. Instead, it must be manually shared with any future control plane nodes that join the cluster. kubeadm outputs this key upon completion of the upload-certs
phase, and it's the administrator's responsibility to securely transfer this key when adding new control plane nodes.
This process allows for the secure addition of new control plane nodes to the cluster by ensuring they have access to the necessary certificates to communicate securely with the rest of the cluster. Without this phase, administrators would have to manually copy certificates to each new node, which can be error-prone and insecure.
By automating the distribution of these certificates and utilizing encryption for their transfer, kubeadm
significantly simplifies the process of scaling the cluster's control plane, while maintaining high standards of security.
mark-control-plane
In this phase, kubeadm
applies a specific label to control plane nodes: node-role.kubernetes.io/control-plane=""
; this marks the node as part of the control plane. Additionally, the node receives a taint, node-role.kubernetes.io/control-plane=:NoSchedule
, which will prevent normal workloads from being scheduled on it.
As noted previously, I see no reason to remove this taint, although I'll probably enable some tolerations for certain workloads (monitoring, etc).
bootstrap-token
This phase creates bootstrap tokens, which are used to authenticate new nodes joining the cluster. This is how we are able to easily scale the cluster dynamically without copying multiple certificates and private keys around.
The "TLS bootstrap" process allows a kubelet to automatically request a certificate from the Kubernetes API server. This certificate is then used for secure communication within the cluster. The process involves the use of a bootstrap token and a Certificate Signing Request (CSR) that the kubelet generates. Once approved, the kubelet receives a certificate and key that it uses for authenticated communication with the API server.
Bootstrap tokens are a simple bearer token. This token is composed of two parts: an ID and a secret, formatted as <id>.<secret>
. The ID and secret are randomly generated strings that authenticate the joining nodes to the cluster.
The generated token is configured with specific permissions using RBAC policies. These permissions typically allow the token to create a certificate signing request (CSR) that the Kubernetes control plane can then approve, granting the joining node the necessary certificates to communicate securely within the cluster.
By default, bootstrap tokens are set to expire after a certain period (24 hours by default), ensuring that tokens cannot be reused indefinitely for joining new nodes to the cluster. This behavior enhances the security posture of the cluster by limiting the window during which a token is valid.
Once generated and configured, the bootstrap token is stored as a secret in the kube-system
namespace.
kubelet-finalize
This phase ensures that the kubelet is fully configured with the necessary settings to securely and effectively participate in the cluster. It involves applying any final kubelet configurations that might depend on the completion of the TLS bootstrap process.
addon
This phase sets up essential add-ons required for the cluster to meet the Kubernetes Conformance Tests.
/coredns
CoreDNS provides DNS services for the internal cluster network, allowing pods to find each other by name and services to load-balance across a set of pods.
/kube-proxy
kube-proxy
is responsible for managing network communication inside the cluster, implementing part of the Kubernetes Service concept by maintaining network rules on nodes. These rules allow network communication to pods from network sessions inside or outside the cluster.
kube-proxy
ensures that the networking aspect of Kubernetes services is correctly handled, allowing for the routing of traffic to the appropriate destinations. It operates in the user space, and it can also run in iptables
mode, where it manipulates rules to allow network traffic. This allows services to be exposed to the external network, load balances traffic to pods across the multiple instances, etc.
show-join-command
This phase simplifies the process of expanding a Kubernetes cluster by generating bootstrap tokens and providing the necessary command to join additional nodes, whether they are worker nodes or additional control plane nodes.
In the next section, we'll actually bootstrap the cluster.
Bootstrapping the First Control Plane Node
With a solid idea of what it is that kubeadm init
actually does, we can return to our command:
kubeadm init \
--control-plane-endpoint="10.4.0.10:6443" \
--kubernetes-version="stable-1.29" \
--service-cidr="172.16.0.0/20" \
--pod-network-cidr="192.168.0.0/16" \
--cert-dir="/etc/kubernetes/pki" \
--cri-socket="unix:///var/run/containerd/containerd.sock" \
--upload-certs
It's really pleasantly concise, given how much is going on under the hood.
The Ansible tasks also symlinks the /etc/kubernetes/admin.conf
file to ~/.kube/config
(so we can use kubectl
without having to specify the config file).
Then it sets up my preferred Container Network Interface addon, Calico. I have in the past sometimes used Flannel, but Flannel doesn't support NetworkPolicy resources as it is a Layer 3 networking solution, whereas Calico operates at Layer 3 and Layer 4, which allows it fine-grained control over traffic based on ports, protocol types, sources and destinations, etc.
I want to play with NetworkPolicy resources, so Calico it is.
The next couple of steps create bootstrap tokens so we can join the cluster.
Joining the Rest of the Control Plane
The next phase of bootstrapping is to admit the rest of the control plane nodes to the control plane.
Certificate Key Extraction
Before joining additional control plane nodes, we need to extract the certificate key from the initial kubeadm init
output:
- name: 'Set the kubeadm certificate key.'
ansible.builtin.set_fact:
k8s_certificate_key: "{{ line | regex_search('--certificate-key ([^ ]+)', '\\1') | first }}"
loop: "{{ hostvars[kubernetes.first]['kubeadm_init'].stdout_lines | default([]) }}"
when: '(line | trim) is match(".*--certificate-key.*")'
This certificate key is crucial for securely downloading control plane certificates during the join process. The --upload-certs
flag from the initial kubeadm init
uploaded these certificates to the cluster, encrypted with this key.
Dynamic Token Generation
Rather than using a static token, we generate a fresh token for the join process:
- name: 'Create kubeadm token for joining nodes.'
ansible.builtin.command:
cmd: "kubeadm --kubeconfig {{ kubernetes.admin_conf_path }} token create"
delegate_to: "{{ kubernetes.first }}"
register: 'temp_token'
- name: 'Set kubeadm token fact.'
ansible.builtin.set_fact:
kubeadm_token: "{{ temp_token.stdout }}"
This approach:
- Security: Uses short-lived tokens (24-hour default expiry)
- Automation: No need to manually specify or distribute tokens
- Reliability: Fresh token for each bootstrap operation
JoinConfiguration Template
The JoinConfiguration manifest is generated from a Jinja2 template (kubeadm-controlplane.yaml.j2
):
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
discovery:
bootstrapToken:
apiServerEndpoint: {{ haproxy.ipv4_address }}:6443
token: {{ kubeadm_token }}
unsafeSkipCAVerification: true
timeout: 5m0s
tlsBootstrapToken: {{ kubeadm_token }}
controlPlane:
localAPIEndpoint:
advertiseAddress: {{ ipv4_address }}
bindPort: 6443
certificateKey: {{ k8s_certificate_key }}
nodeRegistration:
name: {{ inventory_hostname }}
criSocket: {{ kubernetes.cri_socket_path }}
{% if inventory_hostname in kubernetes.rest %}
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
{% else %}
taints: []
{% endif %}
Key Configuration Elements:
Discovery Configuration:
- API Server Endpoint: Points to HAProxy load balancer (
10.4.0.10:6443
) - Bootstrap Token: Dynamically generated token for secure cluster discovery
- CA Verification: Skipped for simplicity (acceptable in trusted network)
- Timeout: 5-minute timeout for discovery operations
Control Plane Configuration:
- Local API Endpoint: Each node advertises its own IP for API server communication
- Certificate Key: Enables secure download of control plane certificates
- Bind Port: Standard Kubernetes API server port (6443)
Node Registration:
- CRI Socket: Uses containerd socket (
unix:///var/run/containerd/containerd.sock
) - Node Name: Uses Ansible inventory hostname for consistency
- Taints: Control plane nodes get NoSchedule taint to prevent workload scheduling
Control Plane Join Process
The actual joining process involves several orchestrated steps:
1. Configuration Setup
- name: 'Ensure presence of Kubernetes directory.'
ansible.builtin.file:
path: '/etc/kubernetes'
state: 'directory'
mode: '0755'
- name: 'Create kubeadm control plane config.'
ansible.builtin.template:
src: 'kubeadm-controlplane.yaml.j2'
dest: '/etc/kubernetes/kubeadm-controlplane.yaml'
mode: '0640'
backup: true
2. Readiness Verification
- name: 'Wait for the kube-apiserver to be ready.'
ansible.builtin.wait_for:
host: "{{ haproxy.ipv4_address }}"
port: '6443'
timeout: 180
This ensures the load balancer and initial control plane node are ready before attempting to join.
3. Clean State Reset
- name: 'Reset certificate directory.'
ansible.builtin.shell:
cmd: |
if [ -f /etc/kubernetes/manifests/kube-apiserver.yaml ]; then
kubeadm reset -f --cert-dir {{ kubernetes.pki_path }};
fi
This conditional reset ensures a clean state if a node was previously part of a cluster.
4. Control Plane Join
- name: 'Join the control plane node to the cluster.'
ansible.builtin.command:
cmd: kubeadm join --config /etc/kubernetes/kubeadm-controlplane.yaml
register: 'kubeadm_join'
5. Administrative Access Setup
- name: 'Ensure .kube directory exists.'
ansible.builtin.file:
path: '~/.kube'
state: 'directory'
mode: '0755'
- name: 'Symlink the kubectl admin.conf to ~/.kube/config.'
ansible.builtin.file:
src: '/etc/kubernetes/admin.conf'
dest: '~/.kube/config'
state: 'link'
mode: '0600'
This sets up kubectl
access for the root user on each control plane node.
Target Nodes
The control plane joining process targets nodes in the kubernetes.rest
group:
- bettley (10.4.0.11) - Second control plane node
- cargyll (10.4.0.12) - Third control plane node
This gives us a 3-node control plane for high availability, capable of surviving the failure of any single node.
High Availability Considerations
Load Balancer Integration:
- All control plane nodes use the HAProxy endpoint for cluster communication
- Even control plane nodes connect through the load balancer for API server access
- This ensures consistent behavior whether accessing from inside or outside the cluster
Certificate Management:
- Control plane certificates are automatically distributed via the certificate key mechanism
- Each node gets its own API server certificate with appropriate SANs
- Certificate rotation is handled through normal Kubernetes certificate management
Etcd Clustering:
- kubeadm automatically configures etcd clustering across all control plane nodes
- Stacked topology (etcd on same nodes as API server) for simplicity
- Quorum maintained with 3 nodes (can survive 1 node failure)
After these steps complete, a simple kubeadm join --config /etc/kubernetes/kubeadm-controlplane.yaml
on each remaining control plane node is sufficient to complete the highly available control plane setup.
Admitting the Worker Nodes
After establishing a highly available control plane, the final phase of cluster bootstrapping involves admitting worker nodes. While conceptually simple, this process involves several important considerations for security, automation, and cluster topology.
Worker Node Join Command Generation
The process begins by generating a fresh join command from the first control plane node:
- name: 'Get a kubeadm join command for worker nodes.'
ansible.builtin.command:
cmd: 'kubeadm token create --print-join-command'
changed_when: false
when: 'ansible_hostname == kubernetes.first'
register: 'kubeadm_join_command'
This command:
- Dynamic tokens: Creates a new bootstrap token with 24-hour expiry
- Complete command: Returns fully formed join command with discovery information
- Security: Each bootstrap operation gets a fresh token to minimize exposure
Join Command Structure
The generated join command typically looks like:
kubeadm join 10.4.0.10:6443 \
--token abc123.defghijklmnopqrs \
--discovery-token-ca-cert-hash sha256:1234567890abcdef...
Key components:
- API Server Endpoint: HAProxy load balancer address (
10.4.0.10:6443
) - Bootstrap Token: Temporary authentication token for initial cluster access
- CA Certificate Hash: SHA256 hash of cluster CA certificate for secure discovery
Ansible Automation
The join command is distributed and executed across all worker nodes:
- name: 'Set the kubeadm join command fact.'
ansible.builtin.set_fact:
kubeadm_join_command: |
{{ hostvars[kubernetes.first]['kubeadm_join_command'].stdout }} --ignore-preflight-errors=all
- name: 'Join node to Kubernetes control plane.'
ansible.builtin.command:
cmd: "{{ kubeadm_join_command }}"
when: "clean_hostname in groups['k8s_worker']"
changed_when: false
Automation features:
- Fact distribution: Join command shared across all hosts via Ansible facts
- Selective execution: Only runs on nodes in the
k8s_worker
inventory group - Preflight error handling:
--ignore-preflight-errors=all
allows join despite minor configuration warnings
Worker Node Inventory
The worker nodes are organized in the Ansible inventory under k8s_worker
:
Raspberry Pi Workers (8 nodes):
- erenford (10.4.0.14) - Ray head node, ZFS storage
- fenn (10.4.0.15) - Ceph storage node
- gardener (10.4.0.16) - Grafana host, ZFS storage
- harlton (10.4.0.17) - General purpose worker
- inchfield (10.4.0.18) - Loki host, Seaweed storage
- jast (10.4.0.19) - Step-CA host, Seaweed storage
- karstark (10.4.0.20) - Ceph storage node
- lipps (10.4.0.21) - Ceph storage node
GPU Worker (1 node):
- velaryon (10.4.1.10) - x86 node with GPU acceleration
This topology provides:
- Heterogeneous compute: Mix of ARM64 (Pi) and x86_64 (velaryon) architectures
- Specialized workloads: GPU node for ML/AI workloads
- Storage diversity: Nodes optimized for different storage backends (ZFS, Ceph, Seaweed)
Node Registration Process
When a worker node joins the cluster, several automated processes occur:
1. TLS Bootstrap
# kubelet initiates TLS bootstrapping
kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \
--kubeconfig=/etc/kubernetes/kubelet.conf
This process:
- Uses bootstrap token for initial authentication
- Generates node-specific key pair
- Requests certificate signing from cluster CA
- Receives permanent kubeconfig upon approval
2. Node Labels and Taints
Automatic labels applied:
kubernetes.io/arch=arm64
(Pi nodes) orkubernetes.io/arch=amd64
(velaryon)kubernetes.io/os=linux
node.kubernetes.io/instance-type=
(based on node hardware)
No default taints: Worker nodes accept all workloads by default, unlike control plane nodes with NoSchedule
taints.
3. Container Runtime Integration
Each worker node connects to the local containerd socket:
# kubelet configuration
criSocket: unix:///var/run/containerd/containerd.sock
This ensures:
- Container lifecycle: kubelet manages pod containers via containerd
- Image management: containerd handles container image pulls and storage
- Runtime security: Proper cgroup and namespace isolation
Cluster Topology Verification
After worker node admission, the cluster achieves the desired topology:
Control Plane (3 nodes)
- High availability: Survives single node failure
- Load balanced: All API requests go through HAProxy
- Etcd quorum: 3-node etcd cluster for data consistency
Worker Pool (9 nodes)
- Compute capacity: 8x Raspberry Pi 4B + 1x x86 GPU node
- Workload distribution: Scheduler can place pods across heterogeneous hardware
- Fault tolerance: Workloads can survive multiple worker node failures
Networking Integration
- Pod networking: Calico CNI provides cluster-wide pod connectivity
- Service networking: kube-proxy configures service load balancing
- External access: MetalLB provides LoadBalancer service implementation
Verification Commands
After worker node admission, verify cluster health:
# Check all nodes are Ready
kubectl get nodes -o wide
# Verify kubelet health across cluster
goldentooth command k8s_cluster 'systemctl status kubelet'
# Check pod networking
kubectl get pods -n kube-system -o wide
# Verify resource availability
kubectl top nodes
And voilà! We have a functioning cluster.
We can also see that the cluster is functioning well from HAProxy's perspective:
Implementation Details
The complete worker node admission process is automated in the bootstrap_k8s.yaml
playbook, which orchestrates:
- Control plane initialization on the first node
- Control plane expansion to remaining master nodes
- Worker node admission across the entire worker pool
- Network configuration with Calico CNI
- Service mesh preparation for later HashiCorp Consul integration
This systematic approach ensures consistent cluster topology and provides a solid foundation for deploying containerized applications and platform services.
Where Do We Go From Here?
We have a functioning cluster now, which is to say that I've spent many hours of my life that I'm not going to get back just doing the same thing that the official documentation manages to convey in just a few lines.
Or that Jeff Geerling's geerlingguy.kubernetes
has already managed to do.
And it's not a tenth of a percent as much as Kubespray can do.
Not much to be proud of, but again, this is a personal learning journey. I'm just trying to build a cluster thoughtfully, limiting the black boxes and the magic as much as practical.
The Foundation is Set
What we've accomplished so far represents the essential foundation of any production Kubernetes cluster:
Core Infrastructure ✅
- High availability control plane with 3 nodes and etcd quorum
- Load balanced API access through HAProxy for reliability
- Container runtime (containerd) with proper CRI integration
- Pod networking with Calico CNI providing cluster-wide connectivity
- Worker node pool with heterogeneous hardware (ARM64 + x86_64)
Automation and Reproducibility ✅
- Infrastructure as Code with comprehensive Ansible automation
- Idempotent operations ensuring consistent cluster state
- Version-pinned packages preventing unexpected upgrades
- Goldentooth CLI providing unified cluster management interface
But a bare Kubernetes cluster, while functional, is just the beginning. Real production workloads require additional platform services and operational capabilities.
The Platform Journey Ahead
The following phases will transform our basic cluster into a comprehensive container platform:
Phase 1: Application Platform Services
The next immediate priorities focus on making the cluster useful for application deployment:
GitOps and Application Management:
- Helm package management for standardized application packaging
- Argo CD for GitOps-based continuous deployment
- ApplicationSets for managing applications across environments
- Sealed Secrets for secure secret management in Git repositories
Ingress and Load Balancing:
- MetalLB for LoadBalancer service implementation
- BGP configuration for dynamic route advertisement
- External DNS for automatic DNS record management
- TLS certificate automation with cert-manager
Phase 2: Observability and Operations
Production clusters require comprehensive observability:
Metrics and Monitoring:
- Prometheus for metrics collection and alerting
- Grafana for visualization and dashboards
- Node exporters for hardware and OS metrics
- Custom metrics for application-specific monitoring
Logging and Troubleshooting:
- Loki for centralized log aggregation
- Vector for log collection and routing
- Distributed tracing for complex application debugging
- Alert routing for operational incident response
Phase 3: Storage and Data Management
Stateful applications require sophisticated storage solutions:
Distributed Storage:
- NFS exports for shared storage across the cluster
- Ceph cluster for distributed block and object storage
- ZFS replication for data durability and snapshots
- SeaweedFS for scalable object storage
Backup and Recovery:
- Velero for cluster backup and disaster recovery
- Database backup automation for stateful workloads
- Cross-datacenter replication for business continuity
Phase 4: Security and Compliance
Enterprise-grade security requires multiple layers:
PKI and Certificate Management:
- Step-CA for internal certificate authority
- Automatic certificate rotation for all cluster services
- SSH certificate authentication for secure node access
- mTLS everywhere for service-to-service communication
Secrets and Access Control:
- HashiCorp Vault for enterprise secret management
- AWS KMS integration for encryption key management
- RBAC policies for fine-grained access control
- Pod security standards for workload isolation
Phase 5: Multi-Orchestrator Hybrid Cloud
The final phase explores advanced orchestration patterns:
Service Mesh and Discovery:
- Consul service mesh for advanced networking and security
- Cross-platform service discovery between Kubernetes and Nomad
- Traffic management and circuit breaking patterns
Workload Distribution:
- Nomad integration for specialized workloads and batch jobs
- Ray cluster for distributed machine learning workloads
- GPU acceleration for AI/ML and scientific computing
Learning Philosophy
This journey prioritizes understanding over convenience:
Transparency Over Magic:
- Each component is manually configured to understand its purpose
- Ansible automation makes every configuration decision explicit
- Documentation captures the reasoning behind each choice
Production Patterns from Day One:
- High availability configurations even in the homelab
- Security-first approach with proper PKI and encryption
- Monitoring and observability built into every service
Platform Engineering Mindset:
- Reusable patterns that could scale to enterprise environments
- GitOps workflows that support team collaboration
- Self-service capabilities for application developers
The Road Ahead
The following chapters will implement these platform services systematically, building up the cluster's capabilities layer by layer. Each addition will:
- Solve a real operational problem (not just add complexity)
- Follow production best practices (high availability, security, monitoring)
- Integrate with existing services (leveraging our PKI, service discovery, etc.)
- Document the implementation (including failure modes and troubleshooting)
This methodical approach ensures that when we're done, we'll have not just a working cluster, but a deep understanding of how modern container platforms are built and operated.
In the following sections, I'll add more functionality.
Installing Helm
I have a lot of ambitions for this cluster, but after some deliberation, the thing I most want to do right now is deploy something to Kubernetes.
So I'll be starting out by installing Argo CD, and I'll do that... soon. In the next chapter. I decided to install Argo CD via Helm, since I expect that Helm will be useful in other situations as well, e.g. trying out applications before I commit (no pun intended) to bringing them into GitOps.
So I created a playbook and role to cover installing Helm.
Installation Implementation
Package Repository Approach
Rather than downloading binaries manually, I chose to use the official Helm APT repository for a more maintainable installation. The Ansible role adds the repository using the modern deb822_repository
format:
- name: 'Add Helm package repository.'
ansible.builtin.deb822_repository:
name: 'helm'
types: ['deb']
uris: ['https://baltocdn.com/helm/stable/debian/']
suites: ['all']
components: ['main']
architectures: ['arm64']
signed_by: 'https://baltocdn.com/helm/signing.asc'
This approach provides several benefits:
- Automatic updates: Using
state: 'latest'
ensures we get the most recent Helm version - Security: Uses the official Helm signing key for package verification
- Architecture support: Properly configured for ARM64 architecture on Raspberry Pi nodes
- Maintainability: Standard package management simplifies updates and removes manual binary management
Installation Scope
Helm is installed only on the Kubernetes control plane nodes (k8s_control_plane
group). This is sufficient because:
- Post-Tiller Architecture: Modern Helm (v3+) doesn't require a server-side component
- Client-only Tool: Helm operates entirely as a client-side tool that communicates with the Kubernetes API
- Administrative Access: Control plane nodes are where cluster administration typically occurs
- Resource Efficiency: No need to install on every worker node
Integration with Cluster Architecture
Kubernetes Integration:
The installation leverages the existing kubernetes.core
Ansible collection, ensuring proper integration with the cluster's Kubernetes components. The role depends on:
- Existing cluster RBAC configurations
- Kubernetes API server access from control plane nodes
- Standard kubeconfig files for authentication
GitOps Integration: Helm serves as a crucial component for the GitOps workflow, particularly for Argo CD installation. The integration follows this pattern:
- name: 'Add Argo Helm chart repository.'
kubernetes.core.helm_repository:
name: 'argo'
repo_url: "{{ argo_cd.chart_repo_url }}"
- name: 'Install Argo CD from Helm chart.'
kubernetes.core.helm:
atomic: false
chart_ref: 'argo/argo-cd'
chart_version: "{{ argo_cd.chart_version }}"
create_namespace: true
release_name: 'argocd'
release_namespace: "{{ argo_cd.namespace }}"
Security Considerations
The installation follows security best practices:
- Signed Packages: Uses official Helm signing key for package verification
- Trusted Repository: Sources packages directly from Helm's CDN
- No Custom RBAC: Relies on existing Kubernetes cluster RBAC rather than creating additional permissions
- System-level Installation: Installed as root for proper system integration
Command Line Integration
The installation integrates seamlessly with the goldentooth CLI:
goldentooth install_helm
This command maps directly to the Ansible playbook execution, maintaining consistency with the cluster's unified management interface.
Version Management Strategy
The configuration uses a state: 'latest'
strategy, which means:
- Automatic Updates: Each playbook run ensures the latest Helm version is installed
- Application-level Pinning: Specific chart versions are managed at the application level (e.g., Argo CD chart version 7.1.5)
- Simplified Maintenance: No need to manually track Helm version updates
High Availability Considerations
By installing Helm on all control plane nodes, the configuration provides:
- Redundancy: Any control plane node can perform Helm operations
- Administrative Flexibility: Cluster administrators can use any control plane node
- Disaster Recovery: Helm operations can continue even if individual control plane nodes fail
Fortunately, this is fairly simple to install and trivial to configure, which is not something I can say for Argo CD 🙂
Installing Argo CD
GitOps is a methodology based around treating IaC stored in Git as a source of truth for the desired state of the infrastructure. Put simply, whatever you push to main
becomes the desired state and your IaC systems, whether they be Terraform, Ansible, etc, will be invoked to bring the actual state into alignment.
Argo CD is a popular system for implementing GitOps with Kubernetes. It can observe a Git repository for changes and react to those changes accordingly, creating/destroying/replacing resources as needed within the cluster.
Argo CD is a large, complicated application in its own right; its Helm chart is thousands of lines long. I'm not trying to learn it all right now, and fortunately, I have a fairly simple structure in mind.
I'll install Argo CD via a new Ansible playbook and role that use Helm, which we set up in the last section.
None of this is particularly complex, but I'll document some of my values overrides here:
# I've seen a mix of `argocd` and `argo-cd` scattered around. I preferred
# `argocd`, but I will shift to `argo-cd` where possible to improve
# consistency.
#
# EDIT: The `argocd` CLI tool appears to be broken and does not allow me to
# override the names of certain components when port forwarding.
# See https://github.com/argoproj/argo-cd/issues/16266 for details.
# As a result, I've gone through and reverted my changes to standardize as much
# as possible on `argocd`. FML.
nameOverride: 'argocd'
global:
# This evaluates to `argocd.goldentooth.hellholt.net`.
domain: "{{ argocd_domain }}"
# Add Prometheus scrape annotations to all metrics services. This can
# be used as an alternative to the ServiceMonitors.
addPrometheusAnnotations: true
# Default network policy rules used by all components.
networkPolicy:
# Create NetworkPolicy objects for all components; this is currently false
# but I think I'd like to create these later.
create: false
# Default deny all ingress traffic; I want to improve security, so I hope
# to enable this later.
defaultDenyIngress: false
configs:
secret:
createSecret: true
# Specify a password. I store an "easy" password, which is in my muscle
# memory, so I'll use that for right now.
argocdServerAdminPassword: "{{ vault.easy_password | password_hash('bcrypt') }}"
# Refer to the repositories that host our applications.
repositories:
# This is the main (and likely only) one.
gitops:
type: 'git'
name: 'gitops'
# This turns out to be https://github.com/goldentooth/gitops.git
url: "{{ argocd_app_repo_url }}"
redis-ha:
# Enable Redis high availability.
enabled: true
controller:
# The HA configuration keeps this at one, and I don't see a reason to change.
replicas: 1
server:
# Enable
autoscaling:
enabled: true
# This immediately scaled up to 3 replicas.
minReplicas: 2
# I'll make this more secure _soon_.
extraArgs:
- '--insecure'
# I don't have load balancing set up yet.
service:
type: 'ClusterIP'
repoServer:
autoscaling:
enabled: true
minReplicas: 2
applicationSet:
replicas: 2
Installation Architecture
The Argo CD installation uses a sophisticated Helm-based approach with the following components:
- Chart Version: 7.1.5 from the official Argo repository (
https://argoproj.github.io/argo-helm
) - CLI Installation: ARM64-specific Argo CD CLI installed to
/usr/local/bin/argocd
- Namespace: Dedicated
argocd
namespace with proper resource isolation - Deployment Scope: Runs once on control plane nodes for efficient resource usage
High Availability Configuration
The installation implements enterprise-grade high availability:
Redis High Availability:
redis-ha:
enabled: true
Component Scaling:
- Server: Autoscaling enabled with minimum 2 replicas for redundancy
- Repo Server: Autoscaling enabled with minimum 2 replicas for Git repository operations
- Application Set Controller: 2 replicas for ApplicationSet management
- Controller: 1 replica (following HA recommendations for the core controller)
This configuration ensures that Argo CD remains operational even during node failures or maintenance.
Security and Authentication
Admin Authentication: The cluster uses bcrypt-hashed passwords stored in the encrypted Ansible vault:
argocdServerAdminPassword: "{{ secret_vault.easy_password | password_hash('bcrypt') }}"
GitHub Integration: For private repository access, the installation creates a Kubernetes secret:
apiVersion: v1
kind: Secret
metadata:
name: github-token
namespace: argocd
data:
token: "{{ secret_vault.github_token | b64encode }}"
Current Security Posture:
- Server configured with
--insecure
flag (temporary for initial setup) - Network policies prepared but not yet enforced
- RBAC relies on default admin access patterns
Service and Network Integration
LoadBalancer Configuration: Unlike the basic ClusterIP shown in the values, the actual deployment uses:
service:
type: LoadBalancer
annotations:
external-dns.alpha.kubernetes.io/hostname: "argocd.{{ cluster.domain }}"
external-dns.alpha.kubernetes.io/ttl: "60"
This integration provides:
- MetalLB Integration: Automatic IP address assignment from the
10.4.11.0/24
pool - External DNS: Automatic DNS record creation for
argocd.goldentooth.net
- Public Access: Direct access from the broader network infrastructure
GitOps Implementation: App of Apps Pattern
The cluster implements the sophisticated "Application of Applications" pattern for managing GitOps workflows:
AppProject Configuration:
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: gitops-repo
spec:
sourceRepos:
- '*' # Lab environment - all repositories allowed
destinations:
- namespace: '*'
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: '*'
kind: '*'
ApplicationSet Generator: The cluster uses GitHub SCM Provider generator to automatically discover and deploy applications:
generators:
- scmProvider:
github:
organization: goldentooth
labelSelector:
matchLabels:
gitops-repo: "true"
This pattern automatically creates Argo CD Applications for any repository in the goldentooth organization with the gitops-repo
label.
Application Standards and Sync Policies
Standardized Sync Configuration: All applications follow consistent sync policies:
syncPolicy:
automated:
prune: true # Remove resources not in Git
selfHeal: true # Automatically fix configuration drift
syncOptions:
- Validate=true
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
- RespectIgnoreDifferences=true
- ApplyOutOfSyncOnly=true
Wave-based Deployment:
Applications use argocd.argoproj.io/wave
annotations for ordered deployment, ensuring dependencies are deployed before dependent services.
Monitoring Integration
Prometheus Integration:
global:
addPrometheusAnnotations: true
This configuration ensures all Argo CD components expose metrics for the cluster's Prometheus monitoring stack, providing visibility into GitOps operations and performance.
Current Application Portfolio
The GitOps system currently manages:
- MetalLB: Load balancer implementation
- External Secrets: Integration with HashiCorp Vault
- Prometheus Node Exporter: Node-level monitoring
- Additional applications: Automatically discovered via the ApplicationSet pattern
Command Line Integration
The installation provides seamless CLI integration:
# Install Argo CD
goldentooth install_argo_cd
# Install managed applications
goldentooth install_argo_cd_apps
Access Methods
Web Interface Access:
- Production: Direct access via
https://argocd.goldentooth.net
(LoadBalancer + External DNS) - Development: Port forwarding via
kubectl -n argocd port-forward service/argocd-server 8081:443 --address 0.0.0.0
After running the port-forward command on one of my control plane nodes, I'm able to view the web interface and log in. With the App of Apps pattern configured, the interface shows automatically discovered applications and their sync status.
The GitOps foundation is now established, enabling declarative application management across the entire cluster infrastructure.
The "Incubator" GitOps Application
Previously, we discussed GitOps and how Argo CD provides a platform for implementing GitOps for Kubernetes.
As mentioned, the general idea is to have some Git repository somewhere that defines an application. We create a corresponding resource in Argo CD to represent that application, and Argo CD will henceforth watch the repository and make changes to the running application as needed.
What does the repository actually include? Well, it might be a Helm chart, or a kustomization, or raw manifests, etc. Pretty much anything that could be done in Kubernetes.
Of course, setting this up involves some manual work; you need to actually create the application within Argo CD and, if you want it to hang around, you need to presumably commit that resource to some version control system somewhere. We of course want to be careful who has access to that repository, though, and we might not want engineers to have access to Argo CD itself. So suddenly there's a rather uncomfortable amount of work and coupling in all of this.
GitOps Deployment Patterns
Traditional Application Management Challenges
Manual application creation:
- Platform engineers must create Argo CD Application resources manually
- Direct access to Argo CD UI required for application management
- Configuration drift between different environments
- Difficulty managing permissions and access control at scale
Repository proliferation:
- Each application requires its own repository or subdirectory
- Inconsistent structure and standards across teams
- Complex permission management across multiple repositories
- Operational overhead for maintaining repository access
The App-of-Apps Pattern
A common pattern in Argo CD is the "app-of-apps" pattern. This is simply an Argo CD application pointing to a repository that contains other Argo CD applications. Thus you can have a single application created for you by the principal platform engineer, and you can turn it into fifty or a hundred finely grained pieces of infrastructure that said principal engineer doesn't have to know about 🙂
(If they haven't configured the security settings carefully, it can all just be your little secret 😉)
App-of-Apps Architecture:
Root Application (managed by platform team)
├── Application 1 (e.g., monitoring stack)
├── Application 2 (e.g., ingress controllers)
├── Application 3 (e.g., security tools)
└── Application N (e.g., developer applications)
Benefits of App-of-Apps:
- Single entry point: Platform team manages one root application
- Delegated management: Development teams control their applications
- Hierarchical organization: Logical grouping of related applications
- Simplified bootstrapping: New environments start with root application
Limitations of App-of-Apps:
- Resource proliferation: Each application creates an Application resource
- Dependency management: Complex inter-application dependencies
- Scaling challenges: Manual management of hundreds of applications
- Limited templating: Difficult to apply consistent patterns
ApplicationSet Pattern (Modern Approach)
A (relatively) new construct in Argo CD is the ApplicationSet construct, which seeks to more clearly define how applications are created and fix the problems with the "app-of-apps" approach. That's the approach we will take in this cluster for mature applications.
ApplicationSet Architecture:
ApplicationSet (template-driven)
├── Generator (Git directories, clusters, pull requests)
├── Template (Application template with parameters)
└── Applications (dynamically created from template)
ApplicationSet Generators:
- Git Generator: Scans Git repositories for directories or files
- Cluster Generator: Creates applications across multiple clusters
- List Generator: Creates applications from predefined lists
- Matrix Generator: Combines multiple generators for complex scenarios
Example ApplicationSet Configuration:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: gitops-repo
namespace: argocd
spec:
generators:
- scmProvider:
github:
organization: goldentooth
allBranches: false
labelSelector:
matchLabels:
gitops-repo: "true"
template:
metadata:
name: '{{repository}}'
spec:
project: gitops-repo
source:
repoURL: '{{url}}'
targetRevision: '{{branch}}'
path: .
destination:
server: https://kubernetes.default.svc
namespace: '{{repository}}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
The Incubator Project Strategy
Given that we're operating in a lab environment, we can use the "app-of-apps" approach for the Incubator, which is where we can try out new configurations. We can give it fairly unrestricted access while we work on getting things to deploy correctly, and then lock things down as we zero in on a stable configuration.
Development vs Production Patterns
Incubator (Development):
- App-of-Apps pattern: Manual application management for experimentation
- Permissive security: Broad access for rapid prototyping
- Flexible structure: Accommodate diverse application types
- Quick iteration: Fast deployment and testing cycles
Production (Mature Applications):
- ApplicationSet pattern: Template-driven automation at scale
- Restrictive security: Principle of least privilege
- Standardized structure: Consistent patterns and practices
- Controlled deployment: Change management and approval processes
But meanwhile, we'll create an AppProject
manifest for the Incubator:
---
apiVersion: 'argoproj.io/v1alpha1'
kind: 'AppProject'
metadata:
name: 'incubator'
# Argo CD resources need to deploy into the Argo CD namespace.
namespace: 'argocd'
finalizers:
- 'resources-finalizer.argocd.argoproj.io'
spec:
description: 'Goldentooth incubator project'
# Allow manifests to deploy from any Git repository.
# This is an acceptable security risk because this is a lab environment
# and I am the only user.
sourceRepos:
- '*'
destinations:
# Prevent any resources from deploying into the kube-system namespace.
- namespace: '!kube-system'
server: '*'
# Allow resources to deploy into any other namespace.
- namespace: '*'
server: '*'
clusterResourceWhitelist:
# Allow any cluster resources to deploy.
- group: '*'
kind: '*'
As mentioned before, this is very permissive. It only slightly differs from the default project by preventing resources from deploying into the kube-system
namespace.
We'll also create an Application
manifest:
apiVersion: 'argoproj.io/v1alpha1'
kind: 'Application'
metadata:
name: 'incubator'
namespace: 'argocd'
labels:
name: 'incubator'
managed-by: 'argocd'
spec:
project: 'incubator'
source:
repoURL: "https://github.com/goldentooth/incubator.git"
path: './'
targetRevision: 'HEAD'
destination:
server: 'https://kubernetes.default.svc'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- Validate=true
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
- RespectIgnoreDifferences=true
- ApplyOutOfSyncOnly=true
That's sufficient to get it to pop up in the Applications view in Argo CD.
AppProject Configuration Deep Dive
Security Boundary Configuration
The AppProject resource provides security boundaries and policy enforcement:
spec:
description: 'Goldentooth incubator project'
sourceRepos:
- '*' # Allow any Git repository (lab environment only)
destinations:
- namespace: '!kube-system' # Explicit exclusion
server: '*'
- namespace: '*' # Allow all other namespaces
server: '*'
clusterResourceWhitelist:
- group: '*' # Allow any cluster-scoped resources
kind: '*'
Security Trade-offs:
- Permissive source repos: Allows rapid experimentation with external charts
- Namespace protection: Prevents accidental modification of system namespaces
- Cluster resource access: Enables testing of operators and custom resources
- Lab environment justification: Security relaxed for learning and development
Production AppProject Example
For comparison, a production AppProject would be much more restrictive:
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production-apps
namespace: argocd
spec:
description: 'Production applications with strict controls'
sourceRepos:
- 'https://github.com/goldentooth/helm-charts'
- 'https://charts.bitnami.com/bitnami'
destinations:
- namespace: 'production-*'
server: 'https://kubernetes.default.svc'
clusterResourceWhitelist:
- group: ''
kind: 'Namespace'
- group: 'rbac.authorization.k8s.io'
kind: 'ClusterRole'
namespaceResourceWhitelist:
- group: 'apps'
kind: 'Deployment'
- group: ''
kind: 'Service'
roles:
- name: 'developers'
policies:
- 'p, proj:production-apps:developers, applications, get, production-apps/*, allow'
- 'p, proj:production-apps:developers, applications, sync, production-apps/*, allow'
Application Configuration Patterns
Sync Policy Configuration
The Application's sync policy defines automated behavior:
syncPolicy:
automated:
prune: true # Remove resources deleted from Git
selfHeal: true # Automatically fix configuration drift
syncOptions:
- Validate=true # Validate resources before applying
- CreateNamespace=true # Auto-create target namespaces
- PrunePropagationPolicy=foreground # Wait for dependent resources
- PruneLast=true # Delete applications last
- RespectIgnoreDifferences=true # Honor ignoreDifferences rules
- ApplyOutOfSyncOnly=true # Only apply changed resources
Sync Policy Implications:
- Prune: Ensures Git repository is single source of truth
- Self-heal: Prevents manual changes from persisting
- Validation: Catches configuration errors before deployment
- Namespace creation: Reduces manual setup for new applications
Repository Structure for App-of-Apps
The incubator repository structure supports the app-of-apps pattern:
incubator/
├── README.md
├── applications/
│ ├── monitoring/
│ │ ├── prometheus.yaml
│ │ ├── grafana.yaml
│ │ └── alertmanager.yaml
│ ├── networking/
│ │ ├── metallb.yaml
│ │ ├── external-dns.yaml
│ │ └── cert-manager.yaml
│ └── storage/
│ ├── nfs-provisioner.yaml
│ ├── ceph-operator.yaml
│ └── seaweedfs.yaml
└── environments/
├── dev/
├── staging/
└── production/
Directory Organization Benefits:
- Logical grouping: Applications organized by functional area
- Environment separation: Different configurations per environment
- Clear ownership: Teams can own specific directories
- Selective deployment: Enable/disable application groups easily
Integration with ApplicationSets
Migration Path from App-of-Apps
As applications mature, they can be migrated from the incubator to ApplicationSet management:
Migration Steps:
- Stabilize configuration: Test thoroughly in incubator environment
- Create Helm chart: Package application as reusable chart
- Add to gitops-repo: Tag repository for ApplicationSet discovery
- Remove from incubator: Delete Application from incubator repository
- Verify automation: Confirm ApplicationSet creates new Application
Example Migration:
# Before: Manual Application in incubator
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: monitoring-stack
namespace: argocd
spec:
project: incubator
source:
repoURL: 'https://github.com/goldentooth/monitoring'
path: './manifests'
# After: Automatically generated by ApplicationSet
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: monitoring
namespace: argocd
ownerReferences:
- apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
name: gitops-repo
spec:
project: gitops-repo
source:
repoURL: 'https://github.com/goldentooth/monitoring'
path: '.'
ApplicationSet Template Advantages
Consistent Configuration:
- All applications get same sync policy
- Standardized labeling and annotations
- Uniform security settings across applications
- Reduced configuration drift between applications
Template Parameters:
template:
metadata:
name: '{{repository}}'
labels:
environment: '{{environment}}'
team: '{{team}}'
gitops-managed: 'true'
spec:
project: '{{project}}'
source:
repoURL: '{{url}}'
targetRevision: '{{branch}}'
helm:
valueFiles:
- 'values-{{environment}}.yaml'
Operational Workflows
Development Workflow
Incubator Development Process:
- Create feature branch: Develop new application in isolated branch
- Add Application manifest: Define application in incubator repository
- Test deployment: Verify application deploys correctly
- Iterate configuration: Refine settings based on testing
- Merge to main: Deploy to shared incubator environment
- Monitor and debug: Observe application behavior and logs
Production Promotion
Graduation from Incubator:
- Create dedicated repository: Move application to own repository
- Package as Helm chart: Standardize configuration management
- Add gitops-repo label: Enable ApplicationSet discovery
- Configure environments: Set up dev/staging/production values
- Test automation: Verify ApplicationSet creates Application
- Remove from incubator: Clean up experimental Application
Monitoring and Observability
Application Health Monitoring:
# Check application sync status
kubectl -n argocd get applications
# View application details
argocd app get incubator
# Monitor sync operations
argocd app sync incubator --dry-run
# Check for configuration drift
argocd app diff incubator
Common Issues and Troubleshooting:
- Sync failures: Check resource validation and RBAC permissions
- Resource conflicts: Verify namespace isolation and resource naming
- Git access issues: Confirm repository permissions and authentication
- Health check failures: Review application health status and events
Best Practices for GitOps
Repository Management
Separation of Concerns:
- Application code: Business logic and container images
- Configuration: Kubernetes manifests and Helm values
- Infrastructure: Cluster setup and platform services
- Policies: Security rules and governance configurations
Version Control Strategy:
main branch → Production environment
staging branch → Staging environment
dev branch → Development environment
feature/* → Feature testing
Security Considerations
Credential Management:
- Use Argo CD's credential templates for repository access
- Implement least-privilege access for Git repositories
- Rotate credentials regularly and audit access
- Consider using Git over SSH for enhanced security
Resource Isolation:
- Separate AppProjects for different security domains
- Use namespace-based isolation for applications
- Implement RBAC policies aligned with organizational structure
- Monitor cross-namespace resource access
This incubator approach provides a safe environment for experimenting with GitOps patterns while establishing the foundation for scalable, automated application management through ApplicationSets as the platform matures.
Prometheus Node Exporter
Sure, I could just jump straight into kube-prometheus, but where's the fun (and, more importantly, the learning) in that?
I'm going to try to build a system from the ground up, tweaking each component as I go.
Prometheus Node Exporter seems like a reasonable place to begin, as it will give me per-node statistics that I can look at immediately. Or almost immediately.
The first order of business is to modify our incubator
repository to refer to the Prometheus Node Exporter Helm chart.
By adding the following in the incubator
repo:
# templates/prometheus_node_exporter.yaml
apiVersion: v1
kind: Namespace
metadata:
name: prometheus-node-exporter
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus-node-exporter
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
destination:
namespace: prometheus-node-exporter
server: 'https://kubernetes.default.svc'
project: incubator
source:
repoURL: https://prometheus-community.github.io/helm-charts
chart: prometheus-node-exporter
targetRevision: 4.31.0
helm:
releaseName: prometheus-node-exporter
We'll soon see the resources created:
And we can curl
a metric butt-ton of information:
$ curl localhost:9100/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 7
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.21.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 829976
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 829976
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.445756e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 704
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.909376e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 829976
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 1.458176e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.310144e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 8628
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.458176e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.76832e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 9332
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 1200
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 37968
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 48888
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 795876
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 425984
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 425984
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 9.4098e+06
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 6
# HELP node_boot_time_seconds Node boot time, in unixtime.
# TYPE node_boot_time_seconds gauge
node_boot_time_seconds 1.706835386e+09
# HELP node_context_switches_total Total number of context switches.
# TYPE node_context_switches_total counter
node_context_switches_total 1.8612307682e+10
# HELP node_cooling_device_cur_state Current throttle state of the cooling device
# TYPE node_cooling_device_cur_state gauge
node_cooling_device_cur_state{name="0",type="gpio-fan"} 1
# HELP node_cooling_device_max_state Maximum throttle state of the cooling device
# TYPE node_cooling_device_max_state gauge
node_cooling_device_max_state{name="0",type="gpio-fan"} 1
# HELP node_cpu_frequency_max_hertz Maximum CPU thread frequency in hertz.
# TYPE node_cpu_frequency_max_hertz gauge
node_cpu_frequency_max_hertz{cpu="0"} 2e+09
node_cpu_frequency_max_hertz{cpu="1"} 2e+09
node_cpu_frequency_max_hertz{cpu="2"} 2e+09
node_cpu_frequency_max_hertz{cpu="3"} 2e+09
# HELP node_cpu_frequency_min_hertz Minimum CPU thread frequency in hertz.
# TYPE node_cpu_frequency_min_hertz gauge
node_cpu_frequency_min_hertz{cpu="0"} 6e+08
node_cpu_frequency_min_hertz{cpu="1"} 6e+08
node_cpu_frequency_min_hertz{cpu="2"} 6e+08
node_cpu_frequency_min_hertz{cpu="3"} 6e+08
# HELP node_cpu_guest_seconds_total Seconds the CPUs spent in guests (VMs) for each mode.
# TYPE node_cpu_guest_seconds_total counter
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_guest_seconds_total{cpu="2",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="2",mode="user"} 0
node_cpu_guest_seconds_total{cpu="3",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="3",mode="user"} 0
# HELP node_cpu_scaling_frequency_hertz Current scaled CPU thread frequency in hertz.
# TYPE node_cpu_scaling_frequency_hertz gauge
node_cpu_scaling_frequency_hertz{cpu="0"} 2e+09
node_cpu_scaling_frequency_hertz{cpu="1"} 2e+09
node_cpu_scaling_frequency_hertz{cpu="2"} 2e+09
node_cpu_scaling_frequency_hertz{cpu="3"} 7e+08
# HELP node_cpu_scaling_frequency_max_hertz Maximum scaled CPU thread frequency in hertz.
# TYPE node_cpu_scaling_frequency_max_hertz gauge
node_cpu_scaling_frequency_max_hertz{cpu="0"} 2e+09
node_cpu_scaling_frequency_max_hertz{cpu="1"} 2e+09
node_cpu_scaling_frequency_max_hertz{cpu="2"} 2e+09
node_cpu_scaling_frequency_max_hertz{cpu="3"} 2e+09
# HELP node_cpu_scaling_frequency_min_hertz Minimum scaled CPU thread frequency in hertz.
# TYPE node_cpu_scaling_frequency_min_hertz gauge
node_cpu_scaling_frequency_min_hertz{cpu="0"} 6e+08
node_cpu_scaling_frequency_min_hertz{cpu="1"} 6e+08
node_cpu_scaling_frequency_min_hertz{cpu="2"} 6e+08
node_cpu_scaling_frequency_min_hertz{cpu="3"} 6e+08
# HELP node_cpu_scaling_governor Current enabled CPU frequency governor.
# TYPE node_cpu_scaling_governor gauge
node_cpu_scaling_governor{cpu="0",governor="conservative"} 0
node_cpu_scaling_governor{cpu="0",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="0",governor="performance"} 0
node_cpu_scaling_governor{cpu="0",governor="powersave"} 0
node_cpu_scaling_governor{cpu="0",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="0",governor="userspace"} 0
node_cpu_scaling_governor{cpu="1",governor="conservative"} 0
node_cpu_scaling_governor{cpu="1",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="1",governor="performance"} 0
node_cpu_scaling_governor{cpu="1",governor="powersave"} 0
node_cpu_scaling_governor{cpu="1",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="1",governor="userspace"} 0
node_cpu_scaling_governor{cpu="2",governor="conservative"} 0
node_cpu_scaling_governor{cpu="2",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="2",governor="performance"} 0
node_cpu_scaling_governor{cpu="2",governor="powersave"} 0
node_cpu_scaling_governor{cpu="2",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="2",governor="userspace"} 0
node_cpu_scaling_governor{cpu="3",governor="conservative"} 0
node_cpu_scaling_governor{cpu="3",governor="ondemand"} 1
node_cpu_scaling_governor{cpu="3",governor="performance"} 0
node_cpu_scaling_governor{cpu="3",governor="powersave"} 0
node_cpu_scaling_governor{cpu="3",governor="schedutil"} 0
node_cpu_scaling_governor{cpu="3",governor="userspace"} 0
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 2.68818165e+06
node_cpu_seconds_total{cpu="0",mode="iowait"} 8376.2
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 64.64
node_cpu_seconds_total{cpu="0",mode="softirq"} 17095.42
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 69354.3
node_cpu_seconds_total{cpu="0",mode="user"} 100985.22
node_cpu_seconds_total{cpu="1",mode="idle"} 2.70092994e+06
node_cpu_seconds_total{cpu="1",mode="iowait"} 10578.32
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 61.07
node_cpu_seconds_total{cpu="1",mode="softirq"} 3442.94
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 72718.57
node_cpu_seconds_total{cpu="1",mode="user"} 112849.28
node_cpu_seconds_total{cpu="2",mode="idle"} 2.70036651e+06
node_cpu_seconds_total{cpu="2",mode="iowait"} 10596.56
node_cpu_seconds_total{cpu="2",mode="irq"} 0
node_cpu_seconds_total{cpu="2",mode="nice"} 44.05
node_cpu_seconds_total{cpu="2",mode="softirq"} 3462.77
node_cpu_seconds_total{cpu="2",mode="steal"} 0
node_cpu_seconds_total{cpu="2",mode="system"} 73257.94
node_cpu_seconds_total{cpu="2",mode="user"} 112932.46
node_cpu_seconds_total{cpu="3",mode="idle"} 2.7039725e+06
node_cpu_seconds_total{cpu="3",mode="iowait"} 10525.98
node_cpu_seconds_total{cpu="3",mode="irq"} 0
node_cpu_seconds_total{cpu="3",mode="nice"} 56.42
node_cpu_seconds_total{cpu="3",mode="softirq"} 3434.8
node_cpu_seconds_total{cpu="3",mode="steal"} 0
node_cpu_seconds_total{cpu="3",mode="system"} 71924.93
node_cpu_seconds_total{cpu="3",mode="user"} 111615.13
# HELP node_disk_discard_time_seconds_total This is the total number of seconds spent by all discards.
# TYPE node_disk_discard_time_seconds_total counter
node_disk_discard_time_seconds_total{device="mmcblk0"} 6.008
node_disk_discard_time_seconds_total{device="mmcblk0p1"} 0.11800000000000001
node_disk_discard_time_seconds_total{device="mmcblk0p2"} 5.889
# HELP node_disk_discarded_sectors_total The total number of sectors discarded successfully.
# TYPE node_disk_discarded_sectors_total counter
node_disk_discarded_sectors_total{device="mmcblk0"} 2.7187894e+08
node_disk_discarded_sectors_total{device="mmcblk0p1"} 4.57802e+06
node_disk_discarded_sectors_total{device="mmcblk0p2"} 2.6730092e+08
# HELP node_disk_discards_completed_total The total number of discards completed successfully.
# TYPE node_disk_discards_completed_total counter
node_disk_discards_completed_total{device="mmcblk0"} 1330
node_disk_discards_completed_total{device="mmcblk0p1"} 20
node_disk_discards_completed_total{device="mmcblk0p2"} 1310
# HELP node_disk_discards_merged_total The total number of discards merged.
# TYPE node_disk_discards_merged_total counter
node_disk_discards_merged_total{device="mmcblk0"} 306
node_disk_discards_merged_total{device="mmcblk0p1"} 20
node_disk_discards_merged_total{device="mmcblk0p2"} 286
# HELP node_disk_filesystem_info Info about disk filesystem.
# TYPE node_disk_filesystem_info gauge
node_disk_filesystem_info{device="mmcblk0p1",type="vfat",usage="filesystem",uuid="5DF9-E225",version="FAT32"} 1
node_disk_filesystem_info{device="mmcblk0p2",type="ext4",usage="filesystem",uuid="3b614a3f-4a65-4480-876a-8a998e01ac9b",version="1.0"} 1
# HELP node_disk_flush_requests_time_seconds_total This is the total number of seconds spent by all flush requests.
# TYPE node_disk_flush_requests_time_seconds_total counter
node_disk_flush_requests_time_seconds_total{device="mmcblk0"} 4597.003
node_disk_flush_requests_time_seconds_total{device="mmcblk0p1"} 0
node_disk_flush_requests_time_seconds_total{device="mmcblk0p2"} 0
# HELP node_disk_flush_requests_total The total number of flush requests completed successfully
# TYPE node_disk_flush_requests_total counter
node_disk_flush_requests_total{device="mmcblk0"} 2.0808855e+07
node_disk_flush_requests_total{device="mmcblk0p1"} 0
node_disk_flush_requests_total{device="mmcblk0p2"} 0
# HELP node_disk_info Info of /sys/block/<block_device>.
# TYPE node_disk_info gauge
node_disk_info{device="mmcblk0",major="179",minor="0",model="",path="platform-fe340000.mmc",revision="",serial="",wwn=""} 1
node_disk_info{device="mmcblk0p1",major="179",minor="1",model="",path="platform-fe340000.mmc",revision="",serial="",wwn=""} 1
node_disk_info{device="mmcblk0p2",major="179",minor="2",model="",path="platform-fe340000.mmc",revision="",serial="",wwn=""} 1
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="mmcblk0"} 0
node_disk_io_now{device="mmcblk0p1"} 0
node_disk_io_now{device="mmcblk0p2"} 0
# HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
# TYPE node_disk_io_time_seconds_total counter
node_disk_io_time_seconds_total{device="mmcblk0"} 109481.804
node_disk_io_time_seconds_total{device="mmcblk0p1"} 4.172
node_disk_io_time_seconds_total{device="mmcblk0p2"} 109479.144
# HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
# TYPE node_disk_io_time_weighted_seconds_total counter
node_disk_io_time_weighted_seconds_total{device="mmcblk0"} 254357.374
node_disk_io_time_weighted_seconds_total{device="mmcblk0p1"} 168.897
node_disk_io_time_weighted_seconds_total{device="mmcblk0p2"} 249591.36000000002
# HELP node_disk_read_bytes_total The total number of bytes read successfully.
# TYPE node_disk_read_bytes_total counter
node_disk_read_bytes_total{device="mmcblk0"} 1.142326272e+09
node_disk_read_bytes_total{device="mmcblk0p1"} 8.704e+06
node_disk_read_bytes_total{device="mmcblk0p2"} 1.132397568e+09
# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.
# TYPE node_disk_read_time_seconds_total counter
node_disk_read_time_seconds_total{device="mmcblk0"} 72.763
node_disk_read_time_seconds_total{device="mmcblk0p1"} 0.8140000000000001
node_disk_read_time_seconds_total{device="mmcblk0p2"} 71.888
# HELP node_disk_reads_completed_total The total number of reads completed successfully.
# TYPE node_disk_reads_completed_total counter
node_disk_reads_completed_total{device="mmcblk0"} 26194
node_disk_reads_completed_total{device="mmcblk0p1"} 234
node_disk_reads_completed_total{device="mmcblk0p2"} 25885
# HELP node_disk_reads_merged_total The total number of reads merged.
# TYPE node_disk_reads_merged_total counter
node_disk_reads_merged_total{device="mmcblk0"} 4740
node_disk_reads_merged_total{device="mmcblk0p1"} 1119
node_disk_reads_merged_total{device="mmcblk0p2"} 3621
# HELP node_disk_write_time_seconds_total This is the total number of seconds spent by all writes.
# TYPE node_disk_write_time_seconds_total counter
node_disk_write_time_seconds_total{device="mmcblk0"} 249681.59900000002
node_disk_write_time_seconds_total{device="mmcblk0p1"} 167.964
node_disk_write_time_seconds_total{device="mmcblk0p2"} 249513.581
# HELP node_disk_writes_completed_total The total number of writes completed successfully.
# TYPE node_disk_writes_completed_total counter
node_disk_writes_completed_total{device="mmcblk0"} 6.356576e+07
node_disk_writes_completed_total{device="mmcblk0p1"} 749
node_disk_writes_completed_total{device="mmcblk0p2"} 6.3564908e+07
# HELP node_disk_writes_merged_total The number of writes merged.
# TYPE node_disk_writes_merged_total counter
node_disk_writes_merged_total{device="mmcblk0"} 9.074629e+06
node_disk_writes_merged_total{device="mmcblk0p1"} 1554
node_disk_writes_merged_total{device="mmcblk0p2"} 9.073075e+06
# HELP node_disk_written_bytes_total The total number of bytes written successfully.
# TYPE node_disk_written_bytes_total counter
node_disk_written_bytes_total{device="mmcblk0"} 2.61909222912e+11
node_disk_written_bytes_total{device="mmcblk0p1"} 8.3293696e+07
node_disk_written_bytes_total{device="mmcblk0p2"} 2.61825929216e+11
# HELP node_entropy_available_bits Bits of available entropy.
# TYPE node_entropy_available_bits gauge
node_entropy_available_bits 256
# HELP node_entropy_pool_size_bits Bits of entropy pool.
# TYPE node_entropy_pool_size_bits gauge
node_entropy_pool_size_bits 256
# HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which node_exporter was built, and the goos and goarch for the build.
# TYPE node_exporter_build_info gauge
node_exporter_build_info{branch="HEAD",goarch="arm64",goos="linux",goversion="go1.21.4",revision="7333465abf9efba81876303bb57e6fadb946041b",tags="netgo osusergo static_build",version="1.7.0"} 1
# HELP node_filefd_allocated File descriptor statistics: allocated.
# TYPE node_filefd_allocated gauge
node_filefd_allocated 2080
# HELP node_filefd_maximum File descriptor statistics: maximum.
# TYPE node_filefd_maximum gauge
node_filefd_maximum 9.223372036854776e+18
# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 4.6998528e+08
node_filesystem_avail_bytes{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 1.12564281344e+11
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 6.7108864e+07
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 8.16193536e+08
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.226496e+06
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 8.19064832e+08
# HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device.
# TYPE node_filesystem_device_error gauge
node_filesystem_device_error{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_device_error{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 0
node_filesystem_device_error{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 0
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/0e619376-d2fe-4f79-bc74-64fe5b3c8232/volumes/kubernetes.io~projected/kube-api-access-2f5p9"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/61bde493-1e9f-4d4d-b77f-1df095f775c4/volumes/kubernetes.io~projected/kube-api-access-rdrm2"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/878c2007-167d-4437-b654-43ef9cc0a5f0/volumes/kubernetes.io~projected/kube-api-access-j5fzh"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/9660f563-0f88-41aa-9d38-654911a04158/volumes/kubernetes.io~projected/kube-api-access-n494p"} 1
node_filesystem_device_error{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/c008ef0e-9212-42ce-9a34-6ccaf6b087d1/volumes/kubernetes.io~projected/kube-api-access-9c8sx"} 1
# HELP node_filesystem_files Filesystem total file nodes.
# TYPE node_filesystem_files gauge
node_filesystem_files{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_files{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 7.500896e+06
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 999839
node_filesystem_files{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 999839
node_filesystem_files{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 999839
node_filesystem_files{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 999839
node_filesystem_files{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 199967
# HELP node_filesystem_files_free Filesystem total free file nodes.
# TYPE node_filesystem_files_free gauge
node_filesystem_files_free{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_files_free{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 7.421624e+06
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 999838
node_filesystem_files_free{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 999838
node_filesystem_files_free{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 998519
node_filesystem_files_free{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 999833
node_filesystem_files_free{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 199947
# HELP node_filesystem_free_bytes Filesystem free space in bytes.
# TYPE node_filesystem_free_bytes gauge
node_filesystem_free_bytes{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 4.6998528e+08
node_filesystem_free_bytes{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 1.18947086336e+11
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 6.7108864e+07
node_filesystem_free_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 8.16193536e+08
node_filesystem_free_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.226496e+06
node_filesystem_free_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 8.19064832e+08
# HELP node_filesystem_readonly Filesystem read-only status.
# TYPE node_filesystem_readonly gauge
node_filesystem_readonly{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 0
node_filesystem_readonly{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 0
node_filesystem_readonly{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/0e619376-d2fe-4f79-bc74-64fe5b3c8232/volumes/kubernetes.io~projected/kube-api-access-2f5p9"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/61bde493-1e9f-4d4d-b77f-1df095f775c4/volumes/kubernetes.io~projected/kube-api-access-rdrm2"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/878c2007-167d-4437-b654-43ef9cc0a5f0/volumes/kubernetes.io~projected/kube-api-access-j5fzh"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/9660f563-0f88-41aa-9d38-654911a04158/volumes/kubernetes.io~projected/kube-api-access-n494p"} 0
node_filesystem_readonly{device="tmpfs",fstype="tmpfs",mountpoint="/var/lib/kubelet/pods/c008ef0e-9212-42ce-9a34-6ccaf6b087d1/volumes/kubernetes.io~projected/kube-api-access-9c8sx"} 0
# HELP node_filesystem_size_bytes Filesystem size in bytes.
# TYPE node_filesystem_size_bytes gauge
node_filesystem_size_bytes{device="/dev/mmcblk0p1",fstype="vfat",mountpoint="/boot/firmware"} 5.34765568e+08
node_filesystem_size_bytes{device="/dev/mmcblk0p2",fstype="ext4",mountpoint="/"} 1.25321166848e+11
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/0208f7a11dc33d5bc8bd289bad919bb17181316989d0b67797b9bc600eca5feb/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/11d4155c4c3ccf57a41200b7ec3de847c49956a051889aed26bcb0efe751d221/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/29e8b6a960d1f80aa5dba931d282e3e896f4689b6d27e0f29296860ac03fa6b4/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3577c3db4143954a3f4213a5a6dedd3dfb336f135900eecf207414ad4770f1b0/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/576fad5789d19fda4bfcc5999d388e6f99e262000d11112356e37c6a929059ed/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/8524e37b55f671a6cb14491142d00badcfe7dc62a7e73540d107378f68b68667/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a15cd8217349b0597cc3fb05844d99db669880444ca3957a26e5c57c326550c0/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/d2d3f63c7fd3e21f208a8a2a2d0428cc248f979655bc87ad89e38f6f93e7d1ac/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/e9738e9d4d1902e832e290dfac1f0a6b6a1d87ba172c64818a032f0ae131b124/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="shm",fstype="tmpfs",mountpoint="/run/containerd/io.containerd.grpc.v1.cri/sandboxes/eea73d4ab31da1cdcbbbfe69e4c1e3b2338d7b659fee3d8e05a33b3e6cf4638c/shm"} 6.7108864e+07
node_filesystem_size_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 8.19068928e+08
node_filesystem_size_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.24288e+06
node_filesystem_size_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/1000"} 8.19064832e+08
# HELP node_forks_total Total number of forks.
# TYPE node_forks_total counter
node_forks_total 1.9002994e+07
# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="platform_gpio_fan_0",chip_name="gpio_fan"} 1
node_hwmon_chip_names{chip="soc:firmware_raspberrypi_hwmon",chip_name="rpi_volt"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1
# HELP node_hwmon_fan_max_rpm Hardware monitor for fan revolutions per minute (max)
# TYPE node_hwmon_fan_max_rpm gauge
node_hwmon_fan_max_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 5000
# HELP node_hwmon_fan_min_rpm Hardware monitor for fan revolutions per minute (min)
# TYPE node_hwmon_fan_min_rpm gauge
node_hwmon_fan_min_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 0
# HELP node_hwmon_fan_rpm Hardware monitor for fan revolutions per minute (input)
# TYPE node_hwmon_fan_rpm gauge
node_hwmon_fan_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 5000
# HELP node_hwmon_fan_target_rpm Hardware monitor for fan revolutions per minute (target)
# TYPE node_hwmon_fan_target_rpm gauge
node_hwmon_fan_target_rpm{chip="platform_gpio_fan_0",sensor="fan1"} 5000
# HELP node_hwmon_in_lcrit_alarm_volts Hardware monitor for voltage (lcrit_alarm)
# TYPE node_hwmon_in_lcrit_alarm_volts gauge
node_hwmon_in_lcrit_alarm_volts{chip="soc:firmware_raspberrypi_hwmon",sensor="in0"} 0
# HELP node_hwmon_pwm Hardware monitor pwm element
# TYPE node_hwmon_pwm gauge
node_hwmon_pwm{chip="platform_gpio_fan_0",sensor="pwm1"} 255
# HELP node_hwmon_pwm_enable Hardware monitor pwm element enable
# TYPE node_hwmon_pwm_enable gauge
node_hwmon_pwm_enable{chip="platform_gpio_fan_0",sensor="pwm1"} 1
# HELP node_hwmon_pwm_mode Hardware monitor pwm element mode
# TYPE node_hwmon_pwm_mode gauge
node_hwmon_pwm_mode{chip="platform_gpio_fan_0",sensor="pwm1"} 0
# HELP node_hwmon_temp_celsius Hardware monitor for temperature (input)
# TYPE node_hwmon_temp_celsius gauge
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp0"} 27.745
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"} 27.745
# HELP node_hwmon_temp_crit_celsius Hardware monitor for temperature (crit)
# TYPE node_hwmon_temp_crit_celsius gauge
node_hwmon_temp_crit_celsius{chip="thermal_thermal_zone0",sensor="temp1"} 110
# HELP node_intr_total Total number of interrupts serviced.
# TYPE node_intr_total counter
node_intr_total 1.0312668562e+10
# HELP node_ipvs_connections_total The total number of connections made.
# TYPE node_ipvs_connections_total counter
node_ipvs_connections_total 2907
# HELP node_ipvs_incoming_bytes_total The total amount of incoming data.
# TYPE node_ipvs_incoming_bytes_total counter
node_ipvs_incoming_bytes_total 2.77474522e+08
# HELP node_ipvs_incoming_packets_total The total number of incoming packets.
# TYPE node_ipvs_incoming_packets_total counter
node_ipvs_incoming_packets_total 3.761541e+06
# HELP node_ipvs_outgoing_bytes_total The total amount of outgoing data.
# TYPE node_ipvs_outgoing_bytes_total counter
node_ipvs_outgoing_bytes_total 7.406631703e+09
# HELP node_ipvs_outgoing_packets_total The total number of outgoing packets.
# TYPE node_ipvs_outgoing_packets_total counter
node_ipvs_outgoing_packets_total 4.224817e+06
# HELP node_load1 1m load average.
# TYPE node_load1 gauge
node_load1 0.87
# HELP node_load15 15m load average.
# TYPE node_load15 gauge
node_load15 0.63
# HELP node_load5 5m load average.
# TYPE node_load5 gauge
node_load5 0.58
# HELP node_memory_Active_anon_bytes Memory information field Active_anon_bytes.
# TYPE node_memory_Active_anon_bytes gauge
node_memory_Active_anon_bytes 1.043009536e+09
# HELP node_memory_Active_bytes Memory information field Active_bytes.
# TYPE node_memory_Active_bytes gauge
node_memory_Active_bytes 1.62168832e+09
# HELP node_memory_Active_file_bytes Memory information field Active_file_bytes.
# TYPE node_memory_Active_file_bytes gauge
node_memory_Active_file_bytes 5.78678784e+08
# HELP node_memory_AnonPages_bytes Memory information field AnonPages_bytes.
# TYPE node_memory_AnonPages_bytes gauge
node_memory_AnonPages_bytes 1.043357696e+09
# HELP node_memory_Bounce_bytes Memory information field Bounce_bytes.
# TYPE node_memory_Bounce_bytes gauge
node_memory_Bounce_bytes 0
# HELP node_memory_Buffers_bytes Memory information field Buffers_bytes.
# TYPE node_memory_Buffers_bytes gauge
node_memory_Buffers_bytes 1.36790016e+08
# HELP node_memory_Cached_bytes Memory information field Cached_bytes.
# TYPE node_memory_Cached_bytes gauge
node_memory_Cached_bytes 4.609712128e+09
# HELP node_memory_CmaFree_bytes Memory information field CmaFree_bytes.
# TYPE node_memory_CmaFree_bytes gauge
node_memory_CmaFree_bytes 5.25586432e+08
# HELP node_memory_CmaTotal_bytes Memory information field CmaTotal_bytes.
# TYPE node_memory_CmaTotal_bytes gauge
node_memory_CmaTotal_bytes 5.36870912e+08
# HELP node_memory_CommitLimit_bytes Memory information field CommitLimit_bytes.
# TYPE node_memory_CommitLimit_bytes gauge
node_memory_CommitLimit_bytes 4.095340544e+09
# HELP node_memory_Committed_AS_bytes Memory information field Committed_AS_bytes.
# TYPE node_memory_Committed_AS_bytes gauge
node_memory_Committed_AS_bytes 3.449647104e+09
# HELP node_memory_Dirty_bytes Memory information field Dirty_bytes.
# TYPE node_memory_Dirty_bytes gauge
node_memory_Dirty_bytes 65536
# HELP node_memory_Inactive_anon_bytes Memory information field Inactive_anon_bytes.
# TYPE node_memory_Inactive_anon_bytes gauge
node_memory_Inactive_anon_bytes 3.25632e+06
# HELP node_memory_Inactive_bytes Memory information field Inactive_bytes.
# TYPE node_memory_Inactive_bytes gauge
node_memory_Inactive_bytes 4.168126464e+09
# HELP node_memory_Inactive_file_bytes Memory information field Inactive_file_bytes.
# TYPE node_memory_Inactive_file_bytes gauge
node_memory_Inactive_file_bytes 4.164870144e+09
# HELP node_memory_KReclaimable_bytes Memory information field KReclaimable_bytes.
# TYPE node_memory_KReclaimable_bytes gauge
node_memory_KReclaimable_bytes 4.01215488e+08
# HELP node_memory_KernelStack_bytes Memory information field KernelStack_bytes.
# TYPE node_memory_KernelStack_bytes gauge
node_memory_KernelStack_bytes 8.667136e+06
# HELP node_memory_Mapped_bytes Memory information field Mapped_bytes.
# TYPE node_memory_Mapped_bytes gauge
node_memory_Mapped_bytes 6.4243712e+08
# HELP node_memory_MemAvailable_bytes Memory information field MemAvailable_bytes.
# TYPE node_memory_MemAvailable_bytes gauge
node_memory_MemAvailable_bytes 6.829756416e+09
# HELP node_memory_MemFree_bytes Memory information field MemFree_bytes.
# TYPE node_memory_MemFree_bytes gauge
node_memory_MemFree_bytes 1.837809664e+09
# HELP node_memory_MemTotal_bytes Memory information field MemTotal_bytes.
# TYPE node_memory_MemTotal_bytes gauge
node_memory_MemTotal_bytes 8.190685184e+09
# HELP node_memory_Mlocked_bytes Memory information field Mlocked_bytes.
# TYPE node_memory_Mlocked_bytes gauge
node_memory_Mlocked_bytes 0
# HELP node_memory_NFS_Unstable_bytes Memory information field NFS_Unstable_bytes.
# TYPE node_memory_NFS_Unstable_bytes gauge
node_memory_NFS_Unstable_bytes 0
# HELP node_memory_PageTables_bytes Memory information field PageTables_bytes.
# TYPE node_memory_PageTables_bytes gauge
node_memory_PageTables_bytes 1.128448e+07
# HELP node_memory_Percpu_bytes Memory information field Percpu_bytes.
# TYPE node_memory_Percpu_bytes gauge
node_memory_Percpu_bytes 3.52256e+06
# HELP node_memory_SReclaimable_bytes Memory information field SReclaimable_bytes.
# TYPE node_memory_SReclaimable_bytes gauge
node_memory_SReclaimable_bytes 4.01215488e+08
# HELP node_memory_SUnreclaim_bytes Memory information field SUnreclaim_bytes.
# TYPE node_memory_SUnreclaim_bytes gauge
node_memory_SUnreclaim_bytes 8.0576512e+07
# HELP node_memory_SecPageTables_bytes Memory information field SecPageTables_bytes.
# TYPE node_memory_SecPageTables_bytes gauge
node_memory_SecPageTables_bytes 0
# HELP node_memory_Shmem_bytes Memory information field Shmem_bytes.
# TYPE node_memory_Shmem_bytes gauge
node_memory_Shmem_bytes 2.953216e+06
# HELP node_memory_Slab_bytes Memory information field Slab_bytes.
# TYPE node_memory_Slab_bytes gauge
node_memory_Slab_bytes 4.81792e+08
# HELP node_memory_SwapCached_bytes Memory information field SwapCached_bytes.
# TYPE node_memory_SwapCached_bytes gauge
node_memory_SwapCached_bytes 0
# HELP node_memory_SwapFree_bytes Memory information field SwapFree_bytes.
# TYPE node_memory_SwapFree_bytes gauge
node_memory_SwapFree_bytes 0
# HELP node_memory_SwapTotal_bytes Memory information field SwapTotal_bytes.
# TYPE node_memory_SwapTotal_bytes gauge
node_memory_SwapTotal_bytes 0
# HELP node_memory_Unevictable_bytes Memory information field Unevictable_bytes.
# TYPE node_memory_Unevictable_bytes gauge
node_memory_Unevictable_bytes 0
# HELP node_memory_VmallocChunk_bytes Memory information field VmallocChunk_bytes.
# TYPE node_memory_VmallocChunk_bytes gauge
node_memory_VmallocChunk_bytes 0
# HELP node_memory_VmallocTotal_bytes Memory information field VmallocTotal_bytes.
# TYPE node_memory_VmallocTotal_bytes gauge
node_memory_VmallocTotal_bytes 2.65885319168e+11
# HELP node_memory_VmallocUsed_bytes Memory information field VmallocUsed_bytes.
# TYPE node_memory_VmallocUsed_bytes gauge
node_memory_VmallocUsed_bytes 2.3687168e+07
# HELP node_memory_WritebackTmp_bytes Memory information field WritebackTmp_bytes.
# TYPE node_memory_WritebackTmp_bytes gauge
node_memory_WritebackTmp_bytes 0
# HELP node_memory_Writeback_bytes Memory information field Writeback_bytes.
# TYPE node_memory_Writeback_bytes gauge
node_memory_Writeback_bytes 0
# HELP node_memory_Zswap_bytes Memory information field Zswap_bytes.
# TYPE node_memory_Zswap_bytes gauge
node_memory_Zswap_bytes 0
# HELP node_memory_Zswapped_bytes Memory information field Zswapped_bytes.
# TYPE node_memory_Zswapped_bytes gauge
node_memory_Zswapped_bytes 0
# HELP node_netstat_Icmp6_InErrors Statistic Icmp6InErrors.
# TYPE node_netstat_Icmp6_InErrors untyped
node_netstat_Icmp6_InErrors 0
# HELP node_netstat_Icmp6_InMsgs Statistic Icmp6InMsgs.
# TYPE node_netstat_Icmp6_InMsgs untyped
node_netstat_Icmp6_InMsgs 2
# HELP node_netstat_Icmp6_OutMsgs Statistic Icmp6OutMsgs.
# TYPE node_netstat_Icmp6_OutMsgs untyped
node_netstat_Icmp6_OutMsgs 1601
# HELP node_netstat_Icmp_InErrors Statistic IcmpInErrors.
# TYPE node_netstat_Icmp_InErrors untyped
node_netstat_Icmp_InErrors 1
# HELP node_netstat_Icmp_InMsgs Statistic IcmpInMsgs.
# TYPE node_netstat_Icmp_InMsgs untyped
node_netstat_Icmp_InMsgs 17
# HELP node_netstat_Icmp_OutMsgs Statistic IcmpOutMsgs.
# TYPE node_netstat_Icmp_OutMsgs untyped
node_netstat_Icmp_OutMsgs 14
# HELP node_netstat_Ip6_InOctets Statistic Ip6InOctets.
# TYPE node_netstat_Ip6_InOctets untyped
node_netstat_Ip6_InOctets 3.997070725e+09
# HELP node_netstat_Ip6_OutOctets Statistic Ip6OutOctets.
# TYPE node_netstat_Ip6_OutOctets untyped
node_netstat_Ip6_OutOctets 3.997073515e+09
# HELP node_netstat_IpExt_InOctets Statistic IpExtInOctets.
# TYPE node_netstat_IpExt_InOctets untyped
node_netstat_IpExt_InOctets 1.08144717251e+11
# HELP node_netstat_IpExt_OutOctets Statistic IpExtOutOctets.
# TYPE node_netstat_IpExt_OutOctets untyped
node_netstat_IpExt_OutOctets 1.56294035787e+11
# HELP node_netstat_Ip_Forwarding Statistic IpForwarding.
# TYPE node_netstat_Ip_Forwarding untyped
node_netstat_Ip_Forwarding 1
# HELP node_netstat_TcpExt_ListenDrops Statistic TcpExtListenDrops.
# TYPE node_netstat_TcpExt_ListenDrops untyped
node_netstat_TcpExt_ListenDrops 0
# HELP node_netstat_TcpExt_ListenOverflows Statistic TcpExtListenOverflows.
# TYPE node_netstat_TcpExt_ListenOverflows untyped
node_netstat_TcpExt_ListenOverflows 0
# HELP node_netstat_TcpExt_SyncookiesFailed Statistic TcpExtSyncookiesFailed.
# TYPE node_netstat_TcpExt_SyncookiesFailed untyped
node_netstat_TcpExt_SyncookiesFailed 0
# HELP node_netstat_TcpExt_SyncookiesRecv Statistic TcpExtSyncookiesRecv.
# TYPE node_netstat_TcpExt_SyncookiesRecv untyped
node_netstat_TcpExt_SyncookiesRecv 0
# HELP node_netstat_TcpExt_SyncookiesSent Statistic TcpExtSyncookiesSent.
# TYPE node_netstat_TcpExt_SyncookiesSent untyped
node_netstat_TcpExt_SyncookiesSent 0
# HELP node_netstat_TcpExt_TCPSynRetrans Statistic TcpExtTCPSynRetrans.
# TYPE node_netstat_TcpExt_TCPSynRetrans untyped
node_netstat_TcpExt_TCPSynRetrans 342
# HELP node_netstat_TcpExt_TCPTimeouts Statistic TcpExtTCPTimeouts.
# TYPE node_netstat_TcpExt_TCPTimeouts untyped
node_netstat_TcpExt_TCPTimeouts 513
# HELP node_netstat_Tcp_ActiveOpens Statistic TcpActiveOpens.
# TYPE node_netstat_Tcp_ActiveOpens untyped
node_netstat_Tcp_ActiveOpens 7.121624e+06
# HELP node_netstat_Tcp_CurrEstab Statistic TcpCurrEstab.
# TYPE node_netstat_Tcp_CurrEstab untyped
node_netstat_Tcp_CurrEstab 236
# HELP node_netstat_Tcp_InErrs Statistic TcpInErrs.
# TYPE node_netstat_Tcp_InErrs untyped
node_netstat_Tcp_InErrs 0
# HELP node_netstat_Tcp_InSegs Statistic TcpInSegs.
# TYPE node_netstat_Tcp_InSegs untyped
node_netstat_Tcp_InSegs 5.82648533e+08
# HELP node_netstat_Tcp_OutRsts Statistic TcpOutRsts.
# TYPE node_netstat_Tcp_OutRsts untyped
node_netstat_Tcp_OutRsts 5.798397e+06
# HELP node_netstat_Tcp_OutSegs Statistic TcpOutSegs.
# TYPE node_netstat_Tcp_OutSegs untyped
node_netstat_Tcp_OutSegs 6.13524809e+08
# HELP node_netstat_Tcp_PassiveOpens Statistic TcpPassiveOpens.
# TYPE node_netstat_Tcp_PassiveOpens untyped
node_netstat_Tcp_PassiveOpens 6.751246e+06
# HELP node_netstat_Tcp_RetransSegs Statistic TcpRetransSegs.
# TYPE node_netstat_Tcp_RetransSegs untyped
node_netstat_Tcp_RetransSegs 173853
# HELP node_netstat_Udp6_InDatagrams Statistic Udp6InDatagrams.
# TYPE node_netstat_Udp6_InDatagrams untyped
node_netstat_Udp6_InDatagrams 279
# HELP node_netstat_Udp6_InErrors Statistic Udp6InErrors.
# TYPE node_netstat_Udp6_InErrors untyped
node_netstat_Udp6_InErrors 0
# HELP node_netstat_Udp6_NoPorts Statistic Udp6NoPorts.
# TYPE node_netstat_Udp6_NoPorts untyped
node_netstat_Udp6_NoPorts 0
# HELP node_netstat_Udp6_OutDatagrams Statistic Udp6OutDatagrams.
# TYPE node_netstat_Udp6_OutDatagrams untyped
node_netstat_Udp6_OutDatagrams 236
# HELP node_netstat_Udp6_RcvbufErrors Statistic Udp6RcvbufErrors.
# TYPE node_netstat_Udp6_RcvbufErrors untyped
node_netstat_Udp6_RcvbufErrors 0
# HELP node_netstat_Udp6_SndbufErrors Statistic Udp6SndbufErrors.
# TYPE node_netstat_Udp6_SndbufErrors untyped
node_netstat_Udp6_SndbufErrors 0
# HELP node_netstat_UdpLite6_InErrors Statistic UdpLite6InErrors.
# TYPE node_netstat_UdpLite6_InErrors untyped
node_netstat_UdpLite6_InErrors 0
# HELP node_netstat_UdpLite_InErrors Statistic UdpLiteInErrors.
# TYPE node_netstat_UdpLite_InErrors untyped
node_netstat_UdpLite_InErrors 0
# HELP node_netstat_Udp_InDatagrams Statistic UdpInDatagrams.
# TYPE node_netstat_Udp_InDatagrams untyped
node_netstat_Udp_InDatagrams 6.547468e+06
# HELP node_netstat_Udp_InErrors Statistic UdpInErrors.
# TYPE node_netstat_Udp_InErrors untyped
node_netstat_Udp_InErrors 0
# HELP node_netstat_Udp_NoPorts Statistic UdpNoPorts.
# TYPE node_netstat_Udp_NoPorts untyped
node_netstat_Udp_NoPorts 9
# HELP node_netstat_Udp_OutDatagrams Statistic UdpOutDatagrams.
# TYPE node_netstat_Udp_OutDatagrams untyped
node_netstat_Udp_OutDatagrams 3.213419e+06
# HELP node_netstat_Udp_RcvbufErrors Statistic UdpRcvbufErrors.
# TYPE node_netstat_Udp_RcvbufErrors untyped
node_netstat_Udp_RcvbufErrors 0
# HELP node_netstat_Udp_SndbufErrors Statistic UdpSndbufErrors.
# TYPE node_netstat_Udp_SndbufErrors untyped
node_netstat_Udp_SndbufErrors 0
# HELP node_network_address_assign_type Network device property: address_assign_type
# TYPE node_network_address_assign_type gauge
node_network_address_assign_type{device="cali60e575ce8db"} 3
node_network_address_assign_type{device="cali85a56337055"} 3
node_network_address_assign_type{device="cali8c459f6702e"} 3
node_network_address_assign_type{device="eth0"} 0
node_network_address_assign_type{device="lo"} 0
node_network_address_assign_type{device="tunl0"} 0
node_network_address_assign_type{device="wlan0"} 0
# HELP node_network_carrier Network device property: carrier
# TYPE node_network_carrier gauge
node_network_carrier{device="cali60e575ce8db"} 1
node_network_carrier{device="cali85a56337055"} 1
node_network_carrier{device="cali8c459f6702e"} 1
node_network_carrier{device="eth0"} 1
node_network_carrier{device="lo"} 1
node_network_carrier{device="tunl0"} 1
node_network_carrier{device="wlan0"} 0
# HELP node_network_carrier_changes_total Network device property: carrier_changes_total
# TYPE node_network_carrier_changes_total counter
node_network_carrier_changes_total{device="cali60e575ce8db"} 4
node_network_carrier_changes_total{device="cali85a56337055"} 4
node_network_carrier_changes_total{device="cali8c459f6702e"} 4
node_network_carrier_changes_total{device="eth0"} 1
node_network_carrier_changes_total{device="lo"} 0
node_network_carrier_changes_total{device="tunl0"} 0
node_network_carrier_changes_total{device="wlan0"} 1
# HELP node_network_carrier_down_changes_total Network device property: carrier_down_changes_total
# TYPE node_network_carrier_down_changes_total counter
node_network_carrier_down_changes_total{device="cali60e575ce8db"} 2
node_network_carrier_down_changes_total{device="cali85a56337055"} 2
node_network_carrier_down_changes_total{device="cali8c459f6702e"} 2
node_network_carrier_down_changes_total{device="eth0"} 0
node_network_carrier_down_changes_total{device="lo"} 0
node_network_carrier_down_changes_total{device="tunl0"} 0
node_network_carrier_down_changes_total{device="wlan0"} 1
# HELP node_network_carrier_up_changes_total Network device property: carrier_up_changes_total
# TYPE node_network_carrier_up_changes_total counter
node_network_carrier_up_changes_total{device="cali60e575ce8db"} 2
node_network_carrier_up_changes_total{device="cali85a56337055"} 2
node_network_carrier_up_changes_total{device="cali8c459f6702e"} 2
node_network_carrier_up_changes_total{device="eth0"} 1
node_network_carrier_up_changes_total{device="lo"} 0
node_network_carrier_up_changes_total{device="tunl0"} 0
node_network_carrier_up_changes_total{device="wlan0"} 0
# HELP node_network_device_id Network device property: device_id
# TYPE node_network_device_id gauge
node_network_device_id{device="cali60e575ce8db"} 0
node_network_device_id{device="cali85a56337055"} 0
node_network_device_id{device="cali8c459f6702e"} 0
node_network_device_id{device="eth0"} 0
node_network_device_id{device="lo"} 0
node_network_device_id{device="tunl0"} 0
node_network_device_id{device="wlan0"} 0
# HELP node_network_dormant Network device property: dormant
# TYPE node_network_dormant gauge
node_network_dormant{device="cali60e575ce8db"} 0
node_network_dormant{device="cali85a56337055"} 0
node_network_dormant{device="cali8c459f6702e"} 0
node_network_dormant{device="eth0"} 0
node_network_dormant{device="lo"} 0
node_network_dormant{device="tunl0"} 0
node_network_dormant{device="wlan0"} 0
# HELP node_network_flags Network device property: flags
# TYPE node_network_flags gauge
node_network_flags{device="cali60e575ce8db"} 4099
node_network_flags{device="cali85a56337055"} 4099
node_network_flags{device="cali8c459f6702e"} 4099
node_network_flags{device="eth0"} 4099
node_network_flags{device="lo"} 9
node_network_flags{device="tunl0"} 129
node_network_flags{device="wlan0"} 4099
# HELP node_network_iface_id Network device property: iface_id
# TYPE node_network_iface_id gauge
node_network_iface_id{device="cali60e575ce8db"} 73
node_network_iface_id{device="cali85a56337055"} 74
node_network_iface_id{device="cali8c459f6702e"} 70
node_network_iface_id{device="eth0"} 2
node_network_iface_id{device="lo"} 1
node_network_iface_id{device="tunl0"} 18
node_network_iface_id{device="wlan0"} 3
# HELP node_network_iface_link Network device property: iface_link
# TYPE node_network_iface_link gauge
node_network_iface_link{device="cali60e575ce8db"} 4
node_network_iface_link{device="cali85a56337055"} 4
node_network_iface_link{device="cali8c459f6702e"} 4
node_network_iface_link{device="eth0"} 2
node_network_iface_link{device="lo"} 1
node_network_iface_link{device="tunl0"} 0
node_network_iface_link{device="wlan0"} 3
# HELP node_network_iface_link_mode Network device property: iface_link_mode
# TYPE node_network_iface_link_mode gauge
node_network_iface_link_mode{device="cali60e575ce8db"} 0
node_network_iface_link_mode{device="cali85a56337055"} 0
node_network_iface_link_mode{device="cali8c459f6702e"} 0
node_network_iface_link_mode{device="eth0"} 0
node_network_iface_link_mode{device="lo"} 0
node_network_iface_link_mode{device="tunl0"} 0
node_network_iface_link_mode{device="wlan0"} 1
# HELP node_network_info Non-numeric data from /sys/class/net/<iface>, value is always 1.
# TYPE node_network_info gauge
node_network_info{address="00:00:00:00",adminstate="up",broadcast="00:00:00:00",device="tunl0",duplex="",ifalias="",operstate="unknown"} 1
node_network_info{address="00:00:00:00:00:00",adminstate="up",broadcast="00:00:00:00:00:00",device="lo",duplex="",ifalias="",operstate="unknown"} 1
node_network_info{address="d8:3a:dd:89:c1:0b",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="eth0",duplex="full",ifalias="",operstate="up"} 1
node_network_info{address="d8:3a:dd:89:c1:0c",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="wlan0",duplex="",ifalias="",operstate="down"} 1
node_network_info{address="ee:ee:ee:ee:ee:ee",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="cali60e575ce8db",duplex="full",ifalias="",operstate="up"} 1
node_network_info{address="ee:ee:ee:ee:ee:ee",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="cali85a56337055",duplex="full",ifalias="",operstate="up"} 1
node_network_info{address="ee:ee:ee:ee:ee:ee",adminstate="up",broadcast="ff:ff:ff:ff:ff:ff",device="cali8c459f6702e",duplex="full",ifalias="",operstate="up"} 1
# HELP node_network_mtu_bytes Network device property: mtu_bytes
# TYPE node_network_mtu_bytes gauge
node_network_mtu_bytes{device="cali60e575ce8db"} 1480
node_network_mtu_bytes{device="cali85a56337055"} 1480
node_network_mtu_bytes{device="cali8c459f6702e"} 1480
node_network_mtu_bytes{device="eth0"} 1500
node_network_mtu_bytes{device="lo"} 65536
node_network_mtu_bytes{device="tunl0"} 1480
node_network_mtu_bytes{device="wlan0"} 1500
# HELP node_network_name_assign_type Network device property: name_assign_type
# TYPE node_network_name_assign_type gauge
node_network_name_assign_type{device="cali60e575ce8db"} 3
node_network_name_assign_type{device="cali85a56337055"} 3
node_network_name_assign_type{device="cali8c459f6702e"} 3
node_network_name_assign_type{device="eth0"} 1
node_network_name_assign_type{device="lo"} 2
# HELP node_network_net_dev_group Network device property: net_dev_group
# TYPE node_network_net_dev_group gauge
node_network_net_dev_group{device="cali60e575ce8db"} 0
node_network_net_dev_group{device="cali85a56337055"} 0
node_network_net_dev_group{device="cali8c459f6702e"} 0
node_network_net_dev_group{device="eth0"} 0
node_network_net_dev_group{device="lo"} 0
node_network_net_dev_group{device="tunl0"} 0
node_network_net_dev_group{device="wlan0"} 0
# HELP node_network_protocol_type Network device property: protocol_type
# TYPE node_network_protocol_type gauge
node_network_protocol_type{device="cali60e575ce8db"} 1
node_network_protocol_type{device="cali85a56337055"} 1
node_network_protocol_type{device="cali8c459f6702e"} 1
node_network_protocol_type{device="eth0"} 1
node_network_protocol_type{device="lo"} 772
node_network_protocol_type{device="tunl0"} 768
node_network_protocol_type{device="wlan0"} 1
# HELP node_network_receive_bytes_total Network device statistic receive_bytes.
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="cali60e575ce8db"} 6.800154e+07
node_network_receive_bytes_total{device="cali85a56337055"} 6.6751833e+07
node_network_receive_bytes_total{device="cali8c459f6702e"} 5.9727975e+07
node_network_receive_bytes_total{device="eth0"} 5.6372248596e+10
node_network_receive_bytes_total{device="lo"} 6.0342387372e+10
node_network_receive_bytes_total{device="tunl0"} 3.599596e+06
node_network_receive_bytes_total{device="wlan0"} 0
# HELP node_network_receive_compressed_total Network device statistic receive_compressed.
# TYPE node_network_receive_compressed_total counter
node_network_receive_compressed_total{device="cali60e575ce8db"} 0
node_network_receive_compressed_total{device="cali85a56337055"} 0
node_network_receive_compressed_total{device="cali8c459f6702e"} 0
node_network_receive_compressed_total{device="eth0"} 0
node_network_receive_compressed_total{device="lo"} 0
node_network_receive_compressed_total{device="tunl0"} 0
node_network_receive_compressed_total{device="wlan0"} 0
# HELP node_network_receive_drop_total Network device statistic receive_drop.
# TYPE node_network_receive_drop_total counter
node_network_receive_drop_total{device="cali60e575ce8db"} 1
node_network_receive_drop_total{device="cali85a56337055"} 1
node_network_receive_drop_total{device="cali8c459f6702e"} 1
node_network_receive_drop_total{device="eth0"} 0
node_network_receive_drop_total{device="lo"} 0
node_network_receive_drop_total{device="tunl0"} 0
node_network_receive_drop_total{device="wlan0"} 0
# HELP node_network_receive_errs_total Network device statistic receive_errs.
# TYPE node_network_receive_errs_total counter
node_network_receive_errs_total{device="cali60e575ce8db"} 0
node_network_receive_errs_total{device="cali85a56337055"} 0
node_network_receive_errs_total{device="cali8c459f6702e"} 0
node_network_receive_errs_total{device="eth0"} 0
node_network_receive_errs_total{device="lo"} 0
node_network_receive_errs_total{device="tunl0"} 0
node_network_receive_errs_total{device="wlan0"} 0
# HELP node_network_receive_fifo_total Network device statistic receive_fifo.
# TYPE node_network_receive_fifo_total counter
node_network_receive_fifo_total{device="cali60e575ce8db"} 0
node_network_receive_fifo_total{device="cali85a56337055"} 0
node_network_receive_fifo_total{device="cali8c459f6702e"} 0
node_network_receive_fifo_total{device="eth0"} 0
node_network_receive_fifo_total{device="lo"} 0
node_network_receive_fifo_total{device="tunl0"} 0
node_network_receive_fifo_total{device="wlan0"} 0
# HELP node_network_receive_frame_total Network device statistic receive_frame.
# TYPE node_network_receive_frame_total counter
node_network_receive_frame_total{device="cali60e575ce8db"} 0
node_network_receive_frame_total{device="cali85a56337055"} 0
node_network_receive_frame_total{device="cali8c459f6702e"} 0
node_network_receive_frame_total{device="eth0"} 0
node_network_receive_frame_total{device="lo"} 0
node_network_receive_frame_total{device="tunl0"} 0
node_network_receive_frame_total{device="wlan0"} 0
# HELP node_network_receive_multicast_total Network device statistic receive_multicast.
# TYPE node_network_receive_multicast_total counter
node_network_receive_multicast_total{device="cali60e575ce8db"} 0
node_network_receive_multicast_total{device="cali85a56337055"} 0
node_network_receive_multicast_total{device="cali8c459f6702e"} 0
node_network_receive_multicast_total{device="eth0"} 3.336362e+06
node_network_receive_multicast_total{device="lo"} 0
node_network_receive_multicast_total{device="tunl0"} 0
node_network_receive_multicast_total{device="wlan0"} 0
# HELP node_network_receive_nohandler_total Network device statistic receive_nohandler.
# TYPE node_network_receive_nohandler_total counter
node_network_receive_nohandler_total{device="cali60e575ce8db"} 0
node_network_receive_nohandler_total{device="cali85a56337055"} 0
node_network_receive_nohandler_total{device="cali8c459f6702e"} 0
node_network_receive_nohandler_total{device="eth0"} 0
node_network_receive_nohandler_total{device="lo"} 0
node_network_receive_nohandler_total{device="tunl0"} 0
node_network_receive_nohandler_total{device="wlan0"} 0
# HELP node_network_receive_packets_total Network device statistic receive_packets.
# TYPE node_network_receive_packets_total counter
node_network_receive_packets_total{device="cali60e575ce8db"} 800641
node_network_receive_packets_total{device="cali85a56337055"} 781891
node_network_receive_packets_total{device="cali8c459f6702e"} 680023
node_network_receive_packets_total{device="eth0"} 3.3310639e+08
node_network_receive_packets_total{device="lo"} 2.57029971e+08
node_network_receive_packets_total{device="tunl0"} 39699
node_network_receive_packets_total{device="wlan0"} 0
# HELP node_network_speed_bytes Network device property: speed_bytes
# TYPE node_network_speed_bytes gauge
node_network_speed_bytes{device="cali60e575ce8db"} 1.25e+09
node_network_speed_bytes{device="cali85a56337055"} 1.25e+09
node_network_speed_bytes{device="cali8c459f6702e"} 1.25e+09
node_network_speed_bytes{device="eth0"} 1.25e+08
# HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
# TYPE node_network_transmit_bytes_total counter
node_network_transmit_bytes_total{device="cali60e575ce8db"} 5.2804647e+07
node_network_transmit_bytes_total{device="cali85a56337055"} 5.4239763e+07
node_network_transmit_bytes_total{device="cali8c459f6702e"} 1.115901473e+09
node_network_transmit_bytes_total{device="eth0"} 1.02987658518e+11
node_network_transmit_bytes_total{device="lo"} 6.0342387372e+10
node_network_transmit_bytes_total{device="tunl0"} 8.407628e+06
node_network_transmit_bytes_total{device="wlan0"} 0
# HELP node_network_transmit_carrier_total Network device statistic transmit_carrier.
# TYPE node_network_transmit_carrier_total counter
node_network_transmit_carrier_total{device="cali60e575ce8db"} 0
node_network_transmit_carrier_total{device="cali85a56337055"} 0
node_network_transmit_carrier_total{device="cali8c459f6702e"} 0
node_network_transmit_carrier_total{device="eth0"} 0
node_network_transmit_carrier_total{device="lo"} 0
node_network_transmit_carrier_total{device="tunl0"} 0
node_network_transmit_carrier_total{device="wlan0"} 0
# HELP node_network_transmit_colls_total Network device statistic transmit_colls.
# TYPE node_network_transmit_colls_total counter
node_network_transmit_colls_total{device="cali60e575ce8db"} 0
node_network_transmit_colls_total{device="cali85a56337055"} 0
node_network_transmit_colls_total{device="cali8c459f6702e"} 0
node_network_transmit_colls_total{device="eth0"} 0
node_network_transmit_colls_total{device="lo"} 0
node_network_transmit_colls_total{device="tunl0"} 0
node_network_transmit_colls_total{device="wlan0"} 0
# HELP node_network_transmit_compressed_total Network device statistic transmit_compressed.
# TYPE node_network_transmit_compressed_total counter
node_network_transmit_compressed_total{device="cali60e575ce8db"} 0
node_network_transmit_compressed_total{device="cali85a56337055"} 0
node_network_transmit_compressed_total{device="cali8c459f6702e"} 0
node_network_transmit_compressed_total{device="eth0"} 0
node_network_transmit_compressed_total{device="lo"} 0
node_network_transmit_compressed_total{device="tunl0"} 0
node_network_transmit_compressed_total{device="wlan0"} 0
# HELP node_network_transmit_drop_total Network device statistic transmit_drop.
# TYPE node_network_transmit_drop_total counter
node_network_transmit_drop_total{device="cali60e575ce8db"} 0
node_network_transmit_drop_total{device="cali85a56337055"} 0
node_network_transmit_drop_total{device="cali8c459f6702e"} 0
node_network_transmit_drop_total{device="eth0"} 0
node_network_transmit_drop_total{device="lo"} 0
node_network_transmit_drop_total{device="tunl0"} 0
node_network_transmit_drop_total{device="wlan0"} 0
# HELP node_network_transmit_errs_total Network device statistic transmit_errs.
# TYPE node_network_transmit_errs_total counter
node_network_transmit_errs_total{device="cali60e575ce8db"} 0
node_network_transmit_errs_total{device="cali85a56337055"} 0
node_network_transmit_errs_total{device="cali8c459f6702e"} 0
node_network_transmit_errs_total{device="eth0"} 0
node_network_transmit_errs_total{device="lo"} 0
node_network_transmit_errs_total{device="tunl0"} 0
node_network_transmit_errs_total{device="wlan0"} 0
# HELP node_network_transmit_fifo_total Network device statistic transmit_fifo.
# TYPE node_network_transmit_fifo_total counter
node_network_transmit_fifo_total{device="cali60e575ce8db"} 0
node_network_transmit_fifo_total{device="cali85a56337055"} 0
node_network_transmit_fifo_total{device="cali8c459f6702e"} 0
node_network_transmit_fifo_total{device="eth0"} 0
node_network_transmit_fifo_total{device="lo"} 0
node_network_transmit_fifo_total{device="tunl0"} 0
node_network_transmit_fifo_total{device="wlan0"} 0
# HELP node_network_transmit_packets_total Network device statistic transmit_packets.
# TYPE node_network_transmit_packets_total counter
node_network_transmit_packets_total{device="cali60e575ce8db"} 560412
node_network_transmit_packets_total{device="cali85a56337055"} 582260
node_network_transmit_packets_total{device="cali8c459f6702e"} 733054
node_network_transmit_packets_total{device="eth0"} 3.54151866e+08
node_network_transmit_packets_total{device="lo"} 2.57029971e+08
node_network_transmit_packets_total{device="tunl0"} 39617
node_network_transmit_packets_total{device="wlan0"} 0
# HELP node_network_transmit_queue_length Network device property: transmit_queue_length
# TYPE node_network_transmit_queue_length gauge
node_network_transmit_queue_length{device="cali60e575ce8db"} 0
node_network_transmit_queue_length{device="cali85a56337055"} 0
node_network_transmit_queue_length{device="cali8c459f6702e"} 0
node_network_transmit_queue_length{device="eth0"} 1000
node_network_transmit_queue_length{device="lo"} 1000
node_network_transmit_queue_length{device="tunl0"} 1000
node_network_transmit_queue_length{device="wlan0"} 1000
# HELP node_network_up Value is 1 if operstate is 'up', 0 otherwise.
# TYPE node_network_up gauge
node_network_up{device="cali60e575ce8db"} 1
node_network_up{device="cali85a56337055"} 1
node_network_up{device="cali8c459f6702e"} 1
node_network_up{device="eth0"} 1
node_network_up{device="lo"} 0
node_network_up{device="tunl0"} 0
node_network_up{device="wlan0"} 0
# HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking.
# TYPE node_nf_conntrack_entries gauge
node_nf_conntrack_entries 474
# HELP node_nf_conntrack_entries_limit Maximum size of connection tracking table.
# TYPE node_nf_conntrack_entries_limit gauge
node_nf_conntrack_entries_limit 131072
# HELP node_nfs_connections_total Total number of NFSd TCP connections.
# TYPE node_nfs_connections_total counter
node_nfs_connections_total 0
# HELP node_nfs_packets_total Total NFSd network packets (sent+received) by protocol type.
# TYPE node_nfs_packets_total counter
node_nfs_packets_total{protocol="tcp"} 0
node_nfs_packets_total{protocol="udp"} 0
# HELP node_nfs_requests_total Number of NFS procedures invoked.
# TYPE node_nfs_requests_total counter
node_nfs_requests_total{method="Access",proto="3"} 0
node_nfs_requests_total{method="Access",proto="4"} 0
node_nfs_requests_total{method="Allocate",proto="4"} 0
node_nfs_requests_total{method="BindConnToSession",proto="4"} 0
node_nfs_requests_total{method="Clone",proto="4"} 0
node_nfs_requests_total{method="Close",proto="4"} 0
node_nfs_requests_total{method="Commit",proto="3"} 0
node_nfs_requests_total{method="Commit",proto="4"} 0
node_nfs_requests_total{method="Create",proto="2"} 0
node_nfs_requests_total{method="Create",proto="3"} 0
node_nfs_requests_total{method="Create",proto="4"} 0
node_nfs_requests_total{method="CreateSession",proto="4"} 0
node_nfs_requests_total{method="DeAllocate",proto="4"} 0
node_nfs_requests_total{method="DelegReturn",proto="4"} 0
node_nfs_requests_total{method="DestroyClientID",proto="4"} 0
node_nfs_requests_total{method="DestroySession",proto="4"} 0
node_nfs_requests_total{method="ExchangeID",proto="4"} 0
node_nfs_requests_total{method="FreeStateID",proto="4"} 0
node_nfs_requests_total{method="FsInfo",proto="3"} 0
node_nfs_requests_total{method="FsInfo",proto="4"} 0
node_nfs_requests_total{method="FsLocations",proto="4"} 0
node_nfs_requests_total{method="FsStat",proto="2"} 0
node_nfs_requests_total{method="FsStat",proto="3"} 0
node_nfs_requests_total{method="FsidPresent",proto="4"} 0
node_nfs_requests_total{method="GetACL",proto="4"} 0
node_nfs_requests_total{method="GetAttr",proto="2"} 0
node_nfs_requests_total{method="GetAttr",proto="3"} 0
node_nfs_requests_total{method="GetDeviceInfo",proto="4"} 0
node_nfs_requests_total{method="GetDeviceList",proto="4"} 0
node_nfs_requests_total{method="GetLeaseTime",proto="4"} 0
node_nfs_requests_total{method="Getattr",proto="4"} 0
node_nfs_requests_total{method="LayoutCommit",proto="4"} 0
node_nfs_requests_total{method="LayoutGet",proto="4"} 0
node_nfs_requests_total{method="LayoutReturn",proto="4"} 0
node_nfs_requests_total{method="LayoutStats",proto="4"} 0
node_nfs_requests_total{method="Link",proto="2"} 0
node_nfs_requests_total{method="Link",proto="3"} 0
node_nfs_requests_total{method="Link",proto="4"} 0
node_nfs_requests_total{method="Lock",proto="4"} 0
node_nfs_requests_total{method="Lockt",proto="4"} 0
node_nfs_requests_total{method="Locku",proto="4"} 0
node_nfs_requests_total{method="Lookup",proto="2"} 0
node_nfs_requests_total{method="Lookup",proto="3"} 0
node_nfs_requests_total{method="Lookup",proto="4"} 0
node_nfs_requests_total{method="LookupRoot",proto="4"} 0
node_nfs_requests_total{method="MkDir",proto="2"} 0
node_nfs_requests_total{method="MkDir",proto="3"} 0
node_nfs_requests_total{method="MkNod",proto="3"} 0
node_nfs_requests_total{method="Null",proto="2"} 0
node_nfs_requests_total{method="Null",proto="3"} 0
node_nfs_requests_total{method="Null",proto="4"} 0
node_nfs_requests_total{method="Open",proto="4"} 0
node_nfs_requests_total{method="OpenConfirm",proto="4"} 0
node_nfs_requests_total{method="OpenDowngrade",proto="4"} 0
node_nfs_requests_total{method="OpenNoattr",proto="4"} 0
node_nfs_requests_total{method="PathConf",proto="3"} 0
node_nfs_requests_total{method="Pathconf",proto="4"} 0
node_nfs_requests_total{method="Read",proto="2"} 0
node_nfs_requests_total{method="Read",proto="3"} 0
node_nfs_requests_total{method="Read",proto="4"} 0
node_nfs_requests_total{method="ReadDir",proto="2"} 0
node_nfs_requests_total{method="ReadDir",proto="3"} 0
node_nfs_requests_total{method="ReadDir",proto="4"} 0
node_nfs_requests_total{method="ReadDirPlus",proto="3"} 0
node_nfs_requests_total{method="ReadLink",proto="2"} 0
node_nfs_requests_total{method="ReadLink",proto="3"} 0
node_nfs_requests_total{method="ReadLink",proto="4"} 0
node_nfs_requests_total{method="ReclaimComplete",proto="4"} 0
node_nfs_requests_total{method="ReleaseLockowner",proto="4"} 0
node_nfs_requests_total{method="Remove",proto="2"} 0
node_nfs_requests_total{method="Remove",proto="3"} 0
node_nfs_requests_total{method="Remove",proto="4"} 0
node_nfs_requests_total{method="Rename",proto="2"} 0
node_nfs_requests_total{method="Rename",proto="3"} 0
node_nfs_requests_total{method="Rename",proto="4"} 0
node_nfs_requests_total{method="Renew",proto="4"} 0
node_nfs_requests_total{method="RmDir",proto="2"} 0
node_nfs_requests_total{method="RmDir",proto="3"} 0
node_nfs_requests_total{method="Root",proto="2"} 0
node_nfs_requests_total{method="Secinfo",proto="4"} 0
node_nfs_requests_total{method="SecinfoNoName",proto="4"} 0
node_nfs_requests_total{method="Seek",proto="4"} 0
node_nfs_requests_total{method="Sequence",proto="4"} 0
node_nfs_requests_total{method="ServerCaps",proto="4"} 0
node_nfs_requests_total{method="SetACL",proto="4"} 0
node_nfs_requests_total{method="SetAttr",proto="2"} 0
node_nfs_requests_total{method="SetAttr",proto="3"} 0
node_nfs_requests_total{method="SetClientID",proto="4"} 0
node_nfs_requests_total{method="SetClientIDConfirm",proto="4"} 0
node_nfs_requests_total{method="Setattr",proto="4"} 0
node_nfs_requests_total{method="StatFs",proto="4"} 0
node_nfs_requests_total{method="SymLink",proto="2"} 0
node_nfs_requests_total{method="SymLink",proto="3"} 0
node_nfs_requests_total{method="Symlink",proto="4"} 0
node_nfs_requests_total{method="TestStateID",proto="4"} 0
node_nfs_requests_total{method="WrCache",proto="2"} 0
node_nfs_requests_total{method="Write",proto="2"} 0
node_nfs_requests_total{method="Write",proto="3"} 0
node_nfs_requests_total{method="Write",proto="4"} 0
# HELP node_nfs_rpc_authentication_refreshes_total Number of RPC authentication refreshes performed.
# TYPE node_nfs_rpc_authentication_refreshes_total counter
node_nfs_rpc_authentication_refreshes_total 0
# HELP node_nfs_rpc_retransmissions_total Number of RPC transmissions performed.
# TYPE node_nfs_rpc_retransmissions_total counter
node_nfs_rpc_retransmissions_total 0
# HELP node_nfs_rpcs_total Total number of RPCs performed.
# TYPE node_nfs_rpcs_total counter
node_nfs_rpcs_total 0
# HELP node_os_info A metric with a constant '1' value labeled by build_id, id, id_like, image_id, image_version, name, pretty_name, variant, variant_id, version, version_codename, version_id.
# TYPE node_os_info gauge
node_os_info{build_id="",id="debian",id_like="",image_id="",image_version="",name="Debian GNU/Linux",pretty_name="Debian GNU/Linux 12 (bookworm)",variant="",variant_id="",version="12 (bookworm)",version_codename="bookworm",version_id="12"} 1
# HELP node_os_version Metric containing the major.minor part of the OS version.
# TYPE node_os_version gauge
node_os_version{id="debian",id_like="",name="Debian GNU/Linux"} 12
# HELP node_procs_blocked Number of processes blocked waiting for I/O to complete.
# TYPE node_procs_blocked gauge
node_procs_blocked 0
# HELP node_procs_running Number of processes in runnable state.
# TYPE node_procs_running gauge
node_procs_running 2
# HELP node_schedstat_running_seconds_total Number of seconds CPU spent running a process.
# TYPE node_schedstat_running_seconds_total counter
node_schedstat_running_seconds_total{cpu="0"} 193905.40964483
node_schedstat_running_seconds_total{cpu="1"} 201807.778053838
node_schedstat_running_seconds_total{cpu="2"} 202480.951626566
node_schedstat_running_seconds_total{cpu="3"} 199368.582085578
# HELP node_schedstat_timeslices_total Number of timeslices executed by CPU.
# TYPE node_schedstat_timeslices_total counter
node_schedstat_timeslices_total{cpu="0"} 2.671310666e+09
node_schedstat_timeslices_total{cpu="1"} 2.839935261e+09
node_schedstat_timeslices_total{cpu="2"} 2.840250945e+09
node_schedstat_timeslices_total{cpu="3"} 2.791566809e+09
# HELP node_schedstat_waiting_seconds_total Number of seconds spent by processing waiting for this CPU.
# TYPE node_schedstat_waiting_seconds_total counter
node_schedstat_waiting_seconds_total{cpu="0"} 146993.907550125
node_schedstat_waiting_seconds_total{cpu="1"} 148954.872956911
node_schedstat_waiting_seconds_total{cpu="2"} 149496.824640957
node_schedstat_waiting_seconds_total{cpu="3"} 148325.351612478
# HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
# TYPE node_scrape_collector_duration_seconds gauge
node_scrape_collector_duration_seconds{collector="arp"} 0.000472051
node_scrape_collector_duration_seconds{collector="bcache"} 9.7776e-05
node_scrape_collector_duration_seconds{collector="bonding"} 0.00025022
node_scrape_collector_duration_seconds{collector="btrfs"} 0.018567631
node_scrape_collector_duration_seconds{collector="conntrack"} 0.014180114
node_scrape_collector_duration_seconds{collector="cpu"} 0.004748662
node_scrape_collector_duration_seconds{collector="cpufreq"} 0.049445245
node_scrape_collector_duration_seconds{collector="diskstats"} 0.001468727
node_scrape_collector_duration_seconds{collector="dmi"} 1.093e-06
node_scrape_collector_duration_seconds{collector="edac"} 7.6574e-05
node_scrape_collector_duration_seconds{collector="entropy"} 0.000781326
node_scrape_collector_duration_seconds{collector="fibrechannel"} 3.0574e-05
node_scrape_collector_duration_seconds{collector="filefd"} 0.000214998
node_scrape_collector_duration_seconds{collector="filesystem"} 0.041031802
node_scrape_collector_duration_seconds{collector="hwmon"} 0.007842633
node_scrape_collector_duration_seconds{collector="infiniband"} 4.1777e-05
node_scrape_collector_duration_seconds{collector="ipvs"} 0.000964547
node_scrape_collector_duration_seconds{collector="loadavg"} 0.000368979
node_scrape_collector_duration_seconds{collector="mdadm"} 7.6555e-05
node_scrape_collector_duration_seconds{collector="meminfo"} 0.001052527
node_scrape_collector_duration_seconds{collector="netclass"} 0.036469213
node_scrape_collector_duration_seconds{collector="netdev"} 0.002758901
node_scrape_collector_duration_seconds{collector="netstat"} 0.002033075
node_scrape_collector_duration_seconds{collector="nfs"} 0.000542699
node_scrape_collector_duration_seconds{collector="nfsd"} 0.000331331
node_scrape_collector_duration_seconds{collector="nvme"} 0.000140017
node_scrape_collector_duration_seconds{collector="os"} 0.000326923
node_scrape_collector_duration_seconds{collector="powersupplyclass"} 0.000183962
node_scrape_collector_duration_seconds{collector="pressure"} 6.4647e-05
node_scrape_collector_duration_seconds{collector="rapl"} 0.000149461
node_scrape_collector_duration_seconds{collector="schedstat"} 0.000511218
node_scrape_collector_duration_seconds{collector="selinux"} 0.000327182
node_scrape_collector_duration_seconds{collector="sockstat"} 0.001023898
node_scrape_collector_duration_seconds{collector="softnet"} 0.000578402
node_scrape_collector_duration_seconds{collector="stat"} 0.013851062
node_scrape_collector_duration_seconds{collector="tapestats"} 0.000176499
node_scrape_collector_duration_seconds{collector="textfile"} 5.7296e-05
node_scrape_collector_duration_seconds{collector="thermal_zone"} 0.017899137
node_scrape_collector_duration_seconds{collector="time"} 0.000422885
node_scrape_collector_duration_seconds{collector="timex"} 0.000182517
node_scrape_collector_duration_seconds{collector="udp_queues"} 0.001325488
node_scrape_collector_duration_seconds{collector="uname"} 7.0184e-05
node_scrape_collector_duration_seconds{collector="vmstat"} 0.000352664
node_scrape_collector_duration_seconds{collector="xfs"} 4.2481e-05
node_scrape_collector_duration_seconds{collector="zfs"} 0.00011237
# HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
# TYPE node_scrape_collector_success gauge
node_scrape_collector_success{collector="arp"} 0
node_scrape_collector_success{collector="bcache"} 1
node_scrape_collector_success{collector="bonding"} 0
node_scrape_collector_success{collector="btrfs"} 1
node_scrape_collector_success{collector="conntrack"} 0
node_scrape_collector_success{collector="cpu"} 1
node_scrape_collector_success{collector="cpufreq"} 1
node_scrape_collector_success{collector="diskstats"} 1
node_scrape_collector_success{collector="dmi"} 0
node_scrape_collector_success{collector="edac"} 1
node_scrape_collector_success{collector="entropy"} 1
node_scrape_collector_success{collector="fibrechannel"} 0
node_scrape_collector_success{collector="filefd"} 1
node_scrape_collector_success{collector="filesystem"} 1
node_scrape_collector_success{collector="hwmon"} 1
node_scrape_collector_success{collector="infiniband"} 0
node_scrape_collector_success{collector="ipvs"} 1
node_scrape_collector_success{collector="loadavg"} 1
node_scrape_collector_success{collector="mdadm"} 0
node_scrape_collector_success{collector="meminfo"} 1
node_scrape_collector_success{collector="netclass"} 1
node_scrape_collector_success{collector="netdev"} 1
node_scrape_collector_success{collector="netstat"} 1
node_scrape_collector_success{collector="nfs"} 1
node_scrape_collector_success{collector="nfsd"} 0
node_scrape_collector_success{collector="nvme"} 1
node_scrape_collector_success{collector="os"} 1
node_scrape_collector_success{collector="powersupplyclass"} 1
node_scrape_collector_success{collector="pressure"} 0
node_scrape_collector_success{collector="rapl"} 0
node_scrape_collector_success{collector="schedstat"} 1
node_scrape_collector_success{collector="selinux"} 1
node_scrape_collector_success{collector="sockstat"} 1
node_scrape_collector_success{collector="softnet"} 1
node_scrape_collector_success{collector="stat"} 1
node_scrape_collector_success{collector="tapestats"} 0
node_scrape_collector_success{collector="textfile"} 1
node_scrape_collector_success{collector="thermal_zone"} 1
node_scrape_collector_success{collector="time"} 1
node_scrape_collector_success{collector="timex"} 1
node_scrape_collector_success{collector="udp_queues"} 1
node_scrape_collector_success{collector="uname"} 1
node_scrape_collector_success{collector="vmstat"} 1
node_scrape_collector_success{collector="xfs"} 1
node_scrape_collector_success{collector="zfs"} 0
# HELP node_selinux_enabled SELinux is enabled, 1 is true, 0 is false
# TYPE node_selinux_enabled gauge
node_selinux_enabled 0
# HELP node_sockstat_FRAG6_inuse Number of FRAG6 sockets in state inuse.
# TYPE node_sockstat_FRAG6_inuse gauge
node_sockstat_FRAG6_inuse 0
# HELP node_sockstat_FRAG6_memory Number of FRAG6 sockets in state memory.
# TYPE node_sockstat_FRAG6_memory gauge
node_sockstat_FRAG6_memory 0
# HELP node_sockstat_FRAG_inuse Number of FRAG sockets in state inuse.
# TYPE node_sockstat_FRAG_inuse gauge
node_sockstat_FRAG_inuse 0
# HELP node_sockstat_FRAG_memory Number of FRAG sockets in state memory.
# TYPE node_sockstat_FRAG_memory gauge
node_sockstat_FRAG_memory 0
# HELP node_sockstat_RAW6_inuse Number of RAW6 sockets in state inuse.
# TYPE node_sockstat_RAW6_inuse gauge
node_sockstat_RAW6_inuse 1
# HELP node_sockstat_RAW_inuse Number of RAW sockets in state inuse.
# TYPE node_sockstat_RAW_inuse gauge
node_sockstat_RAW_inuse 0
# HELP node_sockstat_TCP6_inuse Number of TCP6 sockets in state inuse.
# TYPE node_sockstat_TCP6_inuse gauge
node_sockstat_TCP6_inuse 44
# HELP node_sockstat_TCP_alloc Number of TCP sockets in state alloc.
# TYPE node_sockstat_TCP_alloc gauge
node_sockstat_TCP_alloc 272
# HELP node_sockstat_TCP_inuse Number of TCP sockets in state inuse.
# TYPE node_sockstat_TCP_inuse gauge
node_sockstat_TCP_inuse 211
# HELP node_sockstat_TCP_mem Number of TCP sockets in state mem.
# TYPE node_sockstat_TCP_mem gauge
node_sockstat_TCP_mem 665
# HELP node_sockstat_TCP_mem_bytes Number of TCP sockets in state mem_bytes.
# TYPE node_sockstat_TCP_mem_bytes gauge
node_sockstat_TCP_mem_bytes 2.72384e+06
# HELP node_sockstat_TCP_orphan Number of TCP sockets in state orphan.
# TYPE node_sockstat_TCP_orphan gauge
node_sockstat_TCP_orphan 0
# HELP node_sockstat_TCP_tw Number of TCP sockets in state tw.
# TYPE node_sockstat_TCP_tw gauge
node_sockstat_TCP_tw 55
# HELP node_sockstat_UDP6_inuse Number of UDP6 sockets in state inuse.
# TYPE node_sockstat_UDP6_inuse gauge
node_sockstat_UDP6_inuse 2
# HELP node_sockstat_UDPLITE6_inuse Number of UDPLITE6 sockets in state inuse.
# TYPE node_sockstat_UDPLITE6_inuse gauge
node_sockstat_UDPLITE6_inuse 0
# HELP node_sockstat_UDPLITE_inuse Number of UDPLITE sockets in state inuse.
# TYPE node_sockstat_UDPLITE_inuse gauge
node_sockstat_UDPLITE_inuse 0
# HELP node_sockstat_UDP_inuse Number of UDP sockets in state inuse.
# TYPE node_sockstat_UDP_inuse gauge
node_sockstat_UDP_inuse 3
# HELP node_sockstat_UDP_mem Number of UDP sockets in state mem.
# TYPE node_sockstat_UDP_mem gauge
node_sockstat_UDP_mem 249
# HELP node_sockstat_UDP_mem_bytes Number of UDP sockets in state mem_bytes.
# TYPE node_sockstat_UDP_mem_bytes gauge
node_sockstat_UDP_mem_bytes 1.019904e+06
# HELP node_sockstat_sockets_used Number of IPv4 sockets in use.
# TYPE node_sockstat_sockets_used gauge
node_sockstat_sockets_used 563
# HELP node_softnet_backlog_len Softnet backlog status
# TYPE node_softnet_backlog_len gauge
node_softnet_backlog_len{cpu="0"} 0
node_softnet_backlog_len{cpu="1"} 0
node_softnet_backlog_len{cpu="2"} 0
node_softnet_backlog_len{cpu="3"} 0
# HELP node_softnet_cpu_collision_total Number of collision occur while obtaining device lock while transmitting
# TYPE node_softnet_cpu_collision_total counter
node_softnet_cpu_collision_total{cpu="0"} 0
node_softnet_cpu_collision_total{cpu="1"} 0
node_softnet_cpu_collision_total{cpu="2"} 0
node_softnet_cpu_collision_total{cpu="3"} 0
# HELP node_softnet_dropped_total Number of dropped packets
# TYPE node_softnet_dropped_total counter
node_softnet_dropped_total{cpu="0"} 0
node_softnet_dropped_total{cpu="1"} 0
node_softnet_dropped_total{cpu="2"} 0
node_softnet_dropped_total{cpu="3"} 0
# HELP node_softnet_flow_limit_count_total Number of times flow limit has been reached
# TYPE node_softnet_flow_limit_count_total counter
node_softnet_flow_limit_count_total{cpu="0"} 0
node_softnet_flow_limit_count_total{cpu="1"} 0
node_softnet_flow_limit_count_total{cpu="2"} 0
node_softnet_flow_limit_count_total{cpu="3"} 0
# HELP node_softnet_processed_total Number of processed packets
# TYPE node_softnet_processed_total counter
node_softnet_processed_total{cpu="0"} 3.91430308e+08
node_softnet_processed_total{cpu="1"} 7.0427743e+07
node_softnet_processed_total{cpu="2"} 7.2377954e+07
node_softnet_processed_total{cpu="3"} 7.0743949e+07
# HELP node_softnet_received_rps_total Number of times cpu woken up received_rps
# TYPE node_softnet_received_rps_total counter
node_softnet_received_rps_total{cpu="0"} 0
node_softnet_received_rps_total{cpu="1"} 0
node_softnet_received_rps_total{cpu="2"} 0
node_softnet_received_rps_total{cpu="3"} 0
# HELP node_softnet_times_squeezed_total Number of times processing packets ran out of quota
# TYPE node_softnet_times_squeezed_total counter
node_softnet_times_squeezed_total{cpu="0"} 298183
node_softnet_times_squeezed_total{cpu="1"} 0
node_softnet_times_squeezed_total{cpu="2"} 0
node_softnet_times_squeezed_total{cpu="3"} 0
# HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise
# TYPE node_textfile_scrape_error gauge
node_textfile_scrape_error 0
# HELP node_thermal_zone_temp Zone temperature in Celsius
# TYPE node_thermal_zone_temp gauge
node_thermal_zone_temp{type="cpu-thermal",zone="0"} 28.232
# HELP node_time_clocksource_available_info Available clocksources read from '/sys/devices/system/clocksource'.
# TYPE node_time_clocksource_available_info gauge
node_time_clocksource_available_info{clocksource="arch_sys_counter",device="0"} 1
# HELP node_time_clocksource_current_info Current clocksource read from '/sys/devices/system/clocksource'.
# TYPE node_time_clocksource_current_info gauge
node_time_clocksource_current_info{clocksource="arch_sys_counter",device="0"} 1
# HELP node_time_seconds System time in seconds since epoch (1970).
# TYPE node_time_seconds gauge
node_time_seconds 1.7097658934862518e+09
# HELP node_time_zone_offset_seconds System time zone offset in seconds.
# TYPE node_time_zone_offset_seconds gauge
node_time_zone_offset_seconds{time_zone="UTC"} 0
# HELP node_timex_estimated_error_seconds Estimated error in seconds.
# TYPE node_timex_estimated_error_seconds gauge
node_timex_estimated_error_seconds 0
# HELP node_timex_frequency_adjustment_ratio Local clock frequency adjustment.
# TYPE node_timex_frequency_adjustment_ratio gauge
node_timex_frequency_adjustment_ratio 0.9999922578277588
# HELP node_timex_loop_time_constant Phase-locked loop time constant.
# TYPE node_timex_loop_time_constant gauge
node_timex_loop_time_constant 7
# HELP node_timex_maxerror_seconds Maximum error in seconds.
# TYPE node_timex_maxerror_seconds gauge
node_timex_maxerror_seconds 0.672
# HELP node_timex_offset_seconds Time offset in between local system and reference clock.
# TYPE node_timex_offset_seconds gauge
node_timex_offset_seconds -0.000593063
# HELP node_timex_pps_calibration_total Pulse per second count of calibration intervals.
# TYPE node_timex_pps_calibration_total counter
node_timex_pps_calibration_total 0
# HELP node_timex_pps_error_total Pulse per second count of calibration errors.
# TYPE node_timex_pps_error_total counter
node_timex_pps_error_total 0
# HELP node_timex_pps_frequency_hertz Pulse per second frequency.
# TYPE node_timex_pps_frequency_hertz gauge
node_timex_pps_frequency_hertz 0
# HELP node_timex_pps_jitter_seconds Pulse per second jitter.
# TYPE node_timex_pps_jitter_seconds gauge
node_timex_pps_jitter_seconds 0
# HELP node_timex_pps_jitter_total Pulse per second count of jitter limit exceeded events.
# TYPE node_timex_pps_jitter_total counter
node_timex_pps_jitter_total 0
# HELP node_timex_pps_shift_seconds Pulse per second interval duration.
# TYPE node_timex_pps_shift_seconds gauge
node_timex_pps_shift_seconds 0
# HELP node_timex_pps_stability_exceeded_total Pulse per second count of stability limit exceeded events.
# TYPE node_timex_pps_stability_exceeded_total counter
node_timex_pps_stability_exceeded_total 0
# HELP node_timex_pps_stability_hertz Pulse per second stability, average of recent frequency changes.
# TYPE node_timex_pps_stability_hertz gauge
node_timex_pps_stability_hertz 0
# HELP node_timex_status Value of the status array bits.
# TYPE node_timex_status gauge
node_timex_status 24577
# HELP node_timex_sync_status Is clock synchronized to a reliable server (1 = yes, 0 = no).
# TYPE node_timex_sync_status gauge
node_timex_sync_status 1
# HELP node_timex_tai_offset_seconds International Atomic Time (TAI) offset.
# TYPE node_timex_tai_offset_seconds gauge
node_timex_tai_offset_seconds 0
# HELP node_timex_tick_seconds Seconds between clock ticks.
# TYPE node_timex_tick_seconds gauge
node_timex_tick_seconds 0.01
# HELP node_udp_queues Number of allocated memory in the kernel for UDP datagrams in bytes.
# TYPE node_udp_queues gauge
node_udp_queues{ip="v4",queue="rx"} 0
node_udp_queues{ip="v4",queue="tx"} 0
node_udp_queues{ip="v6",queue="rx"} 0
node_udp_queues{ip="v6",queue="tx"} 0
# HELP node_uname_info Labeled system information as provided by the uname system call.
# TYPE node_uname_info gauge
node_uname_info{domainname="(none)",machine="aarch64",nodename="bettley",release="6.1.0-rpi7-rpi-v8",sysname="Linux",version="#1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24)"} 1
# HELP node_vmstat_oom_kill /proc/vmstat information field oom_kill.
# TYPE node_vmstat_oom_kill untyped
node_vmstat_oom_kill 0
# HELP node_vmstat_pgfault /proc/vmstat information field pgfault.
# TYPE node_vmstat_pgfault untyped
node_vmstat_pgfault 3.706999478e+09
# HELP node_vmstat_pgmajfault /proc/vmstat information field pgmajfault.
# TYPE node_vmstat_pgmajfault untyped
node_vmstat_pgmajfault 5791
# HELP node_vmstat_pgpgin /proc/vmstat information field pgpgin.
# TYPE node_vmstat_pgpgin untyped
node_vmstat_pgpgin 1.115617e+06
# HELP node_vmstat_pgpgout /proc/vmstat information field pgpgout.
# TYPE node_vmstat_pgpgout untyped
node_vmstat_pgpgout 2.55770725e+08
# HELP node_vmstat_pswpin /proc/vmstat information field pswpin.
# TYPE node_vmstat_pswpin untyped
node_vmstat_pswpin 0
# HELP node_vmstat_pswpout /proc/vmstat information field pswpout.
# TYPE node_vmstat_pswpout untyped
node_vmstat_pswpout 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.05
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.2292096e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.7097658257e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.269604352e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_errors_total Total number of internal errors encountered by the promhttp metric handler.
# TYPE promhttp_metric_handler_errors_total counter
promhttp_metric_handler_errors_total{cause="encoding"} 0
promhttp_metric_handler_errors_total{cause="gathering"} 0
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
So... yay?
We could shift this to a separate repository, or we can just rip it back out of the incubator and create a separate Application
resource for it in this task file. We could organize it a thousand different ways. A prometheus_node_exporter
repository? A prometheus
repository? A monitoring
repository?
Because I'm not really sure which I'd like to do, I'll just defer the decision until a later date and move on to other things.
Router BGP Configuration
Before I go too much further, I want to get load balancer services working.
With major cloud vendors that support Kubernetes, creating a service of type LoadBalancer
will create a load balancer within that platform that provides external access to that service. This spares us from having to use ClusterIP, etc, to access our services.
This functionality isn't automatically available in a homelab. Why would it be? How could it know what you want? Regardless of the complexities preventing this from Just Working™, this topic is often a source of irritation to the homelabber.
Fortunately, a gentleman and scholar named Dave Anderson spent (I assume) a significant amount of time and devised a system, MetalLB, to bring load balancer functionality to bare metal clusters.
With a reasonable amount of effort, we can configure a router supporting BGP and a Kubernetes cluster running MetalLB into a pretty clean network infrastructure.
Network Architecture Overview
The BGP configuration creates a sophisticated routing topology that enables dynamic load balancer allocation:
Network Segmentation
- Infrastructure CIDR:
10.4.0.0/20
(main cluster network) - Service CIDR:
172.16.0.0/20
(Kubernetes internal services) - Pod CIDR:
192.168.0.0/16
(container networking) - MetalLB Pool:
10.4.11.0/24
(load balancer IP range:10.4.11.0 - 10.4.15.254
)
BGP Autonomous System Design
- Router ASN:
64500
(OPNsense gateway acting as route reflector) - Cluster ASN:
64501
(all Kubernetes nodes share this AS number) - Peer relationship: eBGP (External BGP) between different AS numbers
This design follows RFC 7938 recommendations for private AS numbers in the range 64512-65534.
OPNsense Router Configuration
In my case, this starts with configuring my router/firewall (running OPNsense) to support BGP.
Step 1: FRR Plugin Installation
This means installing the os-frr
(for "Free-Range Routing") plugin:
Free-Range Routing (FRR) is a routing software suite that provides:
- BGP-4: Border Gateway Protocol implementation
- OSPF: Open Shortest Path First for dynamic routing
- ISIS/RIP: Additional routing protocol support
- Route maps: Sophisticated traffic engineering capabilities
Step 2: Enable Global Routing
Then we enable routing:
This configuration enables:
- Kernel route injection: FRR can modify the system routing table
- Route redistribution: Between different routing protocols
- Multi-protocol support: IPv4 and IPv6 route advertisement
Step 3: BGP Configuration
Then we enable BGP. We give the router an AS number of 64500.
BGP Configuration Parameters:
- Router ID: Typically set to the router's loopback or primary interface IP (
10.4.0.1
) - AS Number:
64500
(private ASN for the gateway) - Network advertisements: Routes to be advertised to BGP peers
- Redistribution: Connected routes, static routes, and other protocols
Step 4: BGP Neighbor Configuration
Then we add each of the nodes that might run MetalLB "speakers" as neighbors. They all will share a single AS number, 64501.
Kubernetes Node BGP Peers:
# Control Plane Nodes (also run MetalLB speakers)
10.4.0.11 (bettley) - ASN 64501
10.4.0.12 (cargyll) - ASN 64501
10.4.0.13 (dalt) - ASN 64501
# Worker Nodes (potential MetalLB speakers)
10.4.0.14 (erenford) - ASN 64501
10.4.0.15 (fenn) - ASN 64501
10.4.0.16 (gardener) - ASN 64501
10.4.0.17 (harlton) - ASN 64501
10.4.0.18 (inchfield) - ASN 64501
10.4.0.19 (jast) - ASN 64501
10.4.0.20 (karstark) - ASN 64501
10.4.0.21 (lipps) - ASN 64501
10.4.1.10 (velaryon) - ASN 64501
Neighbor Configuration Details:
- Peer Type: External BGP (eBGP) due to different AS numbers
- Authentication: Can use MD5 authentication for security
- Timers: Hold time (180s) and keepalive (60s) for session management
- Route filters: Accept only specific route prefixes from cluster
BGP Route Advertisement Strategy
Router Advertisements
The OPNsense router advertises:
- Default route (
0.0.0.0/0
) to provide internet access - Infrastructure networks (
10.4.0.0/20
) for internal cluster communication - External services that may be hosted outside the cluster
Cluster Advertisements
MetalLB speakers advertise:
- LoadBalancer service IPs from the
10.4.11.0/24
pool - Individual /32 routes for each allocated load balancer IP
- Equal-cost multi-path (ECMP) when multiple speakers announce the same service
Route Selection and Load Balancing
BGP Path Selection
When multiple MetalLB speakers advertise the same service IP:
- Prefer shortest AS path (all speakers have same path length)
- Prefer lowest origin code (IGP over EGP over incomplete)
- Prefer lowest MED (Multi-Exit Discriminator)
- Prefer eBGP over iBGP (not applicable here)
- Prefer lowest IGP cost to BGP next-hop
- Prefer oldest route (route stability)
Router Load Balancing
OPNsense can be configured for:
- Per-packet load balancing: Maximum utilization but potential packet reordering
- Per-flow load balancing: Maintains flow affinity while distributing across paths
- Weighted load balancing: Different weights for different next-hops
Security Considerations
BGP Session Security
- MD5 Authentication: Prevents unauthorized BGP session establishment
- TTL Security: Ensures BGP packets come from directly connected neighbors
- Prefix filters: Prevent route hijacking by filtering unexpected announcements
Route Filtering
# Example prefix filter configuration
prefix-list METALLB-ROUTES permit 10.4.11.0/24 le 32
neighbor 10.4.0.11 prefix-list METALLB-ROUTES in
This ensures the router only accepts MetalLB routes within the designated pool.
Monitoring and Troubleshooting
BGP Session Monitoring
Key commands for BGP troubleshooting:
# View BGP summary
vtysh -c "show ip bgp summary"
# Check specific neighbor status
vtysh -c "show ip bgp neighbor 10.4.0.11"
# View advertised routes
vtysh -c "show ip bgp advertised-routes"
# Check routing table
ip route show table main
Common BGP Issues
- Session flapping: Often due to network connectivity or timer mismatches
- Route installation failures: Check routing table limits and memory
- Asymmetric routing: Verify return path routing and firewalls
Integration with MetalLB
The BGP configuration on the router side enables MetalLB to:
- Establish BGP sessions with the cluster gateway
- Advertise LoadBalancer service IPs dynamically as services are created
- Withdraw routes automatically when services are deleted
- Provide redundancy through multiple speaker nodes
This creates a fully dynamic load balancing solution where:
- Services get real IP addresses from the external network
- Traffic routes optimally through the cluster
- Failover happens automatically via BGP reconvergence
- No manual network configuration required for new services
In the next section, we'll configure MetalLB to establish these BGP sessions and begin advertising load balancer routes.
MetalLB
MetalLB requires that its namespace have some extra privileges:
apiVersion: 'v1'
kind: 'Namespace'
metadata:
name: 'metallb'
labels:
name: 'metallb'
managed-by: 'argocd'
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/audit: privileged
pod-security.kubernetes.io/warn: privileged
Its application is (perhaps surprisingly) rather simple to configure:
apiVersion: 'argoproj.io/v1alpha1'
kind: 'Application'
metadata:
name: 'metallb'
namespace: 'argocd'
labels:
name: 'metallb'
managed-by: 'argocd'
spec:
project: 'metallb'
source:
repoURL: 'https://metallb.github.io/metallb'
chart: 'metallb'
targetRevision: '0.14.3'
helm:
releaseName: 'metallb'
valuesObject:
rbac:
create: true
prometheus:
scrapeAnnotations: true
metricsPort: 7472
rbacPrometheus: true
destination:
server: 'https://kubernetes.default.svc'
namespace: 'metallb'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- Validate=true
- CreateNamespace=false
- PrunePropagationPolicy=foreground
- PruneLast=true
- RespectIgnoreDifferences=true
- ApplyOutOfSyncOnly=true
It does require some extra resources, though. The first of these is an address pool from which to allocate IP addresses. It's important that this not overlap with a DHCP pool.
The full network is 10.4.0.0/20 and I've configured the DHCP server to only serve addresses in 10.4.0.100-254, so we have plenty of space to play with. Right now, I'll use 10.4.11.0-10.4.15.254, which gives ~1250 usable addresses. I don't think I'll use quite that many.
apiVersion: 'metallb.io/v1beta1'
kind: 'IPAddressPool'
metadata:
name: 'primary'
namespace: 'metallb'
spec:
addresses:
- 10.4.11.0 - 10.4.15.254
Then we need to configure MetalLB to act as a BGP peer:
apiVersion: 'metallb.io/v1beta2'
kind: 'BGPPeer'
metadata:
name: 'marbrand'
namespace: 'metallb'
spec:
myASN: 64501
peerASN: 64500
peerAddress: 10.4.0.1
And advertise the IP address pool:
apiVersion: 'metallb.io/v1beta1'
kind: 'BGPAdvertisement'
metadata:
name: 'primary'
namespace: 'metallb'
spec:
ipAddressPools:
- 'primary'
That's that; we can deploy it, and soon we'll be up and running, although we can't yet test it.
Testing MetalLB
The simplest way to test MetalLB is just to deploy an application with a LoadBalancer
service and see if it works.
I'm a fan of httpbin
and its Go port, httpbingo
, so up it goes:
apiVersion: 'argoproj.io/v1alpha1'
kind: 'Application'
metadata:
name: 'httpbin'
namespace: 'argocd'
labels:
name: 'httpbin'
managed-by: 'argocd'
spec:
project: 'httpbin'
source:
repoURL: 'https://matheusfm.dev/charts'
chart: 'httpbin'
targetRevision: '0.1.1'
helm:
releaseName: 'httpbin'
valuesObject:
service:
type: 'LoadBalancer'
destination:
server: 'https://kubernetes.default.svc'
namespace: 'httpbin'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- Validate=true
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
- RespectIgnoreDifferences=true
- ApplyOutOfSyncOnly=true
Very quickly, it's synced:
We can get the IP address allocated for the load balancer with kubectl -n httpbin get svc
:
And sure enough, it's allocated from the IP address pool we specified. That seems like an excellent sign!
Can we access it from a web browser running on a computer on a different network?
Yes, we can! Our load balancer system is working!
Comprehensive MetalLB Testing Suite
While the httpbin test demonstrates basic functionality, production MetalLB deployments require more thorough validation of various scenarios and failure modes.
Phase 1: Basic Functionality Tests
1.1 IP Address Allocation Verification
First, verify that MetalLB allocates IP addresses from the configured pool:
# Check the configured IP address pool
kubectl -n metallb get ipaddresspool primary -o yaml
# Deploy multiple LoadBalancer services and verify allocations
kubectl create deployment test-nginx --image=nginx
kubectl expose deployment test-nginx --type=LoadBalancer --port=80
# Verify sequential allocation from pool
kubectl get svc test-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
Expected behavior:
- IPs allocated from
10.4.11.0 - 10.4.15.254
range - Sequential allocation starting from pool beginning
- No IP conflicts between services
1.2 Service Lifecycle Testing
Test the complete service lifecycle to ensure proper cleanup:
# Create service and note allocated IP
kubectl create deployment lifecycle-test --image=httpd
kubectl expose deployment lifecycle-test --type=LoadBalancer --port=80
ALLOCATED_IP=$(kubectl get svc lifecycle-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Verify service is accessible
curl -s http://$ALLOCATED_IP/ | grep "It works!"
# Delete service and verify IP is released
kubectl delete svc lifecycle-test
kubectl delete deployment lifecycle-test
# Verify IP is available for reallocation
kubectl create deployment reallocation-test --image=nginx
kubectl expose deployment reallocation-test --type=LoadBalancer --port=80
NEW_IP=$(kubectl get svc reallocation-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Should reuse the previously released IP
echo "Original IP: $ALLOCATED_IP, New IP: $NEW_IP"
Phase 2: BGP Advertisement Testing
2.1 BGP Session Health Verification
Monitor BGP session establishment and health:
# Check MetalLB speaker status
kubectl -n metallb get pods -l component=speaker
# Verify BGP sessions from router perspective
goldentooth command allyrion 'vtysh -c "show ip bgp summary"'
# Check BGP neighbor status for specific node
goldentooth command allyrion 'vtysh -c "show ip bgp neighbor 10.4.0.11"'
Expected BGP session states:
- Established: BGP session is active and exchanging routes
- Route count: Number of routes received from each speaker
- Session uptime: Indicates session stability
2.2 Route Advertisement Verification
Verify that LoadBalancer IPs are properly advertised via BGP:
# Create test service
kubectl create deployment bgp-test --image=nginx
kubectl expose deployment bgp-test --type=LoadBalancer --port=80
TEST_IP=$(kubectl get svc bgp-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Check route advertisement on router
goldentooth command allyrion "vtysh -c 'show ip bgp | grep $TEST_IP'"
# Verify route in kernel routing table
goldentooth command allyrion "ip route show | grep $TEST_IP"
# Test route withdrawal
kubectl delete svc bgp-test
sleep 30
# Verify route is withdrawn
goldentooth command allyrion "vtysh -c 'show ip bgp | grep $TEST_IP' || echo 'Route withdrawn'"
Phase 3: High Availability Testing
3.1 Speaker Node Failure Simulation
Test MetalLB's behavior when speaker nodes fail:
# Identify which node is announcing a service
kubectl create deployment ha-test --image=nginx
kubectl expose deployment ha-test --type=LoadBalancer --port=80
HA_IP=$(kubectl get svc ha-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
# Find announcing node from BGP table
goldentooth command allyrion "vtysh -c 'show ip bgp $HA_IP'"
# Simulate node failure by stopping kubelet on announcing node
ANNOUNCING_NODE=$(kubectl get svc ha-test -o jsonpath='{.metadata.annotations.metallb\.universe\.tf/announcing-node}' 2>/dev/null || echo "bettley")
goldentooth command_root $ANNOUNCING_NODE 'systemctl stop kubelet'
# Wait for BGP reconvergence (typically 30-180 seconds)
sleep 60
# Verify service is still accessible (new node should announce)
curl -s http://$HA_IP/ | grep "Welcome to nginx"
# Check new announcing node
goldentooth command allyrion "vtysh -c 'show ip bgp $HA_IP'"
# Restore failed node
goldentooth command_root $ANNOUNCING_NODE 'systemctl start kubelet'
3.2 Split-Brain Prevention Testing
Verify that MetalLB prevents split-brain scenarios where multiple nodes announce the same service:
# Deploy service with specific node selector to control placement
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: split-brain-test
annotations:
metallb.universe.tf/allow-shared-ip: "split-brain-test"
spec:
type: LoadBalancer
selector:
app: split-brain-test
ports:
- port: 80
targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: split-brain-test
spec:
replicas: 2
selector:
matchLabels:
app: split-brain-test
template:
metadata:
labels:
app: split-brain-test
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
EOF
# Monitor BGP announcements for the service IP
SPLIT_IP=$(kubectl get svc split-brain-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
goldentooth command allyrion "vtysh -c 'show ip bgp $SPLIT_IP detail'"
# Should see only one announcement path, not multiple conflicting paths
Phase 4: Performance and Scale Testing
4.1 IP Pool Exhaustion Testing
Test behavior when IP address pool is exhausted:
# Calculate available IPs in pool (10.4.11.0 - 10.4.15.254 = ~1250 IPs)
# Deploy services until pool exhaustion
for i in {1..10}; do
kubectl create deployment scale-test-$i --image=nginx
kubectl expose deployment scale-test-$i --type=LoadBalancer --port=80
echo "Created service $i"
sleep 5
done
# Check for services stuck in Pending state
kubectl get svc | grep Pending
# Verify MetalLB events for pool exhaustion
kubectl -n metallb get events --sort-by='.lastTimestamp'
4.2 BGP Convergence Time Measurement
Measure BGP convergence time under various scenarios:
# Create test service and measure initial advertisement time
start_time=$(date +%s)
kubectl create deployment convergence-test --image=nginx
kubectl expose deployment convergence-test --type=LoadBalancer --port=80
# Wait for IP allocation
while [ -z "$(kubectl get svc convergence-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null)" ]; do
sleep 1
done
CONV_IP=$(kubectl get svc convergence-test -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "IP allocated: $CONV_IP"
# Wait for BGP advertisement
while ! goldentooth command allyrion "ip route show | grep $CONV_IP" >/dev/null 2>&1; do
sleep 1
done
end_time=$(date +%s)
convergence_time=$((end_time - start_time))
echo "BGP convergence time: ${convergence_time} seconds"
Phase 5: Integration Testing
5.1 ExternalDNS Integration
Test automatic DNS record creation for LoadBalancer services:
# Deploy service with DNS annotation
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: dns-integration-test
annotations:
external-dns.alpha.kubernetes.io/hostname: test.goldentooth.net
spec:
type: LoadBalancer
selector:
app: dns-test
ports:
- port: 80
targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dns-test
spec:
replicas: 1
selector:
matchLabels:
app: dns-test
template:
metadata:
labels:
app: dns-test
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
EOF
# Wait for DNS propagation
sleep 60
# Test DNS resolution
nslookup test.goldentooth.net
# Test HTTP access via DNS name
curl -s http://test.goldentooth.net/ | grep "Welcome to nginx"
5.2 TLS Certificate Integration
Test automatic TLS certificate provisioning for LoadBalancer services:
# Deploy service with cert-manager annotations
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: tls-integration-test
annotations:
external-dns.alpha.kubernetes.io/hostname: tls-test.goldentooth.net
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
type: LoadBalancer
selector:
app: tls-test
ports:
- port: 443
targetPort: 443
name: https
- port: 80
targetPort: 80
name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tls-test-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
tls:
- hosts:
- tls-test.goldentooth.net
secretName: tls-test-cert
rules:
- host: tls-test.goldentooth.net
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tls-integration-test
port:
number: 80
EOF
# Wait for certificate provisioning
kubectl wait --for=condition=Ready certificate/tls-test-cert --timeout=300s
# Test HTTPS access
curl -s https://tls-test.goldentooth.net/ | grep "Welcome to nginx"
Phase 6: Troubleshooting and Monitoring
6.1 MetalLB Component Health
Monitor MetalLB component health and logs:
# Check MetalLB controller status
kubectl -n metallb get pods -l component=controller
kubectl -n metallb logs -l component=controller --tail=50
# Check MetalLB speaker status on each node
kubectl -n metallb get pods -l component=speaker -o wide
kubectl -n metallb logs -l component=speaker --tail=50
# Check MetalLB configuration
kubectl -n metallb get ipaddresspool,bgppeer,bgpadvertisement -o wide
6.2 BGP Session Troubleshooting
Debug BGP session issues:
# Check BGP session state
goldentooth command allyrion 'vtysh -c "show ip bgp summary"'
# Detailed neighbor analysis
for node_ip in 10.4.0.11 10.4.0.12 10.4.0.13; do
echo "=== BGP Neighbor $node_ip ==="
goldentooth command allyrion "vtysh -c 'show ip bgp neighbor $node_ip'"
done
# Check for BGP route-map and prefix-list configurations
goldentooth command allyrion 'vtysh -c "show ip bgp route-map"'
goldentooth command allyrion 'vtysh -c "show ip prefix-list"'
# Monitor BGP route changes in real-time
goldentooth command allyrion 'vtysh -c "debug bgp events"'
6.3 Network Connectivity Testing
Comprehensive network path testing:
# Test connectivity from external networks
for test_ip in $(kubectl get svc -A -o jsonpath='{.items[?(@.spec.type=="LoadBalancer")].status.loadBalancer.ingress[0].ip}'); do
echo "Testing connectivity to $test_ip"
# Test from router
goldentooth command allyrion "ping -c 3 $test_ip"
# Test HTTP connectivity
goldentooth command allyrion "curl -s -o /dev/null -w '%{http_code}' http://$test_ip/ || echo 'Connection failed'"
# Test from external network (if possible)
# ping -c 3 $test_ip
done
# Test internal cluster connectivity
kubectl run network-test --image=busybox --rm -it --restart=Never -- /bin/sh
# From within the pod:
# wget -qO- http://test-service.default.svc.cluster.local/
This comprehensive testing suite ensures MetalLB is functioning correctly across all operational scenarios, from basic IP allocation to complex failure recovery and integration testing. Each test phase builds confidence in the load balancer implementation and helps identify potential issues before they impact production workloads.
Refactoring Argo CD
We're only a few projects in, and using Ansible to install our Argo CD applications seems a bit weak. It's not very GitOps-y to run a Bash command that runs an Ansible playbook that kubectl
s some manifests into our Kubernetes cluster.
In fact, the less we mess with Argo CD itself, the better. Eventually, we'll be able to create a repository on GitHub and see resources appear within our Kubernetes cluster without having to touch Argo CD at all!
We'll do this by using the power of ApplicationSet
resources.
First, we'll create a secret to hold a GitHub token. This part is optional, but it'll allow us to use the API more.
Second, we'll create an AppProject
to encompass these applications. It'll have pretty broad permissions at first, though I'll try and tighten them up a bit.
apiVersion: 'argoproj.io/v1alpha1'
kind: 'AppProject'
metadata:
name: 'gitops-repo'
namespace: 'argocd'
finalizers:
- 'resources-finalizer.argocd.argoproj.io'
spec:
description: 'Goldentooth GitOps-Repo project'
sourceRepos:
- '*'
destinations:
- namespace: '!kube-system'
server: '*'
- namespace: '*'
server: '*'
clusterResourceWhitelist:
- group: '*'
kind: '*'
Then an ApplicationSet
.
apiVersion: 'argoproj.io/v1alpha1'
kind: 'ApplicationSet'
metadata:
name: 'gitops-repo'
namespace: 'argocd'
spec:
generators:
- scmProvider:
github:
organization: 'goldentooth'
tokenRef:
secretName: 'github-token'
key: 'token'
filters:
- labelMatch: 'gitops-repo'
template:
goTemplate: true
goTemplateOptions: ["missingkey=error"]
metadata:
# Prefix name with `gitops-repo-`.
# This allows us to define the `Application` manifest within the repo and
# have significantly greater flexibility, at the cost of an additional
# application in the Argo CD UI.
name: 'gitops-repo-{{ .repository }}'
spec:
source:
repoURL: '{{ .url }}'
targetRevision: '{{ .branch }}'
path: './'
project: 'gitops-repo'
destination:
server: https://kubernetes.default.svc
namespace: '{{ .repository }}'
The idea is that I'll create a repository and give it a topic of gitops-repo
. This will be matched by the labelMatch
filter
, and then Argo CD will deploy whatever manifests it finds there.
MetalLB is the natural place to start.
We don't actually have to do that much to get this working:
- Create a new repository
metallb
. - Add a
Chart.yaml
file with some boilerplate. - Add the manifests to a
templates/
directory. - Add a
values.yaml
file with values to substitute into the manifests. - As mentioned above, edit the repo to give it the
gitops-repo
topic.
Within a few minutes, Argo CD will notice the changes and deploy a gitops-repo-metallb
application:
If we click into it, we'll see the resources deployed by the manifests within the repository:
So we see the resources we created previously for the BGPPeer, IPAddressPool, and BGPAdvertisement. We also see an Application, metallb
, which we can also see in the general Applications overview in Argo CD:
Clicking into it, we'll see all of the resources deployed by the metallb
Helm chart we referenced.
A quick test to verify that our httpbin
application is still assigned a working load balancer, and we can declare victory!
While I'm here, I might as well shift httpbin
and prometheus-node-exporter
as well...
Giving Argo CD a Load Balancer
All this time, the Argo CD server has been operating with a ClusterIP service, and I've been manually port forwarding it via kubectl
to be able to show all of these beautiful screenshots of the web UI.
That's annoying and we don't have to do it anymore. Fortunately, it's very easy to change this now; all we need to do is modify the Helm release values slightly; change server.service.type
from 'ClusterIP' to 'LoadBalancer' and redeploy. A few minutes later, we can access Argo CD via http://10.4.11.1, no port forwarding required.
ExternalDNS
The workflow for accessing our LoadBalancer
services ain't great.
If we deploy a new application, we need to run kubectl -n <namespace> get svc
and read through a list to determine the IP address on which it's exposed. And that's not going to be stable; there's nothing at all guaranteeing that Argo CD will always be available at http://10.4.11.1
.
Enter ExternalDNS. The idea is that we annotate our services with external-dns.alpha.kubernetes.io/hostname: "argocd.my-cluster.my-domain.com"
and a DNS record will be created pointing to the actual IP address of the LoadBalancer
service.
This is comparatively straightforward to configure if you host your DNS in one of the supported services. I host mine via AWS Route53, which is supported.
The complication is that we don't yet have a great way of managing secrets, so there's a manual step here that I find unpleasant, but we'll cross that bridge when we get to it.
Architecture Overview
ExternalDNS creates a bridge between Kubernetes services and external DNS providers, enabling automatic DNS record management:
DNS Infrastructure
- Primary Domain:
goldentooth.net
managed in AWS Route53 - Zone ID:
Z0736727S7ZH91VKK44A
(defined in Terraform) - Cluster Subdomain: Services automatically get
<service>.goldentooth.net
- TTL Configuration: Default 60 seconds for rapid updates during development
Integration Points
- MetalLB: Provides LoadBalancer IPs from pool
10.4.11.0/24
- Route53: AWS DNS service for public domain management
- Argo CD: GitOps deployment and lifecycle management
- Terraform: Infrastructure-as-code for Route53 zone and ACM certificates
Helm Chart Configuration
Because of work we've done previously with Argo CD, we can just create a new repository to deploy ExternalDNS within our cluster.
The ExternalDNS deployment is managed through a custom Helm chart with comprehensive configuration:
Chart Structure (Chart.yaml
)
apiVersion: v2
name: external-dns
description: ExternalDNS for automatic DNS record management
type: application
version: 0.0.1
appVersion: "v0.14.2"
Values Configuration (values.yaml
)
metadata:
namespace: external-dns
name: external-dns
projectName: gitops-repo
spec:
domainFilter: goldentooth.net
version: v0.14.2
This configuration provides:
- Namespace isolation: Dedicated
external-dns
namespace - GitOps integration: Part of the
gitops-repo
project for Argo CD - Domain scoping: Only manages records for
goldentooth.net
- Version pinning: Uses ExternalDNS v0.14.2 for stability
Deployment Architecture
Core Deployment Configuration
This has the following manifests:
Deployment: The deployment has several interesting features:
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
namespace: external-dns
spec:
selector:
matchLabels:
app: external-dns
template:
metadata:
labels:
app: external-dns
spec:
containers:
- name: external-dns
image: registry.k8s.io/external-dns/external-dns:v0.14.2
args:
- --source=service
- --domain-filter=goldentooth.net
- --provider=aws
- --aws-zone-type=public
- --registry=txt
- --txt-owner-id=external-dns-external-dns
- --log-level=debug
- --aws-region=us-east-1
env:
- name: AWS_SHARED_CREDENTIALS_FILE
value: /.aws/credentials
volumeMounts:
- name: aws-credentials
mountPath: /.aws
readOnly: true
volumes:
- name: aws-credentials
secret:
secretName: external-dns
Key Configuration Parameters:
- Provider:
aws
for Route53 integration - Sources:
service
(monitors Kubernetes LoadBalancer services) - Domain Filter:
goldentooth.net
(restricts DNS management scope) - AWS Zone Type:
public
(only manages public DNS records) - Registry:
txt
(uses TXT records for ownership tracking) - Owner ID:
external-dns-external-dns
(namespace-app format) - Region:
us-east-1
(AWS region for Route53 operations)
AWS Credentials Management
Secret Configuration:
apiVersion: v1
kind: Secret
metadata:
name: external-dns
namespace: external-dns
type: Opaque
data:
credentials: |
[default]
aws_access_key_id = {{ secret_vault.aws.access_key_id | b64encode }}
aws_secret_access_key = {{ secret_vault.aws.secret_access_key | b64encode }}
This setup:
- Secure storage: AWS credentials stored in Ansible vault
- Minimal permissions: IAM user with only Route53 zone modification rights
- File-based auth: Uses AWS credentials file format for compatibility
- Namespace isolation: Secret accessible only within
external-dns
namespace
RBAC Configuration
ServiceAccount: Just adds a service account for ExternalDNS.
apiVersion: v1
kind: ServiceAccount
metadata:
name: external-dns
namespace: external-dns
ClusterRole: Describes an ability to observe changes in services.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: external-dns
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods", "nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "watch", "list"]
ClusterRoleBinding: Binds the above cluster role and ExternalDNS.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: external-dns-viewer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: external-dns
subjects:
- kind: ServiceAccount
name: external-dns
namespace: external-dns
Permission Scope:
- Read-only access: ExternalDNS cannot modify Kubernetes resources
- Cluster-wide monitoring: Can watch services across all namespaces
- Resource types: Services, endpoints, pods, nodes, and ingresses
- Security principle: Least privilege for DNS management operations
Service Annotation Patterns
Basic DNS Record Creation
Services use annotations to trigger DNS record creation:
apiVersion: v1
kind: Service
metadata:
name: httpbin
namespace: httpbin
annotations:
external-dns.alpha.kubernetes.io/hostname: httpbin.goldentooth.net
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080
selector:
app: httpbin
Annotation Functions:
- Hostname:
external-dns.alpha.kubernetes.io/hostname
specifies the FQDN - TTL:
external-dns.alpha.kubernetes.io/ttl
sets DNS record time-to-live - Automatic A record: Points to MetalLB-allocated LoadBalancer IP
- Automatic TXT record: Ownership tracking with txt-owner-id
Advanced Annotation Options
annotations:
# Multiple hostnames for the same service
external-dns.alpha.kubernetes.io/hostname: "app.goldentooth.net,api.goldentooth.net"
# Custom TTL for caching strategy
external-dns.alpha.kubernetes.io/ttl: "300"
# AWS-specific: Route53 weighted routing
external-dns.alpha.kubernetes.io/aws-weight: "100"
# AWS-specific: Health check configuration
external-dns.alpha.kubernetes.io/aws-health-check-id: "12345678-1234-1234-1234-123456789012"
DNS Record Lifecycle Management
Record Creation Process
- Service Creation: LoadBalancer service deployed with ExternalDNS annotations
- IP Allocation: MetalLB assigns IP from configured pool (
10.4.11.0/24
) - Service Discovery: ExternalDNS watches Kubernetes API for service changes
- DNS Creation: Creates A record pointing to LoadBalancer IP
- Ownership Tracking: Creates TXT record for ownership verification
Record Cleanup Process
- Service Deletion: LoadBalancer service removed from cluster
- Change Detection: ExternalDNS detects service removal event
- Ownership Verification: Checks TXT record ownership before deletion
- DNS Cleanup: Removes both A and TXT records from Route53
- IP Release: MetalLB returns IP to available pool
TXT Record Ownership System
ExternalDNS uses TXT records for safe multi-cluster DNS management:
# Example TXT record for ownership tracking
dig TXT httpbin.goldentooth.net
# Response includes:
# httpbin.goldentooth.net. 60 IN TXT "heritage=external-dns,external-dns/owner=external-dns-external-dns"
This prevents:
- Record conflicts: Multiple ExternalDNS instances managing same domain
- Accidental deletion: Only owner can modify/delete records
- Split-brain scenarios: Clear ownership prevents conflicting updates
Integration with GitOps
Argo CD Application Configuration
ExternalDNS is deployed via GitOps using the ApplicationSet pattern:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: gitops-repo
namespace: argocd
spec:
generators:
- scmProvider:
github:
organization: goldentooth
allBranches: false
labelSelector:
matchLabels:
gitops-repo: "true"
template:
metadata:
name: '{{repository}}'
spec:
project: gitops-repo
source:
repoURL: '{{url}}'
targetRevision: '{{branch}}'
path: .
destination:
server: https://kubernetes.default.svc
namespace: '{{repository}}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
This provides:
- Automatic deployment: Changes to external-dns repository trigger redeployment
- Namespace creation: Automatically creates
external-dns
namespace - Self-healing: Argo CD corrects configuration drift
- Pruning: Removes resources deleted from Git repository
Repository Structure
external-dns/
├── Chart.yaml # Helm chart metadata
├── values.yaml # Configuration values
└── templates/
├── Deployment.yaml # ExternalDNS deployment
├── ServiceAccount.yaml
├── ClusterRole.yaml
├── ClusterRoleBinding.yaml
└── Secret.yaml # AWS credentials (Ansible-templated)
Monitoring and Troubleshooting
Health Verification
# Check ExternalDNS pod status
kubectl -n external-dns get pods
# Monitor ExternalDNS logs
kubectl -n external-dns logs -l app=external-dns --tail=50
# Verify AWS credentials
kubectl -n external-dns exec deployment/external-dns -- cat /.aws/credentials
# Check service discovery
kubectl -n external-dns logs deployment/external-dns | grep "Creating record"
DNS Record Validation
# Verify A record creation
dig A httpbin.goldentooth.net
# Check TXT record ownership
dig TXT httpbin.goldentooth.net
# Validate Route53 changes
aws route53 list-resource-record-sets --hosted-zone-id Z0736727S7ZH91VKK44A | jq '.ResourceRecordSets[] | select(.Name | contains("httpbin"))'
Common Issues and Solutions
Issue: DNS records not created
- Check: Service has
type: LoadBalancer
and LoadBalancer IP is assigned - Verify: ExternalDNS has RBAC permissions to read services
- Debug: Check ExternalDNS logs for AWS API errors
Issue: DNS records not cleaned up
- Check: TXT record ownership matches ExternalDNS txt-owner-id
- Verify: AWS credentials have Route53 delete permissions
- Debug: Monitor ExternalDNS logs during service deletion
Issue: Multiple DNS entries for same service
- Check: Only one ExternalDNS instance should manage each domain
- Verify: txt-owner-id is unique across clusters
- Fix: Use different owner IDs for different environments
Integration Examples
Argo CD Access
A few minutes after pushing changes to the repository, we can reach Argo CD via https://argocd.goldentooth.net/.
Service Configuration:
apiVersion: v1
kind: Service
metadata:
name: argocd-server
namespace: argocd
annotations:
external-dns.alpha.kubernetes.io/hostname: argocd.goldentooth.net
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
type: LoadBalancer
ports:
- port: 443
targetPort: 8080
protocol: TCP
name: https
selector:
app.kubernetes.io/component: server
app.kubernetes.io/name: argocd-server
This automatically creates:
- A record:
argocd.goldentooth.net → 10.4.11.1
(MetalLB-assigned IP) - TXT record: Ownership tracking for safe management
- 60-second TTL: Rapid DNS propagation for development workflows
The combination of MetalLB for LoadBalancer IP allocation and ExternalDNS for automatic DNS management creates a seamless experience where services become accessible via friendly DNS names without manual intervention, enabling true infrastructure-as-code patterns for both networking and DNS.
Killing the Incubator
At this point, given the ease of spinning up new applications with the gitops-repo
ApplicationSet
, there's really not much benefit to the Incubator
app-of-apps repo.
I'd also added a way of easily spinning up generic projects, but I don't think that's necessary either. The ApplicationSet
approach is really pretty powerful 🙂
Welcome Back
So, uh, it's been a while. Things got busy and I didn't really touch the cluster for a while, and now I'm interested in it again and of course have completely forgotten everything about it.
I also ditched my OPNsense firewall because I felt it was probably costing too much power and replaced with with a simpler Unifi device, which is great but I just realized that I now have to reconfigure MetalLB to use Layer 2 instead of BGP. I probably should've used Layer 2 from the beginning, but I thought BGP would make me look a little cooler. So no load balancer integration is working right now on the cluster, which means I can't easily check in on ArgoCD. But that's fine, that's not really my highest priority.
Also, I have some new interests; I've gotten into HPC and MLOps, and some of the people I'm interested in working with use Nomad, which I've used for a couple of throwaway play projects but never on an ongoing basis. So I'm going to set up Slurm and Nomad and probably some other things. Should be fun and teach me a good amount. Of course, that's moving away from Kubernetes, but I figure I'll keep the name of this blog the same because frankly I just don't have any interest in renaming it.
First, though, I need to make sure the cluster itself is up to snuff.
Now, even I remember that I have a little Bash tool to ease administering the cluster. And because I know me, it has online help:
$ goldentooth
Usage: goldentooth <subcommand> [arguments...]
Subcommands:
autocomplete Enable bash autocompletion.
install Install Ansible dependencies.
lint Lint all roles.
ping Ping all hosts.
uptime Get uptime for all hosts.
command Run an arbitrary command on all hosts.
edit_vault Edit the vault.
ansible_playbook Run a specified Ansible playbook.
usage Display usage information.
bootstrap_k8s Bootstrap Kubernetes cluster with kubeadm.
cleanup Perform various cleanup tasks.
configure_cluster Configure the hosts in the cluster.
install_argocd Install Argo CD on Kubernetes cluster.
install_argocd_apps Install Argo CD applications.
install_helm Install Helm on Kubernetes cluster.
install_k8s_packages Install Kubernetes packages.
reset_k8s Reset Kubernetes cluster with kubeadm.
setup_load_balancer Setup the load balancer.
shutdown Cleanly shut down the hosts in the cluster.
uninstall_k8s_packages Uninstall Kubernetes packages.
so I can ping all of the nodes:
$ goldentooth ping
allyrion | SUCCESS => {
"changed": false,
"ping": "pong"
}
gardener | SUCCESS => {
"changed": false,
"ping": "pong"
}
inchfield | SUCCESS => {
"changed": false,
"ping": "pong"
}
cargyll | SUCCESS => {
"changed": false,
"ping": "pong"
}
erenford | SUCCESS => {
"changed": false,
"ping": "pong"
}
dalt | SUCCESS => {
"changed": false,
"ping": "pong"
}
bettley | SUCCESS => {
"changed": false,
"ping": "pong"
}
jast | SUCCESS => {
"changed": false,
"ping": "pong"
}
harlton | SUCCESS => {
"changed": false,
"ping": "pong"
}
fenn | SUCCESS => {
"changed": false,
"ping": "pong"
}
and... yes, that's all of them. Okay, that's a good sign.
And then I can get their uptime:
$ goldentooth uptime
gardener | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:48, 0 user, load average: 0.13, 0.17, 0.14
allyrion | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:49, 0 user, load average: 0.10, 0.06, 0.01
inchfield | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:47, 0 user, load average: 0.25, 0.59, 0.60
erenford | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:48, 0 user, load average: 0.08, 0.15, 0.12
jast | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:47, 0 user, load average: 0.11, 0.19, 0.27
dalt | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:49, 0 user, load average: 0.84, 0.64, 0.59
fenn | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:48, 0 user, load average: 0.27, 0.34, 0.23
harlton | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:48, 0 user, load average: 0.27, 0.14, 0.20
bettley | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:49, 0 user, load average: 0.41, 0.49, 0.81
cargyll | CHANGED | rc=0 >>
19:26:23 up 17 days, 4:49, 0 user, load average: 0.26, 0.42, 0.64
17 days, which is about when I set up the new router and had to reorganize a lot of my network. Seems legit. So it looks like the power supplies are still fine. When I first set up the cluster, I think there was a flaky USB cable on one of the Pis, so it would occasionally drop off. I'd prefer to control my chaos engineering, not have it arise spontaneously from my poor QC, thank you very much.
My first node just runs HAProxy (currently) and is the simplest, so I'm going to check and see what needs to be updated. Nobody cares about apt
stuff so I'll skip the details.
TL;DR: It wasn't that much, really, though it does appear that I had some files in /etc/modprobe.d
that should've been in /etc/modules-load.d
. I blame... someone else.
So I'll update all of the nodes, hope they rejoin the cluster when they reboot, and in the next entry I'll try to update Kubernetes...
NFS Exports
Just kidding, I'm going to set up a USB thumb drive and NFS exports on Allyrion (my load balancer node).
The thumbdrive is just a Sandisk 64GB. Should be enough to do some fun stuff. fdisk
it (hey, I remember the commands!), mkfs.ext4
it, get the UUID, add it to /etc/fstab
(not "f-stab", "fs-tab"), and we have a bright shiny new volume.
NFS Server Implementation
NFS isn't hard to set up, but I'm going to use Jeff's ansible-role-nfs for consistency and maintainability.
The implementation consists of two main components:
Server Configuration
The NFS server setup is managed through the setup_nfs_exports.yaml
playbook, which performs these operations:
- Install NFS utilities on all nodes:
- name: 'Install NFS utilities.'
hosts: 'all'
remote_user: 'root'
tasks:
- name: 'Ensure NFS utilities are installed.'
ansible.builtin.apt:
name:
- nfs-common
state: present
- Configure NFS server on allyrion:
- name: 'Setup NFS exports.'
hosts: 'nfs'
remote_user: 'root'
roles:
- { role: 'geerlingguy.nfs' }
Export Configuration
The NFS export is configured through host variables in inventory/host_vars/allyrion.yaml
:
nfs_exports:
- "/mnt/usb1 *(rw,sync,no_root_squash,no_subtree_check)"
This export configuration provides:
- Path:
/mnt/usb1
- The USB thumb drive mount point - Access:
*
- Allow access from any host within the cluster network - Permissions:
rw
- Read-write access for all clients - Sync Policy:
sync
- Synchronous writes (safer but slower than async) - Root Mapping:
no_root_squash
- Allow root user from clients to maintain root privileges - Performance:
no_subtree_check
- Disable subtree checking for better performance
Network Integration
The NFS server integrates with the cluster's network architecture:
Server Information:
- Host: allyrion (10.4.0.10)
- Role: Dual-purpose load balancer and NFS server
- Network: Infrastructure CIDR
10.4.0.0/20
Global NFS Configuration (in group_vars/all/vars.yaml
):
nfs:
server: "{{ groups['nfs_server'] | first}}"
mounts:
primary:
share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
mount: '/mnt/nfs'
safe_name: 'mnt-nfs'
type: 'nfs'
options: {}
This configuration:
- Dynamically determines the NFS server from the
nfs_server
inventory group - Uses the server's IP address for robust connectivity
- Standardizes the client mount point as
/mnt/nfs
- Provides a safe filesystem name for systemd units
Security Considerations
Internal Network Trust Model: The NFS configuration uses a simplified security model appropriate for an internal cluster:
- Open Access: The
*
wildcard allows any host to mount the share - No Kerberos: Uses IP-based authentication rather than user-based
- Root Access:
no_root_squash
enables administrative operations from clients - Network Boundary: Security relies on the trusted internal network (
10.4.0.0/20
)
Storage Infrastructure
Physical Storage:
- Device: SanDisk 64GB USB thumb drive
- Filesystem: ext4 for reliability and broad compatibility
- Mount Point:
/mnt/usb1
- Persistence: Configured in
/etc/fstab
using UUID for reliability
Performance Characteristics:
- Capacity: 64GB available storage
- Access Pattern: Shared read-write across 13 cluster nodes
- Use Cases: Configuration files, shared data, cluster coordination
Verification and Testing
The NFS export can be verified using standard tools:
$ showmount -e allyrion
Exports list on allyrion:
/mnt/usb1 *
This output confirms:
- The export is properly configured and accessible
- The path
/mnt/usb1
is being served - Access is open to all hosts (
*
)
Command Line Integration
The NFS setup integrates with the goldentooth CLI for consistent cluster management:
# Configure NFS server
goldentooth setup_nfs_exports
# Configure client mounts (covered in Chapter 031)
goldentooth setup_nfs_mounts
# Verify exports
goldentooth command allyrion 'showmount -e allyrion'
Future Evolution
Note: This represents the initial NFS implementation. The cluster later evolves to include more sophisticated storage with ZFS pools and replication (documented in Chapter 050), while maintaining compatibility with this foundational NFS export.
We'll return to this later and find out if it actually works when we configure the client mounts in the next section.
Kubernetes Updates
Because I'm not a particularly smart man, I've allowed my cluster to fall behind. The current version, as of today, is 1.32.3, and my cluster is on 1.29.something.
So that means I need to upgrade 1.29 -> 1.30, 1.30 -> 1.31, and 1.31 -> 1.32.
1.29 -> 1.30
First, I update the repo URL in /etc/apt/sources.list.d/kubernetes.sources
and run:
$ sudo apt update
Hit:1 http://deb.debian.org/debian bookworm InRelease
Hit:2 http://deb.debian.org/debian-security bookworm-security InRelease
Hit:3 http://deb.debian.org/debian bookworm-updates InRelease
Hit:4 https://download.docker.com/linux/debian bookworm InRelease
Hit:6 http://archive.raspberrypi.com/debian bookworm InRelease
Hit:7 https://baltocdn.com/helm/stable/debian all InRelease
Get:5 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.30/deb InRelease [1,189 B]
Err:5 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.30/deb InRelease
The following signatures were invalid: EXPKEYSIG 234654DA9A296436 isv:kubernetes OBS Project <isv:kubernetes@build.opensuse.org>
Reading package lists... Done
W: https://download.docker.com/linux/debian/dists/bookworm/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: GPG error: https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.30/deb InRelease: The following signatures were invalid: EXPKEYSIG 234654DA9A296436 isv:kubernetes OBS Project <isv:kubernetes@build.opensuse.org>
E: The repository 'https://pkgs.k8s.io/core:/stable:/v1.30/deb InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
Well, shit. Looks like I need to do some surgery elsewhere.
Fortunately, I had some code for setting up the Kubernetes package repositories in install_k8s_packages
. Of course, I don't want to install new versions of the packages – the upgrade process is a little more delicate than that – so I factored it out into a new role called setup_k8s_apt
. Running that role against my cluster with goldentooth setup_k8s_apt
made the necessary changes.
$ sudo apt-cache madison kubeadm
kubeadm | 1.30.11-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.10-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.9-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.8-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.7-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.6-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.5-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.4-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.3-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.2-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.1-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
kubeadm | 1.30.0-1.1 | https://pkgs.k8s.io/core:/stable:/v1.30/deb Packages
There we go. That wasn't that bad.
Now, the next steps are things I'm going to do repeatedly, and I don't want to type a bunch of commands, so I'm going to do it in Ansible. I need to do that advisedly, though.
I created a new role, goldentooth.upgrade_k8s
. I'm working through the upgrade documentation, Ansible-izing it as I go.
So I added some tasks to update the Apt cache, unhold kubeadm, upgrade it, and then hold it again (via a handler). I tagged these with first_control_plane
and invoke the role dynamically (because that is the only context in which you can limit execution of a role to the specified tags).
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"30", GitVersion:"v1.30.11", GitCommit:"6a074997c960757de911780f250ecd9931917366", GitTreeState:"clean", BuildDate:"2025-03-11T19:56:25Z", GoVersion:"go1.23.6", Compiler:"gc", Platform:"linux/arm64"}
It worked!
The plan operation similarly looks fine.
$ sudo kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: 1.29.6
[upgrade/versions] kubeadm version: v1.30.11
I0403 11:18:34.338987 564280 version.go:256] remote version is much newer: v1.32.3; falling back to: stable-1.30
[upgrade/versions] Target version: v1.30.11
[upgrade/versions] Latest version in the v1.29 series: v1.29.15
Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT NODE CURRENT TARGET
kubelet bettley v1.29.2 v1.29.15
kubelet cargyll v1.29.2 v1.29.15
kubelet dalt v1.29.2 v1.29.15
kubelet erenford v1.29.2 v1.29.15
kubelet fenn v1.29.2 v1.29.15
kubelet gardener v1.29.2 v1.29.15
kubelet harlton v1.29.2 v1.29.15
kubelet inchfield v1.29.2 v1.29.15
kubelet jast v1.29.2 v1.29.15
Upgrade to the latest version in the v1.29 series:
COMPONENT NODE CURRENT TARGET
kube-apiserver bettley v1.29.6 v1.29.15
kube-apiserver cargyll v1.29.6 v1.29.15
kube-apiserver dalt v1.29.6 v1.29.15
kube-controller-manager bettley v1.29.6 v1.29.15
kube-controller-manager cargyll v1.29.6 v1.29.15
kube-controller-manager dalt v1.29.6 v1.29.15
kube-scheduler bettley v1.29.6 v1.29.15
kube-scheduler cargyll v1.29.6 v1.29.15
kube-scheduler dalt v1.29.6 v1.29.15
kube-proxy 1.29.6 v1.29.15
CoreDNS v1.11.1 v1.11.3
etcd bettley 3.5.10-0 3.5.15-0
etcd cargyll 3.5.10-0 3.5.15-0
etcd dalt 3.5.10-0 3.5.15-0
You can now apply the upgrade by executing the following command:
kubeadm upgrade apply v1.29.15
_____________________________________________________________________
Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT NODE CURRENT TARGET
kubelet bettley v1.29.2 v1.30.11
kubelet cargyll v1.29.2 v1.30.11
kubelet dalt v1.29.2 v1.30.11
kubelet erenford v1.29.2 v1.30.11
kubelet fenn v1.29.2 v1.30.11
kubelet gardener v1.29.2 v1.30.11
kubelet harlton v1.29.2 v1.30.11
kubelet inchfield v1.29.2 v1.30.11
kubelet jast v1.29.2 v1.30.11
Upgrade to the latest stable version:
COMPONENT NODE CURRENT TARGET
kube-apiserver bettley v1.29.6 v1.30.11
kube-apiserver cargyll v1.29.6 v1.30.11
kube-apiserver dalt v1.29.6 v1.30.11
kube-controller-manager bettley v1.29.6 v1.30.11
kube-controller-manager cargyll v1.29.6 v1.30.11
kube-controller-manager dalt v1.29.6 v1.30.11
kube-scheduler bettley v1.29.6 v1.30.11
kube-scheduler cargyll v1.29.6 v1.30.11
kube-scheduler dalt v1.29.6 v1.30.11
kube-proxy 1.29.6 v1.30.11
CoreDNS v1.11.1 v1.11.3
etcd bettley 3.5.10-0 3.5.15-0
etcd cargyll 3.5.10-0 3.5.15-0
etcd dalt 3.5.10-0 3.5.15-0
You can now apply the upgrade by executing the following command:
kubeadm upgrade apply v1.30.11
_____________________________________________________________________
The table below shows the current state of component configs as understood by this version of kubeadm.
Configs that have a "yes" mark in the "MANUAL UPGRADE REQUIRED" column require manual config upgrade or
resetting to kubeadm defaults before a successful upgrade can be performed. The version to manually
upgrade to is denoted in the "PREFERRED VERSION" column.
API GROUP CURRENT VERSION PREFERRED VERSION MANUAL UPGRADE REQUIRED
kubeproxy.config.k8s.io v1alpha1 v1alpha1 no
kubelet.config.k8s.io v1beta1 v1beta1 no
_____________________________________________________________________
Of course, I won't automate the actual upgrade process; that seems unwise.
I'm skipping certificate renewal because I'd like to fight with one thing at a time.
$ sudo kubeadm upgrade apply v1.30.11 --certificate-renewal=false
[preflight] Running pre-flight checks.
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to "v1.30.11"
[upgrade/versions] Cluster version: v1.29.6
[upgrade/versions] kubeadm version: v1.30.11
[upgrade] Are you sure you want to proceed? [y/N]: y
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
W0403 11:23:42.086815 566901 checks.go:844] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.9" as the CRI sandbox image.
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.30.11" (timeout: 5m0s)...
[upgrade/etcd] Upgrading to TLS for etcd
[upgrade/staticpods] Preparing for "etcd" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=etcd
[upgrade/staticpods] Component "etcd" upgraded successfully!
[upgrade/etcd] Waiting for etcd to become available
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests1796562509"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2025-04-03-11-25-50/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This can take up to 5m0s
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upgrade] Backing up kubelet config file to /etc/kubernetes/tmp/kubeadm-kubelet-config2173844632/config.yaml
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[upgrade/addons] skip upgrade addons because control plane instances [cargyll dalt] have not been upgraded
[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.30.11". Enjoy!
[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
The next steps for the other two control plane nodes are fairly straightforward. This mostly just consisted of duplicating the playbook block to add a new step for when the playbook is executed with the 'other_control_plane' tag and adding that tag to the steps already added in the setup_k8s
role.
$ goldentooth command cargyll,dalt 'sudo kubeadm upgrade node'
And a few minutes later, both of the remaining control plane nodes have updated.
The next step is to upgrade the kubelet in each node.
Serially, for obvious reasons, we need to drain each node (from a control plane node), upgrade the kubelet (unhold, upgrade, hold), then uncordon the node (again, from a control plane node).
That's not too bad, and it's included in the latest changes to the upgrade_k8s
role.
The final step is upgrading kubectl
on each of the control plane nodes, which is a comparative cakewalk.
$ sudo kubectl version
Client Version: v1.30.11
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.11
Nice!
1.30 -> 1.31
Now that the Ansible playbook and role are fleshed out, the process moving forward is comparatively simple.
- Change the
k8s_version_clean
variable to1.31
. goldentooth setup_k8s_apt
goldentooth upgrade_k8s --tags=kubeadm_first
goldentooth command bettley 'kubeadm version'
goldentooth command bettley 'sudo kubeadm upgrade plan'
goldentooth command bettley 'sudo kubeadm upgrade apply v1.31.7 --certificate-renewal=false -y'
goldentooth upgrade_k8s --tags=kubeadm_rest
goldentooth command cargyll,dalt 'sudo kubeadm upgrade node'
goldentooth upgrade_k8s --tags=kubelet
goldentooth upgrade_k8s --tags=kubectl
1.31 -> 1.32
Hell, this is kinda fun now.
- Change the
k8s_version_clean
variable to1.32
. goldentooth setup_k8s_apt
goldentooth upgrade_k8s --tags=kubeadm_first
goldentooth command bettley 'kubeadm version'
goldentooth command bettley 'sudo kubeadm upgrade plan'
goldentooth command bettley 'sudo kubeadm upgrade apply v1.32.3 --certificate-renewal=false -y'
goldentooth upgrade_k8s --tags=kubeadm_rest
goldentooth command cargyll,dalt 'sudo kubeadm upgrade node'
goldentooth upgrade_k8s --tags=kubelet
goldentooth upgrade_k8s --tags=kubectl
And eventually, everything is fine:
$ sudo kubectl get nodes
NAME STATUS ROLES AGE VERSION
bettley Ready control-plane 286d v1.32.3
cargyll Ready control-plane 286d v1.32.3
dalt Ready control-plane 286d v1.32.3
erenford Ready <none> 286d v1.32.3
fenn Ready <none> 286d v1.32.3
gardener Ready <none> 286d v1.32.3
harlton Ready <none> 286d v1.32.3
inchfield Ready <none> 286d v1.32.3
jast Ready <none> 286d v1.32.3
Fixing MetalLB
As mentioned here, I purchased a new router to replace a power-hungry Dell server running OPNsense, and that cost me BGP support. This kills my MetalLB configuration, so I need to switch it to use Layer 2.
This transition represents a fundamental change in how MetalLB operates and requires understanding the trade-offs between BGP and Layer 2 modes.
BGP vs Layer 2 Architecture Comparison
BGP Mode (Previous Configuration)
- Dynamic routing: BGP speakers advertise LoadBalancer IPs to upstream routers
- True load balancing: Multiple nodes can announce the same service IP with ECMP
- Scalability: Router handles load distribution and failover automatically
- Network integration: Works with enterprise routing infrastructure
- Requirements: Router must support BGP (FRR, Quagga, hardware routers)
Layer 2 Mode (New Configuration)
- ARP announcements: Nodes respond to ARP requests for LoadBalancer IPs
- Active/passive failover: Only one node answers ARP for each service IP
- Simpler setup: No routing protocol configuration required
- Limited scalability: All traffic for a service goes through single node
- Requirements: Nodes must be on same Layer 2 network segment
Hardware Infrastructure Change
The transition was necessitated by hardware changes:
Previous Setup:
- Dell server: Power-hungry (likely PowerEdge) running OPNsense
- BGP support: FRR (Free Range Routing) plugin provided full BGP implementation
- Power consumption: High power draw from server-class hardware
- Complexity: Full routing stack with BGP, OSPF, and other protocols
New Setup:
- Consumer router: Lower power consumption
- No BGP support: Consumer-grade firmware lacks routing protocol support
- Simplified networking: Standard static routing and NAT
- Cost efficiency: Reduced power costs and hardware complexity
Migration Process
The migration involved several coordinated steps to minimize service disruption:
Step 1: Remove BGP Configuration
That shouldn't be too bad.
I think it's just a matter of deleting the BGP advertisement:
$ sudo kubectl -n metallb delete BGPAdvertisement primary
bgpadvertisement.metallb.io "primary" deleted
This command removes the BGP advertisement configuration, which:
- Stops route announcements: MetalLB speakers stop advertising LoadBalancer IPs via BGP
- Maintains IP allocation: Existing LoadBalancer services keep their assigned IPs
- Preserves connectivity: Services remain accessible until Layer 2 mode is configured
Step 2: Configure Layer 2 Advertisement
and creating an L2 advertisement:
$ cat tmp.yaml
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: primary
namespace: metallb
$ sudo kubectl apply -f tmp.yaml
l2advertisement.metallb.io/primary created
L2Advertisement Configuration Details:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: primary
namespace: metallb
spec:
ipAddressPools:
- primary
nodeSelectors:
- matchLabels:
kubernetes.io/hostname: "*"
interfaces:
- eth0
Key behaviors in Layer 2 mode:
- ARP responder: Nodes respond to ARP requests for LoadBalancer IPs
- Leader election: One node per service IP elected as ARP responder
- Gratuitous ARP: Leader sends gratuitous ARP to announce IP ownership
- Failover: New leader elected if current leader becomes unavailable
Step 3: Router Static Route Configuration
After adding the static route to my router, I can see the friendly go-httpbin
response when I navigate to https://10.4.11.1/
Static Route Configuration:
# Router configuration (varies by model)
# Destination: 10.4.11.0/24 (MetalLB IP pool)
# Gateway: 10.4.0.X (any cluster node IP)
# Interface: LAN interface connected to cluster network
Why static routes are necessary:
- IP pool isolation: MetalLB pool (
10.4.11.0/24
) is separate from cluster network (10.4.0.0/20
) - Router awareness: Router needs to know how to reach LoadBalancer IPs
- Return path: Ensures bidirectional connectivity for external clients
Network Topology Changes
Layer 2 Network Requirements
Physical topology:
[Internet] → [Router] → [Switch] → [Cluster Nodes]
↓
Static Route:
10.4.11.0/24 → cluster
ARP behavior:
- Client request: External client sends packet to LoadBalancer IP
- Router forwarding: Router forwards based on static route to cluster network
- ARP resolution: Router/switch broadcasts ARP request for LoadBalancer IP
- Node response: MetalLB leader node responds with its MAC address
- Traffic delivery: Subsequent packets sent directly to leader node
Failover Mechanism
Leader election process:
# Check current leader for a service
kubectl -n metallb logs -l app.kubernetes.io/component=speaker | grep "announcing"
# Example output:
# {"level":"info","ts":"2024-01-15T10:30:00Z","msg":"announcing","ip":"10.4.11.1","node":"bettley"}
Failover sequence:
- Leader failure: Current announcing node becomes unavailable
- Detection: MetalLB speakers detect leader absence (typically 10-30 seconds)
- Election: Remaining speakers elect new leader using deterministic algorithm
- Gratuitous ARP: New leader sends gratuitous ARP to update network caches
- Service restoration: Traffic resumes through new leader node
DNS Infrastructure Migration
I also lost some control over DNS, e.g. the router's DNS server will override all lookups for hellholt.net rather than forwarding requests to my DNS servers.
So I created a new domain, goldentooth.net, to handle this cluster. A couple of tweaks to ExternalDNS and some service definitions and I can verify that ExternalDNS is setting the DNS records correctly, although I don't seem to be able to resolve names just yet.
Domain Migration Impact
Previous Domain: hellholt.net
- Router control: New router overrides DNS resolution
- Local DNS interference: Router's DNS server intercepts queries
- Limited delegation: Consumer router lacks sophisticated DNS forwarding
New Domain: goldentooth.net
- External control: Managed entirely in AWS Route53
- Clean delegation: No local DNS interference
- ExternalDNS compatibility: Full automation support
ExternalDNS Configuration Updates
Domain filter change:
# Previous configuration
args:
- --domain-filter=hellholt.net
# New configuration
args:
- --domain-filter=goldentooth.net
Service annotation updates:
# httpbin service example
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: httpbin.goldentooth.net
# Previously: httpbin.hellholt.net
DNS record verification:
# Check Route53 records
aws route53 list-resource-record-sets --hosted-zone-id Z0736727S7ZH91VKK44A
# Verify DNS propagation
dig A httpbin.goldentooth.net
dig TXT httpbin.goldentooth.net # Ownership records
Performance and Operational Considerations
Layer 2 Mode Limitations
Single point of failure:
- Only one node handles traffic for each LoadBalancer IP
- Node failure causes service interruption until failover completes
- No load distribution across multiple nodes
Network broadcast traffic:
- ARP announcements increase broadcast traffic
- Gratuitous ARP during failover events
- Potential impact on large Layer 2 domains
Scalability constraints:
- All service traffic passes through single node
- Node bandwidth becomes bottleneck for high-traffic services
- Limited horizontal scaling compared to BGP mode
Monitoring and Troubleshooting
MetalLB speaker logs:
# Monitor speaker activities
kubectl -n metallb logs -l component=speaker --tail=50
# Check for leader election events
kubectl -n metallb logs -l component=speaker | grep -E "(leader|announcing|failover)"
# Verify ARP announcements
kubectl -n metallb logs -l component=speaker | grep "gratuitous ARP"
Network connectivity testing:
# Test ARP resolution for LoadBalancer IPs
arping -c 3 10.4.11.1
# Check MAC address consistency
arp -a | grep "10.4.11"
# Verify static routes on router
ip route show | grep "10.4.11.0/24"
Future TLS Strategy
I think I still need to get TLS working too, but I've soured on the idea of maintaining a cert per domain name and per service. I think I'll just have a wildcard over goldentooth.net and share that out. Too much aggravation otherwise. That's a problem for another time, though.
Wildcard certificate benefits:
- Simplified management: Single certificate for all subdomains
- Reduced complexity: No per-service certificate automation
- Cost efficiency: One certificate instead of multiple Let's Encrypt certs
- Faster deployment: No certificate provisioning delays for new services
Implementation considerations:
# Wildcard certificate configuration
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: goldentooth-wildcard
namespace: default
spec:
secretName: goldentooth-wildcard-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- "*.goldentooth.net"
- "goldentooth.net"
Configuration Persistence
The Layer 2 configuration is maintained in the gitops repository structure:
MetalLB Helm chart updates:
# values.yaml changes
spec:
# BGP configuration removed
# bgpPeers: []
# bgpAdvertisements: []
# Layer 2 configuration added
l2Advertisements:
- name: primary
ipAddressPools:
- primary
This transition demonstrates the flexibility of MetalLB to adapt to different network environments while maintaining service availability. While Layer 2 mode has limitations compared to BGP, it provides a viable solution for simpler network infrastructures and reduces operational complexity in exchange for some scalability constraints.
Post-Implementation Updates and Additional Fixes
After the initial MetalLB L2 migration, several additional issues were discovered and resolved to achieve full operational status.
Network Interface Selection Issues
During verification, a critical issue emerged with "super shaky" primary interface selection on cluster nodes. Some nodes (particularly newer ones like lipps
and karstark
) had both wired (eth0
) and wireless (wlan0
) interfaces active, causing:
- Calico confusion: CNI plugin using wireless interfaces for pod networking
- MetalLB routing failures: ARP announcements on wrong interfaces
- Inconsistent connectivity: Services unreachable from certain nodes
Solution implemented:
- Enhanced networking role: Created robust interface detection logic preferring
eth0
- Wireless interface management: Automatic detection and disabling of
wlan0
on dual-homed nodes - SystemD persistence: Network configurations and wireless disable service survive reboots
- Network debugging tools: Installed comprehensive toolset (
arping
,tcpdump
,mtr
, etc.)
Networking role improvements:
# /ansible/roles/goldentooth.setup_networking/tasks/main.yaml
- name: 'Set primary interface to eth0 if available'
ansible.builtin.set_fact:
metallb_interface: 'eth0'
when:
- 'network.metallb.interface == ""'
- 'eth0_exists.rc == 0'
- name: 'Disable wireless interface if both eth0 and wireless exist'
ansible.builtin.shell:
cmd: "ip link set {{ wireless_interface_name.stdout }} down"
when:
- 'wireless_interface_count.stdout | int > 0'
- 'eth0_exists.rc == 0'
DNS Architecture Migration
The L2 migration coincided with a broader DNS restructuring from hellholt.net
to goldentooth.net
with hierarchical service domains:
New domain structure:
- Nodes:
<node>.nodes.goldentooth.net
- Kubernetes services:
<service>.services.k8s.goldentooth.net
- Nomad services:
<service>.services.nomad.goldentooth.net
- General services:
<service>.services.goldentooth.net
ExternalDNS integration:
# Service annotations for automatic DNS management
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: "argocd.services.k8s.goldentooth.net"
external-dns.alpha.kubernetes.io/ttl: "60"
Current Operational Status (July 2025)
The MetalLB L2 configuration is now fully operational with the following verified services:
Active LoadBalancer services:
- ArgoCD:
argocd.services.k8s.goldentooth.net
→10.4.11.0
- HTTPBin:
httpbin.services.k8s.goldentooth.net
→10.4.11.1
Verification commands (updated):
# Check MetalLB speaker status
kubectl -n metallb logs -l app.kubernetes.io/component=speaker --tail=20
# Verify L2 announcements
kubectl -n metallb logs -l app.kubernetes.io/component=speaker | grep "announcing"
# Test connectivity to LoadBalancer IPs
curl -I http://10.4.11.1/ # HTTPBin
curl -I http://10.4.11.0/ # ArgoCD
# Verify DNS resolution
dig argocd.services.k8s.goldentooth.net
dig httpbin.services.k8s.goldentooth.net
# Check interface status on all nodes
goldentooth command all_nodes "ip link show | grep -E '(eth0|wlan)'"
MetalLB configuration summary:
- Mode: Layer 2 (BGP disabled)
- IP Pool:
10.4.11.0 - 10.4.15.254
- Interface:
eth0
(consistently across all nodes) - FRR: Disabled in Helm values for pure L2 operation
NFS Mounts
Now that Kubernetes is kinda squared away, I'm going to set up NFS mounts on the cluster nodes.
For the sake of simplicity, I'll just set up the mounts on every node, including the load balancer (which is currently exporting the share).
Implementation Architecture
Systemd-Based Mounting
Rather than using traditional /etc/fstab
entries, I implemented NFS mounting using systemd mount and automount units. This approach provides several advantages:
- Dynamic mounting: Automount units mount filesystems on-demand
- Service management: Standard systemd service lifecycle management
- Dependency handling: Proper ordering with network services
- Logging: Integration with systemd journal for troubleshooting
Global Configuration
The NFS mount configuration is defined in group_vars/all/vars.yaml
:
nfs:
server: "{{ groups['nfs_server'] | first}}"
mounts:
primary:
share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
mount: '/mnt/nfs'
safe_name: 'mnt-nfs'
type: 'nfs'
options: {}
This configuration:
- Dynamically determines NFS server: Uses first host in
nfs_server
group (allyrion) - IP-based addressing: Uses
10.4.0.10:/mnt/usb1
for reliable connectivity - Standardized mount point: All nodes mount at
/mnt/nfs
- Safe naming: Provides
mnt-nfs
for systemd unit names
Systemd Template Implementation
Mount Unit Template
The mount service template (templates/mount.j2
) creates individual systemd mount units:
[Unit]
Description=Mount {{ item.key }}
[Mount]
What={{ item.value.share }}
Where={{ item.value.mount }}
Type={{ item.value.type }}
Options={{ item.value.options | join(',') }}
[Install]
WantedBy=default.target
This generates a unit file at /etc/systemd/system/mnt-nfs.mount
with:
- What:
10.4.0.10:/mnt/usb1
(NFS export path) - Where:
/mnt/nfs
(local mount point) - Type:
nfs
(filesystem type) - Options: Default NFS mount options
Automount Unit Template
The automount template (templates/automount.j2
) provides on-demand mounting:
[Unit]
Description=Automount {{ item.key }}
After=remote-fs-pre.target network-online.target network.target
Before=umount.target remote-fs.target
[Automount]
Where={{ item.value.mount }}
[Install]
WantedBy=default.target
Key features:
- Network dependencies: Waits for network availability before attempting mounts
- Lazy mounting: Only mounts when the path is accessed
- Proper ordering: Correctly sequences with system startup and shutdown
Deployment Process
Ansible Role Implementation
The goldentooth.setup_nfs_mounts
role handles the complete deployment:
- name: 'Generate mount unit for {{ item.key }}.'
ansible.builtin.template:
src: 'mount.j2'
dest: "/etc/systemd/system/{{ item.value.safe_name }}.mount"
mode: '0644'
loop: "{{ nfs.mounts | dict2items }}"
notify: 'reload systemd'
- name: 'Generate automount unit for {{ item.key }}.'
ansible.builtin.template:
src: 'automount.j2'
dest: "/etc/systemd/system/{{ item.value.safe_name }}.automount"
mode: '0644'
loop: "{{ nfs.mounts | dict2items }}"
notify: 'reload systemd'
Service Management
The role ensures proper service lifecycle:
- name: 'Enable and start automount services.'
ansible.builtin.systemd:
name: "{{ item.value.safe_name }}.automount"
enabled: true
state: started
daemon_reload: true
loop: "{{ nfs.mounts | dict2items }}"
Network Integration
Client Targeting
The NFS mounts are deployed across the entire cluster:
Target Hosts: All cluster nodes (hosts: 'all'
)
- 12 Raspberry Pi nodes: allyrion, bettley, cargyll, dalt, erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps
- 1 x86 GPU node: velaryon
Including NFS Server: Even allyrion (the NFS server) mounts its own export, providing:
- Consistent access patterns: Same path (
/mnt/nfs
) on all nodes - Testing capability: Server can verify export functionality
- Simplified administration: Uniform management across cluster
Network Configuration
Infrastructure Network: All communication occurs within the trusted 10.4.0.0/20
CIDR
NFS Protocol: Standard NFSv3/v4 with default options
Firewall: No additional firewall rules needed within cluster network
Directory Structure and Permissions
Mount Point Creation
- name: 'Ensure mount directories exist.'
ansible.builtin.file:
path: "{{ item.value.mount }}"
state: directory
mode: '0755'
loop: "{{ nfs.mounts | dict2items }}"
Shared Directory Usage
The NFS mount serves multiple cluster functions:
Slurm Integration:
slurm_nfs_base_path: "{{ nfs.mounts.primary.mount }}/slurm"
Common Patterns:
/mnt/nfs/slurm/
- HPC job shared storage/mnt/nfs/shared/
- General cluster shared data/mnt/nfs/config/
- Configuration file distribution
Command Line Integration
goldentooth CLI Commands
# Configure NFS mounts on all nodes
goldentooth setup_nfs_mounts
# Verify mount status
goldentooth command all 'systemctl status mnt-nfs.automount'
goldentooth command all 'df -h /mnt/nfs'
# Test shared storage
goldentooth command allyrion 'echo "test" > /mnt/nfs/test.txt'
goldentooth command bettley 'cat /mnt/nfs/test.txt'
Troubleshooting and Verification
Service Status Verification
# Check automount service status
systemctl status mnt-nfs.automount
# Check mount service status (after access)
systemctl status mnt-nfs.mount
# View mount information
mount | grep nfs
df -h /mnt/nfs
Common Issues and Solutions
Network Dependencies: The automount units properly wait for network availability through After=network-online.target
Permission Issues: The NFS export uses no_root_squash
, allowing proper root access from clients
Mount Persistence: Automount units ensure mounts survive reboots and network interruptions
Security Considerations
Trust Model
Internal Network Security: Security relies on the trusted cluster network boundary
No User Authentication: Uses IP-based access control rather than user credentials
Root Access: no_root_squash
on server allows administrative operations
Future Enhancements
The current implementation could be enhanced with:
- Kerberos authentication for user-based security
- Network policies for additional access control
- Encryption in transit for sensitive data protection
Integration with Storage Evolution
Note: This NFS mounting system provides the foundation for shared storage. As documented in Chapter 050, the cluster later evolves to include ZFS-based storage with replication, while maintaining compatibility with these NFS mount patterns.
This in itself wasn't too complicated, but I created two template files (one for a .mount service, another for a .automount service), fought with the variables for a bit, and it seems to work. The result is robust, cluster-wide shared storage accessible at /mnt/nfs
on every node.
Slurm
Okay, finally, geez.
So this is about Slurm, an open-source, highly scalable, and fault-tolerant cluster management and job-scheduling system.
Before we get started: I want to express tremendous gratitude to Hossein Ghorbanfekr, for this Medium article and this second Medium article, which helped me set up Slurm and the modules and illustrated how to work with the system and verify its functionality. I'm a Slurm newbie and his articles were invaluable.
First, we're going to set up MUNGE, which is an authentication service designed for scalability within HPC environments. This is just a matter of installing the munge
package, synchronizing the MUNGE key across the cluster (which isn't as ergonomic as I'd like, but oh well), and restarting the service.
Slurm itself isn't too complex to install, but we want to switch off slurmctld
for the compute nodes and on for the controller nodes.
The next part is the configuration, which, uh, I'm not going to run through here. There are a ton of options and I'm figuring it out directive by directive by reading the documentation. Suffice to say that it's detailed, I had to hack some things in, and everything appears to work but I can't verify that just yet.
The control nodes write state to the NFS volume, the idea being that if one of them fails there'll be a short nonresponsive period and then another will take over. It recommends not using NFS, and I think it wants something like Ceph or GlusterFS or something, but I'm not going to bother; this is just an educational cluster, and these distributed filesystems really introduce a lot of complexity that I don't want to deal with right now.
Ultimately, I end up with this:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
general* up infinite 9 idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
debug up infinite 9 idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
$ scontrol show nodes
NodeName=bettley Arch=aarch64 CoresPerSocket=4
CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.84
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=10.4.0.11 NodeHostName=bettley Version=22.05.8
OS=Linux 6.12.20+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.12.20-1+rpt1~bpo12+1 (2025-03-19)
RealMemory=4096 AllocMem=0 FreeMem=1086 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=general,debug
BootTime=2025-04-02T20:28:31 SlurmdStartTime=2025-04-04T12:43:13
LastBusyTime=2025-04-04T12:43:21
CfgTRES=cpu=1,mem=4G,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
... etc ...
The next step is to set up Lua and Lmod for managing environments. Lua of course is a scripting language, and the Lmod system allows users of a Slurm cluster to flexibly modify their environment, use different versions of libraries and tools, etc by loading and unloading modules.
Setting this up isn't terribly fun or interesting. Lmod is on sourceforge, Lua is in Apt, we install some things, build Lmod from source, create some symlinks to ensure that Lmod is available in users' shell environments, and when we shell in and type a command, we can list our modules.
$ module av
------------------------------------------------------------------ /mnt/nfs/slurm/apps/modulefiles -------------------------------------------------------------------
StdEnv
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
After the StdEnv, we can set up OpenMPI. OpenMPI is an implementation of Message Passing Interface (MPI), used to coordinate communication between processes running across different nodes in a cluster. It's built for speed and flexibility in environments where you need to split computation across many CPUs or machines, and allows us to quickly and easily execute processes on multiple Slurm nodes.
OpenMPI is comparatively straightforward to set up, mostly just installing a few system packages for libraries and headers and creating a module file.
The next step is setting up Golang, which is unfortunately a bit more aggravating than it should be, involving "manual" work (in Ansible terms, so executing commands and operating via trial-and-error rather than using predefined modules) because the latest version of Go in the Apt repos appears to be 1.19 but the latest version is 1.24 and I apparently need 1.23 at least to build Singularity (see next section).
Singularity is a method for running containers without the full Docker daemon and its complications. It's written in Go, which is why we had to install 1.23.0 and couldn't rest on our laurels with 1.19.0 in the Apt repository (or, indeed, 1.21.0 as I originally thought).
Building Singularity requires additional packages, and it takes quite a while. But when done:
$ module av
------------------------------------------------------------------ /mnt/nfs/slurm/apps/modulefiles -------------------------------------------------------------------
Golang/1.21.0 Golang/1.23.0 (D) OpenMPI Singularity/4.3.0 StdEnv
Where:
D: Default Module
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
Then we can use it:
$ module load Singularity
$ singularity pull docker://arm64v8/hello-world
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
INFO: Fetching OCI image...
INFO: Extracting OCI image...
INFO: Inserting Singularity configuration...
INFO: Creating SIF file...
$ srun singularity run hello-world_latest.sif
WARNING: passwd file doesn't exist in container, not updating
WARNING: group file doesn't exist in container, not updating
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(arm64v8)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
We can also build a Singularity definition file with
$ cat > ~/torch.def << EOF
Bootstrap: docker
From: ubuntu:20.04
%post
apt-get -y update
apt-get -y install python3-pip
pip3 install numpy torch
%environment
export LC_ALL=C
EOF
$ singularity build --fakeroot torch.sif torch.def
INFO: Starting build...
INFO: Fetching OCI image...
24.8MiB / 24.8MiB [===============================================================================================================================] 100 % 2.8 MiB/s 0s
INFO: Extracting OCI image...
INFO: Inserting Singularity configuration...
....
INFO: Adding environment to container
INFO: Creating SIF file...
INFO: Build complete: torch.sif
and finally run it interactively:
$ salloc --tasks=1 --cpus-per-task=2 --mem=1gb
$ srun singularity run torch.sif \
python3 -c "import torch; print(torch.tensor(range(5)))"
tensor([0, 1, 2, 3, 4])
$ exit
We can also submit it as a batch:
$ cat > ~/submit_torch.sh << EOF
#!/usr/bin/sh -l
#SBATCH --job-name=torch
#SBATCH --mem=1gb
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=00:05:00
module load Singularity
srun singularity run torch.sif \
python3 -c "import torch; print(torch.tensor(range(5)))"
EOF
$ sbatch submit_torch.sh
Submitted batch job 398
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
398 general torch nathan R 0:03 1 bettley
$ cat slurm-398.out
tensor([0, 1, 2, 3, 4])
The next part will be setting up Conda, which is similarly a bit more aggravating than it probably should.
Once that's done, though:
$ conda env list
# conda environments:
#
base /mnt/nfs/slurm/miniforge
default-env /mnt/nfs/slurm/miniforge/envs/default-env
python3.10 /mnt/nfs/slurm/miniforge/user_envs/python3.10
python3.11 /mnt/nfs/slurm/miniforge/user_envs/python3.11
python3.12 /mnt/nfs/slurm/miniforge/user_envs/python3.12
python3.13 /mnt/nfs/slurm/miniforge/user_envs/python3.13
And we can easily activate an environment...
$ source activate python3.13
(python3.13) $
And we can schedule jobs to run across multiple nodes:
$ cat > ./submit_conda.sh << EOF
#!/usr/bin/env bash
#SBATCH --job-name=conda
#SBATCH --mem=1gb
#SBATCH --ntasks=6
#SBATCH --cpus-per-task=2
#SBATCH --time=00:05:00
# Load Conda and activate Python 3.13 environment.
module load Conda
source activate python3.13
srun python --version
sleep 5
EOF
$ sbatch submit_conda.sh
Submitted batch job 403
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
403 general conda nathan R 0:01 3 bettley,cargyll,dalt
$ cat slurm-403.out
Python 3.13.2
Python 3.13.2
Python 3.13.2
Python 3.13.2
Python 3.13.2
Python 3.13.2
Super cool.
Terraform
I would like my Goldentooth services to be reachable from anywhere, but 1) I'd like trouble-free TLS encryption to the domain, and 2) I'd like to hide my home IP address. I can set up LetsEncrypt, but I've had issues in the past with local storage of certificates on what is essentially an ephemeral setup that might get blown up at any point. I'd like to prevent spurious strain on their systems, outages due to me spamming, etc etc etc. I'd prefer it if I could use AWS ACM. If I can use ACM and CloudFront and just not cache anything, or cache intelligently on a per-domain basis, that would be nice. I don't know if I can get that working - I know AWS intends ACM for AWS services – but I'll try.
So the first step of this is to set up Terraform; to create an S3 bucket to hold the state and a lock to support state locking.
We can bootstrap this by just creating the S3 bucket, then creating a Terraform configuration that only contains that S3 bucket and imports the existing bucket (mostly so I don't forget what the bucket is for or what it is using). I apply that - yup, works.
The next thing I add is configuration for an OIDC provider for GitHub. Fortunately, there's a provider for this, so it's easy to set up. I apply that and it creates an IAM role. I assign it Administrator access temporarily.
I create a GitHub Actions workflow to set up Terraform, plan, and apply the configuration. That works when I push to main
. Pretty sweet.
Dynamic DNS
As previously stated: I would like my Goldentooth services to be reachable from anywhere, but 1) I'd like trouble-free TLS encryption to the domain, and 2) I'd like to hide my home IP address. I can set up LetsEncrypt, but I've had issues in the past with local storage of certificates on what is essentially an ephemeral setup that might get blown up at any point. I'd like to prevent spurious strain on their systems, outages due to me spamming, etc etc etc. I'd prefer it if I could use AWS ACM. If I can use ACM and CloudFront and just not cache anything, or cache intelligently on a per-domain basis, that would be nice. I don't know if I can get that working - I know AWS intends ACM for AWS services – but I'll try.
The next step of this is to get my router to update Route53 with my home IP address whenever it changes. That's going to require a Lambda function, API Gateway, an SSM Parameter for the credentials, an IAM role, etc. That's all going to be deployed and managed via Terraform.
Load Balancer Revisited
As previously stated: I would like my Goldentooth services to be reachable from anywhere, but 1) I'd like trouble-free TLS encryption to the domain, and 2) I'd like to hide my home IP address. I can set up LetsEncrypt, but I've had issues in the past with local storage of certificates on what is essentially an ephemeral setup that might get blown up at any point. I'd like to prevent spurious strain on their systems, outages due to me spamming, etc etc etc. I'd prefer it if I could use AWS ACM. If I can use ACM and CloudFront and just not cache anything, or cache intelligently on a per-domain basis, that would be nice. I don't know if I can get that working - I know AWS intends ACM for AWS services – but I'll try.
Now, one thing I want to be able to do for this is to have a single origin for the CloudFront distribution, e.g. *.my-home.goldentooth.net, which will resolve to my home IP address. But I want to be able to route based on domain name. I already have <service>.goldentooth.net
working with ExternalDNS and MetalLB. So I want my reverse proxy to map an incoming request for <service>.my-home.goldentooth.net
to a backend <service>.goldentooth.net
with as little extra work as possible. Performance is less of an issue here than the fact that it works, that it's easy to maintain and repair if it breaks three year from now, and that I can complete this and move on to something else.
These factors combined mean that I should not use HAProxy for this. HAProxy is incredibly powerful and very performant, but it is not incredibly flexible for this sort of ad-hoc YOLO kind of work. Nginx, however, is.
So, alongside HAProxy, which I'm using for Kubernetes high-availability, I'll open a port on my router and forward it to Nginx, which will reverse-proxy that based on the domain name to the appropriate local load balancer service.
The resulting configuration is pretty simple:
server {
listen 8080;
resolver 8.8.8.8 valid=10s;
server_name ~^(?<subdomain>[^.]+)\.{{ cluster.cloudfront_origin_domain }}$;
location / {
set $target_host "$subdomain.{{ cluster.domain }}";
proxy_pass http://$target_host;
proxy_set_header Host $target_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_ssl_verify off;
}
}
And it just works; requesting http://httpbin.my-home.goldentooth.net:7463/ returns the appropriate service.
CloudFront and ACM
The next step will be to set up a CloudFront distribution that uses this address format as an origin, with no caching, and an ACM certificate. Assuming I can do that. If I can't, I might need to figure something else out. I could also use CloudFlare, and indeed if anyone ever reads this they're probably screaming at me, "just use CloudFlare, you idiot," but I'm trying to restrict the number of services and complications that I need to keep operational simultaneously.
Plus, I use Safari (and Brave) rather than Chrome, and one of the only systems with which I seem to encounter persistent issues using Safari is... CloudFlare. It might not for my use case, but I figure I would need to set it up just to test it.
So, yes, I'm totally aware this is a nasty hack, but... I'm gonna try it.
Spelling this out a little, here's the explicit idea:
- Make a request to
service.home-proxy.goldentooth.net
- That does DNS lookup, which points to a CloudFront distribution
- TLS certificate loads for CloudFront
- CloudFront makes request to my home internet, preserving the Host header
- That request gets port-forwarded to Nginx
- Nginx matches host header
service.home-proxy.goldentooth.net
and sets$subdomain
toservice
- Nginx sets upstream server name to
service.goldentooth.net
- Nginx does DNS lookup for upstream server and finds
10.4.11.43
- Nginx proxies request back to
10.4.11.43
And this appears to work:
$ curl https://httpbin.home-proxy.goldentooth.net/ip
{
"origin": "66.61.26.32"
}
The latency is nonzero but not noticeable to me. It's still an ugly hack, and there are some security implications I'll need to deal with. I ended up adding basic auth on the Nginx listener which, while not fantastic, is probably as much as I really need.
Prometheus
Way back in Chapter 19, I set up an Prometheus Node Exporter "app" for Argo CD, but I never actually set up Prometheus itself.
That's really fairly odd for me, since I'm normally super twitchy about metrics, logging, and observability. I guess I put it off because I was dealing with some kind of existential questions; where would Prometheus live, how would it communicate, etc, but then ended up kinda running out of steam before I answered the questions.
So, better late than never, I'm going to work on setting up Prometheus in a nice, decentralized kind of way.
Implementation Architecture
Installation Method
I'm using the official prometheus.prometheus.prometheus Ansible role from the Prometheus community. The depth to Prometheus is, after all, configuring and using it, not merely in installing it.
The installation is managed through:
- Playbook:
setup_prometheus.yaml
- Custom role:
goldentooth.setup_prometheus
(wraps the community role) - CLI command:
goldentooth setup_prometheus
Deployment Location
Prometheus runs on allyrion
(10.4.0.10), which consolidates multiple infrastructure services:
- HAProxy load balancer
- NFS server
- Prometheus monitoring server
This placement provides several advantages:
- Central location for cluster-wide monitoring
- Proximity to load balancer for HAProxy metrics
- Reduced resource usage on Kubernetes worker nodes
Service Configuration
Core Settings
The Prometheus service is configured with production-ready settings:
# Storage and retention
prometheus_storage_retention_time: "15d"
prometheus_storage_retention_size: "5GB"
prometheus_storage_tsdb_path: "/var/lib/prometheus"
# Network and performance
prometheus_web_listen_address: "0.0.0.0:9090"
prometheus_config_global_scrape_interval: "60s"
prometheus_config_global_evaluation_interval: "15s"
prometheus_config_global_scrape_timeout: "15s"
Security Hardening
The service implements comprehensive security measures:
- Dedicated user: Runs as
prometheus
user/group - Systemd hardening: NoNewPrivileges, PrivateDevices, ProtectSystem=strict
- Capability restrictions: Limited to CAP_SET_UID only
- Resource limits: GOMAXPROCS=4 to prevent CPU exhaustion
External Labels
Cluster identification through external labels:
external_labels:
environment: goldentooth
cluster: goldentooth
domain: goldentooth.net
Service Discovery Implementation
File-Based Service Discovery
Rather than relying on complex auto-discovery, I implement file-based service discovery for reliability and explicit control:
Target Generation (/etc/prometheus/file_sd/node.yaml
):
{% for host in groups['all'] %}
- targets:
- "{{ hostvars[host]['ipv4_address'] }}:9100"
labels:
instance: "{{ host }}"
job: 'node'
{% endfor %}
This approach:
- Auto-generates targets from Ansible inventory
- Covers all 13 cluster nodes (12 Raspberry Pis + 1 x86 GPU node)
- Provides consistent labeling with
instance
andjob
labels - Updates automatically when nodes are added/removed
Scrape Configurations
Core Infrastructure Monitoring
Prometheus Self-Monitoring:
- job_name: 'prometheus'
static_configs:
- targets: ['allyrion:9090']
HAProxy Load Balancer:
- job_name: 'haproxy'
static_configs:
- targets: ['allyrion:8405']
HAProxy includes a built-in Prometheus exporter accessible at /metrics
on port 8405, providing load balancer performance and health metrics.
Nginx Reverse Proxy:
- job_name: 'nginx'
static_configs:
- targets: ['allyrion:9113']
Node Monitoring
File Service Discovery for all cluster nodes:
- job_name: "unknown"
file_sd_configs:
- files:
- "/etc/prometheus/file_sd/*.yaml"
- "/etc/prometheus/file_sd/*.json"
This targets all Node Exporter instances across the cluster, providing comprehensive infrastructure metrics.
Advanced Integrations
Loki Log Aggregation:
- job_name: 'loki'
static_configs:
- targets: ['inchfield:3100']
scheme: 'https'
tls_config:
ca_file: /etc/ssl/certs/goldentooth.pem
This integration uses the Step-CA certificate authority for secure communication with the Loki log aggregation service.
Exporter Ecosystem
Node Exporter Deployment
Kubernetes Nodes (via Argo CD):
- Helm Chart: prometheus-node-exporter v4.46.1
- Namespace: prometheus-node-exporter
- Extra Collectors:
--collector.systemd
,--collector.processes
- Management: Automated GitOps deployment with auto-sync
Infrastructure Node (allyrion):
- Installation: Via
prometheus.prometheus.node_exporter
role - Enabled Collectors: systemd for service monitoring
- Integration: Direct scraping by local Prometheus instance
Application Exporters
I also configured several application-specific exporters:
HAProxy Built-in Exporter: Provides load balancer metrics including backend health, response times, and traffic distribution
Nginx Exporter: Monitors reverse proxy performance and request patterns
Network Access and Security
Nginx Reverse Proxy
To provide secure external access to Prometheus, I configured an Nginx reverse proxy:
server {
listen 8081;
location / {
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Application prometheus;
}
}
This provides:
- Network isolation (Prometheus only accessible locally)
- Header injection for request identification
- Potential for future authentication layer
Certificate Integration
The cluster uses Step-CA for comprehensive certificate management. Prometheus leverages this infrastructure for:
- Secure scraping of TLS-enabled services (Loki)
- Potential future TLS termination
- Integration with the broader security model
Alerting Configuration
Basic Alert Rules
The installation includes foundational alerting rules in /etc/prometheus/rules/ansible_managed.yml
:
Watchdog Alert: Always-firing alert to verify the alerting pipeline is functional
Instance Down Alert: Critical alert when up == 0
for 5 minutes, indicating node or service failure
Future Expansion
The alert rule framework is prepared for expansion with application-specific alerts, SLA monitoring, and capacity planning alerts.
Integration with Monitoring Stack
Grafana Integration
Prometheus serves as the primary data source for Grafana dashboards:
datasources:
- name: prometheus
type: prometheus
url: http://allyrion:9090
access: proxy
This enables rich visualization of cluster metrics through pre-configured and custom dashboards.
Storage and Performance
TSDB Configuration:
- Retention: 15 days (time) and 5GB (size) for appropriate data lifecycle
- Storage: Local disk at
/var/lib/prometheus
- Compaction: Automatic TSDB compaction for optimal query performance
The scrape configuration was fairly straightforward, and the result is a comprehensive monitoring foundation covering all infrastructure components and preparing for future application-specific monitoring expansion.
Consul
I wanted to install a service discovery system to manage, well, all of the other services that exist only to manage other services on this cluster.
I have the idea of installing Authelia, then Envoy, then Consul in a chain as a replacement for Nginx. Obviously it's far more complicated than Nginx, but by now that's the point; to increase the complexity of this homelab until it collapses under its own weight. Alas poor Goldentooth. I knew him, Gentle Reader, a cluster of infinite GPIO!
First order of business is to set up the Consul servers – leader and followers – which will occupy Bettley, Cargyll, and Dalt.
For most of this, I just followed the deployment guide. Then I followed the guide for creating client agent tokens.
Unfortunately, I encountered some issues when it came to setting up ACLs. For some reason, my server nodes worked precisely as expected, but my nodes would not join the cluster.
Apr 12 13:44:56 fenn consul[328873]: ==> Starting Consul agent...
Apr 12 13:44:56 fenn consul[328873]: Version: '1.20.5'
Apr 12 13:44:56 fenn consul[328873]: Build Date: '2025-03-11 10:16:18 +0000 UTC'
Apr 12 13:44:56 fenn consul[328873]: Node ID: 'a5c6a1f2-8811-9de7-917f-acc1cd9fc8b7'
Apr 12 13:44:56 fenn consul[328873]: Node name: 'fenn'
Apr 12 13:44:56 fenn consul[328873]: Datacenter: 'dc1' (Segment: '')
Apr 12 13:44:56 fenn consul[328873]: Server: false (Bootstrap: false)
Apr 12 13:44:56 fenn consul[328873]: Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: -1, DNS: 8600)
Apr 12 13:44:56 fenn consul[328873]: Cluster Addr: 10.4.0.15 (LAN: 8301, WAN: 8302)
Apr 12 13:44:56 fenn consul[328873]: Gossip Encryption: true
Apr 12 13:44:56 fenn consul[328873]: Auto-Encrypt-TLS: true
Apr 12 13:44:56 fenn consul[328873]: ACL Enabled: true
Apr 12 13:44:56 fenn consul[328873]: ACL Default Policy: deny
Apr 12 13:44:56 fenn consul[328873]: HTTPS TLS: Verify Incoming: true, Verify Outgoing: true, Min Version: TLSv1_2
Apr 12 13:44:56 fenn consul[328873]: gRPC TLS: Verify Incoming: true, Min Version: TLSv1_2
Apr 12 13:44:56 fenn consul[328873]: Internal RPC TLS: Verify Incoming: true, Verify Outgoing: true (Verify Hostname: true), Min Version: TLSv1_2
Apr 12 13:44:56 fenn consul[328873]: ==> Log data will now stream in as it occurs:
Apr 12 13:44:56 fenn consul[328873]: 2025-04-12T13:44:55.999-0400 [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config f
ormat must be set
Apr 12 13:44:56 fenn consul[328873]: 2025-04-12T13:44:56.021-0400 [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json,
or config format must be set
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.216-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.11:8300 error="rpcinsecure: err
or making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly
when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.225-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.12:8300 error="rpcinsecure: err
or making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify
a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.240-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.240-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.240-0400 [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.255-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.11:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.263-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.12:8300 error="rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.277-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:06 fenn consul[328873]: 2025-04-12T13:45:06.277-0400 [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.508-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.11:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.515-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.12:8300 error="rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.538-0400 [ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=10.4.0.13:8300 error="rpcinsecure: error making call: rpcinsecure: error making call: Permission denied: anonymous token lacks permission 'node:write' on \"fenn\". The anonymous token is used implicitly when a request does not specify a token."
Apr 12 13:45:07 fenn consul[328873]: 2025-04-12T13:45:07.538-0400 [ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request
It seemed that the token would not be persisted on the client node after running consul acl set-agent-token agent <acl-token-secret-id>
, even though I have enable_token_persistence
set to true
. As a result, I needed to go back and set it in the consul.hcl
configuration file.
The fiddliness of the ACL bootstrapping also led me to split that out into a separate Ansible role.
Vault
As long as I'm setting up Consul, I figure I might as well set up Vault too.
This wasn't that bad, compared to the experience I had with ACLs in Consul. I set up a KMS key for unsealing, generated a certificate authority and regenerated TLS assets for my three server nodes, and the Consul storage backend works seamlessly.
Vault Cluster Architecture
Deployment Configuration
The Vault cluster runs on three Raspberry Pi nodes: bettley
, cargyll
, and dalt
. This provides high availability with automatic leader election and fault tolerance.
Key Design Decisions:
- Storage Backend: Consul (not Raft) - leverages existing Consul cluster for data persistence
- Auto-Unsealing: AWS KMS integration eliminates manual unsealing after restarts
- TLS Everywhere: Full mutual TLS with Step-CA integration
- Service Integration: Deep integration with Consul service discovery
AWS KMS Auto-Unsealing
Rather than managing unseal keys manually, I implemented AWS KMS auto-unsealing through Terraform:
KMS Key Configuration (terraform/modules/vault_seal/kms.tf
):
resource "aws_kms_key" "vault_seal" {
description = "KMS key for managing the Goldentooth vault seal"
key_usage = "ENCRYPT_DECRYPT"
enable_key_rotation = true
deletion_window_in_days = 30
}
resource "aws_kms_alias" "vault_seal" {
name = "alias/goldentooth/vault-seal"
target_key_id = aws_kms_key.vault_seal.key_id
}
This provides:
- Automatic unsealing on service restart
- Key rotation managed by AWS
- Audit trail through CloudTrail
- No manual intervention required for cluster recovery
Vault Server Configuration
Core Settings
The main Vault configuration demonstrates production-ready patterns:
ui = true
cluster_addr = "https://{{ ipv4_address }}:8201"
api_addr = "https://{{ ipv4_address }}:8200"
disable_mlock = true
cluster_name = "goldentooth"
enable_response_header_raft_node_id = true
log_level = "debug"
Key Features:
- Web UI enabled for administrative access
- Per-node cluster addressing using individual IP addresses
- Memory lock disabled (appropriate for container/Pi environments)
- Debug logging for troubleshooting and development
Storage Backend: Consul Integration
storage "consul" {
address = "{{ ipv4_address }}:8500"
check_timeout = "5s"
consistency_mode = "strong"
path = "vault/"
token = "{{ vault_consul_token.token.SecretID }}"
}
The Consul storage backend provides:
- Strong consistency for data integrity
- Leveraged infrastructure - reuses existing Consul cluster
- ACL integration with dedicated Consul tokens
- Service discovery through Consul's native mechanisms
TLS Configuration
listener "tcp" {
address = "{{ ipv4_address }}:8200"
tls_cert_file = "/opt/vault/tls/tls.crt"
tls_key_file = "/opt/vault/tls/tls.key"
tls_require_and_verify_client_cert = true
telemetry {
unauthenticated_metrics_access = true
}
}
Security Features:
- Mutual TLS required for all client connections
- Step-CA certificates with multiple Subject Alternative Names
- Automatic certificate renewal via systemd timers
- Telemetry access for monitoring without authentication
Certificate Management Integration
Step-CA Integration
Vault certificates are issued by the cluster's Step-CA with comprehensive SAN coverage:
Certificate Attributes:
vault.service.consul
- Service discovery namelocalhost
- Local access- Node hostname (e.g.,
bettley.nodes.goldentooth.net
) - Node IP address (e.g.,
10.4.0.11
)
Renewal Automation:
[Service]
Environment=CERT_LOCATION=/opt/vault/tls/tls.crt \
KEY_LOCATION=/opt/vault/tls/tls.key
# Restart Vault service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active vault.service || systemctl try-reload-or-restart vault.service"
Certificate Lifecycle
- Validity: 24 hours (short-lived certificates)
- Renewal: Automatic via
cert-renewer@vault.timer
- Service Integration: Automatic Vault restart after renewal
- CLI Management:
goldentooth rotate_vault_certs
Consul Backend Configuration
Dedicated ACL Policy
Vault nodes use dedicated Consul ACL tokens with specific permissions:
key_prefix "vault/" {
policy = "write"
}
service "vault" {
policy = "write"
}
agent_prefix "" {
policy = "read"
}
session_prefix "" {
policy = "write"
}
This provides:
- Minimal permissions for Vault's operational needs
- Isolated key space under
vault/
prefix - Service registration capabilities
- Session management for locking mechanisms
Security and Service Configuration
Systemd Hardening
[Service]
User=vault
Group=vault
SecureBits=keep-caps
AmbientCapabilities=CAP_IPC_LOCK
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK
NoNewPrivileges=yes
Security Measures:
- Dedicated user/group isolation
- Capability restrictions - only IPC_LOCK and SYSLOG
- Memory locking capability for sensitive data
- No privilege escalation permitted
Environment Security
AWS credentials for KMS access are managed through environment files:
AWS_ACCESS_KEY_ID={{ vault.aws.access_key_id }}
AWS_SECRET_ACCESS_KEY={{ vault.aws.secret_access_key }}
AWS_REGION={{ vault.aws.region }}
- File permissions: 0600 (owner read/write only)
- Encrypted at rest in Ansible vault
- Least privilege IAM policies for KMS access only
Monitoring and Observability
Prometheus Integration
telemetry {
prometheus_retention_time = "24h"
usage_gauge_period = "10m"
maximum_gauge_cardinality = 500
enable_hostname_label = true
lease_metrics_epsilon = "1h"
num_lease_metrics_buckets = 168
add_lease_metrics_namespace_labels = false
filter_default = true
disable_hostname = true
}
Metrics Features:
- 24-hour retention for operational metrics
- 10-minute usage gauges for capacity planning
- Hostname labeling for per-node identification
- Lease metrics with weekly granularity (168 buckets)
- Unauthenticated metrics access for Prometheus scraping
Command Line Integration
goldentooth CLI Commands
# Deploy and configure Vault cluster
goldentooth setup_vault
# Rotate TLS certificates
goldentooth rotate_vault_certs
# Edit encrypted secrets
goldentooth edit_vault
Environment Configuration
For Vault CLI operations:
export VAULT_ADDR=https://{{ ipv4_address }}:8200
export VAULT_CLIENT_CERT=/opt/vault/tls/tls.crt
export VAULT_CLIENT_KEY=/opt/vault/tls/tls.key
External Secrets Integration
Kubernetes Integration
The cluster includes External Secrets Operator (v0.9.13) for Kubernetes secrets management:
- Namespace:
external-secrets
- Management: Argo CD GitOps deployment
- Integration: Direct Vault API access for secret retrieval
- Use Cases: Database credentials, API keys, TLS certificates
Directory Structure
/opt/vault/ # Base directory
├── tls/ # TLS certificates
│ ├── tls.crt # Server certificate (Step-CA issued)
│ └── tls.key # Private key
├── data/ # Data directory (unused with Consul backend)
└── raft/ # Raft storage (unused with Consul backend)
/etc/vault.d/ # Configuration directory
├── vault.hcl # Main configuration
└── vault.env # Environment variables (AWS credentials)
High Availability and Operations
Cluster Behavior
- Leader Election: Automatic through Consul backend
- Split-Brain Protection: Consul quorum requirements
- Rolling Updates: One node at a time with certificate renewal
- Disaster Recovery: AWS KMS auto-unsealing enables rapid recovery
Operational Patterns
Health Checks: Consul health checks monitor Vault API availability
Service Discovery: vault.service.consul
provides load balancing
Monitoring: Prometheus metrics for capacity and performance monitoring
Logging: systemd journal integration with structured logging
That said, I haven't actually put anything into it yet, so the real test will come when I start using it for secrets management across the cluster infrastructure. The External Secrets integration provides the foundation for Kubernetes secrets management, while the Consul integration enables broader service authentication.
Envoy
I would like to replace Nginx with an edge routing solution of Envoy + Consul. Consul is setup, so let's get cracking on Envoy.
Unfortunately, it doesn't work out of the box:
$ envoy --version
external/com_github_google_tcmalloc/tcmalloc/system-alloc.cc:625] MmapAligned() failed - unable to allocate with tag (hint, size, alignment) - is something limiting address placement? 0x17f840000000 1073741824 1073741824 @ 0x5560994c54 0x5560990f40 0x5560990830 0x5560971b6c 0x556098de00 0x556098dbd0 0x5560966e60 0x55608964ec 0x556089314c 0x556095e340 0x7fa24a77fc
external/com_github_google_tcmalloc/tcmalloc/arena.cc:58] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size); is something preventing mmap from succeeding (sandbox, VSS limitations)? 131072 632 @ 0x5560994fb8 0x5560971bfc 0x556098de00 0x556098dbd0 0x5560966e60 0x55608964ec 0x556089314c 0x556095e340 0x7fa24a77fc
Aborted
That's because of this issue.
I don't really have the horsepower on these Pis to compile Envoy, and I don't want to recompile the kernel, so for the time being I think I'll need to run a special build of Envoy in Docker. Unfortunately, I can't find a version that both 1) runs on Raspberry Pis, and 2) is compatible with a current version of Consul, so I think I'm kinda screwed for the moment.
Cross-Compilation Investigation
To solve the tcmalloc issue, I attempted to cross-compile Envoy v1.32.0 for ARM64 with --define tcmalloc=disabled
on Velaryon (the x86 node). This would theoretically produce a Raspberry Pi-compatible binary without the memory alignment problems.
Setup Completed
- ✅ Created cross-compilation toolkit with ARM64 toolchain (
aarch64-linux-gnu-gcc
) - ✅ Built containerized build environment with Bazel 6.5.0 (required by Envoy)
- ✅ Verified ARM64 cross-compilation works for simple C programs
- ✅ Confirmed Envoy source has ARM64 configurations (
//bazel:linux_aarch64
) - ✅ Found Envoy's CI system officially supports ARM64 builds
Fundamental Blocker
All cross-compilation attempts failed with the same error:
cc_toolchain_suite '@local_config_cc//:toolchain' does not contain a toolchain for cpu 'aarch64'
The root cause is a version compatibility gap:
- Envoy v1.32.0 requires Bazel 6.5.0 for compatibility
- Bazel 6.5.0 predates built-in ARM64 toolchain support
- Envoy's CI likely uses custom Docker images with pre-configured ARM64 toolchains
Attempts Made
- Custom cross-compilation setup - Blocked by missing Bazel ARM64 toolchain
- Platform-based approach - Wrong platform type (
config_setting
vsplatform
) - CPU-based configuration - Same toolchain issue
- Official Envoy CI approach - Same fundamental Bazel limitation
Verdict
Cross-compiling Envoy for ARM64 would require either:
- Creating custom Bazel ARM64 toolchain definitions (complex, undocumented)
- Finding Envoy's exact CI Docker environment (may not be public)
- Upgrading to newer Bazel (likely breaks Envoy v1.32.0 compatibility)
The juice isn't worth the squeeze. For edge routing on Raspberry Pi, simpler alternatives exist:
- nginx (lightweight, excellent ARM64 support)
- HAProxy (proven load balancer, ARM64 packages available)
- Traefik (modern proxy, native ARM64 builds)
- Caddy (simple reverse proxy, ARM64 support)
Step-CA
Apparently, another thing I did recently was to set up Nomad, but I didn't take any notes about it.
That's not really that big of a deal, though, because what I need to do is to get Nomad and Consul and Vault working together, and currently they aren't.
This is complicated by the fact that if I do want AutoEncrypt working between Nomad and Consul, the two have to have a certificate chain proceeding from either 1) the same root certificate, or 2) different root certificates that have cross-signed. Currently, Vault has its own root certificate that I generate from scratch with the Ansible x509 tools, and then Nomad and Consul generate their own certificates using the built-in tools.
This seems messy, so it's probably time to dive into some kind of meaningful, long-term TLS infrastructure.
The choice seemed fairly clear: step-ca
. Although I hadn't used it before, I'd flirted with it a time or two and it seemed to be fairly straightforward.
I poked around a bit in other people's implementations and pilfered them ruthlessly (I've bought Max Hösel a couple coffees and I'm crediting him, never fear). I don't really need the full range of his features (and they are wonderful, it's really a lovely collection), so I cribbed the basic flow.
Once that's done, we have a few new Ansible playbooks:
apt_smallstep
: Configure the Smallstep Apt repository.install_step_ca
: Installstep-ca
andstep-cli
on the CA node (which I've set to be Jast, the tenth node).install_step_cli
: Performed on all nodes.init_cluster_ca
: Initialize the certificate authority on the CA node.bootstrap_cluster_ca
: Install the root certificate in the trust store on every node.zap_cluster_ca
: To clean up, just nuke every file in thestep-ca
data directory.
The playbooks mentioned above get us most of the way there, but we need to revisit some of the places we've generated certificates (Vault, Consul, and Nomad) and integrate them into this system.
Refactoring HashiApp certificate management.
As it turned out, doing that involved refactor a good amount of my Ansible IaC. One thing I've learned about code quality:
Can you make the change easily? If so, make the change. If not, fix the most obvious obstacle, then reevaluate.
In this case, the change in question was to use the step
CLI tool to generate certificates signed by the step-ca
root certificate authority for services like Nomad, Vault, and Consul.
I knew immediately this would not be an easy change to make, just because of how I had written my Ansible roles. I had adopted conventional patterns for these roles, even though I knew they were not for general use and I didn't really have much intention of distributing them. Conventional patterns included naming variables expecting them to be reused across modules, etc. So I would declare variables in a general fashion within the defaults/main.yaml
and then override them within my inventory's group_vars
and host_vars
.
I now consider this to be a mistake. In reality, the modules weren't really designed cleanly; there were a lot of assumptions based on my own use cases that I baked into the modules, and that affected which modules I declared, etc. So yeah, I had an Ansible role to set up Slurm, but it was by no means general enough to actually help most people set up Slurm. It just gathered together a lot of tasks that I found appropriate that had to do with setting up Slurm.
Nevertheless, I persisted for a while. Mostly, I think, out of a belief that I should at least pay lip service to community style guidelines.
This task, getting Nomad and Consul and Vault working with TLS courtesy of step-ca
, was my breaking point. There was just too much crap that needed to be renamed, just to maintain the internal consistency of an increasingly clumsy architecture intended to please people who didn't notice and almost surely wouldn't care if they had.
So, TL;DR: there was a great reduction in redundancy and I shifted to specifying variables in dictionaries rather than distinctly-named snake-cased variables that reminded me a little too much of Java naming conventions.
Configuring HashiApps to use Step-CA
Once refactoring was done, configuring the apps to use Step-CA was mostly straightforward. A single step
command was needed to generate the certificates, then another Ansible block to adjust the permissions and ownership of the generated files. For our labors, we're eventually greeted with Consul, Vault, and Nomad running exactly as they had before, but secured by a coherent certificate chain that can span all Goldentooth services.
Ray
Finally, we're getting back to something that's associated directly with machine learning: Ray.
It would be normal to opt for KubeRay here, since I am actually running Kubernetes on Goldentooth, but I'm not normal 🤷♂️ Instead, I'll be going with the on-prem approach, which... has some implications.
First of these is that I need to install Conda on every node. This is fine and probably something I should've already done anyway, just as a normal matter of course. Except I kind of did as part of setting up Slurm. Which, yeah, probably means a refactor is in order.
So let's install and configure Conda, then setup a Ray cluster!
TL;DR: The attempt on my life has left me scarred and deformed.
So, that ended up being a major pain in the ass. The conda-forge
channel didn't have builds of Ray for aarch64
, so I needed to configure the defaults
channel. Once the correct packages were installed, I encountered mysterious issues where the Ray dashboard wouldn't start up, causing the entire service to crash. It turned out, after prolonged debugging, that the Ray dashboard was apparently segfaulting because of issues with a grpcio
wheel – not sure if it was built improperly, or what.
After figuring that out, I managed to get the cluster up, but still encountered issues. Well, the cluster was running Ray 2.46.0, and my MBP was running 2.7.0, so... that checks out. Unfortunately, I was attempting to follow MadeWithML based on a recommendation, and there were no Pi builds available for 2.7.0.
So I updated the MadeWithML project to use 2.46.0, brute-force-ishly, and that worked - for a time, but then incompatibilities started popping up. So I guess MadeWithML and my cluster weren't meant to be together.
Nevertheless, I do have a somewhat functioning Ray cluster, so I'm going to call this a victory (the only one I can) and move on.
Grafana
This, the next "article" (on Loki), and the successive one (on Vector), are occurring mostly in parallel so that I can validate these services as I go.
I (minimally) set up Vector first, then Loki, then Grafana, just to verify I could pass info around in some coherent fashion and see it in Grafana. However, that's not really sufficient.
The fact is that I'm not really experienced with Grafana. I've used it to debug things, I've installed and managed it, I've created and updated dashboards, etc. But I don't have a deep understanding of it or its featureset.
At work, we use Datadog. I love Datadog. Datadog has incredible features and a wonderful user interface. Datadog makes more money than I do, and costs more than I can afford. Also, they won't hire me, but I'm not bitter. The fact is that they don't really have a hobbyist tier, or at least not one that makes a ten-node cluster affordable.
At work, I prioritize observability. I rely heavily on logs, metrics, and traces to do my job. In my work on Goldentooth, I've been neglecting that. I've been using journalctl
to review logs and debug services, and that's a pretty poor experience. It's recently become very, very clear that I need to have a better system here, and that means learning how to use Grafana and how to configure it best for my needs.
So, yeah, Grafana.
Grafana
My initial installation was bog-standard, basic Grafana. Not a thing changed. It worked! Okay, let's make it better.
The first thing I did was to throw that SQLite DB on a tmpfs
. I'm not really concerned enough about the volume or load to consider moving to something like PostgreSQL, but 1) it also doesn't matter if I keep logs/metrics past a reboot, and 2) it's probably good to avoid any writes to the SD card that I can.
Next thing was to create a new repository, grafana-dashboards
, to manage dashboards. I want a bunch of these dudes and it's better to manage them in a separate repository than in Ansible itself. I checked it out via Git, added a script to sync the repo every so often, added that to cron.
Of course, then I needed a dashboard to test it out, so I grabbed a nice one to incorporate data from Prometheus Node Exporter here. (Thanks, Ricardo F!)
Then I had to connect Grafana to Prometheus Node Exporter, then I realized I was missing a couple of command-line arguments in my Prometheus Node Exporter Helm chart that were nice to have, so I added those to the Argo CD Application, re-synced the app, etc, and finally things started showing up.
Pretty cool, I think.
Grafana Implementation Details
tmpfs Database Configuration
The first optimization I implemented was mounting the Grafana data directory on tmpfs to avoid SD card writes:
- name: 'Manage the mount for the Grafana data directory.'
ansible.posix.mount:
path: '/var/lib/grafana'
src: 'tmpfs'
fstype: 'tmpfs'
opts: 'size=100M,mode=0755'
state: 'present'
This configuration:
- Avoids SD card wear: Eliminates database writes to flash storage
- Improves performance: RAM-based storage for faster access
- Ephemeral data: Acceptable for a lab environment where persistence across reboots isn't critical
- Size limit: 100MB allocation prevents memory exhaustion
TLS Configuration
I finished up by adding comprehensive TLS support to Grafana using Step-CA integration:
Server Configuration (in grafana.ini
):
[server]
protocol = https
http_addr = {{ ipv4_address }}
http_port = 3000
cert_file = {{ grafana.cert_path }}
cert_key = {{ grafana.key_path }}
[grpc_server]
use_tls = true
cert_file = {{ grafana.cert_path }}
key_file = {{ grafana.key_path }}
Certificate Management:
- Source: Step-CA issued certificates with 24-hour validity
- Renewal: Automatic via
cert-renewer@grafana.timer
- Service Integration: Automatic Grafana restart after certificate renewal
- Paths:
/opt/grafana/tls/tls.crt
and/opt/grafana/tls/tls.key
Dashboard Repository Management
Next thing was to create a new repository, grafana-dashboards
, to manage dashboards externally:
Repository Integration:
- name: 'Check out the Grafana dashboards repository.'
ansible.builtin.git:
repo: "https://github.com/{{ cluster.github.organization }}/{{ grafana.provisioners.dashboards.repository_name }}.git"
dest: '/var/lib/grafana/dashboards'
become_user: 'grafana'
Dashboard Provisioning (provisioners.dashboards.yaml
):
apiVersion: 1
providers:
- name: "grafana-dashboards"
orgId: 1
type: file
folder: ''
disableDeletion: false
updateIntervalSeconds: 15
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Automatic Dashboard Updates
I added a script to sync the repository periodically via cron:
Update Script (/usr/local/bin/grafana-update-dashboards.sh
):
#!/usr/bin/env bash
dashboard_path="/var/lib/grafana/dashboards"
cd "${dashboard_path}"
git fetch --all
git reset --hard origin/master
git pull
Cron Integration: Updates every 15 minutes to keep dashboards current with the repository
Data Source Provisioning
The Prometheus integration is configured through automatic data source provisioning:
datasources:
- name: 'Prometheus'
type: 'prometheus'
access: 'proxy'
url: http://{{ groups['prometheus'] | first }}:9090
jsonData:
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
cacheLevel: 'High'
disableRecordingRules: false
incrementalQueryOverlapWindow: 10m
This configuration:
- Automatic discovery: Uses Ansible inventory to find Prometheus server
- High performance: POST method and high cache level for better performance
- Alert management: Enables Grafana to manage Prometheus alerts
- Query optimization: 10-minute overlap window for incremental queries
Advanced Monitoring Integration
Loki Integration for State History:
[unified_alerting.state_history]
backend = "multiple"
primary = "loki"
loki_remote_url = "https://{{ groups['loki'] | first }}:3100"
This enables:
- Alert state history: Stored in Loki for long-term retention
- Multi-backend support: Primary storage in Loki with annotations fallback
- HTTPS integration: Secure communication with Loki using Step-CA certificates
Security and Authentication
Password Management:
- name: 'Reset Grafana admin password.'
ansible.builtin.command:
cmd: grafana-cli admin reset-admin-password "{{ grafana.admin_password }}"
Security Headers: The configuration includes comprehensive security settings:
- TLS enforcement: HTTPS-only communication
- Cookie security: Secure cookie settings for HTTPS
- Content security: XSS protection and content type options enabled
Service Integration
Certificate Renewal Automation:
[Service]
Environment=CERT_LOCATION=/opt/grafana/tls/tls.crt \
KEY_LOCATION=/opt/grafana/tls/tls.key
# Restart Grafana service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active grafana.service || systemctl try-reload-or-restart grafana.service"
Systemd Integration:
- Service runs as dedicated
grafana
user - Automatic startup and dependency management
- Integration with cluster-wide certificate renewal system
Dashboard Ecosystem
Of course, then I needed a dashboard to test it out, so I grabbed a nice one to incorporate data from Prometheus Node Exporter here. (Thanks, Ricardo F!)
The dashboard management system provides:
- Version control: All dashboards tracked in Git
- Automatic updates: Regular synchronization from repository
- Folder organization: File system structure maps to Grafana folders
- Community integration: Easy incorporation of community dashboards
Monitoring Stack Integration
Then I had to connect Grafana to Prometheus Node Exporter, then I realized I was missing a couple of command-line arguments in my Prometheus Node Exporter Helm chart that were nice to have, so I added those to the Argo CD Application, re-synced the app, etc, and finally things started showing up.
Node Exporter Enhancement:
- Additional collectors:
--collector.systemd
,--collector.processes
- GitOps deployment: Changes managed through Argo CD
- Automatic synchronization: Dashboard updates reflect new metrics immediately
This comprehensive Grafana setup provides a production-ready observability platform that integrates seamlessly with the broader goldentooth monitoring ecosystem, combining security, automation, and extensibility.
Loki
This, the previous "article" (on Grafana), and the next one (on Vector), are occurring mostly in parallel so that I can validate these services as I go.
Loki is... there's a whole lot going on there.
Log Retention Configuration
I enabled a retention policy so that my logs wouldn't grow without bound until the end of time. This coincided with me noticing that my /var/log/journal
directories had gotten up to about 4GB, which led me to perform a similar change in the journald
configuration.
Retention Policy Configuration:
limits_config:
retention_period: 168h # 7 days
compactor:
working_directory: /tmp/retention
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 5
delete_request_store: filesystem
I reduced the retention_delete_worker_count
from 150 to 5 🙂 This optimization:
- Reduces resource usage: Less CPU overhead on Raspberry Pi nodes
- Maintains efficiency: 5 workers sufficient for 7-day retention window
- Prevents overload: Avoids overwhelming the Pi's limited resources
Consul Integration for Ring Management
I also configured Loki to use Consul as its ring kvstore, which involved sketching out an ACL policy and generating a token, but nothing too weird. (Assuming that it works.)
Ring Configuration:
common:
ring:
kvstore:
store: consul
consul:
acl_token: {{ loki_consul_token }}
host: {{ ipv4_address }}:8500
Consul ACL Policy (loki.policy.hcl
):
key_prefix "collectors/" {
policy = "write"
}
key_prefix "loki/" {
policy = "write"
}
This integration provides:
- Service discovery: Automatic discovery of Loki components
- Consistent hashing: Proper ring distribution for ingester scaling
- High availability: Shared state management across cluster nodes
- Security: ACL-based access control to Consul KV store
Comprehensive TLS Configuration
The next several hours involved cleanup after I rashly configured Loki to use TLS. I didn't know that I'd then need to configure Loki to communicate with itself via TLS, and that I would have to do so in several different places and that those places would have different syntax for declaring the same core ideas (CA cert, TLS cert, TLS key).
Server TLS Configuration
GRPC and HTTP Server:
server:
grpc_listen_address: {{ ipv4_address }}
grpc_listen_port: 9096
grpc_tls_config: &http_tls_config
cert_file: "{{ loki.cert_path }}"
key_file: "{{ loki.key_path }}"
client_ca_file: "{{ step_ca.root_cert_path }}"
client_auth_type: "VerifyClientCertIfGiven"
http_listen_address: {{ ipv4_address }}
http_listen_port: 3100
http_tls_config: *http_tls_config
TLS Features:
- Mutual TLS: Client certificate verification when provided
- Step-CA Integration: Uses cluster certificate authority
- YAML Anchors: Reuses TLS config across components to reduce duplication
Component-Level TLS Configuration
Frontend Configuration:
frontend:
grpc_client_config: &grpc_client_config
tls_enabled: true
tls_cert_path: "{{ loki.cert_path }}"
tls_key_path: "{{ loki.key_path }}"
tls_ca_path: "{{ step_ca.root_cert_path }}"
tail_tls_config:
tls_cert_path: "{{ loki.cert_path }}"
tls_key_path: "{{ loki.key_path }}"
tls_ca_path: "{{ step_ca.root_cert_path }}"
Pattern Ingester TLS:
pattern_ingester:
metric_aggregation:
loki_address: {{ ipv4_address }}:3100
use_tls: true
http_client_config:
tls_config:
ca_file: "{{ step_ca.root_cert_path }}"
cert_file: "{{ loki.cert_path }}"
key_file: "{{ loki.key_path }}"
Internal Component Communication
The configuration ensures TLS across all internal communications:
- Ingester Client:
grpc_client_config: *grpc_client_config
- Frontend Worker:
grpc_client_config: *grpc_client_config
- Query Scheduler:
grpc_client_config: *grpc_client_config
- Ruler: Uses separate alertmanager client TLS config
And holy crap, the Loki site is absolutely awful for finding and understanding where some configuration is needed.
Advanced Configuration Features
Pattern Recognition and Analytics
Pattern Ingester:
pattern_ingester:
enabled: true
metric_aggregation:
loki_address: {{ ipv4_address }}:3100
use_tls: true
This enables:
- Log pattern detection: Automatic recognition of log patterns
- Metric generation: Convert log patterns to Prometheus metrics
- Performance insights: Understanding log volume and patterns
Schema and Storage Configuration
TSDB Schema (v13):
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
Storage Paths:
common:
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
Query Performance Optimization
Caching Configuration:
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 20
Performance Features:
- Embedded cache: 20MB query result cache for faster repeated queries
- Protobuf encoding: Efficient data serialization for frontend communication
- Concurrent streams: 1000 max concurrent GRPC streams
Certificate Management Integration
Automatic Certificate Renewal:
[Service]
Environment=CERT_LOCATION={{ loki.cert_path }} \
KEY_LOCATION={{ loki.key_path }}
# Restart Loki service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active loki.service || systemctl try-reload-or-restart loki.service"
Certificate Lifecycle:
- 24-hour validity: Short-lived certificates for enhanced security
- Automatic renewal:
cert-renewer@loki.timer
handles renewal - Service restart: Seamless certificate updates with service reload
- Step-CA integration: Consistent with cluster-wide PKI infrastructure
Monitoring and Alerting Integration
Ruler Configuration:
ruler:
alertmanager_url: http://{{ ipv4_address }}:9093
alertmanager_client:
tls_cert_path: "{{ loki.cert_path }}"
tls_key_path: "{{ loki.key_path }}"
tls_ca_path: "{{ step_ca.root_cert_path }}"
Observability Features:
- Structured logging: JSON format for better parsing
- Debug logging: Detailed logging for troubleshooting
- Request logging: Log requests at info level for monitoring
- Grafana integration: Primary storage for alert state history
Deployment Architecture
Single-Node Deployment: Currently deployed on inchfield
node
Replication Factor: 1 (appropriate for single-node setup)
Resource Optimization: Configured for Raspberry Pi resource constraints
Integration Points:
- Vector: Log shipping from all cluster nodes
- Grafana: Log visualization and alerting
- Prometheus: Metrics scraping from Loki endpoints
This comprehensive Loki configuration provides a production-ready log aggregation platform with enterprise-grade security, retention management, and integration capabilities, despite the complexity of getting all the TLS configurations properly aligned across the numerous internal components.
Vector
This and the two previous "articles" (on Grafana and on Vector) are occurring mostly in parallel so that I can validate these services as I go.
The main thing I wanted to do immediately with Vector was hook up more sources. A couple were turnkey (journald
, kubernetes_logs
, internal_logs
) but most were just log files. These latter are not currently parsed according to any specific format, so I'll need to revisit this and extract as much information as possible from each.
It would also be good for me to inject some more fields into this that are set on a per-node level. I already have hostname, but I should probably inject IP address, etc, and anything else I can think of.
Other than that, it doesn't really seem like there's a lot to discuss here. Vector's cool, though. And in the future, I should remember that adding a whole bunch of log files into Vector from ten nodes, all at once, is not a great idea, as it will flood the Loki sink...
New Server!
Today I saw Beyond NanoGPT: Go From LLM Beginner to AI Researcher! on HackerNews, and while I'm less interested than most in LLMs specifically, I'm still interested.
The notes included the following:
The codebase will generally work with either a CPU or GPU, but most implementations basically require a GPU as they will be untenably slow otherwise. I recommend either a consumer laptop with GPU, paying for Colab/Runpod, or simply asking a compute provider or local university for a compute grant if those are out of budget (this works surprisingly well, people are very generous).
If this was expected to be slow on a standard CPU, it'd probably be unbearable (or not run at all) on a Pi, so this gave me pause 🤔
Fortunately, I had a solution. I have an extra PC that's a few years old but still relatively beefy (a Ryzen 9 3900X (12 cores) with 32GB RAM and an RTX 2070 Super). I built it as a VR PC and my kid and I haven't played VR in quite a while, so... it's just kinda sitting there. But it occurred to me that it was probably sufficiently powerful to run most of Beyond NanoGPT, and if it struggled with anything I might be able to upgrade to an RTX 4XXX or 5XXX.
Of course, this single machine by itself dominates the rest of Goldentooth, so I'll need to take some steps to minimize its usefulness.
Setup
I installed Ubuntu 24.04, which I felt was probably a decent parity for the Raspberry Pi OS on Goldentooth. Perhaps I should've installed Ubuntu on the Pis as well, but hindsight is 20/20 and I don't have enough complaints about RPOS to switch now. At some point, SD cards are going to start dropping like flies and I'll probably make the switch at that time.
The installation itself was over in a flash, quickly enough that I thought something might've failed. Admittedly, it's been a while since I've installed Ubuntu Server Minimal on a modern-ish PC.
After that, I just needed to lug the damned thing down to the basement, wire it in, and start running Ansible playbooks on it to set it up. A few minutes later:
Hello, Velaryon!
Oh, and install Nvidia's kernel modules and other tools. None of that was particularly difficult, although it was a tad more irritating than it should've been.
Once I had the GPU showing up, and the relevant tools and libraries installed, I wanted to verify that I could actually run things on the GPU, so I checked out NVIDIA's cuda-samples and built 'em.
With that done:
🐠nathan@velaryon:~/cuda-samples/build/Samples/1_Utilities/deviceQueryDrv
$ ./deviceQueryDrv
./deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version: 12.9
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7786 MBytes (8164081664 bytes)
(40) Multiprocessors, ( 64) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1770 MHz (1.77 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 45 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS
Not the sexiest thing I've ever seen, but it's a step in the right direction.
Kubernetes
Again, I only want this machine to run in very limited circumstances. I figure it'll make a nice box for cross-compiling, and for running GPU-heavy workloads when necessary, but otherwise I want it to stay in the background.
After I added it to the Kubernetes cluster:
I tainted it to prevent standard pods from being scheduled on it:
kubectl taint nodes velaryon gpu=true:NoSchedule
and labeled it so that pods requiring a GPU would be scheduled on it:
kubectl label nodes velaryon gpu=true arch=amd64
Now, any pod I wish to run on this node should have the following:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
gpu: "true"
Nomad
A similar tweak was needed for the nomad.hcl
config:
{% if clean_hostname in groups['nomad_client'] -%}
client {
enabled = true
node_class = "{{ nomad.client.node_class }}"
meta {
arch = "{{ ansible_architecture }}"
gpu = "{{ 'true' if 'gpu' == nomad.client.node_class else 'false' }}"
}
}
{% endif %}
I think this will work for a constraint:
constraint {
attribute = "${node.class}"
operator = "="
value = "gpu"
}
But I haven't tried it yet.
After applying, we see the class show up:
🐠root@velaryon:~
$ nomad node status
ID Node Pool DC Name Class Drain Eligibility Status
76ff3ff3 default dc1 velaryon gpu false eligible ready
30ffab50 default dc1 inchfield default false eligible ready
db6ae26b default dc1 gardener default false eligible ready
02174920 default dc1 jast default false eligible ready
9ffa31f3 default dc1 fenn default false eligible ready
01f0cd94 default dc1 harlton default false eligible ready
793b9a5a default dc1 erenford default false eligible ready
Other than that, it should get the standard complement of features - Vector, Consul, etc. I initially set up Slurm, then undid it; I felt it would just complicate matters.
New Rack!
I poked around a bit and realized that I had two extra Raspberry Pi 4B+'s, so I ended up spending an absolutely absurd amount of money to build a 10" rack and get all of the existing and new Pis into it, along with some fans, 5V and 12V power supplies, a 16-port switch, etc. It was absolutely ridiculous and I would not recommend this course of action to anyone, and I'll never financially recover from this.
The main goal of this was to take my existing Picocluster (which was screwed together and looked nice and... well, was already paid for) and have something where I could pull out an individual Pi and replace or repair it if I needed. Another issue was that I didn't really have any substantial external storage, e.g. SSDs.
I've been playing with some other things recently, and have delayed updating this too much. I was intending my current focus to be the next article in this clog, but I think it's going to take quite a lot longer (and will likely be the subject of a great many articles), so I think in the meantime I need to return to the subject of the actual cluster and progress it along.
TLS Certificate Renewal
So some time back I configured step-ca
to generate TLS certificates for various services, but I gave the certs very short lifetimes and didn't set up renewal, so... whenever I step away from the cluster for a few days, everything breaks 🙃
Today's goal is to fix that.
$ consul members
Error retrieving members: Get "http://127.0.0.1:8500/v1/agent/members?segment=_all": dial tcp 127.0.0.1:8500: connect: connection refused
Indeed, very little is working.
Fortunately, step-ca
provides good instructions for dealing with this sort of situation. I created a cert-renewer@service
file:
[Unit]
Description=Certificate renewer for %I
After=network-online.target
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
StartLimitIntervalSec=0
; PartOf=cert-renewer.target
[Service]
Type=oneshot
User=root
Environment=STEPPATH=/etc/step-ca \
CERT_LOCATION=/etc/step/certs/%i.crt \
KEY_LOCATION=/etc/step/certs/%i.key
; ExecCondition checks if the certificate is ready for renewal,
; based on the exit status of the command.
; (In systemd <242, you can use ExecStartPre= here.)
ExecCondition=/usr/bin/step certificate needs-renewal ${CERT_LOCATION}
; ExecStart renews the certificate, if ExecStartPre was successful.
ExecStart=/usr/bin/step ca renew --force ${CERT_LOCATION} ${KEY_LOCATION}
; Try to reload or restart the systemd service that relies on this cert-renewer
; If the relying service doesn't exist, forge ahead.
; (In systemd <229, use `reload-or-try-restart` instead of `try-reload-or-restart`)
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active %i.service || systemctl try-reload-or-restart %i"
[Install]
WantedBy=multi-user.target
and cert-renewer@.timer
:
[Unit]
Description=Timer for certificate renewal of %I
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
; PartOf=cert-renewer.target
[Timer]
Persistent=true
; Run the timer unit every 5 minutes.
OnCalendar=*:1/5
; Always run the timer on time.
AccuracySec=1us
; Add jitter to prevent a "thundering hurd" of simultaneous certificate renewals.
RandomizedDelaySec=1m
[Install]
WantedBy=timers.target
and the necessary Ansible to throw it into place, and synced that over.
Then I created an overrides file for Consul:
[Service]
; `Environment=` overrides are applied per environment variable. This line does not
; affect any other variables set in the service template.
Environment=CERT_LOCATION={{ consul.cert_path }} \
KEY_LOCATION={{ consul.key_path }}
WorkingDirectory={{ consul.key_path | dirname }}
; Restart Consul service after certificate renewal
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active consul.service || systemctl try-reload-or-restart consul.service"
Unfortunately, I couldn't build the update the Consul configuration because the TLS certs had expired:
TASK [goldentooth.setup_consul : Create a Consul agent policy for each node.] ****************************************************
Wednesday 16 July 2025 18:43:18 -0400 (0:00:57.623) 0:01:24.371 ********
skipping: [bettley]
skipping: [cargyll]
skipping: [dalt]
FAILED - RETRYING: [allyrion -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [harlton -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [erenford -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [inchfield -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [velaryon -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [jast -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [karstark -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [fenn -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [gardener -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [lipps -> bettley]: Create a Consul agent policy for each node. (3 retries left).
FAILED - RETRYING: [allyrion -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [harlton -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [erenford -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [jast -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [inchfield -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [velaryon -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [fenn -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [gardener -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [karstark -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [lipps -> bettley]: Create a Consul agent policy for each node. (2 retries left).
FAILED - RETRYING: [allyrion -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [harlton -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [erenford -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [fenn -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [jast -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [inchfield -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [velaryon -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [gardener -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [lipps -> bettley]: Create a Consul agent policy for each node. (1 retries left).
FAILED - RETRYING: [karstark -> bettley]: Create a Consul agent policy for each node. (1 retries left).
fatal: [allyrion -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [harlton -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [erenford -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [fenn -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [jast -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [inchfield -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [velaryon -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [gardener -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [karstark -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
fatal: [lipps -> bettley]: FAILED! => changed=false
attempts: 3
msg: Could not connect to consul agent at bettley:8500, error was <urlopen error [Errno 111] Connection refused>
And it was then that I noticed that the dates on all of the Raspberry Pis were off by about 8 days 😑. I'd never set up NTP. A quick Ansible playbook later, every Pi agrees on the same date and time, but now:
● consul.service - "HashiCorp Consul"
Loaded: loaded (/etc/systemd/system/consul.service; enabled; preset: enabled)
Active: active (running) since Wed 2025-07-16 18:51:09 EDT; 13s ago
Docs: https://www.consul.io/
Main PID: 733215 (consul)
Tasks: 9 (limit: 8737)
Memory: 19.4M
CPU: 551ms
CGroup: /system.slice/consul.service
└─733215 /usr/bin/consul agent -config-dir=/etc/consul.d
Jul 16 18:51:09 bettley consul[733215]: gRPC TLS: Verify Incoming: true, Min Version: TLSv1_2
Jul 16 18:51:09 bettley consul[733215]: Internal RPC TLS: Verify Incoming: true, Verify Outgoing: true (Verify Hostname: true), Min Version: TLSv1_2
Jul 16 18:51:09 bettley consul[733215]: ==> Log data will now stream in as it occurs:
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.903-0400 [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.903-0400 [WARN] agent: bootstrap_expect > 0: expecting 3 servers
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.963-0400 [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.963-0400 [WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.966-0400 [WARN] agent: keyring doesn't include key provided with -encrypt, using keyring: keyring=WAN
Jul 16 18:51:09 bettley consul[733215]: 2025-07-16T18:51:09.967-0400 [ERROR] agent: startup error: error="refusing to rejoin cluster because server has been offline for more than the configured server_rejoin_age_max (168h0m0s) - consider wiping your data dir"
Jul 16 18:51:19 bettley consul[733215]: 2025-07-16T18:51:19.968-0400 [ERROR] agent: startup error: error="refusing to rejoin cluster because server has been offline for more than the configured server_rejoin_age_max (168h0m0s) - consider wiping your data dir"
It won't rebuild the cluster because it's been offline too long 🙃 So I had to zap a file on the nodes:
$ goldentooth command bettley,cargyll,dalt 'sudo rm -rf /opt/consul/server_metadata.json*'
dalt | CHANGED | rc=0 >>
bettley | CHANGED | rc=0 >>
cargyll | CHANGED | rc=0 >>
and then I was able to restart the cluster.
As it turned out, I had to rotate the Consul certificates anyway, since they were invalid, but I think it's working now. I've shortened the cert lifetime to 24 hours, so I should find out pretty quickly 🙂
After that, it's the same procedure (rotate the certs, then re-setup the app and install the cert renewal timer) for Grafana, Loki, Nomad, Vault, and Vector.
SSH Certificates
So remember back in chapter 32 when I set up Step-CA as our internal certificate authority? Step-CA also handle SSH certificates, which allows a less peer-to-peer model for authenticating between nodes. I'd actually tried to set these up before and it was an enormous pain in the pass and didn't really work well, so when I saw Step-CA included it in its featureset, I was excited.
It's very easy to allow authorized_keys
to grow without bound, and I'm fairly sure very few people actually read these messages:
The authenticity of host 'wtf.node.goldentooth.net (192.168.10.51)' can't be established.
ED25519 key fingerprint is SHA256:8xKJ5Fw6K+YFGxqR5EWsM4w3t5Y7MzO1p3G9kPvXHDo.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
So I wanted something that would allow seamless interconnection between the nodes while maintaining good security.
SSH certificates solve both of these problems elegantly. Instead of managing individual keys, you have a certificate authority that signs certificates. For user authentication, the SSH server trusts the CA's public key. For host authentication, your SSH client trusts the CA's public key.
It's basically the same model as TLS certificates, but for SSH. And since we already have Step-CA running, why not use it?
The Implementation
I created an Ansible role called goldentooth.setup_ssh_certificates
to handle all of this. Let me walk through what it does.
Setting Up the CA Trust
First, we need to grab the SSH CA public keys from our Step-CA server. There are actually two different keys - one for signing user certificates and one for signing host certificates:
- name: 'Get SSH User CA public key'
ansible.builtin.slurp:
src: "{{ step_ca.ca.etc_path }}/certs/ssh_user_ca_key.pub"
register: 'ssh_user_ca_key_b64'
delegate_to: "{{ step_ca.server }}"
run_once: true
become: true
- name: 'Get SSH Host CA public key'
ansible.builtin.slurp:
src: "{{ step_ca.ca.etc_path }}/certs/ssh_host_ca_key.pub"
register: 'ssh_host_ca_key_b64'
delegate_to: "{{ step_ca.server }}"
run_once: true
become: true
Then we configure sshd to trust certificates signed by our User CA:
- name: 'Configure sshd to trust User CA'
ansible.builtin.lineinfile:
path: '/etc/ssh/sshd_config'
regexp: '^#?TrustedUserCAKeys'
line: 'TrustedUserCAKeys /etc/ssh/ssh_user_ca.pub'
state: 'present'
validate: '/usr/sbin/sshd -t -f %s'
notify: 'reload sshd'
Host Certificates
For host certificates, we generate a certificate for each node that includes multiple principals (names the certificate is valid for):
- name: 'Generate SSH host certificate'
ansible.builtin.shell:
cmd: |
step ssh certificate \
--host \
--sign \
--force \
--no-password \
--insecure \
--provisioner="{{ step_ca.default_provisioner.name }}" \
--provisioner-password-file="{{ step_ca.default_provisioner.password_path }}" \
--principal="{{ ansible_hostname }}" \
--principal="{{ ansible_hostname }}.{{ cluster.node_domain }}" \
--principal="{{ ansible_hostname }}.{{ cluster.domain }}" \
--principal="{{ ansible_default_ipv4.address }}" \
--ca-url="https://{{ hostvars[step_ca.server].ipv4_address }}:9443" \
--root="{{ step_ca.root_cert_path }}" \
--not-after=24h \
{{ ansible_hostname }} \
/etc/step/certs/ssh_host.key.pub
Automatic Certificate Renewal
Notice the --not-after=24h
? Yeah, these certificates expire daily. Which means it's very important that the automatic renewal works 😀
Enter systemd timers:
[Unit]
Description=Timer for SSH host certificate renewal
Documentation=https://smallstep.com/docs/step-cli/reference/ssh/certificate
[Timer]
OnBootSec=5min
OnUnitActiveSec=15min
RandomizedDelaySec=5min
[Install]
WantedBy=timers.target
This runs every 15 minutes (with some randomization to avoid thundering herd problems). The service itself checks if the certificate needs renewal before actually doing anything:
# Check if certificate needs renewal
ExecCondition=/usr/bin/step certificate needs-renewal /etc/step/certs/ssh_host.key-cert.pub
User Certificates
For user certificates, I set up both root and my regular user account. The process is similar - generate a certificate with appropriate principals:
- name: 'Generate root user SSH certificate'
ansible.builtin.shell:
cmd: |
step ssh certificate \
--sign \
--force \
--no-password \
--insecure \
--provisioner="{{ step_ca.default_provisioner.name }}" \
--provisioner-password-file="{{ step_ca.default_provisioner.password_path }}" \
--principal="root" \
--principal="{{ ansible_hostname }}-root" \
--ca-url="https://{{ hostvars[step_ca.server].ipv4_address }}:9443" \
--root="{{ step_ca.root_cert_path }}" \
--not-after=24h \
root@{{ ansible_hostname }} \
/etc/step/certs/root_ssh_key.pub
Then configure SSH to actually use the certificate:
- name: 'Configure root SSH to use certificate'
ansible.builtin.blockinfile:
path: '/root/.ssh/config'
create: true
owner: 'root'
group: 'root'
mode: '0600'
block: |
Host *
CertificateFile /etc/step/certs/root_ssh_key-cert.pub
IdentityFile /etc/step/certs/root_ssh_key
marker: '# {mark} ANSIBLE MANAGED BLOCK - SSH CERTIFICATE'
The Trust Configuration
For the client side, we need to tell SSH to trust host certificates signed by our CA:
- name: 'Configure SSH client to trust Host CA'
ansible.builtin.lineinfile:
path: '/etc/ssh/ssh_known_hosts'
line: "@cert-authority * {{ ssh_host_ca_key }}"
create: true
owner: 'root'
group: 'root'
mode: '0644'
And since we're all friends here in the cluster, I disabled strict host key checking for cluster nodes:
- name: 'Disable StrictHostKeyChecking for cluster nodes'
ansible.builtin.blockinfile:
path: '/etc/ssh/ssh_config'
block: |
Host *.{{ cluster.node_domain }} *.{{ cluster.domain }}
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
marker: '# {mark} ANSIBLE MANAGED BLOCK - CLUSTER SSH CONFIG'
Is this less secure? Technically yes. Do I care? Not really. These are all nodes in my internal cluster that I control. The certificates provide the actual authentication.
The Results
After running the playbook, I can now SSH between any nodes in the cluster without passwords or key management:
root@bramble-ca:~# ssh bramble-01
Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-1017-raspi aarch64)
...
Last login: Sat Jul 19 00:15:23 2025 from 192.168.10.50
root@bramble-01:~#
No host key verification prompts. No password prompts. Just instant access.
And the best part? I can verify that certificates are being used:
root@bramble-01:~# ssh-keygen -L -f /etc/step/certs/ssh_host.key-cert.pub
/etc/step/certs/ssh_host.key-cert.pub:
Type: ssh-ed25519-cert-v01@openssh.com host certificate
Public key: ED25519-CERT SHA256:M5PQn6zVH7xJL+OFQzH4yVwR5EHrF2xQPm9QR5xKXBc
Signing CA: ED25519 SHA256:gNPpOqPsZW6YZDmhWQWqJ4l+L8E5Xgg8FQyAAbPi7Ss (using ssh-ed25519)
Key ID: "bramble-01"
Serial: 8485811653946933657
Valid: from 2025-07-18T20:13:42 to 2025-07-19T20:14:42
Principals:
bramble-01
bramble-01.node.goldentooth.net
bramble-01.goldentooth.net
192.168.10.51
Critical Options: (none)
Extensions: (none)
Look at that! The certificate is valid for exactly 24 hours and includes all the names I might use to connect to this host.
ZFS and Replication
So remember back in chapters 28 and 31 when I set up NFS exports using a USB thumbdrive? Obviously my crowning achievement as an infrastructure engineer.
After living with that setup for a bit, I finally got my hands on some SSDs. Not new ones, mind you – these are various drives I've accumulated over the years. Eight of them, to be precise:
- 3x 120GB SSDs
- 3x ~450GB SSDs
- 2x 1TB SSDs
Time to do something more serious with storage.
The Storage Strategy
I spent way too much time researching distributed storage options. GlusterFS? Apparently dead. Lustre? Way overkill for a Pi cluster, and the complexity-to-benefit ratio is terrible. BeeGFS? Same story.
So I decided to split the drives across three different storage systems:
- ZFS for the 3x 120GB drives – rock solid, great snapshot support, and I already know it
- Ceph for the 3x 450GB drives – the gold standard for distributed block storage in Kubernetes
- SeaweedFS for the 2x 1TB drives – interesting distributed object storage that's simpler than MinIO
Today we're tackling ZFS, because I actually have experience with it and it seemed like the easiest place to start.
The ZFS Setup
I created a role called goldentooth.setup_zfs
to handle all of this. The basic idea is to set up ZFS on nodes that have SSDs attached, create datasets for shared storage, and then use Sanoid for snapshot management and Syncoid for replication between nodes.
First, let's install ZFS and configure it for the Pi's limited RAM:
- name: 'Install ZFS.'
ansible.builtin.apt:
name:
- 'zfsutils-linux'
- 'zfs-dkms'
- 'zfs-zed'
- 'sanoid'
state: 'present'
update_cache: true
- name: 'Configure ZFS Event Daemon.'
ansible.builtin.lineinfile:
path: '/etc/zfs/zed.d/zed.rc'
regexp: '^#?ZED_EMAIL_ADDR='
line: 'ZED_EMAIL_ADDR="{{ my.email }}"'
notify: 'Restart ZFS-zed service.'
- name: 'Limit ZFS ARC to 128MB of RAM.'
ansible.builtin.lineinfile:
path: '/etc/modprobe.d/zfs.conf'
line: 'options zfs zfs_arc_max=1073741824'
create: true
notify: 'Update initramfs.'
That ARC limit is important – by default ZFS will happily eat half your RAM for caching, which is not great when you only have 8GB to start with.
Creating the Pool
The pool creation is straightforward. I'm not doing anything fancy like RAID-Z because I only have one SSD per node:
- name: 'Create ZFS pool.'
ansible.builtin.command: |
zpool create {{ zfs.pool.name }} {{ zfs.pool.device }}
args:
creates: "/{{ zfs.pool.name }}"
when: ansible_hostname == 'allyrion'
Wait, why when: ansible_hostname == 'allyrion'
? Well, it turns out I'm only creating the pool on the primary node. The other nodes will receive the data via replication. This is a bit different from a typical ZFS setup where each node would have its own pool, but it makes sense for my use case.
Sanoid for Snapshots
Sanoid is a fantastic tool for managing ZFS snapshots. It handles creating snapshots on a schedule and pruning old ones according to a retention policy. The configuration is pretty simple:
# Primary dataset for source snapshots
[{{ zfs.pool.name }}/{{ zfs.datasets[0].name }}]
use_template = production
recursive = yes
autosnap = yes
autoprune = yes
[template_production]
frequently = 0
hourly = 36
daily = 30
monthly = 3
yearly = 0
autosnap = yes
autoprune = yes
This keeps 36 hourly snapshots, 30 daily snapshots, and 3 monthly snapshots. No yearly snapshots because, let's be honest, this cluster probably won't last that long without me completely rebuilding it.
Syncoid for Replication
Here's where it gets interesting. Syncoid is Sanoid's companion tool that handles ZFS replication. It's basically a smart wrapper around zfs send
and zfs receive
that handles all the complexity of incremental replication.
I set up systemd services and timers to handle the replication:
[Unit]
Description=Syncoid ZFS replication to %i
After=zfs-import.target
Requires=zfs-import.target
[Service]
Type=oneshot
ExecStart=/usr/sbin/syncoid --no-privilege-elevation {{ zfs.pool.name }}/{{ zfs.datasets[0].name }} root@%i:{{ zfs.pool.name }}/{{ zfs.datasets[0].name }}
StandardOutput=journal
StandardError=journal
The %i
is systemd template magic – it gets replaced with whatever comes after the @
in the service name. So syncoid@bramble-01.service
would replicate to bramble-01
.
The timer runs every 15 minutes:
[Unit]
Description=Syncoid ZFS replication to %i timer
Requires=syncoid@%i.service
[Timer]
OnCalendar=*:0/15
RandomizedDelaySec=60
Persistent=true
SSH Configuration for Replication
Of course, Syncoid needs to SSH between nodes to do the replication. Initially, I tried to set this up with a separate SSH key for ZFS replication. That turned into such a mess that it actually motivated me to finally implement SSH certificates properly (see the previous chapter).
After setting up SSH certificates, I could simplify the configuration to just reference the certificates:
- name: 'Configure SSH config for ZFS replication using certificates.'
ansible.builtin.blockinfile:
path: '/root/.ssh/config'
create: true
mode: '0600'
block: |
# ZFS replication configuration using SSH certificates
{% for host in groups['zfs'] %}
{% if host != inventory_hostname %}
Host {{ host }}
HostName {{ hostvars[host]['ipv4_address'] }}
User root
CertificateFile /etc/step/certs/root_ssh_key-cert.pub
IdentityFile /etc/step/certs/root_ssh_key
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
{% endif %}
{% endfor %}
Much cleaner! No more key management, just point to the certificates that are already being automatically renewed. Sometimes a little pain is exactly what you need to motivate doing things the right way.
The Topology
The way I set this up, only the first node in the zfs
group (allyrion) actually creates datasets and takes snapshots. The other nodes just receive replicated data:
- name: 'Enable and start Syncoid timers for replication targets.'
ansible.builtin.systemd:
name: "syncoid@{{ item }}.timer"
enabled: true
state: 'started'
loop: "{{ groups['zfs'] | reject('eq', inventory_hostname) | list }}"
when:
- groups['zfs'] | length > 1
- inventory_hostname == groups['zfs'][0] # Only run on first ZFS node (allyrion)
This creates a hub-and-spoke topology where allyrion is the primary and replicates to all other ZFS nodes. It's not the most resilient topology (if allyrion dies, no new snapshots), but it's simple and works for my needs.
Does It Work?
Let's check using the goldentooth CLI:
$ goldentooth command allyrion 'zfs list'
allyrion | CHANGED | rc=0 >>
NAME USED AVAIL REFER MOUNTPOINT
rpool 546K 108G 24K /rpool
rpool/data 53K 108G 25K /data
Nice! The pool is there. Now let's look at snapshots:
$ goldentooth command allyrion 'zfs list -t snapshot'
allyrion | CHANGED | rc=0 >>
NAME USED AVAIL REFER MOUNTPOINT
rpool/data@autosnap_2025-07-18_18:13:17_monthly 0B - 24K -
rpool/data@autosnap_2025-07-18_18:13:17_daily 0B - 24K -
rpool/data@autosnap_2025-07-18_18:13:17_hourly 0B - 24K -
rpool/data@autosnap_2025-07-18_19:00:03_hourly 0B - 24K -
rpool/data@autosnap_2025-07-18_20:00:10_hourly 0B - 24K -
...
rpool/data@autosnap_2025-07-19_14:00:15_hourly 0B - 24K -
rpool/data@syncoid_allyrion_2025-07-19:10:45:32-GMT-04:00 0B - 25K -
Excellent! Sanoid is creating snapshots hourly, daily, and monthly. That last snapshot with the "syncoid" prefix shows that replication is happening too.
And on the replica nodes? Let me check which nodes have ZFS:
$ goldentooth command gardener 'zfs list'
gardener | CHANGED | rc=0 >>
NAME USED AVAIL REFER MOUNTPOINT
rpool 600K 108G 25K /rpool
rpool/data 53K 108G 25K /rpool/data
The replica has the same dataset structure. And the snapshots?
$ goldentooth command gardener 'zfs list -t snapshot | head -5'
gardener | CHANGED | rc=0 >>
NAME USED AVAIL REFER MOUNTPOINT
rpool/data@autosnap_2025-07-18_18:13:17_monthly 0B - 24K -
rpool/data@autosnap_2025-07-18_18:13:17_daily 0B - 24K -
rpool/data@autosnap_2025-07-18_18:13:17_hourly 0B - 24K -
rpool/data@autosnap_2025-07-18_19:00:03_hourly 0B - 24K -
Perfect! The snapshots are being replicated from allyrion to gardener. The replication is working.
Performance
How's the performance? Well... it's ZFS on a single SSD connected to a Raspberry Pi. It's not going to win any benchmarks:
$ goldentooth command_root allyrion 'dd if=/dev/zero of=/data/test bs=1M count=100'
allyrion | CHANGED | rc=0 >>
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.205277 s, 511 MB/s
511 MB/s writes! That's... actually surprisingly good for a Pi with a SATA SSD over USB3. Clearly the ZFS caching is helping here, but even so, that's plenty fast for shared configuration files, build artifacts, and other cluster data.
Expanding the Kubernetes Cluster
With the Goldentooth cluster continuing to evolve, it was time to bring two more nodes into the Kubernetes fold... Karstark and Lipps, two Raspberry Pi 4Bs (4GB) that were just kinda sitting around.
The Current State
Before the expansion, our Kubernetes cluster consisted of:
- Control Plane (3 nodes): bettley, cargyll, dalt
- Workers (7 nodes): erenford, fenn, gardener, harlton, inchfield, jast, velaryon
Karstark and Lipps were already fully integrated into the cluster infrastructure:
- Both were part of the Consul service mesh as clients
- Both were configured as Nomad clients for workload scheduling
- Both were included in other cluster services like Ray and Slurm
However, they weren't yet part of the Kubernetes cluster, which meant we were missing out on their compute capacity for containerized workloads.
Installing Kubernetes Packages
The first step was to ensure both nodes had the necessary Kubernetes packages installed. Using the goldentooth CLI:
ansible-playbook -i inventory/hosts playbooks/install_k8s_packages.yaml --limit karstark,lipps
This playbook handled:
- Installing and configuring containerd as the container runtime
- Installing kubeadm, kubectl, and kubelet packages
- Setting up proper systemd cgroup configuration
- Enabling and starting the kubelet service
Both nodes successfully installed Kubernetes v1.32.7, which was slightly newer than the existing cluster nodes running v1.32.3.
The Challenge: Certificate Issues
When attempting to use the standard goldentooth bootstrap_k8s
command, we ran into certificate verification issues. The bootstrap process was timing out when trying to communicate with the Kubernetes API server.
The error manifested as:
tls: failed to verify certificate: x509: certificate signed by unknown authority
This is a common issue in clusters that have been running for a while (393 days in our case) and have undergone certificate rotations or updates.
The Solution: Manual Join Process
Instead of relying on the automated bootstrap, I took a more direct approach:
-
Generate a join token from the control plane:
goldentooth command_root bettley "kubeadm token create --print-join-command"
-
Execute the join command on both nodes:
goldentooth command_root karstark,lipps "kubeadm join 10.4.0.10:6443 --token yi3zz8.qf0ziy9ce7nhnkjv --discovery-token-ca-cert-hash sha256:0d6c8981d10e407429e135db4350e6bb21382af57addd798daf6c3c5663ac964 --skip-phases=preflight"
The --skip-phases=preflight
flag was key here, as it bypassed the problematic preflight checks while still allowing the nodes to join successfully.
Verification
After the join process completed, both nodes appeared in the cluster:
goldentooth command_root bettley "kubectl get nodes"
NAME STATUS ROLES AGE VERSION
bettley Ready control-plane 393d v1.32.3
cargyll Ready control-plane 393d v1.32.3
dalt Ready control-plane 393d v1.32.3
erenford Ready <none> 393d v1.32.3
fenn Ready <none> 393d v1.32.3
gardener Ready <none> 393d v1.32.3
harlton Ready <none> 393d v1.32.3
inchfield Ready <none> 393d v1.32.3
jast Ready <none> 393d v1.32.3
karstark Ready <none> 53s v1.32.7
lipps Ready <none> 54s v1.32.7
velaryon Ready <none> 52d v1.32.5
Perfect! Both nodes transitioned from "NotReady" to "Ready" status within about a minute, indicating that the Calico CNI networking had successfully configured them.
The New Topology
Our Kubernetes cluster now consists of:
- Control Plane (3 nodes): bettley, cargyll, dalt
- Workers (9 nodes): erenford, fenn, gardener, harlton, inchfield, jast, karstark, lipps, velaryon (GPU)
This brings us to a total of 12 nodes in the Kubernetes cluster, matching the full complement of our Raspberry Pi bramble plus the x86 GPU node.
GPU Node Configuration
Velaryon, my x86 GPU node, required special configuration to ensure GPU workloads are only scheduled intentionally:
Hardware Specifications
- GPU: NVIDIA GeForce RTX 2070 (8GB VRAM)
- CPU: 24 cores (x86_64)
- Memory: 32GB RAM
- Architecture: amd64
Kubernetes Configuration
The node is configured with:
- Label:
gpu=true
for workload targeting - Taint:
gpu=true:NoSchedule
to prevent accidental scheduling - Architecture:
arch=amd64
for x86-specific workloads
Scheduling Requirements
To schedule workloads on Velaryon, pods must include:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
This ensures that only workloads explicitly designed for GPU execution can access the expensive GPU resources, following the same intentional scheduling pattern used with Nomad.
GPU Resource Detection Challenge
While the taint-based scheduling was working correctly, getting Kubernetes to actually detect and expose the GPU resources proved more challenging. The NVIDIA device plugin is responsible for discovering GPUs and advertising them as nvidia.com/gpu
resources that pods can request.
Initial Problem
The device plugin was failing with the error:
E0719 16:20:41.050191 1 factory.go:115] Incompatible platform detected
E0719 16:20:41.050193 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
Despite having installed the NVIDIA Container Toolkit and configuring containerd, the device plugin couldn't detect the NVML library from within its container environment.
The Root Cause
The issue was that the device plugin container couldn't access:
- NVIDIA Management Library:
libnvidia-ml.so.1
needed for GPU discovery - Device files:
/dev/nvidia*
required for direct GPU communication - Proper privileges: Needed to interact with kernel-level GPU drivers
The Solution
After several iterations, the working configuration required:
Library Access:
volumeMounts:
- name: nvidia-ml-lib
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
readOnly: true
- name: nvidia-ml-actual
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
readOnly: true
Device Access:
volumeMounts:
- name: dev
mountPath: /dev
volumes:
- name: dev
hostPath:
path: /dev
Container Privileges:
securityContext:
privileged: true
Verification
Once properly configured, the device plugin successfully reported:
I0719 16:56:06.462937 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0719 16:56:06.463631 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0719 16:56:06.465420 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
The GPU resource then appeared in the node's capacity:
kubectl get nodes -o json | jq '.items[] | select(.metadata.name=="velaryon") | .status.capacity'
{
"cpu": "24",
"ephemeral-storage": "102626232Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "32803048Ki",
"nvidia.com/gpu": "1",
"pods": "110"
}
Testing GPU Resource Allocation
To verify the system was working end-to-end, I created a test pod that:
- Requests GPU resources:
nvidia.com/gpu: 1
- Includes proper tolerations: To bypass the
gpu=true:NoSchedule
taint - Targets the GPU node: Using
gpu: "true"
node selector
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-workload
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
containers:
- name: gpu-test
image: busybox
command: ["sleep", "60"]
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
The pod successfully scheduled and the node showed:
nvidia.com/gpu 1 1
This confirmed that GPU resource allocation tracking was working correctly.
Final NVIDIA Device Plugin Configuration
For reference, here's the complete working NVIDIA device plugin DaemonSet configuration:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
gpu: "true"
priorityClassName: system-node-critical
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
- name: nvidia-ml-lib
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
readOnly: true
- name: nvidia-ml-actual
mountPath: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
readOnly: true
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
- name: nvidia-ml-lib
hostPath:
path: /lib/x86_64-linux-gnu/libnvidia-ml.so.1
- name: nvidia-ml-actual
hostPath:
path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.575.51.03
Key aspects of this configuration:
- Targeted deployment: Only runs on nodes with
gpu: "true"
label - Taint tolerance: Can schedule on nodes with
gpu=true:NoSchedule
taint - Privileged access: Required for kernel-level GPU driver interaction
- Library binding: Specific mounts for NVIDIA ML library files
- Device access: Full
/dev
mount for GPU device communication
GPU Storage NFS Export
With the cluster expanded to include the Velaryon GPU node, a natural question emerged: how can the Raspberry Pi cluster nodes efficiently exchange data with the GPU node for machine learning workloads and other compute-intensive tasks?
The solution was to leverage Velaryon's secondary 1TB NVMe SSD and expose it to the entire cluster via NFS, creating a high-speed shared storage pool specifically for Pi-GPU data exchange.
The Challenge
Velaryon came with two storage devices:
- Primary NVMe (nvme1n1): Linux system drive
- Secondary NVMe (nvme0n1): 1TB drive with old NTFS partitions from previous Windows installation
The goal was to repurpose this secondary drive as shared storage while maintaining architectural separation - the GPU node should provide storage services without becoming a structural component of the Pi cluster.
Storage Architecture Decision
Rather than integrating Velaryon into the existing storage ecosystem (ZFS replication, Ceph distributed storage), I opted for a simpler approach:
- Pure ext4: Single partition consuming the entire 1TB drive
- NFS export: Simple, performant network filesystem
- Subnet-wide access: Available to all 10.4.x.x nodes
This keeps the GPU node loosely coupled while providing the needed functionality.
Implementation
Drive Preparation
First, I cleared the old NTFS partitions and created a fresh GPT layout:
# Clear existing partition table
sudo wipefs -af /dev/nvme0n1
# Create new GPT partition table and single partition
sudo parted /dev/nvme0n1 mklabel gpt
sudo parted /dev/nvme0n1 mkpart primary ext4 0% 100%
# Format as ext4
sudo mkfs.ext4 -L gpu-storage /dev/nvme0n1p1
The resulting filesystem has UUID 5bc38d5b-a7a4-426e-acdb-e5caf0a809d9
and is mounted persistently at /mnt/gpu-storage
.
NFS Server Configuration
Velaryon was configured as an NFS server with a single export:
# /etc/exports
/mnt/gpu-storage 10.4.0.0/20(rw,sync,no_root_squash,no_subtree_check)
This grants read-write access to the entire infrastructure subnet with synchronous writes for data integrity.
Ansible Integration
Rather than manually configuring each node, I integrated the GPU storage into the existing Ansible automation:
Inventory Updates (inventory/hosts
):
nfs_server:
hosts:
allyrion: # Existing NFS server
velaryon: # New GPU storage server
Host Variables (inventory/host_vars/velaryon.yaml
):
nfs_exports:
- "/mnt/gpu-storage 10.4.0.0/20(rw,sync,no_root_squash,no_subtree_check)"
Global Configuration (group_vars/all/vars.yaml
):
nfs:
mounts:
primary: # Existing allyrion NFS share
share: "{{ hostvars[groups['nfs_server'] | first].ipv4_address }}:/mnt/usb1"
mount: '/mnt/nfs'
safe_name: 'mnt-nfs'
type: 'nfs'
options: {}
gpu_storage: # New GPU storage share
share: "{{ hostvars['velaryon'].ipv4_address }}:/mnt/gpu-storage"
mount: '/mnt/gpu-storage'
safe_name: 'mnt-gpu\x2dstorage' # Systemd unit name escaping
type: 'nfs'
options: {}
Systemd Automount Configuration
The trickiest part was configuring systemd automount units. Systemd requires special character escaping for mount paths - the mount point /mnt/gpu-storage
must use the unit name mnt-gpu\x2dstorage
(where \x2d
is the escaped dash).
Mount Unit Template (templates/mount.j2
):
[Unit]
Description=Mount {{ item.key }}
[Mount]
What={{ item.value.share }}
Where={{ item.value.mount }}
Type={{ item.value.type }}
{% if item.value.options -%}
Options={{ item.value.options | join(',') }}
{% else -%}
Options=defaults
{% endif %}
[Install]
WantedBy=default.target
Automount Unit Template (templates/automount.j2
):
[Unit]
Description=Automount {{ item.key }}
After=remote-fs-pre.target network-online.target network.target
Before=umount.target remote-fs.target
[Automount]
Where={{ item.value.mount }}
TimeoutIdleSec=60
[Install]
WantedBy=default.target
Deployment Playbook
A new playbook setup_gpu_storage.yaml
orchestrates the entire deployment:
---
# Setup GPU storage on Velaryon with NFS export
- name: 'Setup Velaryon GPU storage and NFS export'
hosts: 'velaryon'
become: true
tasks:
- name: 'Ensure GPU storage mount point exists'
ansible.builtin.file:
path: '/mnt/gpu-storage'
state: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
- name: 'Check if GPU storage is mounted'
ansible.builtin.command:
cmd: 'mountpoint -q /mnt/gpu-storage'
register: gpu_storage_mounted
failed_when: false
changed_when: false
- name: 'Mount GPU storage if not already mounted'
ansible.builtin.mount:
src: 'UUID=5bc38d5b-a7a4-426e-acdb-e5caf0a809d9'
path: '/mnt/gpu-storage'
fstype: 'ext4'
opts: 'defaults'
state: 'mounted'
when: gpu_storage_mounted.rc != 0
- name: 'Configure NFS exports on Velaryon'
hosts: 'velaryon'
become: true
roles:
- 'geerlingguy.nfs'
- name: 'Setup NFS mounts on all nodes'
hosts: 'all'
become: true
roles:
- 'goldentooth.setup_nfs_mounts'
Usage
The GPU storage is now seamlessly integrated into the goldentooth CLI:
# Deploy/update GPU storage configuration
goldentooth setup_gpu_storage
# Enable automount on specific nodes
goldentooth command allyrion 'sudo systemctl enable --now mnt-gpu\x2dstorage.automount'
# Verify access (automounts on first access)
goldentooth command cargyll,bettley 'ls /mnt/gpu-storage/'
Results
The implementation provides:
- 1TB shared storage available cluster-wide at
/mnt/gpu-storage
- Automatic mounting via systemd automount on directory access
- Full Ansible automation via the goldentooth CLI
- Clean separation between Pi cluster and GPU node architectures
Data written from any node is immediately visible across the cluster, enabling seamless Pi-GPU workflows for machine learning datasets, model artifacts, and computational results.
Prometheus Blackbox Exporter
The Observability Gap
Our Goldentooth cluster has comprehensive infrastructure monitoring through Prometheus, node exporters, and application metrics. But we've been missing a crucial piece: synthetic monitoring. We can see if our servers are running, but can we actually reach our services? Are our web UIs accessible? Can we connect to our APIs?
Enter the Prometheus Blackbox Exporter - our eyes and ears for service availability across the entire cluster.
What is Blackbox Monitoring?
Blackbox monitoring tests services from the outside, just like your users would. Instead of looking at internal metrics, it:
- Probes HTTP/HTTPS endpoints - "Is the Consul web UI actually working?"
- Tests TCP connectivity - "Can I connect to the Vault API port?"
- Validates DNS resolution - "Do our cluster domains resolve correctly?"
- Checks ICMP reachability - "Are all nodes responding to ping?"
It's called "blackbox" because we don't peek inside the service - we just test if it works from the outside.
Planning the Implementation
I needed to design a comprehensive monitoring strategy that would cover:
Service Categories
- HashiCorp Stack: Consul, Vault, Nomad web UIs and APIs
- Kubernetes Services: API server health, Argo CD, LoadBalancer services
- Observability Stack: Prometheus, Grafana, Loki endpoints
- Infrastructure: All 13 node homepages, HAProxy stats
- External Services: CloudFront distributions
- Network Health: DNS resolution for all cluster domains
Intelligent Probe Types
- Internal HTTPS: Uses Step-CA certificates for cluster services
- External HTTPS: Uses public CAs for external services
- HTTP: Plain HTTP for internal services
- TCP: Port connectivity for APIs and cluster communication
- DNS: Domain resolution for cluster services
- ICMP: Basic network connectivity for all nodes
The Ansible Implementation
I created a comprehensive Ansible role goldentooth.setup_blackbox_exporter
that handles:
Core Deployment
# Install blackbox exporter v0.25.0
# Deploy on allyrion (same node as Prometheus)
# Configure systemd service with security hardening
# Set up TLS certificates via Step-CA
Security Integration
The blackbox exporter integrates seamlessly with our Step-CA infrastructure:
- Client certificates for secure communication
- CA validation for internal services
- Automatic renewal via systemd timers
- Proper certificate ownership for the service user
Service Discovery Magic
Instead of hardcoding targets, I implemented dynamic service discovery:
# Generate targets from Ansible inventory variables
blackbox_https_internal_targets:
- "https://consul.goldentooth.net:8501"
- "https://vault.goldentooth.net:8200"
- "https://nomad.goldentooth.net:4646"
# ... and many more
# Auto-generate ICMP targets for all cluster nodes
{% for host in groups['all'] %}
- targets:
- "{{ hostvars[host]['ipv4_address'] }}"
labels:
job: 'blackbox-icmp'
node: "{{ host }}"
{% endfor %}
Prometheus Integration
The trickiest part was configuring Prometheus to properly scrape blackbox targets. Blackbox exporter works differently than normal exporters:
# Instead of scraping the target directly...
# Prometheus scrapes the blackbox exporter with target as parameter
- job_name: 'blackbox-https-internal'
metrics_path: '/probe'
params:
module: ['https_2xx_internal']
relabel_configs:
# Redirect to blackbox exporter
- target_label: __address__
replacement: "allyrion:9115"
# Pass original target as parameter
- source_labels: [__param_target]
target_label: __param_target
Deployment Day
The deployment was mostly smooth with a few interesting challenges:
Certificate Duration Drama
# First attempt failed:
# "requested duration of 8760h is more than authorized maximum of 168h"
# Solution: Match Step-CA policy
--not-after=168h # 1 week instead of 1 year
DNS Resolution Reality Check
Many of our internal domains (*.goldentooth.net
) don't actually resolve yet, so probes show up=0
. This is expected and actually valuable - it shows us what infrastructure we still need to set up!
Relabel Configuration Complexity
Getting the Prometheus relabel configs right for blackbox took several iterations. The key insight: blackbox exporter targets need to be "redirected" through the exporter itself.
What We're Monitoring Now
The blackbox exporter is now actively monitoring 40+ endpoints across our cluster:
Web UIs and APIs
- Consul Web UI (
https://consul.goldentooth.net:8501
) - Vault Web UI (
https://vault.goldentooth.net:8200
) - Nomad Web UI (
https://nomad.goldentooth.net:4646
) - Grafana Dashboard (
https://grafana.goldentooth.net:3000
) - Argo CD Interface (
https://argocd.goldentooth.net
)
Infrastructure Endpoints
- All 13 node homepages (
http://[node].nodes.goldentooth.net
) - HAProxy statistics page (with basic auth)
- Prometheus web interface
- Loki API endpoints
Network Connectivity
- TCP connectivity to all critical service ports
- DNS resolution for all cluster domains
- ICMP ping for every cluster node
- External CloudFront distributions
The Power of Synthetic Monitoring
Now when something breaks, we'll know immediately:
probe_success
tells us if the service is reachableprobe_duration_seconds
shows response timesprobe_http_status_code
reveals HTTP errorsprobe_ssl_earliest_cert_expiry
warns about certificate expiration
This complements our existing infrastructure monitoring perfectly. We can see both "the server is running" (node exporter) and "the service actually works" (blackbox exporter).
Comprehensive Metrics Collection
After establishing the foundation of our observability stack with Prometheus, Grafana, and the blackbox exporter, it's time to ensure we're collecting metrics from every critical component in our cluster. This chapter covers the addition of Nomad telemetry and Kubernetes object metrics to our monitoring infrastructure.
The Metrics Audit
A comprehensive audit of our cluster revealed which services were already exposing metrics:
Already Configured:
- ✅ Kubernetes API server, controller manager, scheduler (via control plane endpoints)
- ✅ HAProxy (custom exporter on port 8405)
- ✅ Prometheus (self-monitoring)
- ✅ Grafana (internal metrics)
- ✅ Loki (log aggregation metrics)
- ✅ Consul (built-in Prometheus endpoint)
- ✅ Vault (telemetry endpoint)
Missing:
- ❌ Nomad (no telemetry configuration)
- ❌ Kubernetes object state (deployments, pods, services)
Enabling Nomad Telemetry
Nomad has built-in Prometheus support but requires explicit configuration. We added the telemetry block to our Nomad configuration template:
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
This configuration:
- Enables Prometheus-compatible metrics on
/v1/metrics?format=prometheus
- Publishes detailed allocation and node metrics
- Disables hostname labels (we add our own)
- Sets a 1-second collection interval for fine-grained data
Certificate-Based Authentication
Unlike some services that expose metrics without authentication, Nomad requires mutual TLS for metrics access. We leveraged our Step-CA infrastructure to generate proper client certificates:
- name: 'Generate Prometheus client certificate for Nomad metrics.'
ansible.builtin.shell:
cmd: |
{{ step_ca.executable }} ca certificate \
"prometheus.client.nomad" \
"/etc/prometheus/certs/nomad-client.crt" \
"/etc/prometheus/certs/nomad-client.key" \
--provisioner="{{ step_ca.default_provisioner.name }}" \
--password-file="{{ step_ca.default_provisioner.password_path }}" \
--san="prometheus.client.nomad" \
--san="prometheus" \
--san="{{ clean_hostname }}" \
--san="{{ ipv4_address }}" \
--not-after='24h' \
--console \
--force
This approach ensures:
- Certificates are properly signed by our cluster CA
- Client identity is clearly established
- Automatic renewal via systemd timers
- Consistent with our security model
Prometheus Scrape Configuration
With certificates in place, we configured Prometheus to scrape all Nomad nodes:
- job_name: 'nomad'
metrics_path: '/v1/metrics'
params:
format: ['prometheus']
static_configs:
- targets:
- "10.4.0.11:4646" # bettley (server)
- "10.4.0.12:4646" # cargyll (server)
- "10.4.0.13:4646" # dalt (server)
# ... all client nodes
scheme: 'https'
tls_config:
ca_file: "{{ step_ca.root_cert_path }}"
cert_file: "/etc/prometheus/certs/nomad-client.crt"
key_file: "/etc/prometheus/certs/nomad-client.key"
Kubernetes Object Metrics with kube-state-metrics
While node-level metrics tell us about resource usage, we also need visibility into Kubernetes objects themselves. Enter kube-state-metrics, which exposes metrics about:
- Deployment replica counts and rollout status
- Pod phases and container states
- Service endpoints and readiness
- PersistentVolume claims and capacity
- Job completion status
- And much more
GitOps Deployment Pattern
Following our established patterns, we created a dedicated GitOps repository for kube-state-metrics:
# Create the repository
gh repo create goldentooth/kube-state-metrics --public
# Clone into our organization structure
cd ~/Projects/goldentooth
git clone https://github.com/goldentooth/kube-state-metrics.git
# Add the required label for Argo CD discovery
gh repo edit goldentooth/kube-state-metrics --add-topic gitops-repo
The key insight here is that our Argo CD ApplicationSet automatically discovers repositories with the gitops-repo
label, eliminating manual application creation.
kube-state-metrics Configuration
The deployment includes comprehensive RBAC permissions to read all Kubernetes objects:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
# ... additional resources
We discovered that some resources like resourcequotas
, replicationcontrollers
, and limitranges
were missing from the initial configuration, causing permission errors. A quick update to the ClusterRole resolved these issues.
Security Hardening
The kube-state-metrics deployment follows security best practices:
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
Container-level security adds additional restrictions:
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
Prometheus Auto-Discovery
The service includes annotations for automatic Prometheus discovery:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'
This eliminates the need for manual Prometheus configuration - the metrics are automatically discovered and scraped.
Verifying the Deployment
After deployment, we can verify metrics are being exposed:
# Port-forward to test locally
kubectl port-forward -n kube-state-metrics service/kube-state-metrics 8080:8080
# Check deployment metrics
curl -s http://localhost:8080/metrics | grep kube_deployment_status_replicas
Example output:
kube_deployment_status_replicas{namespace="argocd",deployment="argocd-redis-ha-haproxy"} 3
kube_deployment_status_replicas{namespace="kube-system",deployment="coredns"} 2
Blocking Docker Installation
The Problem
I don't know why, and I'm too lazy to dig much into it, but if I install docker
on any node in the Kubernetes cluster, this conflicts with containerd (containerd.io
), which causes Kubernetes to shit blood and stop working on that node. Great.
To prevent this, I implemented a clusterwide ban on Docker. I'm recording the details here in case I need to do it again.
Implementation
First, we removed Docker from nodes where it was already installed (like Allyrion):
# Stop and remove containers
goldentooth command_root allyrion "docker stop envoy && docker rm envoy"
# Remove all images
goldentooth command_root allyrion "docker images -q | xargs -r docker rmi -f"
# Stop and disable Docker
goldentooth command_root allyrion "systemctl stop docker && systemctl disable docker"
goldentooth command_root allyrion "systemctl stop docker.socket && systemctl disable docker.socket"
# Purge Docker packages
goldentooth command_root allyrion "apt-get purge -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin"
goldentooth command_root allyrion "apt-get autoremove -y"
# Clean up Docker directories
goldentooth command_root allyrion "rm -rf /var/lib/docker /etc/docker /var/run/docker.sock"
goldentooth command_root allyrion "rm -f /etc/apt/sources.list.d/docker.list /etc/apt/keyrings/docker.gpg"
APT Preferences Configuration
Next, we added an APT preferences file to the goldentooth.setup_security
role that blocks Docker packages from being installed:
- name: 'Block Docker installation to prevent conflicts with Kubernetes containerd'
ansible.builtin.copy:
dest: '/etc/apt/preferences.d/block-docker'
mode: '0644'
owner: 'root'
group: 'root'
content: |
# Block Docker installation to prevent conflicts with Kubernetes containerd
# Docker packages can break the containerd installation used by Kubernetes
# This preference file prevents accidental installation of Docker
Package: docker-ce
Pin: origin ""
Pin-Priority: -1
Package: docker-ce-cli
Pin: origin ""
Pin-Priority: -1
Package: docker-ce-rootless-extras
Pin: origin ""
Pin-Priority: -1
Package: docker-buildx-plugin
Pin: origin ""
Pin-Priority: -1
Package: docker-compose-plugin
Pin: origin ""
Pin-Priority: -1
Package: docker.io
Pin: origin ""
Pin-Priority: -1
Package: docker-compose
Pin: origin ""
Pin-Priority: -1
Package: docker-registry
Pin: origin ""
Pin-Priority: -1
Package: docker-doc
Pin: origin ""
Pin-Priority: -1
# Also block the older containerd.io package that comes with Docker
# Kubernetes should use the standard containerd package instead
Package: containerd.io
Pin: origin ""
Pin-Priority: -1
Deployment
The configuration was deployed to all nodes using:
goldentooth configure_cluster
Verification
We can verify that Docker is now blocked:
# Check Docker package policy
goldentooth command allyrion "apt-cache policy docker-ce"
# Output shows: Candidate: (none)
# Verify the preferences file exists
goldentooth command all "ls -la /etc/apt/preferences.d/block-docker"
How APT Preferences Work
APT preferences allow you to control which versions of packages are installed. By setting a Pin-Priority
of -1
, we effectively tell APT to never install these packages, regardless of their availability in the configured repositories.
This is more robust than simply removing Docker repositories because:
- It prevents installation from any source (including manual addition of repositories)
- It provides clear documentation of why these packages are blocked
- It's easily reversible if needed (just remove the preferences file)
Infrastructure Test Framework Improvements
After running comprehensive tests across the cluster, we discovered several critical issues with our test framework that were masking real infrastructure problems. This chapter documents the systematic fixes we implemented to ensure our automated testing provides accurate health monitoring.
The Initial Problem
When running goldentooth test all
, multiple test failures appeared across different nodes:
PLAY RECAP *********************************************************************
bettley : ok=47 changed=0 unreachable=0 failed=1 skipped=3 rescued=0 ignored=2
cargyll : ok=47 changed=0 unreachable=0 failed=1 skipped=3 rescued=0 ignored=1
dalt : ok=47 changed=0 unreachable=0 failed=1 skipped=3 rescued=0 ignored=1
The challenge was determining whether these failures indicated real infrastructure issues or problems with the test framework itself.
Root Cause Analysis
1. Kubernetes API Server Connectivity Issues
The most critical failure was the Kubernetes API server health check consistently failing on the bettley
control plane node:
Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
url: https://10.4.0.11:6443/healthz
Initial investigation revealed that while kubelet was running, both etcd and kube-apiserver pods were in CrashLoopBackOff state. This led us to discover that Kubernetes certificates had expired on June 20, 2025, but we were running tests in July 2025.
2. Test Framework Configuration Issues
Several test framework bugs were identified:
- Vault decryption errors: Tests couldn't access encrypted vault secrets
- Wrong certificate paths: Tests checking CA certificates instead of service certificates
- Undefined variables: JMESPath dependencies and variable reference errors
- Localhost binding assumptions: Services bound to specific IPs, not localhost
Infrastructure Fixes
Kubernetes Certificate Renewal
The most significant infrastructure issue was expired Kubernetes certificates. We resolved this using kubeadm:
# Backup existing certificates
ansible -i inventory/hosts bettley -m shell -a "cp -r /etc/kubernetes/pki /etc/kubernetes/pki.backup.$(date +%Y%m%d_%H%M%S)" --become
# Renew all certificates
ansible -i inventory/hosts bettley -m shell -a "kubeadm certs renew all" --become
# Restart control plane components by moving manifests temporarily
cd /etc/kubernetes/manifests
mv kube-apiserver.yaml kube-apiserver.yaml.tmp
mv etcd.yaml etcd.yaml.tmp
mv kube-controller-manager.yaml kube-controller-manager.yaml.tmp
mv kube-scheduler.yaml kube-scheduler.yaml.tmp
# Wait 10 seconds, then restore manifests
sleep 10
mv kube-apiserver.yaml.tmp kube-apiserver.yaml
mv etcd.yaml.tmp etcd.yaml
mv kube-controller-manager.yaml.tmp kube-controller-manager.yaml
mv kube-scheduler.yaml.tmp kube-scheduler.yaml
After renewal, certificates were valid until July 2026:
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
apiserver Jul 23, 2026 00:01 UTC 364d ca no
etcd-peer Jul 23, 2026 00:01 UTC 364d etcd-ca no
etcd-server Jul 23, 2026 00:01 UTC 364d etcd-ca no
Test Framework Fixes
1. Vault Authentication
Fixed missing vault password configuration in test environment:
# /Users/nathan/Projects/goldentooth/ansible/tests/ansible.cfg
[defaults]
vault_password_file = ~/.goldentooth_vault_password
2. Certificate Path Corrections
Updated tests to check actual service certificates instead of CA certificates:
# Before: Checking CA certificates (5-year lifespan)
path: /etc/consul.d/tls/consul-agent-ca.pem
# After: Checking service certificates (24-hour rotation)
path: /etc/consul.d/certs/tls.crt
3. API Connectivity Fixes
Fixed hardcoded localhost assumptions to use actual node IP addresses:
# Before: Assuming localhost binding
url: "https://127.0.0.1:8501/v1/status/leader"
# After: Using actual node IP
url: "http://{{ ansible_default_ipv4.address }}:8500/v1/status/leader"
4. Consul Members Command
Enhanced Consul connectivity testing with proper address specification:
- name: Check if consul command exists
stat:
path: /usr/bin/consul
register: consul_command_stat
- name: Check Consul members
command: consul members -status=alive -http-addr={{ ansible_default_ipv4.address }}:8500
when:
- consul_service.status.ActiveState == "active"
- consul_command_stat.stat.exists
5. Kubernetes Test Improvements
Simplified Kubernetes tests to avoid JMESPath dependencies and fixed variable scoping:
# Simplified node readiness test
- name: Record node readiness test (simplified)
set_fact:
k8s_tests: "{{ k8s_tests + [{'name': 'k8s_cluster_accessible', 'category': 'kubernetes', 'success': (k8s_nodes_raw is defined and k8s_nodes_raw is succeeded) | bool, 'duration': 0.5}] }}"
# Fixed API health test scoping
- name: Record API health test
set_fact:
k8s_tests: "{{ k8s_tests + [{'name': 'k8s_api_healthy', 'category': 'kubernetes', 'success': (k8s_api.status == 200 and k8s_api.content | default('') == 'ok') | bool, 'duration': 0.2}] }}"
when:
- k8s_api is defined
- inventory_hostname in groups['k8s_control_plane']
6. Step-CA Variable References
Fixed undefined variable references in Step-CA connectivity tests:
# Fixed IP address lookup
elif step ca health --ca-url https://{{ hostvars[groups['step_ca'] | first]['ipv4_address'] }}:9443 --root /etc/ssl/certs/goldentooth.pem; then
7. Localhost Aggregation Task
Fixed the test summary task that was failing due to missing facts:
- name: Aggregate test results
hosts: localhost
gather_facts: true # Changed from false
Test Design Philosophy
We adopted a principle of separating certificate presence from validity testing:
# Test 1: Certificate exists
- name: Check Consul certificate exists
stat:
path: /etc/consul.d/certs/tls.crt
register: consul_cert
- name: Record certificate presence test
set_fact:
consul_tests: "{{ consul_tests + [{'name': 'consul_certificate_present', 'category': 'consul', 'success': consul_cert.stat.exists | bool, 'duration': 0.1}] }}"
# Test 2: Certificate is valid (separate test)
- name: Check if certificate needs renewal
command: step certificate needs-renewal /etc/consul.d/certs/tls.crt
register: cert_needs_renewal
when: consul_cert.stat.exists
- name: Record certificate validity test
set_fact:
consul_tests: "{{ consul_tests + [{'name': 'consul_certificate_valid', 'category': 'consul', 'success': (cert_needs_renewal.rc != 0) | bool, 'duration': 0.1}] }}"
This approach provides better debugging information and clearer failure isolation.
Slurm Refactoring and Improvements
Overview
After the initial Slurm deployment (documented in chapter 032), the cluster faced performance and reliability challenges that required significant refactoring. The monolithic setup role was taking 10+ minutes to execute and had idempotency issues, while memory configuration mismatches caused node validation failures.
It's my fault - it's because of my laziness. So this chapter is essentially me saying "yeah, I did a shitty thing, and so now I have to fix it."
Problems Identified
Performance Issues
- Setup Duration: The original
goldentooth.setup_slurm
role took over 10 minutes - Non-idempotent: Re-running the role would repeat expensive operations
- Monolithic Design: Single role handled everything from basic Slurm to complex HPC software stacks
Node Validation Failures
- Memory Mismatch: karstark and lipps nodes (4GB Pi 4s) were configured with 4096MB but only had ~3797MB available
- Invalid State: These nodes showed as "inval" in
sinfo
output - Authentication Issues: MUNGE key synchronization problems across nodes
Configuration Management
- Static Memory Values: All nodes hardcoded to 4096MB regardless of actual capacity
- Limited Flexibility: Single configuration approach didn't account for hardware variations
Refactoring Solution
Modular Role Architecture
Split the monolithic role into focused components:
Core Components (goldentooth.setup_slurm_core
)
- Purpose: Essential Slurm and MUNGE setup only
- Duration: Reduced from 10+ minutes to ~50 seconds
- Scope: Package installation, basic configuration, service management
- Features: MUNGE key synchronization, systemd PID file fixes
Specialized Modules
goldentooth.setup_lmod
: Environment module systemgoldentooth.setup_hpc_software
: HPC software stack (OpenMPI, Singularity, Conda)goldentooth.setup_slurm_modules
: Module files for installed software
Dynamic Memory Detection
Replaced static memory configuration with dynamic detection:
# Before: Static configuration
NodeName=DEFAULT CPUs=4 RealMemory=4096 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
# After: Dynamic per-node configuration
NodeName=DEFAULT CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
{% for slurm_compute_name in groups['slurm_compute'] %}
NodeName={{ slurm_compute_name }} NodeAddr={{ hostvars[slurm_compute_name].ipv4_address }} RealMemory={{ hostvars[slurm_compute_name].ansible_memtotal_mb }}
{% endfor %}
Node Exclusion Strategy
For nodes with insufficient memory (karstark, lipps):
- Inventory Update: Removed from
slurm_compute
group - Service Cleanup: Stopped and disabled slurmd/munge services
- Package Removal: Uninstalled Slurm packages to prevent conflicts
Implementation Details
MUNGE Key Synchronization
Added permanent solution to MUNGE authentication issues:
- name: 'Synchronize MUNGE keys across cluster'
block:
- name: 'Retrieve MUNGE key from first controller'
ansible.builtin.slurp:
src: '/etc/munge/munge.key'
register: 'controller_munge_key'
run_once: true
delegate_to: "{{ groups['slurm_controller'] | first }}"
- name: 'Distribute MUNGE key to all nodes'
ansible.builtin.copy:
content: "{{ controller_munge_key.content | b64decode }}"
dest: '/etc/munge/munge.key'
owner: 'munge'
group: 'munge'
mode: '0400'
backup: yes
when: inventory_hostname != groups['slurm_controller'] | first
notify: 'Restart MUNGE'
SystemD Integration Fixes
Resolved PID file path mismatches:
- name: 'Fix slurmctld pidfile path mismatch'
ansible.builtin.copy:
content: |
[Service]
PIDFile=/var/run/slurm/slurmctld.pid
dest: '/etc/systemd/system/slurmctld.service.d/override.conf'
mode: '0644'
when: inventory_hostname in groups['slurm_controller']
notify: 'Reload systemd and restart slurmctld'
NFS Permission Resolution
Fixed directory permissions that prevented slurm user access:
# Fixed root directory permissions on NFS server
chmod 755 /mnt/usb1 # Was 700, preventing slurm user access
Results
Performance Improvements
- Setup Time: Reduced from 10+ minutes to ~50 seconds for core functionality
- Idempotency: Role can be safely re-run without expensive operations
- Modularity: Users can choose which components to install
Cluster Health
- Node Status: 9 nodes operational and idle
- Authentication: MUNGE working consistently across all nodes
- Resource Detection: Accurate memory reporting per node
Final Cluster State
general* up infinite 9 idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
debug up infinite 9 idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
Prometheus Slurm Exporter
Overview
Following the Slurm refactoring work, the next logical step was to add comprehensive monitoring for the HPC workload manager. This chapter documents the implementation of prometheus-slurm-exporter to provide real-time visibility into cluster utilization, job queues, and resource allocation.
The Challenge
While Slurm was operational with 9 nodes in idle state, there was no integration with the existing Prometheus/Grafana observability stack. Key missing capabilities:
- No Cluster Metrics: Unable to monitor CPU/memory utilization across nodes
- No Job Visibility: No insight into job queues, completion rates, or resource consumption
- No Historical Data: No way to track cluster usage patterns over time
- Limited Alerting: No proactive monitoring of cluster health or resource exhaustion
Implementation Approach
Exporter Selection
Initially attempted the original vpenso/prometheus-slurm-exporter
but discovered it was unmaintained and lacked modern features. Switched to the rivosinc/prometheus-slurm-exporter
fork which provided:
- Active Maintenance: 87 commits, regular releases through v1.6.10
- Pre-built Binaries: ARM64 support via GitHub releases
- Enhanced Features: Job tracing, CLI fallback modes, throttling support
- Better Performance: Optimized for multiple Prometheus instances
Architecture Design
Deployed the exporter following goldentooth cluster patterns:
# Deployment Strategy
Target Nodes: slurm_controller (bettley, cargyll, dalt)
Service Port: 9092 (HTTP)
Protocol: HTTP with Prometheus file-based service discovery
Integration: Full Step-CA certificate management ready
User Management: Dedicated slurm-exporter service user
Role Structure
Created goldentooth.setup_slurm_exporter
following established conventions:
roles/goldentooth.setup_slurm_exporter/
├── CLAUDE.md # Comprehensive documentation
├── tasks/main.yaml # Main deployment tasks
├── templates/
│ ├── slurm-exporter.service.j2 # Systemd service
│ ├── slurm_targets.yaml.j2 # Prometheus targets
│ └── cert-renewer@slurm-exporter.conf.j2 # Certificate renewal
└── handlers/main.yaml # Service management handlers
Technical Implementation
Binary Installation
- name: 'Download prometheus-slurm-exporter from rivosinc fork'
ansible.builtin.get_url:
url: 'https://github.com/rivosinc/prometheus-slurm-exporter/releases/download/v{{ prometheus_slurm_exporter.version }}/prometheus-slurm-exporter_linux_{{ host.architecture }}.tar.gz'
dest: '/tmp/prometheus-slurm-exporter-{{ prometheus_slurm_exporter.version }}.tar.gz'
mode: '0644'
Service Configuration
[Service]
Type=simple
User=slurm-exporter
Group=slurm-exporter
ExecStart=/usr/local/bin/prometheus-slurm-exporter \
-web.listen-address={{ ansible_default_ipv4.address }}:{{ prometheus_slurm_exporter.port }} \
-web.log-level=info
Prometheus Integration
Added to the existing scrape configuration:
prometheus_scrape_configs:
- job_name: 'slurm'
file_sd_configs:
- files:
- "/etc/prometheus/file_sd/slurm_targets.yaml"
relabel_configs:
- source_labels: [instance]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
Service Discovery
Dynamic target generation for all controller nodes:
- targets:
- "{{ hostvars[slurm_controller].ansible_default_ipv4.address }}:{{ prometheus_slurm_exporter.port }}"
labels:
job: 'slurm'
instance: '{{ slurm_controller }}'
cluster: '{{ cluster_name }}'
role: 'slurm-controller'
Metrics Exposed
The rivosinc exporter provides comprehensive cluster visibility:
Core Cluster Metrics
slurm_cpus_total 36 # Total CPU cores (9 nodes × 4 cores)
slurm_cpus_idle 36 # Available CPU cores
slurm_cpus_per_state{state="idle"} 36
slurm_node_count_per_state{state="idle"} 9
Memory Utilization
slurm_mem_real 7.0281e+10 # Total cluster memory (MB)
slurm_mem_alloc 6.0797e+10 # Allocated memory
slurm_mem_free 9.484e+09 # Available memory
Job Queue Metrics
slurm_jobs_pending 0 # Jobs waiting in queue
slurm_jobs_running 0 # Currently executing jobs
slurm_job_scrape_duration 29 # Metric collection performance
Performance Monitoring
slurm_cpu_load 5.83 # Current CPU load average
slurm_node_scrape_duration 35 # Node data collection time
Deployment Results
Service Health
All three controller nodes running successfully:
● slurm-exporter.service - Prometheus Slurm Exporter
Loaded: loaded (/etc/systemd/system/slurm-exporter.service; enabled)
Active: active (running)
Main PID: 3692156 (prometheus-slur)
Tasks: 5 (limit: 8737)
Memory: 1.5M (max: 128.0M available)
Metrics Validation
curl http://10.4.0.11:9092/metrics | grep '^slurm_'
slurm_cpu_load 5.83
slurm_cpus_idle 36
slurm_cpus_per_state{state="idle"} 36
slurm_cpus_total 36
slurm_node_count_per_state{state="idle"} 9
Prometheus Integration
Targets automatically discovered and scraped:
- bettley:9092 - Controller node metrics
- cargyll:9092 - Controller node metrics
- dalt:9092 - Controller node metrics
Configuration Management
Variables Structure
# Prometheus Slurm Exporter configuration (rivosinc fork)
prometheus_slurm_exporter:
version: "1.6.10"
port: 9092
user: "slurm-exporter"
group: "slurm-exporter"
Command Interface
# Deploy exporter
goldentooth setup_slurm_exporter
# Verify deployment
goldentooth command slurm_controller "systemctl status slurm-exporter"
# Check metrics
goldentooth command bettley "curl -s http://localhost:9092/metrics | head -10"
Troubleshooting Lessons
Initial Issues Encountered
-
Wrong Repository: Started with unmaintained vpenso fork
- Solution: Switched to actively maintained rivosinc fork
-
TLS Configuration: Attempted HTTPS but exporter doesn't support TLS flags
- Solution: Used HTTP with plans for future TLS proxy if needed
-
Binary Availability: No pre-built ARM64 binaries in original version
- Solution: rivosinc fork provides comprehensive release assets
-
Port Conflicts: Initially used port 8080
- Solution: Used exporter default 9092 to avoid conflicts
Debugging Process
Service logs were key to identifying configuration issues:
journalctl -u slurm-exporter --no-pager -l
Metrics endpoint testing confirmed functionality:
curl -s http://localhost:9092/metrics | grep -E '^slurm_'
Integration with Existing Stack
The exporter seamlessly integrates with goldentooth monitoring infrastructure:
Prometheus Configuration
- File-based Service Discovery: Automatic target management
- Label Strategy: Consistent with existing exporters
- Scrape Intervals: Standard 60-second collection
Certificate Management
- Step-CA Ready: Templates prepared for future TLS implementation
- Automatic Renewal: Systemd timer configuration included
- Service User: Dedicated account with minimal permissions
Observability Pipeline
- Prometheus: Metrics collection and storage
- Grafana: Dashboard visualization (ready for implementation)
- Alerting: Rule definition for cluster health monitoring
Performance Impact
Resource Usage
- Memory: ~1.5MB RSS per exporter instance
- CPU: Minimal impact during scraping
- Network: Standard HTTP metrics collection
- Slurm Load: Read-only operations with built-in throttling
Scalability Considerations
- Multiple Controllers: Distributed across all controller nodes
- High Availability: No single point of failure
- Data Consistency: Each exporter provides complete cluster view
Certificate Renewal Debugging Odyssey
Some time after setting up the certificate renewal system, the cluster was humming along nicely with 24-hour certificate lifetimes and automatic renewal every 5 minutes. Or so I thought.
One morning, I discovered that Vault certificates had mysteriously expired overnight, despite the renewal system supposedly working. This kicked off a multi-day investigation that would lead to significant improvements in our certificate management and monitoring infrastructure.
The Mystery: Why Didn't Vault Certificates Renew?
The first clue was puzzling - some services had renewed their certificates successfully (Consul, Nomad), while others (Vault) had failed silently. The cert-renewer systemd service showed no errors, and the timers were running on schedule.
$ goldentooth command_root jast 'systemctl status cert-renewer@vault.timer'
● cert-renewer@vault.timer - Timer for certificate renewal of vault
Loaded: loaded (/etc/systemd/system/cert-renewer@.timer; enabled)
Active: active (waiting) since Wed 2025-07-23 14:05:12 EDT; 3h ago
The timer was active, but the certificates were still expired. Something was fundamentally wrong with our renewal logic.
Building a Certificate Renewal Canary
Rather than guessing at the problem, I decided to build proper test infrastructure. The solution was a "canary" service - a minimal certificate renewal setup with extremely short lifetimes that would fail fast and give us rapid feedback.
Creating the Canary Service
I created a new Ansible role goldentooth.setup_cert_renewer_canary
that:
- Creates a dedicated user and service:
cert-canary
user with its own systemd service - Uses 15-minute certificate lifetimes: Fast enough to debug quickly
- Runs on a 5-minute timer: Same schedule as production services
- Provides comprehensive logging: Detailed output for debugging
# roles/goldentooth.setup_cert_renewer_canary/defaults/main.yaml
cert_canary:
username: cert-canary
group: cert-canary
cert_lifetime: 15m
cert_path: /opt/cert-canary/certs/tls.crt
key_path: /opt/cert-canary/certs/tls.key
The canary service template includes detailed logging:
[Unit]
Description=Certificate Canary Service
After=network-online.target
[Service]
Type=oneshot
User=cert-canary
WorkingDirectory=/opt/cert-canary
ExecStart=/bin/echo "Certificate canary service executed successfully"
Discovering the Root Cause
With the canary in place, I could observe the renewal process in real-time. The breakthrough came when I examined the step certificate needs-renewal
command more carefully.
The 66% Threshold Problem
The default cert-renewer configuration uses a 66% threshold for renewal - certificates renew when they have less than 66% of their lifetime remaining. For 24-hour certificates, this means renewal occurs when there are about 8 hours left.
But here's the critical issue: with a 5-minute timer interval, there's only a narrow window for successful renewal. If the renewal fails during that window (due to network issues, service restarts, etc.), the next attempt won't occur until the timer fires again.
The math was unforgiving:
- 24-hour certificate: 66% threshold = ~8 hour renewal window
- 5-minute timer: 12 attempts per hour
- Network/service instability: Occasional renewal failures
- Result: Certificates could expire if multiple renewal attempts failed in succession
The Solution: Environment Variable Configuration
The fix involved making the cert-renewer system more configurable and robust. I updated the base cert-renewer@.service
template to support environment variable overrides:
[Unit]
Description=Certificate renewer for %I
After=network-online.target
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
StartLimitIntervalSec=0
[Service]
Type=oneshot
User=root
Environment=STEPPATH=/etc/step-ca
Environment=CERT_LOCATION=/etc/step/certs/%i.crt
Environment=KEY_LOCATION=/etc/step/certs/%i.key
Environment=EXPIRES_IN_THRESHOLD=66%
ExecCondition=/usr/bin/step certificate needs-renewal ${CERT_LOCATION} --expires-in ${EXPIRES_IN_THRESHOLD}
ExecStart=/usr/bin/step ca renew --force ${CERT_LOCATION} ${KEY_LOCATION}
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active %i.service || systemctl try-reload-or-restart %i.service"
[Install]
WantedBy=multi-user.target
Service-Specific Overrides
Each service now gets its own override configuration that specifies the exact certificate paths and renewal behavior:
# /etc/systemd/system/cert-renewer@vault.service.d/override.conf
[Service]
Environment=CERT_LOCATION=/opt/vault/tls/tls.crt
Environment=KEY_LOCATION=/opt/vault/tls/tls.key
WorkingDirectory=/opt/vault/tls
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active vault.service || systemctl try-reload-or-restart vault.service"
The beauty of this approach is that we can now tune renewal behavior per service without modifying the base template.
Comprehensive Monitoring Infrastructure
While debugging the certificate issue, I also built comprehensive monitoring dashboards and alerting to prevent future incidents.
New Grafana Dashboards
I created three major monitoring dashboards:
- Slurm Cluster Overview: Job queue metrics, resource utilization, historical trends
- HashiCorp Services Overview: Consul health, Vault status, Nomad allocation monitoring
- Infrastructure Health Overview: Node uptime, storage capacity, network metrics
Enhanced Metrics Collection
The monitoring improvements included:
- Vector Internal Metrics: Enabled Vector's internal metrics and Prometheus exporter
- Certificate Expiration Tracking: Automated monitoring of certificate days-remaining
- Service Health Indicators: Real-time status for all critical cluster services
- Alert Rules: Proactive notifications for certificate expiration and service failures
Testing Infrastructure Improvements
The certificate renewal investigation led to significant improvements in our testing infrastructure.
Certificate-Aware Test Suite
I created a comprehensive test_certificate_renewal
role that:
- Node-Specific Testing: Only tests certificates for services actually deployed on each node
- Multi-Layered Validation: Certificate presence, validity, timer status, renewal capability
- Chain Validation: Verifies certificates against the cluster CA
- Canary Health Monitoring: Tracks the certificate canary's renewal cycles
Smart Service Filtering
The test improvements included "intelligent" service filtering:
# Filter services to only those deployed on this node
- name: Filter services for current node
set_fact:
node_certificate_services: |-
{%- set filtered_services = [] -%}
{%- for service in certificate_services -%}
{%- set should_include = false -%}
{%- if service.get('specific_hosts') -%}
{%- if inventory_hostname in service.specific_hosts -%}
{%- set should_include = true -%}
{%- endif -%}
{%- elif service.host_groups -%}
{%- for group in service.host_groups -%}
{%- if inventory_hostname in groups.get(group, []) -%}
{%- set should_include = true -%}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- if should_include -%}
{%- set _ = filtered_services.append(service) -%}
{%- endif -%}
{%- endfor -%}
{{ filtered_services }}
This eliminated false positives where tests were failing for missing certificates on nodes where services weren't supposed to be running.
Nextflow Workflow Management System
Overview
After successfully establishing a robust Slurm HPC cluster with comprehensive monitoring and observability, the next logical step was to add a modern workflow management system. Nextflow provides a powerful solution for data-intensive computational pipelines, enabling scalable and reproducible scientific workflows using software containers.
This chapter documents the installation and integration of Nextflow 24.10.0 with the existing Slurm cluster, complete with Singularity container support, shared storage integration, and module system configuration.
The Challenge
While our Slurm cluster was fully functional for individual job submission, we lacked a sophisticated workflow management system that could:
- Orchestrate Complex Pipelines: Chain multiple computational steps with dependency management
- Provide Reproducibility: Ensure consistent results across different execution environments
- Support Containers: Leverage containerized software for portable and consistent environments
- Integrate with Slurm: Seamlessly submit jobs to our existing cluster scheduler
- Enable Scalability: Automatically parallelize workflows across cluster nodes
Modern bioinformatics and data science workflows often require hundreds of interconnected tasks, making manual job submission impractical and error-prone.
Implementation Approach
The solution involved creating a comprehensive Nextflow installation that integrates deeply with our existing infrastructure:
1. Architecture Design
- Shared Storage Integration: Install Nextflow on NFS to ensure cluster-wide accessibility
- Slurm Executor: Configure native Slurm executor for distributed job execution
- Container Runtime: Leverage existing Singularity installation for reproducible environments
- Module System: Integrate with Lmod for consistent environment management
2. Installation Strategy
- Java Runtime: Install OpenJDK 17 as a prerequisite across all compute nodes
- Centralized Installation: Single installation on shared storage accessible by all nodes
- Configuration Templates: Create reusable configuration for common workflow patterns
- Example Workflows: Provide ready-to-run examples for testing and learning
Technical Implementation
New Ansible Role Creation
Created goldentooth.setup_nextflow
role with comprehensive installation logic:
# Install Java OpenJDK (required for Nextflow)
- name: 'Install Java OpenJDK (required for Nextflow)'
ansible.builtin.apt:
name:
- 'openjdk-17-jdk'
- 'openjdk-17-jre'
state: 'present'
# Download and install Nextflow
- name: 'Download Nextflow binary'
ansible.builtin.get_url:
url: "https://github.com/nextflow-io/nextflow/releases/download/v{{ slurm.nextflow_version }}/nextflow"
dest: "{{ slurm.nfs_base_path }}/nextflow/{{ slurm.nextflow_version }}/nextflow"
owner: 'slurm'
group: 'slurm'
mode: '0755'
Slurm Executor Configuration
Created comprehensive Nextflow configuration optimized for our cluster:
// Nextflow Configuration for Goldentooth Cluster
process {
executor = 'slurm'
queue = 'general'
// Default resource requirements
cpus = 1
memory = '1GB'
time = '1h'
// Enable Singularity containers
container = 'ubuntu:20.04'
// Process-specific configurations
withLabel: 'small' {
cpus = 1
memory = '2GB'
time = '30m'
}
withLabel: 'large' {
cpus = 4
memory = '8GB'
time = '6h'
}
}
// Slurm executor configuration
executor {
name = 'slurm'
queueSize = 100
submitRateLimit = '10/1min'
clusterOptions = {
"--account=default " +
"--partition=\${task.queue} " +
"--job-name=nf-\${task.hash}"
}
}
Container Integration
Configured Singularity integration for reproducible workflows:
singularity {
enabled = true
autoMounts = true
envWhitelist = 'SLURM_*'
// Cache directory on shared storage
cacheDir = '/mnt/nfs/slurm/singularity/cache'
// Mount shared directories
runOptions = '--bind /mnt/nfs/slurm:/mnt/nfs/slurm'
}
Module System Integration
Extended the existing Lmod system with a Nextflow module:
-- Nextflow Workflow Management System
whatis("Nextflow workflow management system 24.10.0")
-- Load required Java module (dependency)
depends_on("java/17")
-- Add Nextflow to PATH
prepend_path("PATH", "/mnt/nfs/slurm/nextflow/24.10.0")
-- Set Nextflow environment variables
setenv("NXF_HOME", "/mnt/nfs/slurm/nextflow/24.10.0")
setenv("NXF_WORK", "/mnt/nfs/slurm/nextflow/workspace")
-- Enable Singularity by default
setenv("NXF_SINGULARITY_CACHEDIR", "/mnt/nfs/slurm/singularity/cache")
Example Pipeline
Created a comprehensive hello-world pipeline demonstrating cluster integration:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// Pipeline parameters
params {
greeting = 'Hello'
names = ['World', 'Goldentooth', 'Slurm', 'Nextflow']
output_dir = './results'
}
process sayHello {
tag "$name"
label 'small'
publishDir params.output_dir, mode: 'copy'
container 'ubuntu:20.04'
input:
val name
output:
path "${name}_greeting.txt"
script:
"""
echo "${params.greeting} ${name}!" > ${name}_greeting.txt
echo "Running on node: \$(hostname)" >> ${name}_greeting.txt
echo "Slurm Job ID: \${SLURM_JOB_ID:-'Not running under Slurm'}" >> ${name}_greeting.txt
"""
}
workflow {
names_ch = Channel.fromList(params.names)
greetings_ch = sayHello(names_ch)
workflow.onComplete {
log.info "Pipeline completed successfully!"
log.info "Results saved to: ${params.output_dir}"
}
}
Deployment Results
Installation Success
The deployment was executed successfully across all Slurm compute nodes:
cd /Users/nathan/Projects/goldentooth/ansible
ansible-playbook -i inventory/hosts playbooks/setup_nextflow.yaml --limit slurm_compute
Installation Summary:
- ✅ Java OpenJDK 17 installed on 9 compute nodes
- ✅ Nextflow 24.10.0 downloaded and configured
- ✅ Slurm executor configured with resource profiles
- ✅ Singularity integration enabled with shared cache
- ✅ Module file created and integrated with Lmod
- ✅ Example pipeline deployed and tested
Verification Output
Nextflow Installation Test:
N E X T F L O W
version 24.10.0 build 5928
created 27-10-2024 18:36 UTC (14:36 GMT-04:00)
cite doi:10.1038/nbt.3820
http://nextflow.io
Installation paths:
- Nextflow: /mnt/nfs/slurm/nextflow/24.10.0
- Config: /mnt/nfs/slurm/nextflow/24.10.0/nextflow.config
- Examples: /mnt/nfs/slurm/nextflow/24.10.0/examples
- Workspace: /mnt/nfs/slurm/nextflow/workspace
Configuration Management
Usage Workflow
Users can now access Nextflow through the module system:
# Load the Nextflow environment
module load Nextflow/24.10.0
# Run the example pipeline
nextflow run /mnt/nfs/slurm/nextflow/24.10.0/examples/hello-world.nf
# Run with development profile (reduced resources)
nextflow run pipeline.nf -profile dev
# Run with custom configuration
nextflow run pipeline.nf -c custom.config
Prometheus Node Exporter Migration: From Kubernetes to Native
The Problem
While working on Grafana dashboard configuration, I discovered that the node exporter dashboard was completely empty - no metrics, no data, just a sad empty dashboard that looked like it had given up on life.
The issue? Our Prometheus Node Exporter was deployed via Kubernetes and Argo CD, but Prometheus itself was running as a systemd service on allyrion
. The Kubernetes deployment created a ClusterIP service at 172.16.12.161:9100
, but Prometheus (running outside the cluster) couldn't reach this internal Kubernetes service.
Meanwhile, Prometheus was configured to scrape node exporters directly at each node's IP on port 9100 (e.g., 10.4.0.11:9100
), but nothing was listening there because the actual exporters were only accessible through the Kubernetes service mesh.
The Solution: Raw-dogging Node Exporter
Time to embrace the chaos and deploy node exporter directly on the nodes as systemd services. Sometimes the simplest solution is the best solution.
Step 1: Create the Ansible Playbook
First, I created a new playbook to deploy node exporter cluster-wide using the same prometheus.prometheus.node_exporter
role that HAProxy was already using:
# ansible/playbooks/setup_node_exporter.yaml
# Description: Setup Prometheus Node Exporter on all cluster nodes.
- name: 'Setup Prometheus Node Exporter.'
hosts: 'all'
remote_user: 'root'
roles:
- { role: 'prometheus.prometheus.node_exporter' }
handlers:
- name: 'Restart Node Exporter.'
ansible.builtin.service:
name: 'node_exporter'
state: 'restarted'
enabled: true
Step 2: Deploy via Goldentooth CLI
Thanks to the goldentooth CLI's fallback behavior (it automatically runs Ansible playbooks with matching names), deployment was as simple as:
goldentooth setup_node_exporter
This installed node exporter on all 13 cluster nodes, creating:
node-exp
system user and group/usr/local/bin/node_exporter
binary/etc/systemd/system/node_exporter.service
systemd service/var/lib/node_exporter
textfile collector directory
Step 3: Handle Port Conflicts
The deployment initially failed on most nodes with "address already in use" errors. The Kubernetes node exporter pods were still running and had claimed port 9100.
Investigation revealed the conflict:
goldentooth command bettley "journalctl -u node_exporter --no-pager -n 10"
# Error: listen tcp 0.0.0.0:9100: bind: address already in use
Step 4: Clean Up Kubernetes Deployment
I removed the Kubernetes deployment entirely:
# Delete the daemonset and namespace
kubectl delete daemonset prometheus-node-exporter -n prometheus-node-exporter
kubectl delete namespace prometheus-node-exporter
# Delete the Argo CD applications managing this
kubectl delete application prometheus-node-exporter gitops-repo-prometheus-node-exporter -n argocd
# Delete the GitHub repository (to prevent ApplicationSet from recreating it)
gh repo delete goldentooth/prometheus-node-exporter --yes
Step 5: Restart Failed Services
With the port conflicts resolved, I restarted the systemd services:
goldentooth command bettley,dalt "systemctl restart node_exporter"
All nodes now showed healthy node exporter services:
● node_exporter.service - Prometheus Node Exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled)
Active: active (running) since Wed 2025-07-23 19:36:30 EDT; 7s ago
Step 6: Reload Prometheus
With native node exporters now listening on port 9100 on all nodes, I reloaded Prometheus to pick up the new targets:
goldentooth command allyrion "systemctl reload prometheus"
Verified metrics were accessible:
goldentooth command allyrion "curl -s http://10.4.0.11:9100/metrics | grep node_cpu_seconds_total | head -3"
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 1.42238869e+06
The Result
Within minutes, the Grafana node exporter dashboard came alive with beautiful metrics from all cluster nodes. CPU usage, memory consumption, disk I/O, network statistics - everything was flowing perfectly.
Authelia Authentication Infrastructure
In our quest to provide secure access to the Goldentooth cluster for AI assistants, we needed a robust authentication and authorization solution. This chapter chronicles the implementation of Authelia, a comprehensive authentication server that provides OAuth 2.0 and OpenID Connect capabilities for our cluster services.
The Authentication Challenge
As we began developing the MCP (Model Context Protocol) server to enable AI assistants like Claude Code to interact with our cluster, we faced a critical security requirement: how to provide secure, standards-based authentication without compromising cluster security or creating a poor user experience.
Traditional authentication approaches like API keys or basic authentication felt inadequate for this use case. We needed:
- Standards-based OAuth 2.0 and OpenID Connect support
- Multi-factor authentication capabilities
- Fine-grained authorization policies
- Integration with our existing Step-CA certificate infrastructure
- Single Sign-On (SSO) for multiple cluster services
Why Authelia?
After evaluating various authentication solutions, Authelia emerged as the ideal choice for our cluster:
- Comprehensive Feature Set: OAuth 2.0, OpenID Connect, LDAP, 2FA/MFA support
- Self-Hosted: No dependency on external authentication providers
- Lightweight: Perfect for deployment on Raspberry Pi infrastructure
- Flexible Storage: Supports SQLite for simplicity or PostgreSQL for scale
- Policy Engine: Fine-grained access control based on users, groups, and resources
Architecture Overview
Authelia fits into our cluster architecture as the central authentication authority:
Claude Code (OAuth Client)
↓ OAuth 2.0 Authorization Code Flow
Authelia (auth.services.goldentooth.net)
↓ JWT/Token Validation
MCP Server (mcp.services.goldentooth.net)
↓ Authenticated API Calls
Goldentooth Cluster Services
The authentication flow follows industry-standard OAuth 2.0 patterns:
- Discovery: Client discovers OAuth endpoints via well-known URLs
- Authorization: User authenticates with Authelia and grants permissions
- Token Exchange: Authorization code exchanged for access/ID tokens
- API Access: Bearer tokens used for authenticated MCP requests
Ansible Implementation
Role Structure
The goldentooth.setup_authelia
role provides comprehensive deployment automation:
ansible/roles/goldentooth.setup_authelia/
├── defaults/main.yml # Default configuration variables
├── tasks/main.yml # Primary deployment tasks
├── templates/ # Configuration templates
│ ├── configuration.yml.j2 # Main Authelia config
│ ├── users_database.yml.j2 # User definitions
│ ├── authelia.service.j2 # Systemd service
│ ├── authelia-consul-service.json.j2 # Consul registration
│ └── cert-renewer@authelia.conf.j2 # Certificate renewal
├── handlers/main.yml # Service restart handlers
└── CLAUDE.md # Role documentation
Key Configuration Elements
OIDC Provider Configuration: Authelia acts as a full OpenID Connect provider with pre-configured clients for the MCP server:
identity_providers:
oidc:
hmac_secret: {{ authelia_oidc_hmac_secret }}
clients:
- client_id: goldentooth-mcp
client_name: Goldentooth MCP Server
client_secret: "$argon2id$v=19$m=65536,t=3,p=4$..."
authorization_policy: one_factor
redirect_uris:
- https://mcp.services.{{ authelia_domain }}/callback
scopes:
- openid
- profile
- email
- groups
- offline_access
grant_types:
- authorization_code
- refresh_token
Security Hardening: Multiple layers of security protection:
authentication_backend:
file:
password:
algorithm: argon2id
iterations: 3
memory: 65536
parallelism: 4
key_length: 32
salt_length: 16
regulation:
max_retries: 3
find_time: 2m
ban_time: 5m
session:
secret: {{ authelia_session_secret }}
expiration: 12h
inactivity: 45m
Certificate Integration
Authelia integrates seamlessly with our Step-CA infrastructure:
# Generate TLS certificate for Authelia server
step ca certificate \
"authelia.{{ authelia_domain }}" \
/etc/authelia/tls.crt \
/etc/authelia/tls.key \
--provisioner="default" \
--san="authelia.{{ authelia_domain }}" \
--san="auth.services.{{ authelia_domain }}" \
--not-after='24h' \
--force
The role also configures automatic certificate renewal through our cert-renewer@authelia.timer
service, ensuring continuous operation without manual intervention.
Consul Integration
Authelia registers itself as a service in our Consul service mesh, enabling service discovery and health monitoring:
{
"service": {
"name": "authelia",
"port": 9091,
"address": "{{ ansible_hostname }}.{{ cluster.node_domain }}",
"tags": ["authentication", "oauth", "oidc"],
"check": {
"http": "https://{{ ansible_hostname }}.{{ cluster.node_domain }}:9091/api/health",
"interval": "30s",
"timeout": "10s",
"tls_skip_verify": false
}
}
}
This integration provides:
- Service Discovery: Other services can locate Authelia via Consul DNS
- Health Monitoring: Consul tracks Authelia's health status
- Load Balancing: Support for multiple Authelia instances if needed
User Management and Policies
Default User Configuration
The deployment creates essential user accounts:
users:
admin:
displayname: "Administrator"
password: "$argon2id$v=19$m=65536,t=3,p=4$..."
email: admin@goldentooth.net
groups:
- admins
- users
mcp-service:
displayname: "MCP Service Account"
password: "$argon2id$v=19$m=65536,t=3,p=4$..."
email: mcp-service@goldentooth.net
groups:
- services
Access Control Policies
Authelia implements fine-grained access control:
access_control:
default_policy: one_factor
rules:
# Public access to health checks
- domain: "*.{{ authelia_domain }}"
policy: bypass
resources:
- "^/api/health$"
# Admin resources require two-factor
- domain: "*.{{ authelia_domain }}"
policy: two_factor
subject:
- "group:admins"
# Regular user access
- domain: "*.{{ authelia_domain }}"
policy: one_factor
Multi-Factor Authentication
Authelia supports multiple 2FA methods out of the box:
TOTP (Time-based One-Time Password):
- Compatible with Google Authenticator, Authy, 1Password
- 6-digit codes with 30-second rotation
- QR code enrollment process
WebAuthn/FIDO2:
- Hardware security keys (YubiKey, SoloKey)
- Platform authenticators (TouchID, Windows Hello)
- Phishing-resistant authentication
Push Notifications (planned):
- Integration with Duo Security for push-based 2FA
- SMS fallback for environments without smartphone access
Deployment and Management
Installation Command
Deploy Authelia across the cluster with a single command:
# Deploy to default Authelia nodes
goldentooth setup_authelia
# Deploy to specific node
goldentooth setup_authelia --limit jast
Service Management
Monitor and manage Authelia using familiar systemd commands:
# Check service status
goldentooth command authelia "systemctl status authelia"
# View logs
goldentooth command authelia "journalctl -u authelia -f"
# Restart service
goldentooth command_root authelia "systemctl restart authelia"
# Validate configuration
goldentooth command authelia "/usr/local/bin/authelia validate-config --config /etc/authelia/configuration.yml"
Health Monitoring
Authelia exposes health and metrics endpoints:
- Health Check:
https://auth.goldentooth.net/api/health
- Metrics:
http://auth.goldentooth.net:9959/metrics
(Prometheus format)
These endpoints integrate with our monitoring stack (Prometheus, Grafana) for observability.
Security Considerations
Threat Mitigation
Authelia addresses multiple attack vectors:
Session Security:
- Secure, HTTP-only cookies
- CSRF protection via state parameters
- Session timeout and inactivity limits
Rate Limiting:
- Failed login attempt throttling
- IP-based temporary bans
- Progressive delays for repeated failures
Password Security:
- Argon2id hashing (memory-hard, side-channel resistant)
- Configurable complexity requirements
- Protection against timing attacks
Network Security
All Authelia communication is secured:
- TLS 1.3: All external communications encrypted
- Certificate Validation: Mutual TLS with cluster CA
- HSTS: HTTP Strict Transport Security headers
- Secure Headers: Complete security header suite
Integration with MCP Server
The MCP server integrates with Authelia through standard OAuth 2.0 flows:
OAuth Discovery
The MCP server exposes OAuth discovery endpoints that delegate to Authelia:
#![allow(unused)] fn main() { // In http_server.rs async fn handle_oauth_metadata() -> Result<Response<Full<Bytes>>, Infallible> { let discovery = auth.discover_oidc_config().await?; let metadata = serde_json::json!({ "issuer": discovery.issuer, "authorization_endpoint": discovery.authorization_endpoint, "token_endpoint": discovery.token_endpoint, "jwks_uri": discovery.jwks_uri, // ... additional OAuth metadata }); Ok(Response::builder() .status(StatusCode::OK) .header("Content-Type", "application/json") .body(Full::new(Bytes::from(metadata.to_string()))) .unwrap()) } }
Token Validation
The MCP server validates tokens using both JWT verification and OAuth token introspection:
#![allow(unused)] fn main() { async fn validate_token(&self, token: &str) -> AuthResult<Claims> { if self.is_jwt_token(token) { // JWT validation for ID tokens self.validate_jwt_token(token).await } else { // Token introspection for opaque access tokens self.introspect_access_token(token).await } } }
This dual approach supports both JWT ID tokens and opaque access tokens that Authelia issues.
Performance and Scalability
Resource Utilization
Authelia runs efficiently on Raspberry Pi hardware:
- Memory: ~50MB RSS under normal load
- CPU: <1% utilization during authentication flows
- Storage: SQLite database grows slowly (~10MB for hundreds of users)
- Network: Minimal bandwidth requirements
Scaling Strategies
For high-availability deployments:
- Multiple Instances: Deploy Authelia on multiple nodes with shared database
- PostgreSQL Backend: Replace SQLite with PostgreSQL for concurrent access
- Redis Sessions: Use Redis for distributed session storage
- Load Balancing: HAProxy or similar for request distribution
SeaweedFS Distributed Storage Implementation
With Ceph providing robust block storage for Kubernetes, Goldentooth needed an object storage solution optimized for file-based workloads. SeaweedFS emerged as the perfect complement: a simple, fast distributed file system that excels at handling large numbers of files with minimal operational overhead.
The Architecture Decision
SeaweedFS follows a different philosophy from traditional distributed storage systems. Instead of complex replication schemes, it uses a simple master-volume architecture inspired by Google's Colossus and Facebook's Haystack:
- Master servers: Coordinate volume assignments with HashiCorp Raft consensus
- Volume servers: Store actual file data in append-only volumes
- HA consensus: Raft-based leadership election with automatic failover
Target Deployment
I implemented a high availability cluster using fenn and karstark with true HA clustering:
- Storage capacity: ~1TB total (491GB + 515GB across dedicated SSDs)
- Fault tolerance: Automatic failover with zero-downtime leadership transitions
- Consensus protocol: HashiCorp Raft for distributed coordination
- Architecture support: Native ARM64 and x86_64 binaries
- Version: SeaweedFS 3.66 with HA clustering capabilities
Storage Foundation
The SeaweedFS deployment builds on the existing goldentooth.bootstrap_seaweedfs
infrastructure:
SSD Preparation
Each storage node gets a dedicated SSD mounted at /mnt/seaweedfs-ssd/
:
- name: Format SSD with ext4 filesystem
ansible.builtin.filesystem:
fstype: "{{ seaweedfs.filesystem_type }}"
dev: "{{ seaweedfs.device }}"
force: true
- name: Set proper ownership on SSD mount
ansible.builtin.file:
path: "{{ seaweedfs.mount_path }}"
owner: "{{ seaweedfs.uid }}"
group: "{{ seaweedfs.gid }}"
mode: '0755'
recurse: true
Directory Structure
The bootstrap creates organized storage directories:
/mnt/seaweedfs-ssd/data/
- Volume server storage/mnt/seaweedfs-ssd/master/
- Master server metadata/mnt/seaweedfs-ssd/index/
- Volume indexing/mnt/seaweedfs-ssd/filer/
- Future filer service data
Service Implementation
The goldentooth.setup_seaweedfs
role handles the complete service deployment:
Binary Management
Cross-architecture support with automatic download:
- name: Download SeaweedFS binary
ansible.builtin.get_url:
url: "https://github.com/seaweedfs/seaweedfs/releases/download/{{ seaweedfs.version }}/linux_arm64.tar.gz"
dest: "/tmp/seaweedfs-{{ seaweedfs.version }}.tar.gz"
when: ansible_architecture == "aarch64"
- name: Download SeaweedFS binary (x86_64)
ansible.builtin.get_url:
url: "https://github.com/seaweedfs/seaweedfs/releases/download/{{ seaweedfs.version }}/linux_amd64.tar.gz"
dest: "/tmp/seaweedfs-{{ seaweedfs.version }}.tar.gz"
when: ansible_architecture == "x86_64"
High Availability Master Configuration
Each node runs a master server with HashiCorp Raft consensus for true HA clustering:
[Unit]
Description=SeaweedFS Master Server
After=network.target
Wants=network.target
[Service]
Type=simple
User=seaweedfs
Group=seaweedfs
ExecStart=/usr/local/bin/weed master \
-port=9333 \
-mdir=/mnt/seaweedfs-ssd/master \
-ip=10.4.x.x \
-peers=fenn:9333,karstark:9333 \
-raftHashicorp=true \
-defaultReplication=001 \
-volumeSizeLimitMB=1024
Restart=always
RestartSec=5s
# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/mnt/seaweedfs-ssd
Volume Server Configuration
Volume servers automatically track the current cluster leader:
[Unit]
Description=SeaweedFS Volume Server
After=network.target seaweedfs-master.service
Wants=network.target
[Service]
Type=simple
User=seaweedfs
Group=seaweedfs
ExecStart=/usr/local/bin/weed volume \
-port=8080 \
-dir=/mnt/seaweedfs-ssd/data \
-max=64 \
-mserver=fenn:9333,karstark:9333 \
-ip=10.4.x.x
Restart=always
RestartSec=5s
# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/mnt/seaweedfs-ssd
Security Hardening
SeaweedFS services run with comprehensive systemd security constraints:
- User isolation: Dedicated
seaweedfs
user (UID/GID 985) - Filesystem protection:
ProtectSystem=strict
with explicit write paths - Privilege containment:
NoNewPrivileges=yes
- Process isolation:
PrivateTmp=yes
andProtectHome=yes
Deployment Process
The deployment uses serial execution to ensure proper cluster formation:
- name: Enable and start SeaweedFS services
ansible.builtin.systemd:
name: "{{ item }}"
enabled: true
state: started
daemon_reload: true
loop:
- seaweedfs-master
- seaweedfs-volume
- name: Wait for SeaweedFS master to be ready
ansible.builtin.uri:
url: "http://{{ ansible_default_ipv4.address }}:9333/cluster/status"
method: GET
until: master_health_check.status == 200
retries: 10
delay: 5
Service Verification
Post-deployment health checks confirm proper operation:
HA Cluster Status
curl http://fenn:9333/cluster/status
Returns cluster topology, current leader, and peer status.
Leadership Monitoring
# Watch leadership changes (healthy flapping every 3 seconds)
watch -n 1 'curl -s http://fenn:9333/cluster/status | jq .Leader'
Volume Server Status
curl http://fenn:8080/status
Shows volume allocation and current master server connections.
Volume Assignment Testing
curl -X POST http://fenn:9333/dir/assign
Demonstrates automatic request routing to the current cluster leader.
High Availability Cluster Status
The SeaweedFS cluster now operates as a true HA system:
- Raft consensus: HashiCorp Raft manages leadership election and state replication
- Automatic failover: Zero-downtime master transitions when nodes fail
- Leadership rotation: Healthy 3-second leadership cycling for load balancing
- Cluster awareness: Volume servers automatically follow leadership changes
- Fault tolerance: Cluster recovers gracefully from network partitions
- Storage capacity: Nearly 1TB with redundancy and automatic replication
Command Integration
SeaweedFS operations integrate with the goldentooth CLI:
# Deploy SeaweedFS cluster
goldentooth setup_seaweedfs
# Check HA cluster status
goldentooth command fenn,karstark "systemctl status seaweedfs-master seaweedfs-volume"
# View cluster leadership and peers
goldentooth command fenn "curl -s http://localhost:9333/cluster/status | jq"
# Monitor leadership changes
goldentooth command fenn "watch -n 1 'curl -s http://localhost:9333/cluster/status | jq .Leader'"
# Monitor storage utilization
goldentooth command fenn,karstark "df -h /mnt/seaweedfs-ssd"
Step-CA Certificate Monitoring Implementation
With the goldentooth cluster now heavily dependent on Step-CA for certificate management across Consul, Vault, Nomad, Grafana, Loki, Vector, HAProxy, Blackbox Exporter, and the newly deployed SeaweedFS distributed storage, we needed comprehensive certificate monitoring to prevent service outages from expired certificates.
The existing certificate monitoring was basic - we had file-based certificate expiry alerts, but lacked the visibility and proactive alerting necessary for enterprise-grade PKI management.
The Monitoring Challenge
Our cluster runs multiple services with Step-CA certificates:
- Consul: Service mesh certificates for all nodes
- Vault: Secrets management with HA cluster
- Nomad: Workload orchestration across the cluster
- Grafana: Observability dashboard access
- Loki: Log aggregation infrastructure
- Vector: Log shipping to Loki
- HAProxy: Load balancer with TLS termintion
- Blackbox Exporter: Synthetic monitoring service
- SeaweedFS: Distributed storage with master/volume servers
Each service has automated certificate renewal via cert-renewer@.service
systemd timers, but we needed comprehensive monitoring to ensure the renewal system itself was healthy and catch any failures before they caused outages.
Enhanced Blackbox Monitoring
The first enhancement expanded our synthetic monitoring to include comprehensive TLS validation for all Step-CA services.
SeaweedFS Integration
With SeaweedFS newly deployed as a high-availability distributed storage system, I added its endpoints to blackbox monitoring:
# SeaweedFS Master servers (HA cluster)
- targets:
- "https://fenn:9333"
- "https://karstark:9333"
labels:
service: "seaweedfs-master"
# SeaweedFS Volume servers
- targets:
- "https://fenn:8080"
- "https://karstark:8080"
labels:
service: "seaweedfs-volume"
Comprehensive TLS Endpoint Monitoring
Every Step-CA managed service now has synthetic TLS validation:
blackbox_https_internal_targets:
- "https://consul.goldentooth.net:8501"
- "https://vault.goldentooth.net:8200"
- "https://nomad.goldentooth.net:4646"
- "https://grafana.goldentooth.net:3000"
- "https://loki.goldentooth.net:3100"
- "https://vector.goldentooth.net:8686"
- "https://fenn:9115" # blackbox exporter itself
- "https://fenn:9333" # seaweedfs master
- "https://karstark:9333"
- "https://fenn:8080" # seaweedfs volume
- "https://karstark:8080"
The blackbox exporter validates not just connectivity, but certificate chain validity, expiry dates, and proper TLS negotiation for each endpoint.
Advanced Prometheus Alert System
The core enhancement was implementing a comprehensive multi-tier alerting system for certificate management.
Certificate Expiry Alerts
I implemented three tiers of certificate expiry warnings:
# 30-day advance warning
- alert: CertificateExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate expiring in 30 days"
description: "Certificate for {{ $labels.instance }} expires in 30 days. Plan renewal."
# 7-day critical alert
- alert: CertificateExpiringCritical
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
for: 5m
labels:
severity: critical
annotations:
summary: "Certificate expiring in 7 days"
description: "Certificate for {{ $labels.instance }} expires in 7 days. Immediate attention required."
# 2-day emergency alert
- alert: CertificateExpiringEmergency
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 2
for: 1m
labels:
severity: critical
annotations:
summary: "Certificate expiring in 2 days"
description: "Certificate for {{ $labels.instance }} expires in 2 days. Emergency action required."
Certificate Renewal System Monitoring
Beyond expiry monitoring, I added alerts for certificate renewal system health:
# File-based certificate monitoring
- alert: CertificateFileExpiring
expr: (file_certificate_expiry_seconds - time()) / 86400 < 7
for: 5m
labels:
severity: warning
annotations:
summary: "Certificate file expiring soon"
description: "Certificate file {{ $labels.path }} expires in less than 7 days"
# Certificate renewal timer failure
- alert: CertificateRenewalTimerFailed
expr: systemd_timer_last_trigger_seconds{name=~"cert-renewer@.*"} < time() - 86400 * 8
for: 10m
labels:
severity: critical
annotations:
summary: "Certificate renewal timer failed"
description: "Certificate renewal timer {{ $labels.name }} hasn't run in over 8 days"
Step-CA Server Health
Critical infrastructure monitoring for the Step-CA service itself:
# Step-CA service availability
- alert: StepCADown
expr: up{job="step-ca"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Step-CA server is down"
description: "Step-CA certificate authority is unreachable"
# TLS endpoint failures
- alert: TLSEndpointDown
expr: probe_success{job=~"blackbox-https.*"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "TLS endpoint unreachable"
description: "TLS endpoint {{ $labels.instance }} is unreachable via HTTPS"
Comprehensive Certificate Dashboard
The monitoring enhancement includes a dedicated Grafana dashboard providing complete PKI visibility.
Dashboard Features
The Step-CA Certificate Dashboard displays:
- Certificate Expiry Timeline: Color-coded visualization showing all certificates with expiry thresholds (green > 30 days, yellow 7-30 days, red < 7 days)
- TLS Endpoint Status: Real-time status of all HTTPS endpoints monitored via blackbox probes
- Certificate Renewal Health: Status of systemd renewal timers across all services
- Step-CA Server Status: Availability and responsiveness of the certificate authority
- Certificate Inventory: Table showing all managed certificates with expiry dates and renewal status
Dashboard Implementation
- name: Deploy Step-CA certificate monitoring dashboard
ansible.builtin.copy:
src: "{{ playbook_dir }}/../grafana-dashboards/step-ca-certificate-dashboard.json"
dest: "/var/lib/grafana/dashboards/"
owner: grafana
group: grafana
mode: '0644'
notify: restart grafana
The dashboard provides at-a-glance visibility into the health of the entire PKI infrastructure, with drill-down capabilities to investigate specific certificate issues.
Infrastructure Integration
Enhanced Grafana Role
The Grafana setup role now includes automated dashboard deployment:
- name: Create dashboards directory
ansible.builtin.file:
path: "/var/lib/grafana/dashboards"
state: present
owner: grafana
group: grafana
mode: '0755'
- name: Deploy certificate monitoring dashboard
ansible.builtin.copy:
src: "step-ca-certificate-dashboard.json"
dest: "/var/lib/grafana/dashboards/"
owner: grafana
group: grafana
mode: '0644'
notify: restart grafana
Prometheus Configuration Updates
The Prometheus alerting rules required careful template escaping for proper alert message formatting:
# Proper Prometheus alert template escaping
annotations:
summary: "Certificate for {{ "{{ $labels.instance }}" }} expires in 30 days"
description: "Certificate renewal required for {{ "{{ $labels.instance }}" }}"
Service Targets Configuration
All Step-CA certificate endpoints are now systematically monitored:
blackbox_targets:
https_internal:
# Core HashiCorp services
- "https://consul.goldentooth.net:8501"
- "https://vault.goldentooth.net:8200"
- "https://nomad.goldentooth.net:4646"
# Observability stack
- "https://grafana.goldentooth.net:3000"
- "https://loki.goldentooth.net:3100"
- "https://vector.goldentooth.net:8686"
# Infrastructure services
- "https://fenn:9115" # blackbox exporter
# SeaweedFS distributed storage
- "https://fenn:9333" # seaweedfs master
- "https://karstark:9333"
- "https://fenn:8080" # seaweedfs volume
- "https://karstark:8080"
Deployment Results
Monitoring Coverage
The enhanced certificate monitoring now provides:
- Complete PKI visibility: All 20+ Step-CA certificates monitored
- Proactive alerting: 30/7/2 day advance warnings prevent surprises
- System health monitoring: Renewal timer and Step-CA service health tracking
- Synthetic validation: Real TLS endpoint testing via blackbox probes
- Centralized dashboard: Single pane of glass for certificate infrastructure
Alert Integration
The alert system provides:
- Early warning system: 30-day alerts allow planned certificate maintenance
- Escalating severity: 7-day critical and 2-day emergency alerts ensure attention
- Renewal system monitoring: Catches failures in automated renewal timers
- Infrastructure monitoring: Step-CA server availability tracking
Operational Impact
Before this enhancement:
- Basic file-based certificate expiry alerts
- Limited visibility into certificate health
- Potential for service outages from unnoticed certificate expiry
- Manual certificate status checking required
After implementation:
- Enterprise-grade certificate lifecycle monitoring
- Proactive alerting preventing service disruptions
- Complete synthetic validation of certificate-dependent services
- Real-time visibility into PKI infrastructure health
- Automated dashboard providing immediate certificate status overview
Repository Integration
Multi-Repository Changes
The implementation spans two repositories:
goldentooth/ansible: Core infrastructure implementation
- Enhanced blackbox exporter role with SeaweedFS targets
- Comprehensive Prometheus alerting rules
- Improved Grafana role with dashboard deployment
- Certificate monitoring integration across all Step-CA services
goldentooth/grafana-dashboards: Dashboard repository
- New Step-CA Certificate Dashboard with complete PKI visibility
- Dashboard committed for reuse across environments
- JSON format compatible with Grafana provisioning
Command Integration
Certificate monitoring integrates with goldentooth CLI:
# Deploy enhanced certificate monitoring
goldentooth setup_blackbox_exporter
goldentooth setup_grafana
goldentooth setup_prometheus
# Check certificate monitoring status
goldentooth command allyrion "systemctl status blackbox-exporter"
# View certificate expiry alerts
goldentooth command allyrion "curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname | contains(\"Certificate\"))'"
# Monitor renewal timers
goldentooth command_all "systemctl list-timers 'cert-renewer@*'"
This comprehensive Step-CA certificate monitoring implementation transforms goldentooth from basic certificate management to enterprise-grade PKI infrastructure with complete lifecycle visibility, proactive alerting, and automated health monitoring. The system now prevents certificate-related service outages through early warning and comprehensive synthetic validation of all certificate-dependent services.
HAProxy Dataplane API Integration
With the cluster's load balancing infrastructure established through our initial HAProxy setup and subsequent revisiting, the next evolution was to enable dynamic configuration management. HAProxy's traditional configuration model requires service restarts for changes, which creates service disruption and doesn't align well with modern infrastructure automation needs.
The HAProxy Dataplane API provides a RESTful interface for runtime configuration management, allowing backend server manipulation, health check configuration, and statistics collection without HAProxy restarts. This capability is essential for automated deployment pipelines and dynamic infrastructure management.
Implementation Strategy
The implementation focused on integrating HAProxy Dataplane API v3.2.1 into the existing goldentooth.setup_haproxy
Ansible role while maintaining the cluster's security and operational standards.
Configuration Architecture
The API requires a specific YAML v2 configuration format with a nested structure significantly different from HAProxy's traditional flat configuration:
config_version: 2
haproxy:
config_file: /etc/haproxy/haproxy.cfg
userlist: controller
reload:
reload_cmd: systemctl reload haproxy
reload_delay: 5
restart_cmd: systemctl restart haproxy
name: dataplaneapi
mode: single
resources:
maps_dir: /etc/haproxy/maps
ssl_certs_dir: /etc/haproxy/ssl
general_storage_dir: /etc/haproxy/general
spoe_dir: /etc/haproxy/spoe
spoe_transaction_dir: /tmp/spoe-haproxy
backups_dir: /etc/haproxy/backups
config_snippets_dir: /etc/haproxy/config_snippets
acl_dir: /etc/haproxy/acl
transactions_dir: /etc/haproxy/transactions
user:
insecure: false
username: "{{ vault.cluster_credentials.username }}"
password: "{{ vault.cluster_credentials.password }}"
advertised:
api_address: 0.0.0.0
api_port: 5555
This configuration structure enables the API to manage HAProxy through systemd reload commands rather than requiring full restarts, maintaining service availability during configuration changes.
Directory Structure Implementation
The API requires an extensive directory hierarchy for storing various configuration components:
# Primary API configuration
/etc/haproxy-dataplane/
# HAProxy configuration storage
/etc/haproxy/dataplane/
/etc/haproxy/maps/
/etc/haproxy/ssl/
/etc/haproxy/general/
/etc/haproxy/spoe/
/etc/haproxy/acl/
/etc/haproxy/transactions/
/etc/haproxy/config_snippets/
/etc/haproxy/backups/
# Temporary processing
/tmp/spoe-haproxy/
All directories are created with proper ownership (haproxy:haproxy
) and permissions to ensure the API service can read and write configuration data while maintaining security boundaries.
HAProxy Configuration Integration
The implementation required specific HAProxy configuration changes to enable API communication:
Master-Worker Mode
global
master-worker
# Admin socket with proper group permissions
stats socket /run/haproxy/admin.sock mode 660 level admin group haproxy
# User authentication for API access
userlist controller
user {{ vault.cluster_credentials.username }} password {{ vault.cluster_credentials.password }}
The master-worker mode enables the API to communicate with HAProxy's runtime through the admin socket, while the userlist provides authentication for API requests.
Backend Configuration
backend haproxy-dataplane-api
server dataplane 127.0.0.1:5555 check
This backend configuration allows external access to the API through the existing reverse proxy infrastructure, integrating seamlessly with the cluster's routing patterns.
Systemd Service Implementation
The service configuration prioritizes security while providing necessary filesystem access:
[Unit]
Description=HAProxy Dataplane API
After=network.target haproxy.service
Requires=haproxy.service
[Service]
Type=exec
User=haproxy
Group=haproxy
ExecStart=/usr/local/bin/dataplaneapi --config-file=/etc/haproxy-dataplane/dataplaneapi.yaml
Restart=always
RestartSec=5
# Security settings
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
# Required filesystem access
ReadWritePaths=/etc/haproxy
ReadWritePaths=/etc/haproxy-dataplane
ReadWritePaths=/var/lib/haproxy
ReadWritePaths=/run/haproxy
ReadWritePaths=/tmp/spoe-haproxy
[Install]
WantedBy=multi-user.target
The security-focused configuration uses ProtectSystem=strict
with explicit ReadWritePaths
declarations, ensuring the service has access only to required directories while maintaining system protection.
Problem Resolution Process
The implementation encountered several configuration challenges that required systematic debugging:
YAML Configuration Format Issues
Problem: Initial configuration used HAProxy's flat format rather than the required nested YAML v2 structure.
Solution: Implemented proper config_version: 2
with nested haproxy:
sections and structured resource directories.
Socket Permission Problems
Problem: HAProxy admin socket was inaccessible to the dataplane API service.
ERRO[0000] error fetching configuration: dial unix /run/haproxy/admin.sock: connect: permission denied
Solution: Added group haproxy
to the HAProxy socket configuration, allowing the dataplane API service running as the haproxy user to access the socket.
Directory Permission Resolution
Problem: Multiple permission denied errors for various storage directories.
ERRO[0000] Cannot create dir /etc/haproxy/maps: mkdir /etc/haproxy/maps: permission denied
Solution: Systematically created all required directories with proper ownership:
- name: Create HAProxy dataplane directories
file:
path: "{{ item }}"
state: directory
owner: haproxy
group: haproxy
mode: '0755'
loop:
- /etc/haproxy/dataplane
- /etc/haproxy/maps
- /etc/haproxy/ssl
- /etc/haproxy/general
- /etc/haproxy/spoe
- /etc/haproxy/acl
- /etc/haproxy/transactions
- /etc/haproxy/config_snippets
- /etc/haproxy/backups
- /tmp/spoe-haproxy
Filesystem Write Access
Problem: The /etc/haproxy
directory was read-only for the haproxy user, preventing configuration updates.
Solution: Modified directory ownership and permissions to allow write access while maintaining security:
chgrp haproxy /etc/haproxy
chmod g+w /etc/haproxy
Service Integration and Accessibility
The API integrates with the cluster's existing infrastructure patterns:
- Service Discovery: Available at
https://haproxy-api.services.goldentooth.net
- Authentication: Uses cluster credentials for API access
- Monitoring: Integrated with existing health check patterns
- Security: TLS termination through existing certificate management
Operational Capabilities
The successful implementation enables several advanced load balancer management capabilities:
Dynamic Backend Management
# Add backend servers without HAProxy restart
curl -X POST https://haproxy-api.services.goldentooth.net/v3/services/haproxy/configuration/servers \
-d '{"name": "new-server", "address": "10.4.1.10", "port": 8080}'
# Modify server weights for traffic distribution
curl -X PUT https://haproxy-api.services.goldentooth.net/v3/services/haproxy/configuration/servers/web1 \
-d '{"weight": 150}'
Health Check Configuration
# Configure health checks dynamically
curl -X PUT https://haproxy-api.services.goldentooth.net/v3/services/haproxy/configuration/backends/web \
-d '{"health_check": {"uri": "/health", "interval": "5s"}}'
Runtime Statistics and Monitoring
The API provides comprehensive runtime statistics and configuration state information, enabling advanced monitoring and automated decision-making for infrastructure management.
Current Status and Integration
The HAProxy Dataplane API is now:
- Active and stable on the allyrion load balancer node
- Listening on port 5555 with proper systemd management
- Responding to HTTP API requests with full functionality
- Integrated with HAProxy through the admin socket interface
- Accessible externally via the configured domain endpoint
- Authenticated using cluster credential standards
This implementation represents a significant enhancement to the cluster's load balancing capabilities, moving from static configuration management to dynamic, API-driven infrastructure control. The systematic approach to troubleshooting configuration issues demonstrates the methodical problem-solving required for complex infrastructure integration while maintaining operational security and reliability standards.
Dynamic Service Discovery with Consul + HAProxy Dataplane API
Building upon our HAProxy Dataplane API integration, the next architectural evolution was implementing dynamic service discovery. This transformation moved the cluster away from static backend configurations toward a fully dynamic, Consul-driven service mesh architecture where services can relocate between nodes without manual load balancer reconfiguration.
The Static Configuration Problem
Traditional HAProxy configurations require explicit backend server definitions:
backend grafana-backend
server grafana1 10.4.1.15:3000 check ssl verify none
server grafana2 10.4.1.16:3000 check ssl verify none backup
This approach creates several operational challenges:
- Manual Updates: Adding or removing services requires HAProxy configuration changes
- Node Dependencies: Services tied to specific IP addresses can't migrate freely
- Health Check Duplication: Both HAProxy and service discovery systems monitor health
- Configuration Drift: Static configurations become outdated as infrastructure evolves
Dynamic Service Discovery Architecture
The new implementation leverages Consul's service registry with HAProxy Dataplane API's dynamic backend creation:
Service Registration → Consul Service Registry → HAProxy Dataplane API → Dynamic Backends
Core Components
- Consul Service Registry: Central service discovery database
- Service Registration Template: Reusable Ansible template for consistent service registration
- HAProxy Dataplane API: Dynamic backend management interface
- Service-to-Backend Mappings: Configuration linking Consul services to HAProxy backends
Implementation: Reusable Service Registration Template
The foundation of dynamic service discovery is the consul-service-registration.json.j2
template in the goldentooth.setup_consul
role:
{
"name": "{{ consul_service_name }}",
"id": "{{ consul_service_name }}-{{ ansible_hostname }}",
"address": "{{ consul_service_address | default(ipv4_address) }}",
"port": {{ consul_service_port }},
"tags": {{ consul_service_tags | default(['goldentooth']) | to_json }},
"meta": {
"version": "{{ consul_service_version | default('unknown') }}",
"environment": "{{ consul_service_environment | default('production') }}",
"service_type": "{{ consul_service_type | default('application') }}",
"cluster": "goldentooth",
"hostname": "{{ ansible_hostname }}",
"protocol": "{{ consul_service_protocol | default('http') }}",
"path": "{{ consul_service_health_path | default('/') }}"
},
"checks": [
{
"id": "{{ consul_service_name }}-http-health",
"name": "{{ consul_service_name | title }} HTTP Health Check",
"http": "{{ consul_service_health_http }}",
"method": "{{ consul_service_health_method | default('GET') }}",
"interval": "{{ consul_service_health_interval | default('30s') }}",
"timeout": "{{ consul_service_health_timeout | default('10s') }}",
"status": "passing"
}
]
}
This template provides:
- Standardized Service Registration: Consistent metadata and health check patterns
- Flexible Health Checks: HTTP and TCP checks with configurable endpoints
- Rich Metadata: Protocol, version, and environment information for routing decisions
- Health Check Integration: Native Consul health monitoring replacing static HAProxy checks
Service Integration Patterns
Grafana Service Registration
The goldentooth.setup_grafana
role demonstrates the integration pattern:
- name: Register Grafana with Consul
include_role:
name: goldentooth.setup_consul
tasks_from: register_service
vars:
consul_service_name: grafana
consul_service_port: 3000
consul_service_tags:
- monitoring
- dashboard
- goldentooth
- https
consul_service_type: monitoring
consul_service_protocol: https
consul_service_health_path: /api/health
consul_service_health_http: "https://{{ ipv4_address }}:3000/api/health"
consul_service_health_tls_skip_verify: true
This registration creates a Grafana service entry in Consul with:
- HTTPS Health Checks: Direct validation of Grafana's API endpoint
- Service Metadata: Rich tagging for service discovery and routing
- TLS Configuration: Proper SSL handling for encrypted services
Service-Specific Health Check Endpoints
Each service uses appropriate health check endpoints:
- Grafana:
/api/health
- Grafana's built-in health endpoint - Prometheus:
/-/healthy
- Standard Prometheus health check - Loki:
/ready
- Loki readiness endpoint - MCP Server:
/health
- Custom health endpoint
HAProxy Dataplane API Configuration
The dataplaneapi.yaml.j2
template defines service-to-backend mappings:
service_discovery:
consuls:
- address: 127.0.0.1:8500
enabled: true
services:
- name: grafana
backend_name: consul-grafana
mode: http
balance: roundrobin
check: enabled
check_ssl: enabled
check_path: /api/health
ssl: enabled
ssl_verify: none
- name: prometheus
backend_name: consul-prometheus
mode: http
balance: roundrobin
check: enabled
check_path: /-/healthy
- name: loki
backend_name: consul-loki
mode: http
balance: roundrobin
check: enabled
check_ssl: enabled
check_path: /ready
ssl: enabled
ssl_verify: none
This configuration:
- Maps Consul Services: Links service registry entries to HAProxy backends
- Configures SSL Settings: Handles HTTPS services with appropriate SSL verification
- Defines Load Balancing: Sets algorithm and health check behavior per service
- Creates Dynamic Backends: Automatically generates
consul-*
backend names
Frontend Routing Transformation
HAProxy frontend configuration transitioned from static to dynamic backends:
Before: Static Backend References
frontend goldentooth-services
use_backend grafana-backend if { hdr(host) -i grafana.services.goldentooth.net }
use_backend prometheus-backend if { hdr(host) -i prometheus.services.goldentooth.net }
After: Dynamic Backend References
frontend goldentooth-services
use_backend consul-grafana if { hdr(host) -i grafana.services.goldentooth.net }
use_backend consul-prometheus if { hdr(host) -i prometheus.services.goldentooth.net }
use_backend consul-loki if { hdr(host) -i loki.services.goldentooth.net }
use_backend consul-mcp-server if { hdr(host) -i mcp.services.goldentooth.net }
The consul-*
naming convention distinguishes dynamically managed backends from static ones.
Multi-Service Role Implementation
Each service role now includes Consul registration:
Prometheus Registration
- name: Register Prometheus with Consul
include_role:
name: goldentooth.setup_consul
tasks_from: register_service
vars:
consul_service_name: prometheus
consul_service_port: 9090
consul_service_health_http: "http://{{ ipv4_address }}:9090/-/healthy"
Loki Registration
- name: Register Loki with Consul
include_role:
name: goldentooth.setup_consul
tasks_from: register_service
vars:
consul_service_name: loki
consul_service_port: 3100
consul_service_health_http: "https://{{ ipv4_address }}:3100/ready"
consul_service_health_tls_skip_verify: true
MCP Server Registration
- name: Register MCP Server with Consul
include_role:
name: goldentooth.setup_consul
tasks_from: register_service
vars:
consul_service_name: mcp-server
consul_service_port: 3001
consul_service_health_http: "http://{{ ipv4_address }}:3001/health"
Technical Benefits
Service Mobility
Services can now migrate between nodes without load balancer reconfiguration. When a service starts on a different node, it registers with Consul, and HAProxy automatically updates backend server lists.
Health Check Integration
Consul's health checking replaces static HAProxy health checks, providing:
- Centralized Health Monitoring: Single source of truth for service health
- Rich Health Check Types: HTTP, TCP, script-based, and TTL checks
- Health Check Inheritance: HAProxy backends inherit health status from Consul
Configuration Simplification
Static backend definitions are eliminated, reducing HAProxy configuration complexity and maintenance overhead.
Service Discovery Foundation
The implementation establishes patterns for:
- Service Registration: Standardized across all cluster services
- Health Check Consistency: Uniform health monitoring approaches
- Metadata Management: Rich service information for advanced routing
- Dynamic Backend Naming: Clear separation between static and dynamic backends
Operational Impact
Deployment Flexibility
Services can be deployed to any cluster node without infrastructure configuration changes. The service registers itself with Consul, and HAProxy automatically includes it in load balancing.
Zero-Downtime Updates
Service updates can leverage Consul's health checking for gradual rollouts. Unhealthy instances are automatically removed from load balancing until they pass health checks.
Monitoring Integration
Consul's web UI provides real-time service health visualization, complementing existing Prometheus/Grafana monitoring infrastructure.
Future Service Mesh Evolution
This implementation represents the foundation for comprehensive service mesh architecture:
- Additional Service Registration: Extending dynamic discovery to all cluster services
- Advanced Routing: Consul metadata-based traffic routing and service versioning
- Security Integration: Service-to-service authentication and authorization
- Circuit Breaking: Automated failure handling and traffic management
The transformation from static to dynamic service discovery fundamentally changes how the Goldentooth cluster manages service routing, establishing patterns that will support continued infrastructure evolution and automation.
SeaweedFS Pi 5 Migration and CSI Integration
After the successful initial SeaweedFS deployment on the Pi 4B nodes (fenn and karstark), a significant hardware upgrade opportunity arose. Four new Raspberry Pi 5 nodes with 1TB NVMe SSDs had joined the cluster: Manderly, Norcross, Oakheart, and Payne. This chapter chronicles the complete migration of the SeaweedFS distributed storage system to these more powerful nodes and the resolution of critical clustering issues that enabled full Kubernetes CSI integration.
The New Hardware Foundation
Meet the Storage Powerhouses
The four new Pi 5 nodes represent a massive upgrade in storage capacity and performance:
- Manderly (10.4.0.22) - 1TB NVMe SSD via PCIe
- Norcross (10.4.0.23) - 1TB NVMe SSD via PCIe
- Oakheart (10.4.0.24) - 1TB NVMe SSD via PCIe
- Payne (10.4.0.25) - 1TB NVMe SSD via PCIe
Total Raw Capacity: 4TB across four nodes (vs. ~1TB across two Pi 4B nodes)
Performance Characteristics
The Pi 5 + NVMe combination delivers substantial improvements:
- Storage Interface: PCIe NVMe vs. USB 3.0 SSD
- Sequential Read/Write: ~400MB/s vs. ~100MB/s
- Random IOPS: 10x improvement for small file operations
- CPU Performance: Cortex-A76 vs. Cortex-A72 cores
- Memory: 8GB LPDDR4X vs. 4GB on old nodes
Migration Strategy
Cluster Topology Decision
Rather than attempt in-place migration, the decision was made to completely rebuild the SeaweedFS cluster on the new hardware. This approach provided:
- Clean Architecture: No legacy configuration artifacts
- Improved Topology: Optimize for 4-node distributed storage
- Zero Downtime: Keep old cluster running during migration
- Rollback Safety: Ability to revert if issues arose
Node Role Assignment
The four Pi 5 nodes were configured with hybrid roles to maximize both performance and fault tolerance:
- Masters: Manderly, Norcross, Oakheart (3-node Raft consensus)
- Volume Servers: All four nodes (maximizing storage capacity)
This design provides proper Raft consensus with an odd number of masters while utilizing all available storage capacity.
The Critical Discovery: Raft Consensus Requirements
The Leadership Election Problem
The initial migration attempt using all four nodes as masters immediately revealed a fundamental issue:
F0804 21:16:33.246267 master.go:285 Only odd number of masters are supported:
[10.4.0.22:9333 10.4.0.23:9333 10.4.0.24:9333 10.4.0.25:9333]
SeaweedFS requires an odd number of masters for Raft consensus. This is a fundamental requirement of distributed consensus algorithms to avoid split-brain scenarios where no majority can be established.
The Mathematics of Consensus
With 4 masters:
- Split scenarios: 2-2 splits prevent majority formation
- Leadership impossible: No node can achieve >50% votes
- Cluster paralysis: "Leader not selected yet" errors continuously
With 3 masters:
- Majority possible: 2 out of 3 can form majority
- Fault tolerance: 1 node failure still allows operation
- Clear leadership: Proper Raft election cycles
Infrastructure Template Updates
Fixing Hardcoded Configurations
The migration revealed template issues that needed correction:
Dynamic Peer Discovery
# Before (hardcoded)
-peers=fenn:9333,karstark:9333
# After (dynamic)
-peers={% for host in groups['seaweedfs'] %}{{ host }}:9333{% if not loop.last %},{% endif %}{% endfor %}
Consul Service Template Fix
{
"peer_addresses": "{% for host in groups['seaweedfs'] %}{{ host }}:9333{% if not loop.last %},{% endif %}{% endfor %}"
}
Removing Problematic Parameters
The -ip=
parameter in master service templates was causing duplicate peer entries:
# Problematic configuration
ExecStart=/usr/local/bin/weed master \
-port=9333 \
-mdir=/mnt/seaweedfs-nvme/master \
-peers=manderly:9333,norcross:9333,oakheart:9333 \
-ip=10.4.0.22 \ # <-- This caused duplicates
-raftHashicorp=true
# Clean configuration
ExecStart=/usr/local/bin/weed master \
-port=9333 \
-mdir=/mnt/seaweedfs-nvme/master \
-peers=manderly:9333,norcross:9333,oakheart:9333 \
-raftHashicorp=true
Kubernetes CSI Integration Challenge
The DNS Resolution Problem
With the SeaweedFS cluster running on bare metal and Kubernetes CSI components running in pods, a networking challenge emerged:
Problem: Kubernetes pods couldn't resolve SeaweedFS node hostnames because they exist outside the cluster DNS.
Solution: Kubernetes Services with explicit Endpoints to bridge the DNS gap.
Service-Based DNS Resolution
# Headless service for each SeaweedFS node
apiVersion: v1
kind: Service
metadata:
name: manderly
namespace: default
spec:
type: ClusterIP
clusterIP: None
ports:
- name: master
port: 9333
- name: volume
port: 8080
---
# Explicit endpoint mapping
apiVersion: v1
kind: Endpoints
metadata:
name: manderly
namespace: default
subsets:
- addresses:
- ip: 10.4.0.22
ports:
- name: master
port: 9333
- name: volume
port: 8080
This approach allows the SeaweedFS filer (running in Kubernetes) to connect to the bare metal cluster using service names like manderly:9333
.
Migration Execution
Phase 1: Infrastructure Preparation
# Update inventory to reflect new nodes
goldentooth edit_vault
# Configure new SeaweedFS group with Pi 5 nodes
# Clean deployment of storage infrastructure
goldentooth cleanup_old_storage
goldentooth setup_seaweedfs
Phase 2: Cluster Formation with Proper Topology
# Deploy 3-master configuration
goldentooth command_root manderly,norcross,oakheart "systemctl start seaweedfs-master"
# Verify leadership election
curl http://10.4.0.22:9333/dir/status
# Start volume servers on all nodes
goldentooth command_root manderly,norcross,oakheart,payne "systemctl start seaweedfs-volume"
Phase 3: Kubernetes Integration
# Deploy DNS bridge services
kubectl apply -f seaweedfs-services-endpoints.yaml
# Deploy and verify filer
kubectl get pods -l app=seaweedfs-filer
kubectl logs seaweedfs-filer-xxx | grep "Start Seaweed Filer"
Verification and Testing
Cluster Health Verification
# Leadership confirmation
curl http://10.4.0.22:9333/cluster/status
# Returns proper topology with elected leader
# Service status across all nodes
goldentooth command manderly,norcross,oakheart,payne "systemctl status seaweedfs-master seaweedfs-volume"
CSI Integration Testing
# Test PVC creation
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: seaweedfs-test-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
storageClassName: seaweedfs-storage
Result: Successful dynamic volume provisioning with NFS-style mounting via seaweedfs-filer:8888:/buckets/pvc-xxx
.
End-to-End Functionality
# Pod with mounted SeaweedFS volume
kubectl exec test-pod -- df -h /data
# Filesystem: seaweedfs-filer:8888:/buckets/pvc-xxx Size: 512M
# File I/O verification
kubectl exec test-pod -- touch /data/test-file
kubectl exec test-pod -- ls -la /data/
# Files persist across pod restarts via distributed storage
Final Architecture
Cluster Topology
- Masters: 3 nodes (Manderly, Norcross, Oakheart) with Raft consensus
- Volume Servers: 4 nodes (all Pi 5s) with 1TB NVMe each
- Total Capacity: ~3.6TB usable distributed storage
- Fault Tolerance: Can survive 1 master failure + multiple volume server failures
- Performance: NVMe speeds with distributed redundancy
Integration Status
- ✅ Kubernetes CSI: Dynamic volume provisioning working
- ✅ DNS Resolution: Service-based hostname resolution
- ✅ Leadership Election: Stable Raft consensus
- ✅ Filer Services: HTTP/gRPC endpoints operational
- ✅ Volume Mounting: NFS-style filesystem access
- ✅ High Availability: Multi-node fault tolerance
Monitoring Integration
SeaweedFS metrics integrate with the existing Goldentooth observability stack:
- Prometheus: Master and volume server metrics collection
- Grafana: Storage capacity and performance dashboards
- Consul: Service discovery and health monitoring
- Step-CA: TLS certificate management for secure communications
Performance Impact
Storage Capacity Comparison
Metric | Old Cluster (Pi 4B) | New Cluster (Pi 5) | Improvement |
---|---|---|---|
Total Capacity | ~1TB | ~3.6TB | 3.6x |
Node Count | 2 | 4 | 2x |
Per-Node Storage | 500GB | 1TB | 2x |
Storage Interface | USB 3.0 SSD | PCIe NVMe | PCIe speed |
Fault Tolerance | Single failure | Multi-failure | Higher |
Architectural Benefits
- Proper Consensus: 3-master Raft eliminates split-brain scenarios
- Expanded Capacity: 3.6TB enables larger workloads and datasets
- Performance Scaling: NVMe storage handles high-IOPS workloads
- Kubernetes Native: CSI integration enables GitOps storage workflows
- Future Ready: Foundation for S3 gateway and advanced SeaweedFS features
P5.js Creative Coding Platform
Goldentooth's journey into creative computing required a platform for hosting and showcasing interactive p5.js sketches. The p5js-sketches
project emerged as a Kubernetes-native solution that combines modern DevOps practices with artistic expression, providing a robust foundation for creative coding experiments and demonstrations.
Project Overview
Vision and Purpose
The p5js-sketches platform serves multiple purposes within the Goldentooth ecosystem:
- Creative Expression: A canvas for computational art and interactive visualizations
- Educational Demos: Showcase machine learning algorithms and mathematical concepts
- Technical Exhibition: Demonstrate Kubernetes deployment patterns for static content
- Community Sharing: Provide a gallery format for browsing and discovering sketches
Architecture Philosophy
The platform embraces cloud-native principles while optimizing for the unique constraints of a Raspberry Pi cluster:
- Container-Native: Docker-based deployments with multi-architecture support
- GitOps Workflow: Code-to-deployment automation via Argo CD
- Edge-Optimized: Resource limits tailored for ARM64 Pi hardware
- Automated Content: CI/CD pipeline for preview generation and deployment
Technical Architecture
Core Components
The platform consists of several integrated components:
Static File Server
- Base: nginx optimized for ARM64 Raspberry Pi hardware
- Content: p5.js sketches with HTML, JavaScript, and assets
- Security: Non-root container with read-only filesystem
- Performance: Tuned for low-memory Pi environments
Storage Foundation
- Backend: local-path storage provisioner
- Capacity: 10Gi persistent volume for sketch content
- Limitation: Single-replica deployment (ReadWriteOnce constraint)
- Future: Ready for migration to SeaweedFS distributed storage
Networking Integration
- Load Balancer: MetalLB for external access
- DNS: external-dns automatic hostname management
- SSL: Future integration with cert-manager and Step-CA
Container Configuration
The deployment leverages advanced Kubernetes security features:
# Security hardening
security:
runAsNonRoot: true
runAsUser: 101 # nginx user
runAsGroup: 101
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
# Resource optimization for Pi hardware
resources:
requests:
memory: "32Mi"
cpu: "50m"
limits:
memory: "64Mi"
cpu: "100m"
Deployment Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ GitHub Repo │───▶│ Argo CD │───▶│ Kubernetes │
│ p5js-sketches │ │ GitOps │ │ Deployment │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ GitHub Actions │ │ nginx Pod │
│ Preview Gen │ │ serving static │
└─────────────────┘ │ content │
└─────────────────┘
Automated Preview Generation System
The Challenge
p5.js sketches are interactive, dynamic content that can't be represented by static screenshots. The platform needed a way to automatically generate compelling preview images that capture the essence of each sketch's visual output.
Solution: Headless Browser Automation
The preview generation system uses Puppeteer for sophisticated browser automation:
Technology Stack
- Puppeteer v21.5.0: Headless Chrome automation
- GitHub Actions: CI/CD execution environment
- Node.js: Runtime environment for capture scripts
- Canvas Capture: Direct p5.js canvas element extraction
Capture Process
const CONFIG = {
sketches_dir: './sketches',
capture_delay: 10000, // Wait for sketch initialization
animation_duration: 3000, // Record animation period
viewport: { width: 600, height: 600 },
screenshot_options: {
type: 'png',
clip: { x: 0, y: 0, width: 400, height: 400 } // Crop to canvas
}
};
Advanced Capture Features
Sketch Lifecycle Awareness
- Initialization Delay: Configurable per-sketch startup time
- Animation Sampling: Capture representative frames from animations
- Canvas Detection: Automatic identification of p5.js canvas elements
- Error Handling: Graceful fallback for problematic sketches
GitHub Actions Integration
on:
push:
paths:
- 'sketches/**' # Trigger on sketch modifications
workflow_dispatch: # Manual execution capability
inputs:
force_regenerate: # Regenerate all previews
capture_delay: # Configurable timing
Automated Workflow
- Trigger Detection: Sketch files modified or manual dispatch
- Environment Setup: Node.js, Puppeteer browser installation
- Dependency Caching: Optimize build times with browser cache
- Preview Generation: Execute capture script across all sketches
- Change Detection: Identify new or modified preview images
- Auto-Commit: Commit generated images back to repository
- Artifact Upload: Preserve previews for debugging and archives
Sketch Organization and Metadata
Directory Structure
Each sketch follows a standardized organization pattern:
sketches/
├── linear-regression/
│ ├── index.html # Entry point with p5.js setup
│ ├── sketch.js # Main p5.js code
│ ├── style.css # Styling and layout
│ ├── metadata.json # Sketch configuration
│ ├── preview.png # Auto-generated preview (400x400)
│ └── libraries/ # p5.js and extensions
│ ├── p5.min.js
│ └── p5.sound.min.js
└── robbie-the-robot/
├── index.html
├── main.js # Entry point
├── robot.js # Agent implementation
├── simulation.js # GA evolution logic
├── world.js # Environment simulation
├── ga-worker.js # Web Worker for GA
├── metadata.json
├── preview.png
└── libraries/
Metadata Configuration
Each sketch includes rich metadata for gallery display and capture configuration:
{
"title": "Robby GA with Worker",
"description": "Genetic algorithm simulation where robots learn to collect cans in a grid world using neural network evolution",
"isAnimated": true,
"captureDelay": 30000,
"lastUpdated": "2025-08-04T19:06:01.506Z"
}
Metadata Fields
- title: Display name for gallery
- description: Detailed explanation of the sketch concept
- isAnimated: Indicates dynamic content requiring longer capture
- captureDelay: Custom initialization time in milliseconds
- lastUpdated: Automatic timestamp for version tracking
Example Sketches
Linear Regression Visualization
A educational demonstration of machine learning fundamentals:
Purpose: Interactive visualization of gradient descent optimization Features:
- Real-time data point plotting
- Animated regression line fitting
- Loss function visualization
- Parameter adjustment controls
Technical Implementation:
- Single-file sketch with mathematical calculations
- Real-time chart updates using p5.js drawing primitives
- Interactive mouse controls for data manipulation
Robbie the Robot - Genetic Algorithm
A sophisticated multi-agent simulation demonstrating evolutionary computation:
Purpose: Showcase genetic algorithms learning optimal can-collection strategies Features:
- Multi-generational population evolution
- Neural network-based agent decision making
- Web Worker-based GA computation for performance
- Real-time fitness and generation statistics
Technical Architecture:
- Main Thread: p5.js rendering and user interaction
- Web Worker: Genetic algorithm computation (ga-worker.js)
- Modular Design: Separate files for robot, simulation, and world logic
- Performance Optimization: Efficient canvas rendering for multiple agents
Deployment Integration
Helm Chart Configuration
The platform uses Helm for templated Kubernetes deployments:
# Chart.yaml
apiVersion: 'v2'
name: 'p5js-sketches'
description: 'P5.js Sketch Server - Static file server for hosting p5.js sketches'
type: 'application'
version: '0.0.1'
Key Templates:
- Deployment: nginx container with security hardening
- Service: LoadBalancer with MetalLB integration
- ConfigMap: nginx configuration optimization
- Namespace: Isolated environment for sketch server
- ServiceAccount: RBAC configuration for security
Argo CD GitOps Integration
The platform deploys automatically via Argo CD:
Repository Structure:
- Source:
github.com/goldentooth/p5js-sketches
- Target:
p5js-sketches
namespace - Sync Policy: Automatic deployment on git changes
- Health Checks: Kubernetes-native readiness and liveness probes
Deployment URL: https://p5js-sketches.services.k8s.goldentooth.net/
Gallery and User Experience
Automated Gallery Generation
The platform includes sophisticated gallery generation:
Features:
- Responsive Grid: CSS Grid layout optimized for various screen sizes
- Preview Integration: Auto-generated preview images with fallbacks
- Metadata Display: Title, description, and technical details
- Interactive Navigation: Direct links to individual sketches
- Search and Filter: Future enhancement for large sketch collections
Template System:
<!-- Gallery template with dynamic sketch injection -->
<div class="gallery-grid">
{{#each sketches}}
<div class="sketch-card">
<img src="{{preview}}" alt="{{title}}" loading="lazy">
<h3>{{title}}</h3>
<p>{{description}}</p>
<a href="{{url}}" class="sketch-link">View Sketch</a>
</div>
{{/each}}
</div>
CLI Ergonomics
The Goldentooth CLI underwent a fundamental transformation, evolving from a verbose, Ansible-heavy interface into a sleek, ergonomic command suite optimized for both human operators and programmatic consumption. This architectural revolution introduced direct SSH operations, intelligent MOTD systems, distributed computing integration, and performance improvements that deliver 3x faster execution times.
The Transformation
From Ansible-Heavy to SSH-Native Operations
The original CLI relied exclusively on Ansible playbooks for every operation, creating unnecessary overhead for simple tasks. The new architecture introduces direct SSH operations that bypass Ansible entirely for appropriate use cases:
Before: Every command required Ansible overhead
# Old approach - always through Ansible
goldentooth command all "systemctl status consul" # ~10-15 seconds
After: Direct SSH with intelligent routing
# New approach - direct SSH operations
goldentooth shell bettley # Instant interactive session
goldentooth command all "systemctl status consul" # ~3-5 seconds with parallel
Revolutionary SSH-Based Command Suite
Interactive Shell Sessions
The shell
command provides seamless access to cluster nodes with intelligent behavior:
# Single node - direct SSH session with beautiful MOTD
goldentooth shell bettley
# Multiple nodes - broadcast mode with synchronized output
goldentooth shell all
Smart Behavior:
- Single node: Interactive SSH session with full MOTD display
- Multiple nodes: Broadcast mode with synchronized command execution
- Automatic host resolution from Ansible inventory groups
Stream Processing with Pipe
The pipe
command transforms stdin into distributed execution:
# Stream commands to multiple nodes
echo "df -h" | goldentooth pipe storage_nodes
echo "systemctl status consul" | goldentooth pipe consul_server
Advanced Features:
- Comment filtering (lines starting with
#
are ignored) - Empty line skipping for clean script processing
- Parallel execution across multiple hosts
- Clean error handling and output formatting
File Transfer with CP
Node-aware file transfer using intuitive syntax:
# Copy from cluster to local
goldentooth cp bettley:/var/log/consul.log ./logs/
# Copy from local to cluster
goldentooth cp ./config.yaml allyrion:/etc/myapp/
# Inter-node transfers
goldentooth cp allyrion:/tmp/data.json bettley:/opt/processing/
Batch Script Execution
Execute shell scripts across the cluster:
# Run maintenance script on storage nodes
goldentooth batch maintenance.sh storage_nodes
# Execute deployment script on all nodes
goldentooth batch deploy.sh all
Multi-line Command Execution
The heredoc
command enables complex multi-line operations:
goldentooth heredoc consul_server <<'EOF'
consul kv put config/database/host "db.goldentooth.net"
consul kv put config/database/port "5432"
systemctl reload myapp
EOF
Performance Architecture
GNU Parallel Integration
The CLI intelligently detects and leverages GNU parallel
for concurrent operations:
Automatic Parallelization:
- Single host: Direct SSH connection
- Multiple hosts: GNU parallel with job control (
-j0
for optimal concurrency) - Fallback: Sequential execution if parallel unavailable
Performance Improvements:
- 3x faster execution for multi-host operations
- Optimal resource utilization across cluster nodes
- Tagged output for clear host identification
Intelligent SSH Configuration
Optimized SSH behavior for different use cases:
Clean Command Output:
ssh_opts="-T -o StrictHostKeyChecking=no -o LogLevel=ERROR -q"
Features:
-T
flag disables pseudo-terminal allocation (suppresses MOTD for commands)- Error suppression for clean programmatic consumption
- Connection optimization for repeated operations
MOTD System Overhaul
Visual Node Identification
Each cluster node features unique ASCII art MOTD for instant visual recognition:
Implementation:
- Node-specific colorized ASCII artwork stored in
/etc/motd
- Beautiful visual identification during interactive SSH sessions
- SSH
PrintMotd yes
configuration for proper display
Examples:
bettley
: Distinctive golden-colored ASCII art designallyrion
: Unique visual signature for immediate recognition- Each node: Custom artwork matching cluster theme and node personality
Smart MOTD Behavior
The system provides context-appropriate MOTD display:
Interactive Sessions: Full MOTD display with ASCII art Command Execution: Suppressed MOTD for clean output Programmatic Access: No visual interference with data processing
Technical Implementation:
- Removed complex PAM-based conditional MOTD system
- Leveraged SSH's built-in
PrintMotd
behavior - Clean separation between interactive and programmatic access
Inventory Integration System
Ansible Group Compatibility
The CLI seamlessly integrates Ansible inventory definitions with SSH operations:
Inventory Parsing:
# parse-inventory.py converts YAML inventory to bash functions
def generate_bash_variables(groups):
# Creates goldentooth:resolve_hosts() function
# Generates case statements for each group
# Maintains compatibility with existing Ansible workflows
Generated Functions:
function goldentooth:resolve_hosts() {
case "$expression" in
"consul_server")
echo "allyrion bettley cargyll"
;;
"storage_nodes")
echo "jast karstark lipps"
;;
# ... all inventory groups
esac
}
Installation Integration:
- Inventory parsing during CLI installation (
make install
) - Automatic generation of
/usr/local/bin/goldentooth-inventory.sh
- Dynamic loading of inventory groups into CLI
Distributed LLaMA Integration
Cross-Platform Compilation
Advanced cross-compilation support for ARM64 distributed computing:
Architecture:
- x86_64 Velaryon node: Cross-compilation host
- ARM64 Pi nodes: Deployment targets
- Automated binary distribution and service management
Commands:
# Model management
goldentooth dllama_download_model meta-llama/Llama-3.2-1B
# Service lifecycle
goldentooth dllama_start_workers
goldentooth dllama_stop
# Cluster status
goldentooth dllama_status
# Distributed inference
goldentooth dllama_inference "Explain quantum computing"
Technical Features:
- Automatic model download and conversion
- Distributed worker node management
- Cross-architecture binary deployment
- Performance monitoring and status reporting
Command Line Interface Enhancements
Bash Completion System
Comprehensive tab completion for all operations:
Features:
- Command completion for all CLI functions
- Host and group name completion
- Context-aware parameter suggestions
- Integration with existing shell environments
Error Handling and Output Management
Professional error management with proper stream handling:
Implementation:
- Error messages directed to stderr
- Clean stdout for programmatic consumption
- Consistent exit codes for automation integration
- Detailed error reporting with actionable suggestions
Help and Documentation
Built-in documentation system:
# List available commands
goldentooth help
# Command-specific help
goldentooth help shell
goldentooth help dllama_inference
# Show available inventory groups
goldentooth list_groups
Integration with Existing Infrastructure
Ansible Compatibility
The new CLI maintains full compatibility with existing Ansible workflows:
Hybrid Approach:
- SSH operations for simple, fast tasks
- Ansible playbooks for complex configuration management
- Seamless switching between approaches based on task requirements
Examples:
# Quick status check - SSH
goldentooth command all "uptime"
# Complex configuration - Ansible
goldentooth setup_consul
Monitoring and Observability
CLI operations integrate with existing monitoring systems:
Features:
- Command execution logging
- Performance metrics collection
- Integration with Prometheus/Grafana monitoring
- Audit trail for security compliance
User Experience Improvements
Intuitive Command Syntax
Natural, memorable command patterns:
# Intuitive file operations
goldentooth cp source destination
# Clear service management
goldentooth dllama_start_workers
# Obvious interactive access
goldentooth shell hostname