Prometheus Node Exporter Migration: From Kubernetes to Native

The Problem

While working on Grafana dashboard configuration, I discovered that the node exporter dashboard was completely empty - no metrics, no data, just a sad empty dashboard that looked like it had given up on life.

The issue? Our Prometheus Node Exporter was deployed via Kubernetes and Argo CD, but Prometheus itself was running as a systemd service on allyrion. The Kubernetes deployment created a ClusterIP service at 172.16.12.161:9100, but Prometheus (running outside the cluster) couldn't reach this internal Kubernetes service.

Meanwhile, Prometheus was configured to scrape node exporters directly at each node's IP on port 9100 (e.g., 10.4.0.11:9100), but nothing was listening there because the actual exporters were only accessible through the Kubernetes service mesh.

The Solution: Raw-dogging Node Exporter

Time to embrace the chaos and deploy node exporter directly on the nodes as systemd services. Sometimes the simplest solution is the best solution.

Step 1: Create the Ansible Playbook

First, I created a new playbook to deploy node exporter cluster-wide using the same prometheus.prometheus.node_exporter role that HAProxy was already using:

# ansible/playbooks/setup_node_exporter.yaml
# Description: Setup Prometheus Node Exporter on all cluster nodes.

- name: 'Setup Prometheus Node Exporter.'
  hosts: 'all'
  remote_user: 'root'
  roles:
    - { role: 'prometheus.prometheus.node_exporter' }
  handlers:
    - name: 'Restart Node Exporter.'
      ansible.builtin.service:
        name: 'node_exporter'
        state: 'restarted'
        enabled: true

Step 2: Deploy via Goldentooth CLI

Thanks to the goldentooth CLI's fallback behavior (it automatically runs Ansible playbooks with matching names), deployment was as simple as:

goldentooth setup_node_exporter

This installed node exporter on all 13 cluster nodes, creating:

node-exp system user and group
/usr/local/bin/node_exporter binary
/etc/systemd/system/node_exporter.service systemd service
/var/lib/node_exporter textfile collector directory

Step 3: Handle Port Conflicts

The deployment initially failed on most nodes with "address already in use" errors. The Kubernetes node exporter pods were still running and had claimed port 9100.

Investigation revealed the conflict:

goldentooth command bettley "journalctl -u node_exporter --no-pager -n 10"
# Error: listen tcp 0.0.0.0:9100: bind: address already in use

Step 4: Clean Up Kubernetes Deployment

I removed the Kubernetes deployment entirely:

# Delete the daemonset and namespace
kubectl delete daemonset prometheus-node-exporter -n prometheus-node-exporter
kubectl delete namespace prometheus-node-exporter

# Delete the Argo CD applications managing this
kubectl delete application prometheus-node-exporter gitops-repo-prometheus-node-exporter -n argocd

# Delete the GitHub repository (to prevent ApplicationSet from recreating it)
gh repo delete goldentooth/prometheus-node-exporter --yes

Step 5: Restart Failed Services

With the port conflicts resolved, I restarted the systemd services:

goldentooth command bettley,dalt "systemctl restart node_exporter"

All nodes now showed healthy node exporter services:

● node_exporter.service - Prometheus Node Exporter
     Loaded: loaded (/etc/systemd/system/node_exporter.service; enabled)
     Active: active (running) since Wed 2025-07-23 19:36:30 EDT; 7s ago

Step 6: Reload Prometheus

With native node exporters now listening on port 9100 on all nodes, I reloaded Prometheus to pick up the new targets:

goldentooth command allyrion "systemctl reload prometheus"

Verified metrics were accessible:

goldentooth command allyrion "curl -s http://10.4.0.11:9100/metrics | grep node_cpu_seconds_total | head -3"
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 1.42238869e+06

The Result

Within minutes, the Grafana node exporter dashboard came alive with beautiful metrics from all cluster nodes. CPU usage, memory consumption, disk I/O, network statistics - everything was flowing perfectly.