Prometheus
Way back in Chapter 19, I set up an Prometheus Node Exporter "app" for Argo CD, but I never actually set up Prometheus itself.
That's really fairly odd for me, since I'm normally super twitchy about metrics, logging, and observability. I guess I put it off because I was dealing with some kind of existential questions; where would Prometheus live, how would it communicate, etc, but then ended up kinda running out of steam before I answered the questions.
So, better late than never, I'm going to work on setting up Prometheus in a nice, decentralized kind of way.
Implementation Architecture
Installation Method
I'm using the official prometheus.prometheus.prometheus Ansible role from the Prometheus community. The depth to Prometheus is, after all, configuring and using it, not merely in installing it.
The installation is managed through:
- Playbook:
setup_prometheus.yaml
- Custom role:
goldentooth.setup_prometheus
(wraps the community role) - CLI command:
goldentooth setup_prometheus
Deployment Location
Prometheus runs on allyrion
(10.4.0.10), which consolidates multiple infrastructure services:
- HAProxy load balancer
- NFS server
- Prometheus monitoring server
This placement provides several advantages:
- Central location for cluster-wide monitoring
- Proximity to load balancer for HAProxy metrics
- Reduced resource usage on Kubernetes worker nodes
Service Configuration
Core Settings
The Prometheus service is configured with production-ready settings:
# Storage and retention
prometheus_storage_retention_time: "15d"
prometheus_storage_retention_size: "5GB"
prometheus_storage_tsdb_path: "/var/lib/prometheus"
# Network and performance
prometheus_web_listen_address: "0.0.0.0:9090"
prometheus_config_global_scrape_interval: "60s"
prometheus_config_global_evaluation_interval: "15s"
prometheus_config_global_scrape_timeout: "15s"
Security Hardening
The service implements comprehensive security measures:
- Dedicated user: Runs as
prometheus
user/group - Systemd hardening: NoNewPrivileges, PrivateDevices, ProtectSystem=strict
- Capability restrictions: Limited to CAP_SET_UID only
- Resource limits: GOMAXPROCS=4 to prevent CPU exhaustion
External Labels
Cluster identification through external labels:
external_labels:
environment: goldentooth
cluster: goldentooth
domain: goldentooth.net
Service Discovery Implementation
File-Based Service Discovery
Rather than relying on complex auto-discovery, I implement file-based service discovery for reliability and explicit control:
Target Generation (/etc/prometheus/file_sd/node.yaml
):
{% for host in groups['all'] %}
- targets:
- "{{ hostvars[host]['ipv4_address'] }}:9100"
labels:
instance: "{{ host }}"
job: 'node'
{% endfor %}
This approach:
- Auto-generates targets from Ansible inventory
- Covers all 13 cluster nodes (12 Raspberry Pis + 1 x86 GPU node)
- Provides consistent labeling with
instance
andjob
labels - Updates automatically when nodes are added/removed
Scrape Configurations
Core Infrastructure Monitoring
Prometheus Self-Monitoring:
- job_name: 'prometheus'
static_configs:
- targets: ['allyrion:9090']
HAProxy Load Balancer:
- job_name: 'haproxy'
static_configs:
- targets: ['allyrion:8405']
HAProxy includes a built-in Prometheus exporter accessible at /metrics
on port 8405, providing load balancer performance and health metrics.
Nginx Reverse Proxy:
- job_name: 'nginx'
static_configs:
- targets: ['allyrion:9113']
Node Monitoring
File Service Discovery for all cluster nodes:
- job_name: "unknown"
file_sd_configs:
- files:
- "/etc/prometheus/file_sd/*.yaml"
- "/etc/prometheus/file_sd/*.json"
This targets all Node Exporter instances across the cluster, providing comprehensive infrastructure metrics.
Advanced Integrations
Loki Log Aggregation:
- job_name: 'loki'
static_configs:
- targets: ['inchfield:3100']
scheme: 'https'
tls_config:
ca_file: /etc/ssl/certs/goldentooth.pem
This integration uses the Step-CA certificate authority for secure communication with the Loki log aggregation service.
Exporter Ecosystem
Node Exporter Deployment
Kubernetes Nodes (via Argo CD):
- Helm Chart: prometheus-node-exporter v4.46.1
- Namespace: prometheus-node-exporter
- Extra Collectors:
--collector.systemd
,--collector.processes
- Management: Automated GitOps deployment with auto-sync
Infrastructure Node (allyrion):
- Installation: Via
prometheus.prometheus.node_exporter
role - Enabled Collectors: systemd for service monitoring
- Integration: Direct scraping by local Prometheus instance
Application Exporters
I also configured several application-specific exporters:
HAProxy Built-in Exporter: Provides load balancer metrics including backend health, response times, and traffic distribution
Nginx Exporter: Monitors reverse proxy performance and request patterns
Network Access and Security
Nginx Reverse Proxy
To provide secure external access to Prometheus, I configured an Nginx reverse proxy:
server {
listen 8081;
location / {
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Application prometheus;
}
}
This provides:
- Network isolation (Prometheus only accessible locally)
- Header injection for request identification
- Potential for future authentication layer
Certificate Integration
The cluster uses Step-CA for comprehensive certificate management. Prometheus leverages this infrastructure for:
- Secure scraping of TLS-enabled services (Loki)
- Potential future TLS termination
- Integration with the broader security model
Alerting Configuration
Basic Alert Rules
The installation includes foundational alerting rules in /etc/prometheus/rules/ansible_managed.yml
:
Watchdog Alert: Always-firing alert to verify the alerting pipeline is functional
Instance Down Alert: Critical alert when up == 0
for 5 minutes, indicating node or service failure
Future Expansion
The alert rule framework is prepared for expansion with application-specific alerts, SLA monitoring, and capacity planning alerts.
Integration with Monitoring Stack
Grafana Integration
Prometheus serves as the primary data source for Grafana dashboards:
datasources:
- name: prometheus
type: prometheus
url: http://allyrion:9090
access: proxy
This enables rich visualization of cluster metrics through pre-configured and custom dashboards.
Storage and Performance
TSDB Configuration:
- Retention: 15 days (time) and 5GB (size) for appropriate data lifecycle
- Storage: Local disk at
/var/lib/prometheus
- Compaction: Automatic TSDB compaction for optimal query performance
The scrape configuration was fairly straightforward, and the result is a comprehensive monitoring foundation covering all infrastructure components and preparing for future application-specific monitoring expansion.