Prometheus Slurm Exporter
Overview
Following the Slurm refactoring work, the next logical step was to add comprehensive monitoring for the HPC workload manager. This chapter documents the implementation of prometheus-slurm-exporter to provide real-time visibility into cluster utilization, job queues, and resource allocation.
The Challenge
While Slurm was operational with 9 nodes in idle state, there was no integration with the existing Prometheus/Grafana observability stack. Key missing capabilities:
- No Cluster Metrics: Unable to monitor CPU/memory utilization across nodes
- No Job Visibility: No insight into job queues, completion rates, or resource consumption
- No Historical Data: No way to track cluster usage patterns over time
- Limited Alerting: No proactive monitoring of cluster health or resource exhaustion
Implementation Approach
Exporter Selection
Initially attempted the original vpenso/prometheus-slurm-exporter
but discovered it was unmaintained and lacked modern features. Switched to the rivosinc/prometheus-slurm-exporter
fork which provided:
- Active Maintenance: 87 commits, regular releases through v1.6.10
- Pre-built Binaries: ARM64 support via GitHub releases
- Enhanced Features: Job tracing, CLI fallback modes, throttling support
- Better Performance: Optimized for multiple Prometheus instances
Architecture Design
Deployed the exporter following goldentooth cluster patterns:
# Deployment Strategy
Target Nodes: slurm_controller (bettley, cargyll, dalt)
Service Port: 9092 (HTTP)
Protocol: HTTP with Prometheus file-based service discovery
Integration: Full Step-CA certificate management ready
User Management: Dedicated slurm-exporter service user
Role Structure
Created goldentooth.setup_slurm_exporter
following established conventions:
roles/goldentooth.setup_slurm_exporter/
├── CLAUDE.md # Comprehensive documentation
├── tasks/main.yaml # Main deployment tasks
├── templates/
│ ├── slurm-exporter.service.j2 # Systemd service
│ ├── slurm_targets.yaml.j2 # Prometheus targets
│ └── cert-renewer@slurm-exporter.conf.j2 # Certificate renewal
└── handlers/main.yaml # Service management handlers
Technical Implementation
Binary Installation
- name: 'Download prometheus-slurm-exporter from rivosinc fork'
ansible.builtin.get_url:
url: 'https://github.com/rivosinc/prometheus-slurm-exporter/releases/download/v{{ prometheus_slurm_exporter.version }}/prometheus-slurm-exporter_linux_{{ host.architecture }}.tar.gz'
dest: '/tmp/prometheus-slurm-exporter-{{ prometheus_slurm_exporter.version }}.tar.gz'
mode: '0644'
Service Configuration
[Service]
Type=simple
User=slurm-exporter
Group=slurm-exporter
ExecStart=/usr/local/bin/prometheus-slurm-exporter \
-web.listen-address={{ ansible_default_ipv4.address }}:{{ prometheus_slurm_exporter.port }} \
-web.log-level=info
Prometheus Integration
Added to the existing scrape configuration:
prometheus_scrape_configs:
- job_name: 'slurm'
file_sd_configs:
- files:
- "/etc/prometheus/file_sd/slurm_targets.yaml"
relabel_configs:
- source_labels: [instance]
target_label: instance
regex: '([^:]+):\d+'
replacement: '${1}'
Service Discovery
Dynamic target generation for all controller nodes:
- targets:
- "{{ hostvars[slurm_controller].ansible_default_ipv4.address }}:{{ prometheus_slurm_exporter.port }}"
labels:
job: 'slurm'
instance: '{{ slurm_controller }}'
cluster: '{{ cluster_name }}'
role: 'slurm-controller'
Metrics Exposed
The rivosinc exporter provides comprehensive cluster visibility:
Core Cluster Metrics
slurm_cpus_total 36 # Total CPU cores (9 nodes × 4 cores)
slurm_cpus_idle 36 # Available CPU cores
slurm_cpus_per_state{state="idle"} 36
slurm_node_count_per_state{state="idle"} 9
Memory Utilization
slurm_mem_real 7.0281e+10 # Total cluster memory (MB)
slurm_mem_alloc 6.0797e+10 # Allocated memory
slurm_mem_free 9.484e+09 # Available memory
Job Queue Metrics
slurm_jobs_pending 0 # Jobs waiting in queue
slurm_jobs_running 0 # Currently executing jobs
slurm_job_scrape_duration 29 # Metric collection performance
Performance Monitoring
slurm_cpu_load 5.83 # Current CPU load average
slurm_node_scrape_duration 35 # Node data collection time
Deployment Results
Service Health
All three controller nodes running successfully:
● slurm-exporter.service - Prometheus Slurm Exporter
Loaded: loaded (/etc/systemd/system/slurm-exporter.service; enabled)
Active: active (running)
Main PID: 3692156 (prometheus-slur)
Tasks: 5 (limit: 8737)
Memory: 1.5M (max: 128.0M available)
Metrics Validation
curl http://10.4.0.11:9092/metrics | grep '^slurm_'
slurm_cpu_load 5.83
slurm_cpus_idle 36
slurm_cpus_per_state{state="idle"} 36
slurm_cpus_total 36
slurm_node_count_per_state{state="idle"} 9
Prometheus Integration
Targets automatically discovered and scraped:
- bettley:9092 - Controller node metrics
- cargyll:9092 - Controller node metrics
- dalt:9092 - Controller node metrics
Configuration Management
Variables Structure
# Prometheus Slurm Exporter configuration (rivosinc fork)
prometheus_slurm_exporter:
version: "1.6.10"
port: 9092
user: "slurm-exporter"
group: "slurm-exporter"
Command Interface
# Deploy exporter
goldentooth setup_slurm_exporter
# Verify deployment
goldentooth command slurm_controller "systemctl status slurm-exporter"
# Check metrics
goldentooth command bettley "curl -s http://localhost:9092/metrics | head -10"
Troubleshooting Lessons
Initial Issues Encountered
-
Wrong Repository: Started with unmaintained vpenso fork
- Solution: Switched to actively maintained rivosinc fork
-
TLS Configuration: Attempted HTTPS but exporter doesn't support TLS flags
- Solution: Used HTTP with plans for future TLS proxy if needed
-
Binary Availability: No pre-built ARM64 binaries in original version
- Solution: rivosinc fork provides comprehensive release assets
-
Port Conflicts: Initially used port 8080
- Solution: Used exporter default 9092 to avoid conflicts
Debugging Process
Service logs were key to identifying configuration issues:
journalctl -u slurm-exporter --no-pager -l
Metrics endpoint testing confirmed functionality:
curl -s http://localhost:9092/metrics | grep -E '^slurm_'
Integration with Existing Stack
The exporter seamlessly integrates with goldentooth monitoring infrastructure:
Prometheus Configuration
- File-based Service Discovery: Automatic target management
- Label Strategy: Consistent with existing exporters
- Scrape Intervals: Standard 60-second collection
Certificate Management
- Step-CA Ready: Templates prepared for future TLS implementation
- Automatic Renewal: Systemd timer configuration included
- Service User: Dedicated account with minimal permissions
Observability Pipeline
- Prometheus: Metrics collection and storage
- Grafana: Dashboard visualization (ready for implementation)
- Alerting: Rule definition for cluster health monitoring
Performance Impact
Resource Usage
- Memory: ~1.5MB RSS per exporter instance
- CPU: Minimal impact during scraping
- Network: Standard HTTP metrics collection
- Slurm Load: Read-only operations with built-in throttling
Scalability Considerations
- Multiple Controllers: Distributed across all controller nodes
- High Availability: No single point of failure
- Data Consistency: Each exporter provides complete cluster view