Slurm Refactoring and Improvements
Overview
After the initial Slurm deployment (documented in chapter 032), the cluster faced performance and reliability challenges that required significant refactoring. The monolithic setup role was taking 10+ minutes to execute and had idempotency issues, while memory configuration mismatches caused node validation failures.
It's my fault - it's because of my laziness. So this chapter is essentially me saying "yeah, I did a shitty thing, and so now I have to fix it."
Problems Identified
Performance Issues
- Setup Duration: The original
goldentooth.setup_slurm
role took over 10 minutes - Non-idempotent: Re-running the role would repeat expensive operations
- Monolithic Design: Single role handled everything from basic Slurm to complex HPC software stacks
Node Validation Failures
- Memory Mismatch: karstark and lipps nodes (4GB Pi 4s) were configured with 4096MB but only had ~3797MB available
- Invalid State: These nodes showed as "inval" in
sinfo
output - Authentication Issues: MUNGE key synchronization problems across nodes
Configuration Management
- Static Memory Values: All nodes hardcoded to 4096MB regardless of actual capacity
- Limited Flexibility: Single configuration approach didn't account for hardware variations
Refactoring Solution
Modular Role Architecture
Split the monolithic role into focused components:
Core Components (goldentooth.setup_slurm_core
)
- Purpose: Essential Slurm and MUNGE setup only
- Duration: Reduced from 10+ minutes to ~50 seconds
- Scope: Package installation, basic configuration, service management
- Features: MUNGE key synchronization, systemd PID file fixes
Specialized Modules
goldentooth.setup_lmod
: Environment module systemgoldentooth.setup_hpc_software
: HPC software stack (OpenMPI, Singularity, Conda)goldentooth.setup_slurm_modules
: Module files for installed software
Dynamic Memory Detection
Replaced static memory configuration with dynamic detection:
# Before: Static configuration
NodeName=DEFAULT CPUs=4 RealMemory=4096 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
# After: Dynamic per-node configuration
NodeName=DEFAULT CPUs=4 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
{% for slurm_compute_name in groups['slurm_compute'] %}
NodeName={{ slurm_compute_name }} NodeAddr={{ hostvars[slurm_compute_name].ipv4_address }} RealMemory={{ hostvars[slurm_compute_name].ansible_memtotal_mb }}
{% endfor %}
Node Exclusion Strategy
For nodes with insufficient memory (karstark, lipps):
- Inventory Update: Removed from
slurm_compute
group - Service Cleanup: Stopped and disabled slurmd/munge services
- Package Removal: Uninstalled Slurm packages to prevent conflicts
Implementation Details
MUNGE Key Synchronization
Added permanent solution to MUNGE authentication issues:
- name: 'Synchronize MUNGE keys across cluster'
block:
- name: 'Retrieve MUNGE key from first controller'
ansible.builtin.slurp:
src: '/etc/munge/munge.key'
register: 'controller_munge_key'
run_once: true
delegate_to: "{{ groups['slurm_controller'] | first }}"
- name: 'Distribute MUNGE key to all nodes'
ansible.builtin.copy:
content: "{{ controller_munge_key.content | b64decode }}"
dest: '/etc/munge/munge.key'
owner: 'munge'
group: 'munge'
mode: '0400'
backup: yes
when: inventory_hostname != groups['slurm_controller'] | first
notify: 'Restart MUNGE'
SystemD Integration Fixes
Resolved PID file path mismatches:
- name: 'Fix slurmctld pidfile path mismatch'
ansible.builtin.copy:
content: |
[Service]
PIDFile=/var/run/slurm/slurmctld.pid
dest: '/etc/systemd/system/slurmctld.service.d/override.conf'
mode: '0644'
when: inventory_hostname in groups['slurm_controller']
notify: 'Reload systemd and restart slurmctld'
NFS Permission Resolution
Fixed directory permissions that prevented slurm user access:
# Fixed root directory permissions on NFS server
chmod 755 /mnt/usb1 # Was 700, preventing slurm user access
Results
Performance Improvements
- Setup Time: Reduced from 10+ minutes to ~50 seconds for core functionality
- Idempotency: Role can be safely re-run without expensive operations
- Modularity: Users can choose which components to install
Cluster Health
- Node Status: 9 nodes operational and idle
- Authentication: MUNGE working consistently across all nodes
- Resource Detection: Accurate memory reporting per node
Final Cluster State
general* up infinite 9 idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast
debug up infinite 9 idle bettley,cargyll,dalt,erenford,fenn,gardener,harlton,inchfield,jast