Certificate Renewal Debugging Odyssey
Some time after setting up the certificate renewal system, the cluster was humming along nicely with 24-hour certificate lifetimes and automatic renewal every 5 minutes. Or so I thought.
One morning, I discovered that Vault certificates had mysteriously expired overnight, despite the renewal system supposedly working. This kicked off a multi-day investigation that would lead to significant improvements in our certificate management and monitoring infrastructure.
The Mystery: Why Didn't Vault Certificates Renew?
The first clue was puzzling - some services had renewed their certificates successfully (Consul, Nomad), while others (Vault) had failed silently. The cert-renewer systemd service showed no errors, and the timers were running on schedule.
$ goldentooth command_root jast 'systemctl status cert-renewer@vault.timer'
● cert-renewer@vault.timer - Timer for certificate renewal of vault
Loaded: loaded (/etc/systemd/system/cert-renewer@.timer; enabled)
Active: active (waiting) since Wed 2025-07-23 14:05:12 EDT; 3h ago
The timer was active, but the certificates were still expired. Something was fundamentally wrong with our renewal logic.
Building a Certificate Renewal Canary
Rather than guessing at the problem, I decided to build proper test infrastructure. The solution was a "canary" service - a minimal certificate renewal setup with extremely short lifetimes that would fail fast and give us rapid feedback.
Creating the Canary Service
I created a new Ansible role goldentooth.setup_cert_renewer_canary
that:
- Creates a dedicated user and service:
cert-canary
user with its own systemd service - Uses 15-minute certificate lifetimes: Fast enough to debug quickly
- Runs on a 5-minute timer: Same schedule as production services
- Provides comprehensive logging: Detailed output for debugging
# roles/goldentooth.setup_cert_renewer_canary/defaults/main.yaml
cert_canary:
username: cert-canary
group: cert-canary
cert_lifetime: 15m
cert_path: /opt/cert-canary/certs/tls.crt
key_path: /opt/cert-canary/certs/tls.key
The canary service template includes detailed logging:
[Unit]
Description=Certificate Canary Service
After=network-online.target
[Service]
Type=oneshot
User=cert-canary
WorkingDirectory=/opt/cert-canary
ExecStart=/bin/echo "Certificate canary service executed successfully"
Discovering the Root Cause
With the canary in place, I could observe the renewal process in real-time. The breakthrough came when I examined the step certificate needs-renewal
command more carefully.
The 66% Threshold Problem
The default cert-renewer configuration uses a 66% threshold for renewal - certificates renew when they have less than 66% of their lifetime remaining. For 24-hour certificates, this means renewal occurs when there are about 8 hours left.
But here's the critical issue: with a 5-minute timer interval, there's only a narrow window for successful renewal. If the renewal fails during that window (due to network issues, service restarts, etc.), the next attempt won't occur until the timer fires again.
The math was unforgiving:
- 24-hour certificate: 66% threshold = ~8 hour renewal window
- 5-minute timer: 12 attempts per hour
- Network/service instability: Occasional renewal failures
- Result: Certificates could expire if multiple renewal attempts failed in succession
The Solution: Environment Variable Configuration
The fix involved making the cert-renewer system more configurable and robust. I updated the base cert-renewer@.service
template to support environment variable overrides:
[Unit]
Description=Certificate renewer for %I
After=network-online.target
Documentation=https://smallstep.com/docs/step-ca/certificate-authority-server-production
StartLimitIntervalSec=0
[Service]
Type=oneshot
User=root
Environment=STEPPATH=/etc/step-ca
Environment=CERT_LOCATION=/etc/step/certs/%i.crt
Environment=KEY_LOCATION=/etc/step/certs/%i.key
Environment=EXPIRES_IN_THRESHOLD=66%
ExecCondition=/usr/bin/step certificate needs-renewal ${CERT_LOCATION} --expires-in ${EXPIRES_IN_THRESHOLD}
ExecStart=/usr/bin/step ca renew --force ${CERT_LOCATION} ${KEY_LOCATION}
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active %i.service || systemctl try-reload-or-restart %i.service"
[Install]
WantedBy=multi-user.target
Service-Specific Overrides
Each service now gets its own override configuration that specifies the exact certificate paths and renewal behavior:
# /etc/systemd/system/cert-renewer@vault.service.d/override.conf
[Service]
Environment=CERT_LOCATION=/opt/vault/tls/tls.crt
Environment=KEY_LOCATION=/opt/vault/tls/tls.key
WorkingDirectory=/opt/vault/tls
ExecStartPost=/usr/bin/env sh -c "! systemctl --quiet is-active vault.service || systemctl try-reload-or-restart vault.service"
The beauty of this approach is that we can now tune renewal behavior per service without modifying the base template.
Comprehensive Monitoring Infrastructure
While debugging the certificate issue, I also built comprehensive monitoring dashboards and alerting to prevent future incidents.
New Grafana Dashboards
I created three major monitoring dashboards:
- Slurm Cluster Overview: Job queue metrics, resource utilization, historical trends
- HashiCorp Services Overview: Consul health, Vault status, Nomad allocation monitoring
- Infrastructure Health Overview: Node uptime, storage capacity, network metrics
Enhanced Metrics Collection
The monitoring improvements included:
- Vector Internal Metrics: Enabled Vector's internal metrics and Prometheus exporter
- Certificate Expiration Tracking: Automated monitoring of certificate days-remaining
- Service Health Indicators: Real-time status for all critical cluster services
- Alert Rules: Proactive notifications for certificate expiration and service failures
Testing Infrastructure Improvements
The certificate renewal investigation led to significant improvements in our testing infrastructure.
Certificate-Aware Test Suite
I created a comprehensive test_certificate_renewal
role that:
- Node-Specific Testing: Only tests certificates for services actually deployed on each node
- Multi-Layered Validation: Certificate presence, validity, timer status, renewal capability
- Chain Validation: Verifies certificates against the cluster CA
- Canary Health Monitoring: Tracks the certificate canary's renewal cycles
Smart Service Filtering
The test improvements included "intelligent" service filtering:
# Filter services to only those deployed on this node
- name: Filter services for current node
set_fact:
node_certificate_services: |-
{%- set filtered_services = [] -%}
{%- for service in certificate_services -%}
{%- set should_include = false -%}
{%- if service.get('specific_hosts') -%}
{%- if inventory_hostname in service.specific_hosts -%}
{%- set should_include = true -%}
{%- endif -%}
{%- elif service.host_groups -%}
{%- for group in service.host_groups -%}
{%- if inventory_hostname in groups.get(group, []) -%}
{%- set should_include = true -%}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{%- if should_include -%}
{%- set _ = filtered_services.append(service) -%}
{%- endif -%}
{%- endfor -%}
{{ filtered_services }}
This eliminated false positives where tests were failing for missing certificates on nodes where services weren't supposed to be running.