Step-CA Certificate Monitoring Implementation

With the goldentooth cluster now heavily dependent on Step-CA for certificate management across Consul, Vault, Nomad, Grafana, Loki, Vector, HAProxy, Blackbox Exporter, and the newly deployed SeaweedFS distributed storage, we needed comprehensive certificate monitoring to prevent service outages from expired certificates.

The existing certificate monitoring was basic - we had file-based certificate expiry alerts, but lacked the visibility and proactive alerting necessary for enterprise-grade PKI management.

The Monitoring Challenge

Our cluster runs multiple services with Step-CA certificates:

Consul: Service mesh certificates for all nodes
Vault: Secrets management with HA cluster
Nomad: Workload orchestration across the cluster
Grafana: Observability dashboard access
Loki: Log aggregation infrastructure
Vector: Log shipping to Loki
HAProxy: Load balancer with TLS termintion
Blackbox Exporter: Synthetic monitoring service
SeaweedFS: Distributed storage with master/volume servers

Each service has automated certificate renewal via cert-renewer@.service systemd timers, but we needed comprehensive monitoring to ensure the renewal system itself was healthy and catch any failures before they caused outages.

Enhanced Blackbox Monitoring

The first enhancement expanded our synthetic monitoring to include comprehensive TLS validation for all Step-CA services.

SeaweedFS Integration

With SeaweedFS newly deployed as a high-availability distributed storage system, I added its endpoints to blackbox monitoring:

# SeaweedFS Master servers (HA cluster)
- targets:
  - "https://fenn:9333"
  - "https://karstark:9333" 
  labels:
    service: "seaweedfs-master"

# SeaweedFS Volume servers  
- targets:
  - "https://fenn:8080"
  - "https://karstark:8080"
  labels:
    service: "seaweedfs-volume"

Comprehensive TLS Endpoint Monitoring

Every Step-CA managed service now has synthetic TLS validation:

blackbox_https_internal_targets:
  - "https://consul.goldentooth.net:8501"
  - "https://vault.goldentooth.net:8200" 
  - "https://nomad.goldentooth.net:4646"
  - "https://grafana.goldentooth.net:3000"
  - "https://loki.goldentooth.net:3100"
  - "https://vector.goldentooth.net:8686"
  - "https://fenn:9115"  # blackbox exporter itself
  - "https://fenn:9333"  # seaweedfs master
  - "https://karstark:9333"
  - "https://fenn:8080"  # seaweedfs volume
  - "https://karstark:8080"

The blackbox exporter validates not just connectivity, but certificate chain validity, expiry dates, and proper TLS negotiation for each endpoint.

Advanced Prometheus Alert System

The core enhancement was implementing a comprehensive multi-tier alerting system for certificate management.

Certificate Expiry Alerts

I implemented three tiers of certificate expiry warnings:

# 30-day advance warning
- alert: CertificateExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Certificate expiring in 30 days"
    description: "Certificate for {{ $labels.instance }} expires in 30 days. Plan renewal."

# 7-day critical alert  
- alert: CertificateExpiringCritical
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Certificate expiring in 7 days"
    description: "Certificate for {{ $labels.instance }} expires in 7 days. Immediate attention required."

# 2-day emergency alert
- alert: CertificateExpiringEmergency  
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 2
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Certificate expiring in 2 days"
    description: "Certificate for {{ $labels.instance }} expires in 2 days. Emergency action required."

Certificate Renewal System Monitoring

Beyond expiry monitoring, I added alerts for certificate renewal system health:

# File-based certificate monitoring
- alert: CertificateFileExpiring
  expr: (file_certificate_expiry_seconds - time()) / 86400 < 7
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Certificate file expiring soon"
    description: "Certificate file {{ $labels.path }} expires in less than 7 days"

# Certificate renewal timer failure
- alert: CertificateRenewalTimerFailed
  expr: systemd_timer_last_trigger_seconds{name=~"cert-renewer@.*"} < time() - 86400 * 8
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "Certificate renewal timer failed"
    description: "Certificate renewal timer {{ $labels.name }} hasn't run in over 8 days"

Step-CA Server Health

Critical infrastructure monitoring for the Step-CA service itself:

# Step-CA service availability
- alert: StepCADown
  expr: up{job="step-ca"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Step-CA server is down"
    description: "Step-CA certificate authority is unreachable"

# TLS endpoint failures
- alert: TLSEndpointDown
  expr: probe_success{job=~"blackbox-https.*"} == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "TLS endpoint unreachable"
    description: "TLS endpoint {{ $labels.instance }} is unreachable via HTTPS"

Comprehensive Certificate Dashboard

The monitoring enhancement includes a dedicated Grafana dashboard providing complete PKI visibility.

Dashboard Features

The Step-CA Certificate Dashboard displays:

Certificate Expiry Timeline: Color-coded visualization showing all certificates with expiry thresholds (green > 30 days, yellow 7-30 days, red < 7 days)
TLS Endpoint Status: Real-time status of all HTTPS endpoints monitored via blackbox probes
Certificate Renewal Health: Status of systemd renewal timers across all services
Step-CA Server Status: Availability and responsiveness of the certificate authority
Certificate Inventory: Table showing all managed certificates with expiry dates and renewal status

Dashboard Implementation

- name: Deploy Step-CA certificate monitoring dashboard
  ansible.builtin.copy:
    src: "{{ playbook_dir }}/../grafana-dashboards/step-ca-certificate-dashboard.json"
    dest: "/var/lib/grafana/dashboards/"
    owner: grafana
    group: grafana
    mode: '0644'
  notify: restart grafana

The dashboard provides at-a-glance visibility into the health of the entire PKI infrastructure, with drill-down capabilities to investigate specific certificate issues.

Infrastructure Integration

Enhanced Grafana Role

The Grafana setup role now includes automated dashboard deployment:

- name: Create dashboards directory
  ansible.builtin.file:
    path: "/var/lib/grafana/dashboards"
    state: present
    owner: grafana
    group: grafana
    mode: '0755'

- name: Deploy certificate monitoring dashboard
  ansible.builtin.copy:
    src: "step-ca-certificate-dashboard.json"
    dest: "/var/lib/grafana/dashboards/"
    owner: grafana
    group: grafana
    mode: '0644'
  notify: restart grafana

Prometheus Configuration Updates

The Prometheus alerting rules required careful template escaping for proper alert message formatting:

# Proper Prometheus alert template escaping
annotations:
  summary: "Certificate for {{ "{{ $labels.instance }}" }} expires in 30 days"
  description: "Certificate renewal required for {{ "{{ $labels.instance }}" }}"

Service Targets Configuration

All Step-CA certificate endpoints are now systematically monitored:

blackbox_targets:
  https_internal:
    # Core HashiCorp services
    - "https://consul.goldentooth.net:8501"
    - "https://vault.goldentooth.net:8200"
    - "https://nomad.goldentooth.net:4646"
    
    # Observability stack
    - "https://grafana.goldentooth.net:3000"
    - "https://loki.goldentooth.net:3100"
    - "https://vector.goldentooth.net:8686"
    
    # Infrastructure services
    - "https://fenn:9115"  # blackbox exporter
    
    # SeaweedFS distributed storage
    - "https://fenn:9333"   # seaweedfs master
    - "https://karstark:9333"
    - "https://fenn:8080"   # seaweedfs volume  
    - "https://karstark:8080"

Deployment Results

Monitoring Coverage

The enhanced certificate monitoring now provides:

Complete PKI visibility: All 20+ Step-CA certificates monitored
Proactive alerting: 30/7/2 day advance warnings prevent surprises
System health monitoring: Renewal timer and Step-CA service health tracking
Synthetic validation: Real TLS endpoint testing via blackbox probes
Centralized dashboard: Single pane of glass for certificate infrastructure

Alert Integration

The alert system provides:

Early warning system: 30-day alerts allow planned certificate maintenance
Escalating severity: 7-day critical and 2-day emergency alerts ensure attention
Renewal system monitoring: Catches failures in automated renewal timers
Infrastructure monitoring: Step-CA server availability tracking

Operational Impact

Before this enhancement:

Basic file-based certificate expiry alerts
Limited visibility into certificate health
Potential for service outages from unnoticed certificate expiry
Manual certificate status checking required

After implementation:

Enterprise-grade certificate lifecycle monitoring
Proactive alerting preventing service disruptions
Complete synthetic validation of certificate-dependent services
Real-time visibility into PKI infrastructure health
Automated dashboard providing immediate certificate status overview

Repository Integration

Multi-Repository Changes

The implementation spans two repositories:

goldentooth/ansible: Core infrastructure implementation

Enhanced blackbox exporter role with SeaweedFS targets
Comprehensive Prometheus alerting rules
Improved Grafana role with dashboard deployment
Certificate monitoring integration across all Step-CA services

goldentooth/grafana-dashboards: Dashboard repository

New Step-CA Certificate Dashboard with complete PKI visibility
Dashboard committed for reuse across environments
JSON format compatible with Grafana provisioning

Command Integration

Certificate monitoring integrates with goldentooth CLI:

# Deploy enhanced certificate monitoring
goldentooth setup_blackbox_exporter
goldentooth setup_grafana  
goldentooth setup_prometheus

# Check certificate monitoring status
goldentooth command allyrion "systemctl status blackbox-exporter"

# View certificate expiry alerts
goldentooth command allyrion "curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname | contains(\"Certificate\"))'"

# Monitor renewal timers
goldentooth command_all "systemctl list-timers 'cert-renewer@*'"

This comprehensive Step-CA certificate monitoring implementation transforms goldentooth from basic certificate management to enterprise-grade PKI infrastructure with complete lifecycle visibility, proactive alerting, and automated health monitoring. The system now prevents certificate-related service outages through early warning and comprehensive synthetic validation of all certificate-dependent services.

Goldentooth