Prometheus Blackbox Exporter

The Observability Gap

Our Goldentooth cluster has comprehensive infrastructure monitoring through Prometheus, node exporters, and application metrics. But we've been missing a crucial piece: synthetic monitoring. We can see if our servers are running, but can we actually reach our services? Are our web UIs accessible? Can we connect to our APIs?

Enter the Prometheus Blackbox Exporter - our eyes and ears for service availability across the entire cluster.

What is Blackbox Monitoring?

Blackbox monitoring tests services from the outside, just like your users would. Instead of looking at internal metrics, it:

Probes HTTP/HTTPS endpoints - "Is the Consul web UI actually working?"
Tests TCP connectivity - "Can I connect to the Vault API port?"
Validates DNS resolution - "Do our cluster domains resolve correctly?"
Checks ICMP reachability - "Are all nodes responding to ping?"

It's called "blackbox" because we don't peek inside the service - we just test if it works from the outside.

Planning the Implementation

I needed to design a comprehensive monitoring strategy that would cover:

Service Categories

HashiCorp Stack: Consul, Vault, Nomad web UIs and APIs
Kubernetes Services: API server health, Argo CD, LoadBalancer services
Observability Stack: Prometheus, Grafana, Loki endpoints
Infrastructure: All 13 node homepages, HAProxy stats
External Services: CloudFront distributions
Network Health: DNS resolution for all cluster domains

Intelligent Probe Types

Internal HTTPS: Uses Step-CA certificates for cluster services
External HTTPS: Uses public CAs for external services
HTTP: Plain HTTP for internal services
TCP: Port connectivity for APIs and cluster communication
DNS: Domain resolution for cluster services
ICMP: Basic network connectivity for all nodes

The Ansible Implementation

I created a comprehensive Ansible role goldentooth.setup_blackbox_exporter that handles:

Core Deployment

# Install blackbox exporter v0.25.0
# Deploy on allyrion (same node as Prometheus)
# Configure systemd service with security hardening
# Set up TLS certificates via Step-CA

Security Integration

The blackbox exporter integrates seamlessly with our Step-CA infrastructure:

Client certificates for secure communication
CA validation for internal services
Automatic renewal via systemd timers
Proper certificate ownership for the service user

Service Discovery Magic

Instead of hardcoding targets, I implemented dynamic service discovery:

# Generate targets from Ansible inventory variables
blackbox_https_internal_targets:
  - "https://consul.goldentooth.net:8501"
  - "https://vault.goldentooth.net:8200"
  - "https://nomad.goldentooth.net:4646"
  # ... and many more

# Auto-generate ICMP targets for all cluster nodes
{% for host in groups['all'] %}
- targets:
    - "{{ hostvars[host]['ipv4_address'] }}"
  labels:
    job: 'blackbox-icmp'
    node: "{{ host }}"
{% endfor %}

Prometheus Integration

The trickiest part was configuring Prometheus to properly scrape blackbox targets. Blackbox exporter works differently than normal exporters:

# Instead of scraping the target directly...
# Prometheus scrapes the blackbox exporter with target as parameter
- job_name: 'blackbox-https-internal'
  metrics_path: '/probe'
  params:
    module: ['https_2xx_internal']
  relabel_configs:
    # Redirect to blackbox exporter
    - target_label: __address__
      replacement: "allyrion:9115"
    # Pass original target as parameter
    - source_labels: [__param_target]
      target_label: __param_target

Deployment Day

The deployment was mostly smooth with a few interesting challenges:

Certificate Duration Drama

# First attempt failed:
# "requested duration of 8760h is more than authorized maximum of 168h"

# Solution: Match Step-CA policy
--not-after=168h  # 1 week instead of 1 year

Consul Web UI (https://consul.goldentooth.net:8501)
Vault Web UI (https://vault.goldentooth.net:8200)
Nomad Web UI (https://nomad.goldentooth.net:4646)
Grafana Dashboard (https://grafana.goldentooth.net:3000)
Argo CD Interface (https://argocd.goldentooth.net)

Infrastructure Endpoints

All 13 node homepages (http://[node].nodes.goldentooth.net)
HAProxy statistics page (with basic auth)
Prometheus web interface
Loki API endpoints

Network Connectivity

TCP connectivity to all critical service ports
DNS resolution for all cluster domains
ICMP ping for every cluster node
External CloudFront distributions

The Power of Synthetic Monitoring

Now when something breaks, we'll know immediately:

probe_success tells us if the service is reachable
probe_duration_seconds shows response times
probe_http_status_code reveals HTTP errors
probe_ssl_earliest_cert_expiry warns about certificate expiration

This complements our existing infrastructure monitoring perfectly. We can see both "the server is running" (node exporter) and "the service actually works" (blackbox exporter).

Goldentooth