Prometheus Blackbox Exporter
The Observability Gap
Our Goldentooth cluster has comprehensive infrastructure monitoring through Prometheus, node exporters, and application metrics. But we've been missing a crucial piece: synthetic monitoring. We can see if our servers are running, but can we actually reach our services? Are our web UIs accessible? Can we connect to our APIs?
Enter the Prometheus Blackbox Exporter - our eyes and ears for service availability across the entire cluster.
What is Blackbox Monitoring?
Blackbox monitoring tests services from the outside, just like your users would. Instead of looking at internal metrics, it:
- Probes HTTP/HTTPS endpoints - "Is the Consul web UI actually working?"
- Tests TCP connectivity - "Can I connect to the Vault API port?"
- Validates DNS resolution - "Do our cluster domains resolve correctly?"
- Checks ICMP reachability - "Are all nodes responding to ping?"
It's called "blackbox" because we don't peek inside the service - we just test if it works from the outside.
Planning the Implementation
I needed to design a comprehensive monitoring strategy that would cover:
Service Categories
- HashiCorp Stack: Consul, Vault, Nomad web UIs and APIs
- Kubernetes Services: API server health, Argo CD, LoadBalancer services
- Observability Stack: Prometheus, Grafana, Loki endpoints
- Infrastructure: All 13 node homepages, HAProxy stats
- External Services: CloudFront distributions
- Network Health: DNS resolution for all cluster domains
Intelligent Probe Types
- Internal HTTPS: Uses Step-CA certificates for cluster services
- External HTTPS: Uses public CAs for external services
- HTTP: Plain HTTP for internal services
- TCP: Port connectivity for APIs and cluster communication
- DNS: Domain resolution for cluster services
- ICMP: Basic network connectivity for all nodes
The Ansible Implementation
I created a comprehensive Ansible role goldentooth.setup_blackbox_exporter
that handles:
Core Deployment
# Install blackbox exporter v0.25.0
# Deploy on allyrion (same node as Prometheus)
# Configure systemd service with security hardening
# Set up TLS certificates via Step-CA
Security Integration
The blackbox exporter integrates seamlessly with our Step-CA infrastructure:
- Client certificates for secure communication
- CA validation for internal services
- Automatic renewal via systemd timers
- Proper certificate ownership for the service user
Service Discovery Magic
Instead of hardcoding targets, I implemented dynamic service discovery:
# Generate targets from Ansible inventory variables
blackbox_https_internal_targets:
- "https://consul.goldentooth.net:8501"
- "https://vault.goldentooth.net:8200"
- "https://nomad.goldentooth.net:4646"
# ... and many more
# Auto-generate ICMP targets for all cluster nodes
{% for host in groups['all'] %}
- targets:
- "{{ hostvars[host]['ipv4_address'] }}"
labels:
job: 'blackbox-icmp'
node: "{{ host }}"
{% endfor %}
Prometheus Integration
The trickiest part was configuring Prometheus to properly scrape blackbox targets. Blackbox exporter works differently than normal exporters:
# Instead of scraping the target directly...
# Prometheus scrapes the blackbox exporter with target as parameter
- job_name: 'blackbox-https-internal'
metrics_path: '/probe'
params:
module: ['https_2xx_internal']
relabel_configs:
# Redirect to blackbox exporter
- target_label: __address__
replacement: "allyrion:9115"
# Pass original target as parameter
- source_labels: [__param_target]
target_label: __param_target
Deployment Day
The deployment was mostly smooth with a few interesting challenges:
Certificate Duration Drama
# First attempt failed:
# "requested duration of 8760h is more than authorized maximum of 168h"
# Solution: Match Step-CA policy
--not-after=168h # 1 week instead of 1 year
DNS Resolution Reality Check
Many of our internal domains (*.goldentooth.net
) don't actually resolve yet, so probes show up=0
. This is expected and actually valuable - it shows us what infrastructure we still need to set up!
Relabel Configuration Complexity
Getting the Prometheus relabel configs right for blackbox took several iterations. The key insight: blackbox exporter targets need to be "redirected" through the exporter itself.
What We're Monitoring Now
The blackbox exporter is now actively monitoring 40+ endpoints across our cluster:
Web UIs and APIs
- Consul Web UI (
https://consul.goldentooth.net:8501
) - Vault Web UI (
https://vault.goldentooth.net:8200
) - Nomad Web UI (
https://nomad.goldentooth.net:4646
) - Grafana Dashboard (
https://grafana.goldentooth.net:3000
) - Argo CD Interface (
https://argocd.goldentooth.net
)
Infrastructure Endpoints
- All 13 node homepages (
http://[node].nodes.goldentooth.net
) - HAProxy statistics page (with basic auth)
- Prometheus web interface
- Loki API endpoints
Network Connectivity
- TCP connectivity to all critical service ports
- DNS resolution for all cluster domains
- ICMP ping for every cluster node
- External CloudFront distributions
The Power of Synthetic Monitoring
Now when something breaks, we'll know immediately:
probe_success
tells us if the service is reachableprobe_duration_seconds
shows response timesprobe_http_status_code
reveals HTTP errorsprobe_ssl_earliest_cert_expiry
warns about certificate expiration
This complements our existing infrastructure monitoring perfectly. We can see both "the server is running" (node exporter) and "the service actually works" (blackbox exporter).