Comprehensive Metrics Collection

After establishing the foundation of our observability stack with Prometheus, Grafana, and the blackbox exporter, it's time to ensure we're collecting metrics from every critical component in our cluster. This chapter covers the addition of Nomad telemetry and Kubernetes object metrics to our monitoring infrastructure.

The Metrics Audit

A comprehensive audit of our cluster revealed which services were already exposing metrics:

Already Configured:

✅ Kubernetes API server, controller manager, scheduler (via control plane endpoints)
✅ HAProxy (custom exporter on port 8405)
✅ Prometheus (self-monitoring)
✅ Grafana (internal metrics)
✅ Loki (log aggregation metrics)
✅ Consul (built-in Prometheus endpoint)
✅ Vault (telemetry endpoint)

Missing:

❌ Nomad (no telemetry configuration)
❌ Kubernetes object state (deployments, pods, services)

Enabling Nomad Telemetry

Nomad has built-in Prometheus support but requires explicit configuration. We added the telemetry block to our Nomad configuration template:

telemetry {
  collection_interval        = "1s"
  disable_hostname           = true
  prometheus_metrics         = true
  publish_allocation_metrics = true
  publish_node_metrics       = true
}

This configuration:

Enables Prometheus-compatible metrics on /v1/metrics?format=prometheus
Publishes detailed allocation and node metrics
Disables hostname labels (we add our own)
Sets a 1-second collection interval for fine-grained data

Certificate-Based Authentication

Unlike some services that expose metrics without authentication, Nomad requires mutual TLS for metrics access. We leveraged our Step-CA infrastructure to generate proper client certificates:

- name: 'Generate Prometheus client certificate for Nomad metrics.'
  ansible.builtin.shell:
    cmd: |
      {{ step_ca.executable }} ca certificate \
        "prometheus.client.nomad" \
        "/etc/prometheus/certs/nomad-client.crt" \
        "/etc/prometheus/certs/nomad-client.key" \
        --provisioner="{{ step_ca.default_provisioner.name }}" \
        --password-file="{{ step_ca.default_provisioner.password_path }}" \
        --san="prometheus.client.nomad" \
        --san="prometheus" \
        --san="{{ clean_hostname }}" \
        --san="{{ ipv4_address }}" \
        --not-after='24h' \
        --console \
        --force

This approach ensures:

Certificates are properly signed by our cluster CA
Client identity is clearly established
Automatic renewal via systemd timers
Consistent with our security model

Prometheus Scrape Configuration

With certificates in place, we configured Prometheus to scrape all Nomad nodes:

- job_name: 'nomad'
  metrics_path: '/v1/metrics'
  params:
    format: ['prometheus']
  static_configs:
    - targets:
        - "10.4.0.11:4646"  # bettley (server)
        - "10.4.0.12:4646"  # cargyll (server)
        - "10.4.0.13:4646"  # dalt (server)
        # ... all client nodes
  scheme: 'https'
  tls_config:
    ca_file: "{{ step_ca.root_cert_path }}"
    cert_file: "/etc/prometheus/certs/nomad-client.crt"
    key_file: "/etc/prometheus/certs/nomad-client.key"

Kubernetes Object Metrics with kube-state-metrics

While node-level metrics tell us about resource usage, we also need visibility into Kubernetes objects themselves. Enter kube-state-metrics, which exposes metrics about:

Deployment replica counts and rollout status
Pod phases and container states
Service endpoints and readiness
PersistentVolume claims and capacity
Job completion status
And much more

GitOps Deployment Pattern

Following our established patterns, we created a dedicated GitOps repository for kube-state-metrics:

# Create the repository
gh repo create goldentooth/kube-state-metrics --public

# Clone into our organization structure
cd ~/Projects/goldentooth
git clone https://github.com/goldentooth/kube-state-metrics.git

# Add the required label for Argo CD discovery
gh repo edit goldentooth/kube-state-metrics --add-topic gitops-repo

The key insight here is that our Argo CD ApplicationSet automatically discovers repositories with the gitops-repo label, eliminating manual application creation.

kube-state-metrics Configuration

The deployment includes comprehensive RBAC permissions to read all Kubernetes objects:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
# ... additional resources

We discovered that some resources like resourcequotas, replicationcontrollers, and limitranges were missing from the initial configuration, causing permission errors. A quick update to the ClusterRole resolved these issues.

Security Hardening

The kube-state-metrics deployment follows security best practices:

securityContext:
  fsGroup: 65534
  runAsGroup: 65534
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

Container-level security adds additional restrictions:

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true

Prometheus Auto-Discovery

The service includes annotations for automatic Prometheus discovery:

annotations:
  prometheus.io/scrape: 'true'
  prometheus.io/port: '8080'
  prometheus.io/path: '/metrics'

This eliminates the need for manual Prometheus configuration - the metrics are automatically discovered and scraped.

Verifying the Deployment

After deployment, we can verify metrics are being exposed:

# Port-forward to test locally
kubectl port-forward -n kube-state-metrics service/kube-state-metrics 8080:8080

# Check deployment metrics
curl -s http://localhost:8080/metrics | grep kube_deployment_status_replicas

Example output:

kube_deployment_status_replicas{namespace="argocd",deployment="argocd-redis-ha-haproxy"} 3
kube_deployment_status_replicas{namespace="kube-system",deployment="coredns"} 2

Goldentooth