Infrastructure Test Framework Improvements
After running comprehensive tests across the cluster, we discovered several critical issues with our test framework that were masking real infrastructure problems. This chapter documents the systematic fixes we implemented to ensure our automated testing provides accurate health monitoring.
The Initial Problem
When running goldentooth test all
, multiple test failures appeared across different nodes:
PLAY RECAP *********************************************************************
bettley : ok=47 changed=0 unreachable=0 failed=1 skipped=3 rescued=0 ignored=2
cargyll : ok=47 changed=0 unreachable=0 failed=1 skipped=3 rescued=0 ignored=1
dalt : ok=47 changed=0 unreachable=0 failed=1 skipped=3 rescued=0 ignored=1
The challenge was determining whether these failures indicated real infrastructure issues or problems with the test framework itself.
Root Cause Analysis
1. Kubernetes API Server Connectivity Issues
The most critical failure was the Kubernetes API server health check consistently failing on the bettley
control plane node:
Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
url: https://10.4.0.11:6443/healthz
Initial investigation revealed that while kubelet was running, both etcd and kube-apiserver pods were in CrashLoopBackOff state. This led us to discover that Kubernetes certificates had expired on June 20, 2025, but we were running tests in July 2025.
2. Test Framework Configuration Issues
Several test framework bugs were identified:
- Vault decryption errors: Tests couldn't access encrypted vault secrets
- Wrong certificate paths: Tests checking CA certificates instead of service certificates
- Undefined variables: JMESPath dependencies and variable reference errors
- Localhost binding assumptions: Services bound to specific IPs, not localhost
Infrastructure Fixes
Kubernetes Certificate Renewal
The most significant infrastructure issue was expired Kubernetes certificates. We resolved this using kubeadm:
# Backup existing certificates
ansible -i inventory/hosts bettley -m shell -a "cp -r /etc/kubernetes/pki /etc/kubernetes/pki.backup.$(date +%Y%m%d_%H%M%S)" --become
# Renew all certificates
ansible -i inventory/hosts bettley -m shell -a "kubeadm certs renew all" --become
# Restart control plane components by moving manifests temporarily
cd /etc/kubernetes/manifests
mv kube-apiserver.yaml kube-apiserver.yaml.tmp
mv etcd.yaml etcd.yaml.tmp
mv kube-controller-manager.yaml kube-controller-manager.yaml.tmp
mv kube-scheduler.yaml kube-scheduler.yaml.tmp
# Wait 10 seconds, then restore manifests
sleep 10
mv kube-apiserver.yaml.tmp kube-apiserver.yaml
mv etcd.yaml.tmp etcd.yaml
mv kube-controller-manager.yaml.tmp kube-controller-manager.yaml
mv kube-scheduler.yaml.tmp kube-scheduler.yaml
After renewal, certificates were valid until July 2026:
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
apiserver Jul 23, 2026 00:01 UTC 364d ca no
etcd-peer Jul 23, 2026 00:01 UTC 364d etcd-ca no
etcd-server Jul 23, 2026 00:01 UTC 364d etcd-ca no
Test Framework Fixes
1. Vault Authentication
Fixed missing vault password configuration in test environment:
# /Users/nathan/Projects/goldentooth/ansible/tests/ansible.cfg
[defaults]
vault_password_file = ~/.goldentooth_vault_password
2. Certificate Path Corrections
Updated tests to check actual service certificates instead of CA certificates:
# Before: Checking CA certificates (5-year lifespan)
path: /etc/consul.d/tls/consul-agent-ca.pem
# After: Checking service certificates (24-hour rotation)
path: /etc/consul.d/certs/tls.crt
3. API Connectivity Fixes
Fixed hardcoded localhost assumptions to use actual node IP addresses:
# Before: Assuming localhost binding
url: "https://127.0.0.1:8501/v1/status/leader"
# After: Using actual node IP
url: "http://{{ ansible_default_ipv4.address }}:8500/v1/status/leader"
4. Consul Members Command
Enhanced Consul connectivity testing with proper address specification:
- name: Check if consul command exists
stat:
path: /usr/bin/consul
register: consul_command_stat
- name: Check Consul members
command: consul members -status=alive -http-addr={{ ansible_default_ipv4.address }}:8500
when:
- consul_service.status.ActiveState == "active"
- consul_command_stat.stat.exists
5. Kubernetes Test Improvements
Simplified Kubernetes tests to avoid JMESPath dependencies and fixed variable scoping:
# Simplified node readiness test
- name: Record node readiness test (simplified)
set_fact:
k8s_tests: "{{ k8s_tests + [{'name': 'k8s_cluster_accessible', 'category': 'kubernetes', 'success': (k8s_nodes_raw is defined and k8s_nodes_raw is succeeded) | bool, 'duration': 0.5}] }}"
# Fixed API health test scoping
- name: Record API health test
set_fact:
k8s_tests: "{{ k8s_tests + [{'name': 'k8s_api_healthy', 'category': 'kubernetes', 'success': (k8s_api.status == 200 and k8s_api.content | default('') == 'ok') | bool, 'duration': 0.2}] }}"
when:
- k8s_api is defined
- inventory_hostname in groups['k8s_control_plane']
6. Step-CA Variable References
Fixed undefined variable references in Step-CA connectivity tests:
# Fixed IP address lookup
elif step ca health --ca-url https://{{ hostvars[groups['step_ca'] | first]['ipv4_address'] }}:9443 --root /etc/ssl/certs/goldentooth.pem; then
7. Localhost Aggregation Task
Fixed the test summary task that was failing due to missing facts:
- name: Aggregate test results
hosts: localhost
gather_facts: true # Changed from false
Test Design Philosophy
We adopted a principle of separating certificate presence from validity testing:
# Test 1: Certificate exists
- name: Check Consul certificate exists
stat:
path: /etc/consul.d/certs/tls.crt
register: consul_cert
- name: Record certificate presence test
set_fact:
consul_tests: "{{ consul_tests + [{'name': 'consul_certificate_present', 'category': 'consul', 'success': consul_cert.stat.exists | bool, 'duration': 0.1}] }}"
# Test 2: Certificate is valid (separate test)
- name: Check if certificate needs renewal
command: step certificate needs-renewal /etc/consul.d/certs/tls.crt
register: cert_needs_renewal
when: consul_cert.stat.exists
- name: Record certificate validity test
set_fact:
consul_tests: "{{ consul_tests + [{'name': 'consul_certificate_valid', 'category': 'consul', 'success': (cert_needs_renewal.rc != 0) | bool, 'duration': 0.1}] }}"
This approach provides better debugging information and clearer failure isolation.