Where Do We Go From Here?
We have a functioning cluster now, which is to say that I've spent many hours of my life that I'm not going to get back just doing the same thing that the official documentation manages to convey in just a few lines.
Or that Jeff Geerling's geerlingguy.kubernetes
has already managed to do.
And it's not a tenth of a percent as much as Kubespray can do.
Not much to be proud of, but again, this is a personal learning journey. I'm just trying to build a cluster thoughtfully, limiting the black boxes and the magic as much as practical.
The Foundation is Set
What we've accomplished so far represents the essential foundation of any production Kubernetes cluster:
Core Infrastructure ✅
- High availability control plane with 3 nodes and etcd quorum
- Load balanced API access through HAProxy for reliability
- Container runtime (containerd) with proper CRI integration
- Pod networking with Calico CNI providing cluster-wide connectivity
- Worker node pool with heterogeneous hardware (ARM64 + x86_64)
Automation and Reproducibility ✅
- Infrastructure as Code with comprehensive Ansible automation
- Idempotent operations ensuring consistent cluster state
- Version-pinned packages preventing unexpected upgrades
- Goldentooth CLI providing unified cluster management interface
But a bare Kubernetes cluster, while functional, is just the beginning. Real production workloads require additional platform services and operational capabilities.
The Platform Journey Ahead
The following phases will transform our basic cluster into a comprehensive container platform:
Phase 1: Application Platform Services
The next immediate priorities focus on making the cluster useful for application deployment:
GitOps and Application Management:
- Helm package management for standardized application packaging
- Argo CD for GitOps-based continuous deployment
- ApplicationSets for managing applications across environments
- Sealed Secrets for secure secret management in Git repositories
Ingress and Load Balancing:
- MetalLB for LoadBalancer service implementation
- BGP configuration for dynamic route advertisement
- External DNS for automatic DNS record management
- TLS certificate automation with cert-manager
Phase 2: Observability and Operations
Production clusters require comprehensive observability:
Metrics and Monitoring:
- Prometheus for metrics collection and alerting
- Grafana for visualization and dashboards
- Node exporters for hardware and OS metrics
- Custom metrics for application-specific monitoring
Logging and Troubleshooting:
- Loki for centralized log aggregation
- Vector for log collection and routing
- Distributed tracing for complex application debugging
- Alert routing for operational incident response
Phase 3: Storage and Data Management
Stateful applications require sophisticated storage solutions:
Distributed Storage:
- NFS exports for shared storage across the cluster
- Ceph cluster for distributed block and object storage
- ZFS replication for data durability and snapshots
- SeaweedFS for scalable object storage
Backup and Recovery:
- Velero for cluster backup and disaster recovery
- Database backup automation for stateful workloads
- Cross-datacenter replication for business continuity
Phase 4: Security and Compliance
Enterprise-grade security requires multiple layers:
PKI and Certificate Management:
- Step-CA for internal certificate authority
- Automatic certificate rotation for all cluster services
- SSH certificate authentication for secure node access
- mTLS everywhere for service-to-service communication
Secrets and Access Control:
- HashiCorp Vault for enterprise secret management
- AWS KMS integration for encryption key management
- RBAC policies for fine-grained access control
- Pod security standards for workload isolation
Phase 5: Multi-Orchestrator Hybrid Cloud
The final phase explores advanced orchestration patterns:
Service Mesh and Discovery:
- Consul service mesh for advanced networking and security
- Cross-platform service discovery between Kubernetes and Nomad
- Traffic management and circuit breaking patterns
Workload Distribution:
- Nomad integration for specialized workloads and batch jobs
- Ray cluster for distributed machine learning workloads
- GPU acceleration for AI/ML and scientific computing
Learning Philosophy
This journey prioritizes understanding over convenience:
Transparency Over Magic:
- Each component is manually configured to understand its purpose
- Ansible automation makes every configuration decision explicit
- Documentation captures the reasoning behind each choice
Production Patterns from Day One:
- High availability configurations even in the homelab
- Security-first approach with proper PKI and encryption
- Monitoring and observability built into every service
Platform Engineering Mindset:
- Reusable patterns that could scale to enterprise environments
- GitOps workflows that support team collaboration
- Self-service capabilities for application developers
The Road Ahead
The following chapters will implement these platform services systematically, building up the cluster's capabilities layer by layer. Each addition will:
- Solve a real operational problem (not just add complexity)
- Follow production best practices (high availability, security, monitoring)
- Integrate with existing services (leveraging our PKI, service discovery, etc.)
- Document the implementation (including failure modes and troubleshooting)
This methodical approach ensures that when we're done, we'll have not just a working cluster, but a deep understanding of how modern container platforms are built and operated.
In the following sections, I'll add more functionality.