Step-CA (Again)

After implementing Cilium for networking (see 076_cilium), the cluster needed a proper internal Public Key Infrastructure (PKI) for managing TLS certificates. We'd set this up before, so it was time to get Step-CA working with the Talos cluster.

Why Internal PKI?

An internal PKI provides several advantages for cluster services:

  • Automated certificate issuance: Services can request certificates declaratively
  • Short-lived certificates: 24-hour lifetimes reduce blast radius of compromised keys
  • Centralized trust: Single root CA for all internal services
  • GitOps-managed: Entire PKI configuration version-controlled and automated
  • No external dependencies: Full control over certificate issuance and revocation
  • Learning opportunity: Deep dive into PKI, x.509, and certificate automation

For a learning cluster, building a proper PKI demonstrates production-grade security practices that are often hidden behind managed services in cloud environments.

Architecture Overview

The PKI infrastructure consists of three layers, deployed via Flux with dependency ordering:

Layer 00: Foundations
  ├── cert-manager (certificate lifecycle management)
  ├── cert-manager-approver-policy (approval automation)
  ├── step-ca (Certificate Authority)
  └── step-issuer (cert-manager ↔ step-ca bridge)

Layer 01: Issuers
  ├── StepClusterIssuer (cluster-wide issuer)
  ├── CertificateRequestPolicy (approval rules)
  └── RBAC bindings (allow cert-manager to use policy)

Layer 02: Tests
  ├── Test certificate (24h lifetime)
  └── Canary certificate (2h lifetime, renews hourly)

Each layer depends on the previous layer being healthy before deploying, ensuring correct startup ordering.

Component Details

step-ca: The Certificate Authority

step-ca is Smallstep's open-source CA server. It's designed for automated certificate issuance with:

  • JWK provisioners: Authenticate with JSON Web Keys
  • Short default lifetimes: Encourages certificate rotation
  • REST API: Easy integration with automation tools
  • Flexible configuration: Claims, provisioners, and extensions

The deployment uses bootstrap mode to auto-generate a root CA and JWK provisioner on first run:

ca:
  name: Goldentooth CA
  address: :9000
  dns: step-ca.goldentooth.net,step-ca.step-ca.svc.cluster.local
  url: https://step-ca.goldentooth.net
  db:
    enabled: true
    persistent: false  # emptyDir for homelab
  claims:
    defaultTLSCertDuration: 24h
    maxTLSCertDuration: 24h
    minTLSCertDuration: 5m

bootstrap:
  enabled: true
  configmaps: true  # Export CA cert to ConfigMap
  secrets: true     # Store CA keys in Secrets

The persistent: false setting uses emptyDir storage. While this means the certificate issuance database is lost on pod restart, the root CA itself is preserved in ConfigMaps and Secrets. Certificates simply need to be re-issued after a restart, which happens automatically thanks to cert-manager.

cert-manager: Certificate Lifecycle Management

cert-manager is the de facto standard for Kubernetes certificate automation. It:

  • Watches Certificate resources and creates certificate requests
  • Coordinates with external CAs to sign requests
  • Stores issued certificates in Kubernetes Secrets
  • Automatically renews certificates before expiration

The deployment uses the official Helm chart with minimal customization:

values:
  crds:
    enabled: true
  resources:
    requests:
      cpu: 10m
      memory: 64Mi

step-issuer: The Bridge

step-issuer is a custom cert-manager Issuer that speaks step-ca's API. It translates cert-manager's CertificateRequest resources into step-ca API calls using JWK authentication.

The critical configuration challenge was finding the correct chart version. The step-issuer project underwent a major version jump from 0.8.x directly to 1.8.x, skipping the 0.10.x range entirely. Using version: "1.9.x" resolved the "chart not found" errors.

cert-manager-approver-policy: Automated Approval

A subtle but critical component! cert-manager's built-in approver only handles internal issuers (CA, SelfSigned, Venafi). External issuers like step-issuer require explicit approval via policies.

Without approver-policy, certificate requests would sit in "pending approval" state forever:

NAME               APPROVED   DENIED   READY   ISSUER
test-certificate-1                     step-ca
                   ^-- stuck here

The approver-policy controller watches for CertificateRequestPolicy resources that define approval rules. But here's the catch: policies must be bound via RBAC to the requester!

GitOps Structure

The deployment follows a layered GitOps approach with explicit dependencies:

gitops/infrastructure/pki/
├── kustomization.yaml (orchestrates all layers)
├── 00-foundations/
│   ├── flux-kustomization.yaml
│   ├── cert-manager/
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── release.yaml
│   │   └── repository.yaml
│   ├── cert-manager-approver-policy/
│   │   ├── kustomization.yaml
│   │   └── release.yaml
│   ├── step-ca/
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── release.yaml
│   │   ├── repository.yaml
│   │   └── service-alias.yaml
│   └── step-issuer/
│       ├── kustomization.yaml
│       └── release.yaml
├── 01-issuers/
│   ├── flux-kustomization.yaml (depends on 00-foundations)
│   ├── kustomization.yaml
│   ├── step-cluster-issuer.yaml
│   ├── certificate-request-policy.yaml
│   └── policy-rbac.yaml
└── 02-tests/
    ├── flux-kustomization.yaml (depends on 01-issuers)
    ├── kustomization.yaml
    ├── test-certificate.yaml
    └── canary-certificate.yaml

Each Flux Kustomization waits for the previous layer's health checks before proceeding:

# 01-issuers/flux-kustomization.yaml
spec:
  dependsOn:
    - name: pki-foundations
  healthChecks:
    - apiVersion: certmanager.step.sm/v1beta1
      kind: StepClusterIssuer
      name: step-ca
      namespace: ""

Deployment Process

The deployment was a journey through several layers of abstraction and error messages:

1. Initial Bootstrap Issues

Early attempts used inject.enabled: true to provide configuration to step-ca. This conflicted with bootstrap.enabled: true, causing the CA to fail initialization. The chart expects either bootstrap (auto-generate config) OR inject (provide pre-existing config), not both.

Solution: Use bootstrap mode exclusively, let step-ca auto-generate its CA and provisioner.

2. Persistence Problems

The default persistence.enabled: true setting tried to create a PersistentVolumeClaim. However, the correct field for disabling persistence in the step-certificates chart is ca.db.persistent: false, not the top-level persistence field.

Solution: Explicitly set ca.db.persistent: false to use emptyDir storage.

3. Service DNS Naming Mismatch

The Helm chart created a service named step-ca-step-ca-step-certificates (following Helm's naming pattern), but the bootstrap process generated a CA certificate with SANs for step-ca.step-ca.svc.cluster.local. The StepClusterIssuer tried to connect to the short name, but TLS verification failed:

certificate is valid for step-ca.goldentooth.net, step-ca.step-ca.svc.cluster.local,
not step-ca-step-ca-step-certificates.step-ca.svc.cluster.local

Solution: Create a Service alias with the short name that matches the certificate SAN:

apiVersion: v1
kind: Service
metadata:
  name: step-ca
  namespace: step-ca
spec:
  type: ClusterIP
  ports:
    - name: https
      port: 443
      targetPort: 9000
  selector:
    app.kubernetes.io/name: step-certificates
    app.kubernetes.io/instance: step-ca-step-ca

This provides step-ca.step-ca.svc.cluster.local DNS resolution while preserving the Helm-generated service name.

4. Certificate Request Approval Deadlock

After resolving connectivity, certificate requests were created but never approved:

$ kubectl get certificaterequest -n cert-test
NAME               APPROVED   DENIED   READY   ISSUER
test-certificate-1                     step-ca

The cert-manager logs showed a helpful message:

Request is not applicable for any policy so ignoring

This revealed two missing pieces:

Missing Component 1: cert-manager-approver-policy

The approver-policy controller wasn't installed. cert-manager's built-in approver only handles internal issuers, so external issuers like step-issuer need the policy controller.

Missing Component 2: RBAC Bindings

Even after creating a CertificateRequestPolicy, requests were still ignored! The policy controller requires RBAC bindings to allow requesters to "use" policies:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: approve-step-ca-requests:use
rules:
  - apiGroups: [policy.cert-manager.io]
    resources: [certificaterequestpolicies]
    verbs: [use]
    resourceNames: [approve-step-ca-requests]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cert-manager:approve-step-ca-requests
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: approve-step-ca-requests:use
subjects:
  - kind: ServiceAccount
    name: cert-manager-cert-manager
    namespace: cert-manager

This pattern follows Kubernetes' principle of explicit authorization: just because a policy exists doesn't mean anyone can use it.

5. Policy Rejection: Missing Organization Field

Once RBAC was in place, certificate requests were finally being evaluated—but denied!

No policy approved this request: [approve-step-ca-requests:
spec.allowed.subject.organizations: Invalid value: []string{"Goldentooth"}: no allowed values]

The test certificate included subject.organizations: ["Goldentooth"], but the policy's allowed section didn't include an organizations field. In cert-manager-approver-policy, the allowed section works as a whitelist: any field in the certificate request must be explicitly allowed, or the request is denied.

Solution: Add organization to the allowed fields:

allowed:
  subject:
    organizations:
      values:
        - "Goldentooth"

This security-by-default behavior prevents privilege escalation via unexpected certificate fields.

Security Hardening

After basic functionality was working, several security improvements were implemented:

1. Tighten Approval Policy

The initial policy used wildcards for everything:

# Before: Overly permissive
allowed:
  commonName: { value: "*" }
  dnsNames: { values: ["*.goldentooth.local", "*.goldentooth.net", "*.svc.cluster.local"] }
  ipAddresses: { values: ["*"] }
  organizations: { values: ["*"] }

This was refined to actual cluster requirements:

# After: Restricted to actual needs
allowed:
  commonName: { value: "*" }  # OK for internal CA
  dnsNames:
    values:
      - "*.goldentooth.local"
      - "*.goldentooth.net"
      - "*.svc.cluster.local"
      - "*.*.svc.cluster.local"  # Namespaced services
  ipAddresses:
    values:
      - "10.*"  # Cluster IP range only
  subject:
    organizations:
      values:
        - "Goldentooth"  # Specific org only

IP addresses are limited to the internal 10.* range (cluster IPs), excluding the home network 192.168.* range which should never receive cluster-issued certificates.

2. Enforce Certificate Duration Limits

Two layers of duration enforcement provide defense-in-depth:

Policy Layer (primary enforcement):

constraints:
  maxDuration: 24h
  minDuration: 5m

The policy rejects any certificate request outside these bounds before it reaches the CA.

CA Layer (backup enforcement):

ca:
  claims:
    defaultTLSCertDuration: 24h
    maxTLSCertDuration: 24h
    minTLSCertDuration: 5m

While the CA claims are configured in the HelmRelease, they may not apply in bootstrap mode (the ConfigMap shows claims: null). However, step-ca's default claims already enforce 24h maximum, providing a reasonable baseline even without explicit configuration.

The policy layer is the primary defense and is cleanly expressed in GitOps. This follows the principle of enforcing security at the earliest possible point in the request flow.

Continuous Validation with Canary Certificates

To ensure certificate renewal continues working, a canary certificate with aggressive rotation was added:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: canary-certificate
  namespace: cert-test
spec:
  secretName: canary-tls-secret
  duration: 2h
  renewBefore: 1h  # Renews when 50% lifetime remains
  commonName: canary.goldentooth.local
  dnsNames:
    - canary.goldentooth.local
  issuerRef:
    name: step-ca
    kind: StepClusterIssuer
    group: certmanager.step.sm

This certificate:

  • Has a 2-hour lifetime
  • Renews when 1 hour remains (50% threshold)
  • Therefore renews approximately every hour

If any component of the PKI stack breaks (step-ca unavailable, policy misconfigured, StepClusterIssuer invalid), the canary will fail to renew within 1-2 hours instead of 20+ hours for a normal 24h certificate. This provides early warning of PKI issues.

The canary pattern is purely declarative—just another Certificate resource—requiring no additional infrastructure. It continuously validates the entire certificate request → approval → issuance → renewal path.

I might reduce this to e.g. 15 minutes or something, but this should be fine.

Verification

After all components were deployed, verification confirmed the PKI was fully operational:

$ kubectl get certificate -n cert-test
NAME                 READY   SECRET              AGE
canary-certificate   True    canary-tls-secret   1h
test-certificate     True    test-tls-secret     5h

Both certificates show READY, meaning they were successfully issued, approved, and stored in Secrets.

Checking the certificate requests:

$ kubectl get certificaterequest -n cert-test
NAME                   APPROVED   DENIED   READY   ISSUER
canary-certificate-1   True                True    step-ca
test-certificate-1     True                True    step-ca

Both show APPROVED=True and READY=True, confirming the approval policy and RBAC bindings are working correctly.

Inspecting the canary certificate shows the short lifetime and renewal timing:

$ kubectl get secret -n cert-test canary-tls-secret -o jsonpath='{.data.tls\.crt}' \
  | base64 -d | openssl x509 -noout -text | grep -A 2 "Validity"
        Validity
            Not Before: Nov 16 03:25:27 2025 GMT
            Not After : Nov 16 05:26:27 2025 GMT

The 2-hour validity period is correct (03:25 → 05:26).

Checking the renewal time:

$ kubectl describe certificate -n cert-test canary-certificate | grep "Renewal Time"
Renewal Time: 2025-11-16T04:26:27Z

The certificate will renew at 04:26 (1 hour after issuance), confirming the renewBefore: 1h configuration.

Defense-in-Depth: Multiple Enforcement Layers

The final architecture implements three overlapping security controls:

  1. CertificateRequestPolicy constraints (Layer 01)

    • Enforces max duration: 24h
    • Enforces min duration: 5m
    • Restricts DNS names to known patterns
    • Restricts IPs to internal cluster range
    • Restricts organization field
    • Primary enforcement point (cleanest GitOps expression)
  2. step-ca claims (Layer 00)

    • Configured in HelmRelease values
    • May not apply due to bootstrap mode
    • Provides backup enforcement if requests bypass policy
    • step-ca defaults are reasonable (24h max)
  3. Certificate resource defaults (Layer 02)

    • Applications request sensible values (24h)
    • Well-behaved clients don't try to abuse the system

This layered approach means even if one layer is misconfigured, the others provide protection. The policy layer is the most important because:

  • It's declarative and version-controlled in Git
  • It catches bad requests before they reach the CA
  • It's easy to audit and modify via pull requests
  • It follows the Kubernetes admission controller pattern

Similar to how Pod Security admission controls prevent privileged pods, CertificateRequestPolicy prevents unauthorized certificates.

Lessons Learned

1. Bootstrap vs Inject Modes Are Mutually Exclusive

step-ca's Helm chart supports either auto-generating configuration (bootstrap) or injecting pre-existing configuration (inject), but not both. For a new deployment, bootstrap mode is simpler and more GitOps-friendly.

2. External Issuers Need Explicit Approval Infrastructure

cert-manager's built-in approver only handles cert-manager's own issuers. External issuers require:

  • The approver-policy controller installed
  • A CertificateRequestPolicy defining rules
  • RBAC bindings allowing requesters to use the policy

This three-part requirement isn't obvious from documentation and was discovered through error messages.

3. Approval Policies Are Whitelists, Not Filters

Any field in a certificate request must be explicitly allowed in the policy's allowed section. Missing fields result in denial, not omission. This security-by-default behavior prevents privilege escalation but requires careful policy authoring.

4. Service Naming Matters for TLS Verification

When the Helm chart's service name doesn't match the CA's certificate SANs, TLS verification fails. Creating a service alias that matches the expected DNS name solves this without modifying chart defaults.

5. Canary Certificates Provide Continuous Validation

Rather than waiting for the first production certificate renewal to discover issues, a short-lived canary certificate provides hourly validation of the entire PKI stack. This is a pure GitOps pattern requiring no additional tooling.

6. Policy Enforcement > CA Enforcement for GitOps

While configuring step-ca's claims seems like the "right" place for enforcement, policies provide better GitOps expressiveness, earlier validation, and easier auditability. The CA provides defense-in-depth, but the policy is the primary control.