MCP Server: From Static JSON to Cluster Eyes

The Starting Point

The MCP server was running on the cluster, exposed at mcp.goldentooth.net, and Claude Code could talk to it. But all it could do was return its own version number and a hardcoded list of node names. Useful for proving the pipeline worked, not useful for actually managing the bramble.

The goal: give the MCP server real access to the cluster so it could answer questions about what's actually happening — node health, pod status, workload state, cert expiry, active alerts, logs, metrics. The full observability stack, queryable through natural language via Claude Code.

In-Cluster Kubernetes API Access

RBAC

Kubernetes doesn't let pods talk to the API server by default — you need a ServiceAccount with explicit permissions. Created a read-only ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: goldentooth-mcp-reader
rules:
  - apiGroups: [""]
    resources: [nodes, pods, services, namespaces, events, configmaps]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: [deployments, statefulsets, daemonsets]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: [jobs, cronjobs]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["cert-manager.io"]
    resources: [certificates, clusterissuers, issuers, certificaterequests]
    verbs: ["get", "list", "watch"]

Read-only across the entire cluster. The MCP server can see everything but touch nothing. Bound to a ServiceAccount in the goldentooth-mcp namespace via a ClusterRoleBinding.

One important ordering detail: I pushed the RBAC changes to the gitops repo before pushing the MCP code that depends on it. Flux reconciles gitops in ~5 minutes, and the MCP build takes ~9 minutes through the full CI pipeline. So the ServiceAccount and token are guaranteed to exist by the time the new pod starts. If you get this backwards, kube::Client::try_default() fails because there's no service account token mounted yet.

kube-rs

Added kube and k8s-openapi to the Rust dependencies. The kube crate's Client::try_default() automatically detects the in-cluster environment — it reads the service account token from /var/run/secrets/kubernetes.io/serviceaccount/token and the API server CA cert from the same directory. No configuration needed.

#![allow(unused)]
fn main() {
let kube_client = match Client::try_default().await {
    Ok(client) => {
        tracing::info!("Kubernetes client initialized (in-cluster)");
        Some(client)
    }
    Err(e) => {
        tracing::warn!("Kubernetes client not available: {e}");
        None
    }
};
}

The client is Option<Client> so the server degrades gracefully when running locally for development — the cluster tools return a clear error instead of panicking.

rmcp Parameter Handling

Discovered that rmcp's #[tool] macro uses a Parameters<T> wrapper for tool inputs, not a #[tool(param)] attribute:

#![allow(unused)]
fn main() {
#[tool(description = "List pods. Optionally filter by namespace.")]
async fn get_pods(
    &self,
    Parameters(input): Parameters<NamespaceFilter>,
) -> Result<CallToolResult, McpError> {
    cluster::get_pods(self.require_kube()?, input.namespace.as_deref()).await
}
}

The NamespaceFilter struct derives both Deserialize and schemars::JsonSchema, which lets the MCP protocol auto-generate the parameter schema that Claude Code uses for tool discovery.

k8s-openapi Version Dance

Ran into a build error because kube 1.x depends on k8s-openapi 0.25 but I initially specified 0.24 with features = ["latest"]. The latest feature only exists on 0.24, and having two versions of k8s-openapi in the dependency tree causes the build script to panic. Fixed by using k8s-openapi 0.25 with an explicit features = ["v1_32"].

Observability Tools

The Kubernetes API gives us infrastructure state, but the cluster has a full monitoring stack — Prometheus, Loki, Alertmanager, and cert-manager — with their own APIs. Added reqwest for HTTP client calls to these in-cluster services.

cert-manager Certificates

cert-manager CRDs aren't in k8s-openapi (they're custom resources), so I used kube's dynamic API:

#![allow(unused)]
fn main() {
let certs_api = kube::Api::<kube::api::DynamicObject>::all_with(
    client.clone(),
    &kube::discovery::ApiResource {
        group: "cert-manager.io".into(),
        version: "v1".into(),
        api_version: "cert-manager.io/v1".into(),
        kind: "Certificate".into(),
        plural: "certificates".into(),
    },
);
}

Returns every Certificate resource with its ready status, expiry time, renewal time, issuer, and DNS names. Now I can ask "are any certs about to expire?" and get a real answer.

Alertmanager Alerts

Simple REST call to the Alertmanager v2 API:

http://monitoring-kube-prometheus-alertmanager.monitoring.svc:9093/api/v2/alerts

Returns active alerts with severity, status, summary, and description. If something's on fire, this is how I'll know.

Loki Log Queries

LogQL queries against Loki:

http://monitoring-loki.monitoring.svc.cluster.local:3100/loki/api/v1/query_range

Accepts arbitrary LogQL — {namespace="forgejo"} |= "error", {job="systemd-journal"} |= "OOM", whatever. Returns up to 500 log lines with timestamps and stream labels. This is the "what just happened" tool.

Prometheus Metrics

PromQL queries against Prometheus:

http://monitoring-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090/api/v1/query

Instant queries for any metric — up, node_memory_MemAvailable_bytes, rate(container_cpu_usage_seconds_total[5m]). This is the "how's the cluster doing right now" tool.

Build Time

The binary got noticeably bigger with kube-rs, k8s-openapi, reqwest, rustls, and all their transitive dependencies. Build time went from ~6 minutes to ~9 minutes in the Forgejo Actions pipeline. Still native compilation on the Pi (ARM64 on ARM64), but Rust release builds with a heavy dependency tree and musl static linking are just slow on a Raspberry Pi. It is what it is.

Also bumped the pod memory limits from 64Mi to 128Mi — the TLS stack in kube-rs and reqwest needs more headroom than a bare HTTP server.

The Full Tool Set

After this work, the MCP server exposes 11 tools:

ToolSourceWhat it does
get_versionStaticServer version and build SHA
get_node_statusK8s APINode readiness, CPU, memory, OS
get_podsK8s APIPod phase, restarts, node placement
get_namespacesK8s APINamespace listing
get_eventsK8s APIRecent cluster events
get_workloadsK8s APIDeployment/StatefulSet/DaemonSet replicas
get_certificatesK8s APIcert-manager certificate status and expiry
get_alertsAlertmanagerActive alerts with severity
query_logsLokiLogQL log search
query_metricsPrometheusPromQL metric queries

The server went from "hello world" to "full cluster observability" in two deploys. The next obvious additions are Flux reconciliation status (is everything in sync?) and maybe ntfy notification history. But this is already enough to do a real cluster health check without touching kubectl.