GPU Gaming Containers: The Quest Begins
After getting JupyterLab working with GPU support (see 082_jupyterlab_gpu), I had a taste of what GPU workloads could do in Kubernetes. Neural networks are cool, but you know what would be cooler? Playing Baldur's Gate 3 streamed from a containerized Steam instance running on my cluster.
The plan: Use Packer to build a gaming container image with Steam and Proton, run it on velaryon's RTX 2070 Super, and stream games via Steam Remote Play. This is definitely overkill for playing video games, but that's never stopped me before.
The Architecture: CUDA vs OpenGL
Before diving in, I needed to understand what makes gaming different from machine learning workloads. JupyterLab uses CUDA—NVIDIA's framework for general-purpose parallel computing. You write kernels that run math operations across thousands of GPU cores. Matrix multiplication, neural network training, that sort of thing.
Gaming uses OpenGL (or Vulkan), which is specifically designed for 3D graphics rendering. It's a fixed pipeline:
Vertices → Vertex Shader → Rasterization → Fragment Shader → Pixels
Both use the same GPU hardware, but they're different programming models. CUDA is "here's data and a function, run it everywhere." OpenGL is "here's a 3D scene, transform and render it through these specific stages."
The RTX 2070 Super has both capabilities. JupyterLab already proved CUDA works. Now I needed to verify the graphics path worked too.
Stage 1: Verifying the Foundation
First, I checked that the NVIDIA device plugin was still running:
$ kubectl get pods -n nvidia
NAME READY STATUS RESTARTS AGE
nvidia-nvidia-device-plugin-wflzh 2/2 Running 0 3d23h
Good. The device plugin is what advertises GPU resources to Kubernetes. Without it, you can't request nvidia.com/gpu: 1 in your pod spec.
Next, I verified velaryon was advertising GPU capacity:
$ kubectl get node velaryon -o jsonpath='{.status.capacity}' | jq
{
"cpu": "24",
"memory": "32786032Ki",
"nvidia.com/gpu": "4",
...
}
Four GPUs! Well, one physical GPU time-sliced into four virtual ones. The time-slicing config from the JupyterLab setup advertises 1 physical RTX 2070 Super as 4 resources via context-switching. JupyterLab is using one slot, leaving three free.
Testing CUDA Access
I created a simple test pod to verify CUDA still worked:
# gpu-test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
runtimeClassName: nvidia # Use NVIDIA container runtime
nodeSelector:
role: gpu
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: gpu-test
image: nvidia/cuda:12.0.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: "1"
The key pieces:
runtimeClassName: nvidia- Tells Kubernetes to use Talos's NVIDIA container runtime (which mounts GPU device files and driver libraries)resources.limits.nvidia.com/gpu: 1- Requests one GPU from the device pluginnodeSelectorandtolerations- Ensures it schedules on velaryon
$ kubectl apply -f gpu-test-pod.yaml
$ kubectl logs gpu-test
Thu Nov 27 16:58:24 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01 Driver Version: 535.247.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2070 ... On | 00000000:2D:00.0 Off | N/A |
| 0% 46C P8 19W / 215W | 749MiB / 8192MiB | 0% Default |
+---------------------------------------------------------------------------------------+
Perfect. CUDA access works. The container sees the GPU, the driver, everything. The 749MiB of used VRAM is JupyterLab sitting idle.
Testing OpenGL: The llvmpipe Problem
CUDA working doesn't guarantee OpenGL works. Time to test graphics rendering. I created a test that runs glxgears (a simple OpenGL demo) in a container:
# gpu-opengl-test-pod.yaml (first attempt)
apiVersion: v1
kind: Pod
metadata:
name: gpu-opengl-test
spec:
runtimeClassName: nvidia
nodeSelector:
role: gpu
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: opengl-test
image: nvidia/opengl:1.2-glvnd-runtime-ubuntu22.04
command:
- /bin/bash
- -c
- |
apt-get update -qq
apt-get install -y -qq mesa-utils xvfb
Xvfb :99 -screen 0 1024x768x24 &
export DISPLAY=:99
sleep 2
glxinfo | grep -E "OpenGL renderer|OpenGL version"
timeout 5s glxgears -info
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
The script:
- Installs
mesa-utils(OpenGL testing tools) andxvfb(virtual X server) - Starts Xvfb on display :99 (OpenGL needs an X server to create a rendering context)
- Runs
glxinfoto see what OpenGL renderer is being used
The result:
OpenGL renderer string: llvmpipe (LLVM 15.0.7, 256 bits)
OpenGL version string: 4.5 (Compatibility Profile) Mesa 23.2.1
llvmpipe. That's Mesa's CPU-based software renderer. OpenGL was rendering on the CPU, not the GPU. Games would run at maybe 2 FPS with this.
The Debugging Journey
This was frustrating. nvidia-smi worked perfectly, proving the GPU was accessible. But OpenGL couldn't see it. Something was broken in the graphics path specifically.
I deployed a long-running debug pod to investigate:
apiVersion: v1
kind: Pod
metadata:
name: gpu-debug
spec:
runtimeClassName: nvidia
nodeSelector:
role: gpu
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: debug
image: nvidia/opengl:1.2-glvnd-runtime-ubuntu22.04
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
Then I exec'd in and started poking around:
$ kubectl exec -it gpu-debug -- bash
# Check if NVIDIA libraries are mounted
root@gpu-debug:/# ls -la /usr/lib/x86_64-linux-gnu/libnvidia* | head -20
lrwxrwxrwx. 1 root root 33 Nov 27 17:28 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.247.01
-rwxr-xr-x. 1 root root 160552 Jun 1 2019 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.535.247.01
...
-rwxr-xr-x. 1 root root 45959040 Jun 1 2019 /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.535.247.01
-rwxr-xr-x. 1 root root 656472 Jun 1 2019 /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.535.247.01
The NVIDIA graphics libraries were there! libnvidia-glcore, libnvidia-glsi, all the OpenGL stuff. The NVIDIA runtime was doing its job and mounting the driver libraries.
# Check GLX libraries
root@gpu-debug:/# ls -la /usr/lib/x86_64-linux-gnu/libGL*
lrwxrwxrwx. 1 root root 14 Jan 4 2022 /usr/lib/x86_64-linux-gnu/libGL.so.1 -> libGL.so.1.7.0
-rw-r--r--. 1 root root 543056 Jan 4 2022 /usr/lib/x86_64-linux-gnu/libGL.so.1.7.0
...
lrwxrwxrwx. 1 root root 27 Nov 27 17:28 /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 -> libGLX_nvidia.so.535.247.01
-rwxr-xr-x. 1 root root 1195552 Jun 1 2019 /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.535.247.01
lrwxrwxrwx. 1 root root 20 Dec 15 2022 /usr/lib/x86_64-linux-gnu/libGLX_mesa.so.0 -> libGLX_mesa.so.0.0.0
-rw-r--r--. 1 root root 459672 Dec 15 2022 /usr/lib/x86_64-linux-gnu/libGLX_mesa.so.0.0.0
Interesting. Both NVIDIA and Mesa GLX libraries existed. libGLX_nvidia.so was the NVIDIA implementation, libGLX_mesa.so was the software renderer. OpenGL was choosing Mesa.
# Check ICD (Installable Client Driver) configs
root@gpu-debug:/# ls -la /usr/share/glvnd/egl_vendor.d/
total 5
drwxr-xr-x. 1 root root 28 May 17 2023 .
drwxr-xr-x. 1 root root 26 May 17 2023 ..
-rw-r--r--. 1 root root 107 Jun 1 2019 10_nvidia.json
-rw-r--r--. 1 root root 105 Dec 15 2022 50_mesa.json
ICD config files tell GLVND (the OpenGL loader) which vendor libraries to use. The files are numbered—lower numbers have higher priority. 10_nvidia.json should be preferred over 50_mesa.json.
root@gpu-debug:/# cat /usr/share/glvnd/egl_vendor.d/10_nvidia.json
{
"file_format_version" : "1.0.0",
"ICD" : {
"library_path" : "libEGL_nvidia.so.0"
}
}
Wait. This ICD config is for EGL (libEGL_nvidia.so.0), not GLX.
EGL vs GLX: The Missing Link
EGL and GLX are two different OpenGL windowing systems:
- GLX (GL + X11): Traditional Linux OpenGL, works with X11 servers like Xvfb
- EGL (Embedded GL): Modern, platform-agnostic OpenGL for Wayland or headless contexts
When I ran glxinfo, I was testing GLX. But the NVIDIA ICD config only told GLVND about EGL libraries. There was no /usr/share/glvnd/glx_vendor.d/ directory with GLX configs.
I tried setting an environment variable that explicitly tells GLVND to use NVIDIA for GLX:
root@gpu-debug:/# export __GLX_VENDOR_LIBRARY_NAME=nvidia
root@gpu-debug:/# glxinfo | grep "OpenGL renderer"
OpenGL renderer string: NVIDIA GeForce RTX 2070 SUPER/PCIe/SSE2
There it is. With __GLX_VENDOR_LIBRARY_NAME=nvidia, GLVND used the NVIDIA libraries instead of Mesa. GPU-accelerated rendering worked.
The Fix
The solution was adding that environment variable to the pod spec:
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "all"
# Tell GLVND to use NVIDIA for GLX (not Mesa)
- name: __GLX_VENDOR_LIBRARY_NAME
value: "nvidia"
After updating the test pod and rerunning:
OpenGL renderer string: NVIDIA GeForce RTX 2070 SUPER/PCIe/SSE2
OpenGL version string: 4.6.0 NVIDIA 535.247.01
GL_VENDOR = NVIDIA Corporation
Perfect. OpenGL 4.6 support with full GPU acceleration. More than enough for modern games.
What I Learned
CUDA vs OpenGL: CUDA is for general parallel compute (neural networks, matrix math). OpenGL is for 3D graphics rendering (games, visualization). Both use the same GPU hardware but through different programming models.
Container Runtime Classes: The runtimeClassName: nvidia tells Kubernetes to use Talos's NVIDIA container runtime, which mounts GPU device files (/dev/nvidia*) and driver libraries into the container. Without this, containers can't access the GPU even if the device plugin allocated a GPU resource.
NVIDIA Environment Variables:
NVIDIA_VISIBLE_DEVICES=all- Makes all GPUs visible to the containerNVIDIA_DRIVER_CAPABILITIES=all- Enables graphics, compute, video encoding/decoding, etc.__GLX_VENDOR_LIBRARY_NAME=nvidia- Tells GLVND to use NVIDIA's GLX implementation instead of Mesa
GLVND Vendor Dispatch: GLVND (GL Vendor Neutral Dispatch) is the modern OpenGL loader that supports multiple GPU vendors on the same system. It uses ICD config files to discover which vendor libraries to load. In containers, the EGL configs are present but the GLX configs are missing, so you need the __GLX_VENDOR_LIBRARY_NAME env var for explicit vendor selection.
The Debugging Process: When something doesn't work, trace the path systematically:
- Verify the hardware is accessible (nvidia-smi)
- Check if libraries are mounted (
ls /usr/lib/.../libnvidia*) - Check ICD loader configs (
/usr/share/glvnd/) - Use environment variables to force specific behavior
- Test in an interactive shell (
kubectl exec -it) before building automated solutions
Next Steps
Stage 1 is complete: GPU-accelerated OpenGL rendering works in containers. The foundation is solid.
Stage 2 will be building a full GUI container with VNC so I can actually see and interact with applications. That means:
- Setting up a proper X server (not just Xvfb)
- Installing a lightweight window manager
- Configuring TigerVNC for remote access
- Making it accessible via Kubernetes Service/Ingress
Then Stage 3: Getting Steam and Proton running in that container.
This is going to be fun. Or frustrating. Probably both.