In early 2024, a subtle but potentially problematic bug was found and resolved in the Linux kernel’s handling of Mellanox (NVIDIA) ConnectX devices under the net/mlx5 driver. Identified as CVE-2024-50136, the flaw allowed notifier callbacks to remain registered even when eswitch initialization failed, possibly leaving systems open to repeated warnings, resource leakage, and unexpected kernel behavior.
This comprehensive article explains the vulnerability, reference sources, provides relevant code snippets, and details how it could be reproduced or triggered in practice—all in simple, clear language.
## What is net/mlx5 and Why Should You Care?
net/mlx5 is a core Linux kernel module used to drive NVIDIA/Mellanox network hardware, especially for high-speed data centers and cloud infrastructure. These devices support advanced features like hardware SR-IOV and eswitching, which are essential for virtualization and performance isolation.
The Bug: Notifier Still Registered!
The problem arises when an attempt to enable the "eswitch" (a sort of virtual switch for network traffic inside the card) *fails*. Due to missing cleanup, if the initialization of eswitch fails, the driver does not unregister the corresponding notifier callback. As a result:
Kernel warning signature
[ 682.589148] ------------[ cut here ]------------
[ 682.590204] notifier callback eswitch_vport_event [mlx5_core] already registered
[ 682.590256] WARNING: CPU: 13 PID: 266 at kernel/notifier.c:31 notifier_chain_register+x3e/x90
and stack traces similar to
notifier_chain_register+x3e/x90
atomic_notifier_chain_register+x25/x40
mlx5_eswitch_enable_locked+x1d4/x3b [mlx5_core]
...
When setting up the eswitch, the net/mlx5 code registers a notifier
// Simplied snippet
ret = atomic_notifier_chain_register(&some_chain, &esw_notifier);
if (ret)
return ret;
Now, if something fails immediately after registration—suppose, hardware is misconfigured—the code exits but forgets to unregister the notifier.
The Fix
err = mlx5_eswitch_enable_locked(esw);
if (err) {
// BUG: was missing cleanup
goto err_unreg_notifier;
}
...
err_unreg_notifier:
mlx5_nb_unregister(&esw->nb);
return err;
In simple English:
When enabling the eswitch fails, the new code makes sure to unregister (remove) the notifier callback, preventing it from being doubly registered in future attempts.
Reference: Original Advisory and Patch Links
- NVD CVE-2024-50136 Record
- Linux Kernel Patch
- Commit Discussion
Exploiting the Vulnerability: Practical Steps
This bug is mostly about kernel internal consistency rather than direct user compromise, but denial of service and system instability are possible. Here’s how it could be triggered:
Setup:
Use a server with Mellanox/NVIDIA ConnectX hardware and load the mlx5_core driver.
Intentionally Fail eswitch Initialization:
You can do this with a misconfigured or physically broken device, or possibly by tweaking driver module parameters.
`bash
# echo 4 > /sys/class/net//device/sriov_numvfs
...
WARNING: CPU: XX PID: XXXX at kernel/notifier.c:31 notifier_chain_register+x3e/x90
Flooded kernel log (dmesg)
- Increasing kernel resources assigned to the notifier chain (possible leak/overflow)
- Unreliable eswitch/virtualization setup
How to Stay Safe
If you run servers with NVIDIA/Mellanox hardware, upgrade your kernel!
This was fixed in upstream Linux as of May 2024. Major distributions will backport this, but if in doubt, contact your vendor.
If you can’t upgrade, at least avoid repeatedly enabling/disabling SR-IOV or eswitching until you have a patched kernel.
Conclusion
While CVE-2024-50136 doesn’t let attackers run code or access secret data directly, it’s a great example of why careful cleanup in kernel drivers is essential. Failing to unregister handlers and notifiers can, over time, destabilize even the most robust Linux servers.
For sysadmins: Watch for strange warnings in your logs if using Mellanox/NVIDIA devices, and patch promptly.
For developers: Always clean up your registrations on *every* error or exit path!
Read more
- NVD CVE-2024-50136
- Official Patch
- Linux Kernel Mailing List Discussion
Timeline
Published on: 11/05/2024 18:15:16 UTC
Last modified on: 11/08/2024 14:31:09 UTC