The Linux kernel is at the heart of countless devices worldwide, managing everything from personal laptops to enterprise servers. Its modularity and the rapid pace of development mean new features are always on the horizon – but sometimes, even small changes can have wide-reaching impact.

Recently, a serious bug was uncovered and fixed in the kernel’s NVMe over Fabrics (NVMe-oF) code, tracked as CVE-2024-53169. In this post, we’ll break down what went wrong, how this bug could crash your system, and how the fix works – all explained simply, with exclusive code snippets and references you can use to learn more.

What Is NVMe-oF?

NVMe over Fabrics is a protocol that lets you connect remote storage devices using network technologies with high performance and low latency, just like if the drives were local. It’s widely used in data centers and high-end systems.

The Linux kernel’s NVMe Fabric driver manages connections (“controllers”) to remote drives and uses a process called “keep-alive” (sending periodic messages) to ensure these controllers are still responsive.

The Bug: Race to the Crash

The issue: When a user or system is *shutting down* an NVMe fabric controller, the keep-alive mechanism could “sneak in” and interfere. This can trigger a race condition between two kernel code paths:

- 1. Controller Shutdown: Removes the admin queue (admin_q) which manages requests to the remote device.

2. Keep-Alive Request: Tries to use the admin queue to send a periodic "keep-alive" command.

If the keep-alive request arrives just as the controller is shutting down, both try to access or remove the admin queue at nearly the same time. Because Linux is multi-threaded, these actions could happen on different CPUs. This leads to use-after-free: the keep-alive task tries to use the queue after it’s already deleted!

Here’s how the problem appears in the kernel’s logs

Call Trace:
    autoremove_wake_function+x/xbc (unreliable)
    __blk_mq_sched_dispatch_requests+x114/x24c
    blk_mq_sched_dispatch_requests+x44/x84
    blk_mq_run_hw_queue+x140/x220
    nvme_keep_alive_work+xc8/x19c [nvme_core]
    process_one_work+x200/x4e
    worker_thread+x340/x504
    kthread+x138/x140
    start_kernel_thread+x14/x18

Why Did This Happen?

A recent change to the kernel (commit a54a93de359) moved the code that *stops* keep-alive messages from early in the shutdown process (before the admin queue is destroyed) to much later (after deletion). This meant there was a window when the keep-alive thread could run after its resources were freed.

How Could This Be Exploited?

While this is not a security bug in the classic sense (such as allowing a remote attacker to inject code), it can be used as a Denial of Service (DoS) attack by causing a kernel panic. If an attacker can force very frequent NVMe controller shutdowns (for example, by rapidly toggling connection states), they may be able to trigger this bug reliably and crash the system.

Here’s a simplified pseudocode example showing the race

// Pseudocode for the problematic sequence
void shutdown_controller(struct nvme_ctrl *ctrl) {
    // Old (buggy) sequence:
    // Stops keep-alive AFTER removing admin queue!
    remove_admin_queue(ctrl->admin_q); // <-- Admin queue resources deleted
    stop_keep_alive(ctrl);             // <-- Too late!
}

void nvme_keep_alive_work(struct nvme_ctrl *ctrl) {
    // Runs periodically
    if (ctrl->admin_q) {
        send_keep_alive(ctrl->admin_q); // <-- Might use freed memory!
    }
}

The Fix: Moving the Stop

To prevent this race, kernel developers decided to move stopping of keep-alive back to *before* admin queue removal. The fix applies a simple, yet effective solution:

Here’s a simplified diff of the patch

- void nvme_uninit_ctrl(struct nvme_ctrl *ctrl) {
-     nvme_stop_keep_alive(ctrl); // Called too late, after admin queue removed
- }

+ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl) {
+     nvme_stop_keep_alive(ctrl); // Now called at the right time
+     // ...now safe to delete admin queue and tagset
+ }

Official Patch

You can see the real patch in the Linux kernel mailing list and here on GitHub.

Relevant code snippet

void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
{
    nvme_stop_keep_alive(ctrl); // <-- The added safe stop
    blk_mq_free_tag_set(&ctrl->admin_tagset);
}

Upgrade your Linux kernel to a version including or newer than the fix described above.

- If you maintain a custom kernel or distribution, port the patch: see the upstream commit.

References and Further Reading

- Upstream bug discussion and patch (kernel.org)
- Problematic commit a54a93de359 on Github
- CVE-2024-53169 Entry (mitre.org)
- Linux NVMe Subsystem
- Understanding Kernel Race Conditions

Final Thoughts

While this vulnerability might seem technical, its root cause – a subtle ordering mistake – is a classic example of how complex and challenging kernel development can be. Even simple changes can open unexpected races with big consequences. Thanks to the vigilance of Linux contributors, CVE-2024-53169 has been patched and future kernels are safer for everyone.

Timeline

Published on: 12/27/2024 14:15:24 UTC
Last modified on: 05/04/2025 13:00:38 UTC