If you work with NVMe over TCP on Linux or run storage systems with both nvmet-tcp (the target/server-side driver) and nvme-tcp (the initiator/client-side driver), this deep-dive is for you. In early 2021, a kernel bug led to potential deadlocks and instability when these modules interacted, meaning a crucial part of your fast storage stack could just freeze. Below, we’ll break down CVE-2021-47041 affecting the Linux kernel, explain the cause of the bug, show you what a kernel deadlock trace looks like, and detail the patch that fixed it—all in plain English.
What is CVE-2021-47041?
CVE-2021-47041 describes a bug in the Linux kernel's nvmet-tcp driver, which provides NVMe storage services over TCP. The bug was about incorrect locking inside a network socket callback, causing possible deadlocks—where neither code path could make progress and the system hung under certain conditions.
Where did the bug live?
The issue was in the state_change callback function for nvmet-tcp, which reacts when the kernel network stack signals that a TCP connection changes state (for example, when a connection closes). Here's a simplified version of what happened:
- Instead of acquiring a read lock (which allows multiple holders), the code acquired a write lock (which is exclusive).
- Since the callback itself was not actually changing the connection state, a write lock was too strict.
- If you run both nvme-tcp and nvmet-tcp on the same machine (common during dev/testing), the kernel might deadlock because both ends could try to grab the write lock at the same time (a classic “hold-and-wait” deadlock).
Let's look at the relevant spot in the source code, before the fix
// Bad: Take a write lock in a state_change callback
write_lock(&sk->sk_callback_lock);
// ... do things ...
write_unlock(&sk->sk_callback_lock);
But we should have used
// Correct: Take a read lock in a state_change callback
read_lock(&sk->sk_callback_lock);
// ... do things ...
read_unlock(&sk->sk_callback_lock);
What Was the Impact?
This bug was most likely to hit dev/test setups, or production machines with both initiator and target stacks loaded. Affected systems might:
The classic indicator was something like this in dmesg or your kernel log
WARNING: inconsistent lock state
5.12.-rc3 #1 Tainted: G I
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-R} usage.
[...]
* DEADLOCK *
[...]
And a very long stack trace with references to nvme_tcp_state_change or nvmet_tcp_state_change.
Here's a trimmed sample of what you’d see if this bug hit (full trace in original report)
WARNING: inconsistent lock state
5.12.-rc3 #1 Tainted: G I
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-R} usage.
nvme/1324 [...] takes:
ffff888363151000 (clock-AF_INET){++-?}-{2:2}, at: nvme_tcp_state_change+x21/x150 [nvme_tcp]
{IN-SOFTIRQ-W} state registered at:
__lock_acquire+x79b/x18d
... (more frames)
* DEADLOCK *
stack backtrace:
CPU: 26 PID: 1324 Comm: nvme Tainted: G I 5.12.-rc3 #1
Call Trace:
dump_stack+x93/xc2
mark_lock_irq.cold+x2c/xb3
...
nvme_tcp_state_change+x21/x150 [nvme_tcp]
Why was it Dangerous?
- Denial of Service (= system hang): By causing a deadlock, system I/O would just stop.
Potential for Data Loss: Storage drivers involved; abrupt shutdown or crash could lose data.
- Hard to Debug: Only showed up with advanced usage, test suites, or if enabling both drivers at once.
## The Patch / Fix
The fix (first merged in Linux v5.12-rc4) was simple and elegant:
- Replace the write_lock/write_unlock pair with a read_lock/read_unlock in the nvmet_tcp_state_change() function.
Here’s the patch in code
- write_lock(&sk->sk_callback_lock);
+ read_lock(&sk->sk_callback_lock);
// ... existing callback code ...
- write_unlock(&sk->sk_callback_lock);
+ read_unlock(&sk->sk_callback_lock);
That’s it! By downgrading to a shared read lock—more suitable because the callback wasn't making modifications—deadlocks are prevented.
Patch commit: nvmet-tcp: fix incorrect locking in state_change sk callback
References
- CVE database entry: CVE-2021-47041
- Linux kernel commit (FIX): nvmet-tcp: fix incorrect locking in state_change sk callback
- Original report (lists.kernel.org): PATCH and discussion thread
- BLKTESTS (where it was caught): Blktests repo
Exploitation Details
Is there a remote exploit?
No.
This is a denial-of-service condition that may be triggered by specific local operations, especially test suites (like blktests), or if a sysadmin is running both nvme-tcp and nvmet-tcp.
Can an attacker exploit it remotely?
Not directly. But in rare setups where end-users could load/activate both drivers and create/initiate connections from both ends (think: multi-tenant cloud SAN), it is possible.
How can I test for it?
If you run both nvmet-tcp and nvme-tcp and run blktests or aggressively plug/unplug NVMe targets and hosts, this could trigger. On old (unpatched) kernels, observe your logs for the WARNING: inconsistent lock state as above. Otherwise, update your kernel!
Proof-of-Concept (Omits actual exploit, for safety)
# (On a system with both drivers available)
# Load both drivers
modprobe nvmet-tcp
modprobe nvme-tcp
# Setup a loopback NVMe target and connect
# and run blktests or fast add/remove sessions
# Observe for hangs or lockdep warnings in dmesg
Conclusion
CVE-2021-47041 is an example of how even small mistakes in kernel code—accidentally using a write_lock where a read_lock suffices—can have major effects. Always keep up to date with kernel fixes if you handle storage! For production, use Linux kernel 5.12.-rc4 or higher (or corresponding distro patches).
If you’re developing or testing NVMe over TCP, make sure your kernel is patched!
*Have questions or want to discuss this bug?
Join the conversation on the Linux Kernel Mailing List!*
Author:
YourLinuxBuddy (for exclusive, simple English technical explainers)
Timeline
Published on: 02/28/2024 09:15:40 UTC
Last modified on: 12/06/2024 18:41:12 UTC