In June 2024, a new Linux kernel vulnerability was identified and patched, known as CVE-2024-50079. This issue impacts the io_uring subsystem’s SQPOLL thread model, specifically when handling task work during thread exit or request cancellation. In this exclusive, easy-to-follow article, we’ll walk through:
Links to references and upstream patches
This article is designed for sysadmins, developers, and anyone interested in Linux internals.
What is io_uring and SQPOLL?
io_uring is a high-performance async I/O API in the Linux kernel, designed to offer huge speedups for I/O-bound apps. SQPOLL (Submission Queue Polling) allows a dedicated kernel thread to poll and submit I/O requests on behalf of user threads—helpful for reducing syscalls and context switches.
But sometimes performance tweaks have subtle side effects…
Short Summary
When the SQPOLL thread (iou-sqp-*) is shutting down, it may need to run some "task_work". If this happens while canceling in-flight I/O (e.g., via io_uring_cancel_generic()), the kernel can end up running non-blocking and even blocking operations in an invalid thread state.
If the thread is not in the proper state (TASK_RUNNING), deep and rare kernel bugs can happen—even deadlocks or security problems if exploited carefully.
Kernel users saw crashes and warnings like
WARNING: CPU: 6 PID: 59939 at kernel/sched/core.c:8561 __might_sleep+xf4/x140
do not call blocking ops when !TASK_RUNNING; state=1 set at [<...>] prepare_to_wait+x88/x2fc
Translation: The kernel tried to do something that blocks (waits), but the thread wasn’t marked as actually "running." This is like going to sleep while holding a lock—dangerous!
Stack Trace Example
Call trace:
__might_sleep+xf4/x140
mutex_lock+x84/x124
io_handle_tw_list+xf4/x260
tctx_task_work_run+x94/x340
io_run_task_work+x1ec/x3c
io_uring_cancel_generic+x364/x524
io_sq_thread+x820/x124c
ret_from_fork+x10/x20
Technical Reason
A thread’s state (TASK_RUNNING, TASK_INTERRUPTIBLE, etc) controls what operations it’s allowed to do safely:
- TASK_RUNNING: Thread is available to run and can use basic locks/waits.
TASK_INTERRUPTIBLE: Thread is “sleeping”; should NOT perform blocking operations.
The SQPOLL cancel path didn’t always restore TASK_RUNNING before running task work. This caused the kernel to call blocking ops while in TASK_INTERRUPTIBLE, which the scheduler flagged.
Potentially use the bug as an infoleak or denial-of-service in tightly controlled workloads
Most real-world impact is on kernel panic / crashes; privilege escalation is *not* easy from this bug alone.
The Patch
The fix is simple but critical: make sure the thread’s state is set to TASK_RUNNING before running task_work in the cancel path—just as it’s done in other places.
Bad (pre-patch)
// This could run while state != TASK_RUNNING
io_run_task_work(ctx->sqo_task);
Good (post-patch)
// Make sure task state is correct before running task_work
set_current_state(TASK_RUNNING);
io_run_task_work(ctx->sqo_task);
Real Upstream Patch
- Link to mainline patch
- Linux stable tree commit
Commit diff summary
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ ... @@
if (task->state != TASK_RUNNING)
- io_run_task_work(task);
+ set_current_state(TASK_RUNNING);
+ io_run_task_work(task);
---
How to Reproduce (and Test the Fix)
To trigger this, you would set up io_uring with SQPOLL, submit several async operations, and try to cancel/close the ring from another thread simultaneously. This is easiest to do from C, using the liburing library.
Warning: Do not run on production systems!
#include <liburing.h>
#include <pthread.h>
void *cancel_thread(void *ring_ptr) {
struct io_uring *ring = (struct io_uring *) ring_ptr;
// Close ring from another thread, potentially racing the SQPOLL exit
io_uring_queue_exit(ring);
return NULL;
}
int main() {
struct io_uring ring;
io_uring_queue_init(8, &ring, IORING_SETUP_SQPOLL);
// Submit some I/O
// ...
pthread_t tid;
pthread_create(&tid, NULL, cancel_thread, &ring);
// Main thread does more ring operations / cancels
// ...
pthread_join(tid, NULL);
return ;
}
If patched, this should not cause kernel warnings/panics.
Conclusion
CVE-2024-50079 was a deep Linux kernel bug affecting the state machine behind async I/O with io_uring’s SQPOLL. Thanks to the Linux kernel community, it’s patched in all major trees in June 2024.
References
- Mainline patch commit
- Patch on lore.kernel.org
- io_uring documentation on kernel.org
- liburing userspace library
Timeline
Published on: 10/29/2024 01:15:04 UTC
Last modified on: 10/30/2024 17:05:40 UTC