The Linux kernel devs have recently resolved a vulnerability in the io_uring subsystem that could potentially cause deadlocks in certain situations. This post will dive deeper into this vulnerability (CVE-2024-53052) and provide an explanation, complete with a code snippet, to help you understand the issue and how it was fixed.
Vulnerability Description
This vulnerability resides in the io_uring subsystem of the Linux kernel, specifically in the interaction between the kiocb_start_write() function and the super block rwsem. The problem arises when the system attempts to freeze the filesystem while the mount point is experiencing concurrent writes. In such cases, the system might encounter a deadlock, as the task responsible for completing the write is blocked while it waits for another write process to start.
Here are the stack traces of the two tasks—one attempting to write, and the other attempting to freeze—caught in this deadlock situation:
task:fio state:D stack: pid:886 tgid:886 ppid:876
Call trace:
__switch_to+x1d8/x348
__schedule+x8e8/x2248
schedule+x110/x3f
percpu_rwsem_wait+x1e8/x3f8
__percpu_down_read+xe8/x500
io_write+xbb8/xff8
io_issue_sqe+x10c/x102
io_submit_sqes+x614/x211
__arm64_sys_io_uring_enter+x524/x1038
invoke_syscall+x74/x268
el_svc_common.constprop.+x160/x238
do_el_svc+x44/x60
el_svc+x44/xb
elt_64_sync_handler+x118/x128
elt_64_sync+x168/x170
INFO: task fsfreeze:7364 blocked for more than 15 seconds.
Not tainted 6.12.-rc5-00063-g76aaf945701c #7963
task:fsfreeze state:D stack: pid:7364 tgid:7364 ppid:995
Call trace:
__switch_to+x1d8/x348
__schedule+x8e8/x2248
schedule+x110/x3f
percpu_down_write+x2b/x680
freeze_super+x248/x8a8
do_vfs_ioctl+x149c/x1b18
__arm64_sys_ioctl+xd/x1a
invoke_syscall+x74/x268
el_svc_common.constprop.+x160/x238
do_el_svc+x44/x60
el_svc+x44/xb
elt_64_sync_handler+x118/x128
elt_64_sync+x168/x170
The vulnerability specifically occurs because the io_uring subsystem does not take into account the IOCB_NOWAIT flag while using kiocb_start_write(). This results in the deadlock, as the write process is unable to complete while waiting for a new write to begin.
The Fix
The fix involves modifying the io_uring subsystem to honor the IOCB_NOWAIT flag properly. When this flag is set, the system should only attempt a blocking grab of the super block rwsem if it isn't set. Consequently, when the IOCB_NOWAIT flag is set (as it should be by default), the patched io_uring will return an -EAGAIN error code. This prompts the io_uring core to issue a blocking attempt of the write, which effectively circumvents the deadlock and ensures forward progress.
It's important to note that this vulnerability would not typically affect regular users, as freezing a filesystem requires CAP_SYS_ADMIN privileges. However, the fix is necessary for system administrators and those who manage Linux servers.
For more information about the io_uring subsystem, please refer to the official documentation. To understand the complete patch for this vulnerability, you can view the patch details on the Linux Kernel Mailing List (LKML) here.
In conclusion, this vulnerability highlights the importance of proper flag checks and deadlock avoidance in concurrent systems programming. Fortunately, the Linux kernel community was quick to identify and patch the issue, thereby preventing any potential fallout from this bug in io_uring.
Timeline
Published on: 11/19/2024 18:15:25 UTC
Last modified on: 12/19/2024 09:38:04 UTC