The Linux kernel devs have recently resolved a vulnerability in the io_uring subsystem that could potentially cause deadlocks in certain situations. This post will dive deeper into this vulnerability (CVE-2024-53052) and provide an explanation, complete with a code snippet, to help you understand the issue and how it was fixed.

Vulnerability Description

This vulnerability resides in the io_uring subsystem of the Linux kernel, specifically in the interaction between the kiocb_start_write() function and the super block rwsem. The problem arises when the system attempts to freeze the filesystem while the mount point is experiencing concurrent writes. In such cases, the system might encounter a deadlock, as the task responsible for completing the write is blocked while it waits for another write process to start.

Here are the stack traces of the two tasks—one attempting to write, and the other attempting to freeze—caught in this deadlock situation:

task:fio             state:D stack:     pid:886   tgid:886   ppid:876
Call trace:
 __switch_to+x1d8/x348
 __schedule+x8e8/x2248
 schedule+x110/x3f
 percpu_rwsem_wait+x1e8/x3f8
 __percpu_down_read+xe8/x500
 io_write+xbb8/xff8
 io_issue_sqe+x10c/x102
 io_submit_sqes+x614/x211
 __arm64_sys_io_uring_enter+x524/x1038
 invoke_syscall+x74/x268
 el_svc_common.constprop.+x160/x238
 do_el_svc+x44/x60
 el_svc+x44/xb
 elt_64_sync_handler+x118/x128
 elt_64_sync+x168/x170
INFO: task fsfreeze:7364 blocked for more than 15 seconds.
      Not tainted 6.12.-rc5-00063-g76aaf945701c #7963

task:fsfreeze        state:D stack:     pid:7364  tgid:7364  ppid:995
Call trace:
 __switch_to+x1d8/x348
 __schedule+x8e8/x2248
 schedule+x110/x3f
 percpu_down_write+x2b/x680
 freeze_super+x248/x8a8
 do_vfs_ioctl+x149c/x1b18
 __arm64_sys_ioctl+xd/x1a
 invoke_syscall+x74/x268
 el_svc_common.constprop.+x160/x238
 do_el_svc+x44/x60
 el_svc+x44/xb
 elt_64_sync_handler+x118/x128
 elt_64_sync+x168/x170

The vulnerability specifically occurs because the io_uring subsystem does not take into account the IOCB_NOWAIT flag while using kiocb_start_write(). This results in the deadlock, as the write process is unable to complete while waiting for a new write to begin.

The Fix

The fix involves modifying the io_uring subsystem to honor the IOCB_NOWAIT flag properly. When this flag is set, the system should only attempt a blocking grab of the super block rwsem if it isn't set. Consequently, when the IOCB_NOWAIT flag is set (as it should be by default), the patched io_uring will return an -EAGAIN error code. This prompts the io_uring core to issue a blocking attempt of the write, which effectively circumvents the deadlock and ensures forward progress.

It's important to note that this vulnerability would not typically affect regular users, as freezing a filesystem requires CAP_SYS_ADMIN privileges. However, the fix is necessary for system administrators and those who manage Linux servers.

For more information about the io_uring subsystem, please refer to the official documentation. To understand the complete patch for this vulnerability, you can view the patch details on the Linux Kernel Mailing List (LKML) here.

In conclusion, this vulnerability highlights the importance of proper flag checks and deadlock avoidance in concurrent systems programming. Fortunately, the Linux kernel community was quick to identify and patch the issue, thereby preventing any potential fallout from this bug in io_uring.

Timeline

Published on: 11/19/2024 18:15:25 UTC
Last modified on: 12/19/2024 09:38:04 UTC