CVE-2022-49931 - Kernel Crash in Linux hfi1 Driver Due to Incorrect List Handling

Summary
CVE-2022-49931 is a vulnerability in the Linux kernel's hfi1 driver for Infiniband devices. It causes a kernel crash when a link goes down and there are waiters for a send to complete. The root cause is a bug in moving linked lists inside the sc_disable() function, introduced by a previous commit meant to fix locking issues. This post will explain the vulnerability, include example kernel crash logs, show the relevant buggy code, demonstrate how the bug can be triggered, and provide links to the fix.

What is hfi1?

hfi1 is a kernel driver designed to support Intel Omni-Path Host Fabric Interface, used in high-performance computing to provide low-latency networking. The bug discussed here affects systems using this driver.

The Bug

A patch introduced in commit 13bac861952a attempted to address deadlock (ABBA locking) issues in the sc_disable() routine by moving nodes from one list to another. Unfortunately, the wrong kernel list operation was used, corrupting memory and leading to a kernel crash.

The kernel log includes this snippet

BUG: kernel NULL pointer dereference, address: 000000000000003
...
Call Trace:
 sc_disable+x1ba/x240 [hfi1]
 pio_freeze+x3d/x60 [hfi1]
 handle_freeze+x27/x1b [hfi1]
 process_one_work+x1b/x380
 worker_thread+x30/x360
 kthread+xd7/x100
 ret_from_fork+x1f/x30

Here is how the buggy code looked in drivers/infiniband/hw/hfi1/send_context.c

// Incorrect way to move a list
list_splice_init(&sc->wait_list, &other_sc->wait_list);
// This corrupts the list pointers!

According to kernel documentation, list_splice_init() should only be used if the second argument is the head of a list, *not* an entry in the middle. The mistake here is passing an entry, not a list head.

Systems with InfiniBand and hfi1 driver can crash unpredictably.

- If your cluster has high availability requirements, or your jobs are latency-sensitive, such crashes can cause outages or data loss.

Exploit Scenario

While this bug does not allow remote code execution, it is a denial of service (DoS) vector. An attacker with control of the network link or the ability to queue work on the hfi1 device could deliberately cause a link-down event, triggering the crash.

The Fix

The correct function to use is list_splice_tail_init(), which correctly moves entries from one list to another, preserving the list's structure and integrity.

Fixed code

// Correct way to move a list
list_splice_tail_init(&sc->wait_list, &other_sc->wait_list);

This change was published in kernel commit 5d77b642e75320.

References

- CVE-2022-49931 at Mitre
- Kernel Patch Fix (LKML)
- Commit Fix on Github
- Kernel List API Documentation

Conclusion

If you rely on the hfi1 driver, patch your kernel with at least version 5.18 (or the backported fix for your distro) to avoid system crashes caused by this list handling bug. This is a classic example of how subtle mistakes in low-level linked list operations can have catastrophic effects in the kernel.

For cluster admins: apply kernel updates as soon as practicable and monitor your system crash logs for signs of this vulnerability.

Stay safe and keep your clusters running!

*This writeup is exclusive and based on kernel commit and CVE research for educational purposes. Distributions such as RHEL and Ubuntu may have their own patches—check with your vendor.*

Timeline

Published on: 05/01/2025 15:16:19 UTC
Last modified on: 05/07/2025 13:29:02 UTC