---

Overview

In June 2024, a critical issue (CVE-2024-44995) was discovered and patched in the Linux kernel’s hns3 Ethernet driver. This bug could cause a deadlock when configuring traffic control (TC) rules while the device was being reset. Here, we’ll walk through what caused this bug, how attackers might have exploited it, and exactly how kernel developers fixed it. We'll also run through some easy-to-understand code snippets to help visualize what was going on.

What Is the hns3 Driver?

The hns3 driver powers certain network cards, especially those from Huawei. It’s used on many large-scale Linux servers. The kernel’s traffic control (TC) subsystem allows network traffic shaping and policing, which is pretty standard for big data centers.

Device Reset: System kicks off a reset (hardware or software, e.g., after a fault or upgrade).

If these two operations happened at just the wrong times, a classic race condition occurred, sometimes causing a deadlock. This froze parts of the kernel network stack, possibly leading to dropped connections, unresponsive machines, or crashes.

Here’s how the sequence looked under the hood

                           pf reset start
                              ?
                              ...
setup tc                      ?
    ?                         ?
napi_disable()(skip)          ?
    ?                         ...
napi_enable()                 ?
                         UINIT: netif_napi_del()
                              ...
                         INIT: netif_napi_add()
                              ...
                         UP: napi_enable()(skip)
                              ...

Reset continues, doing a *UINIT* (shutting stuff down), *deleting* and then *adding* "napi" objects.

4. Late in the chain, *setup TC* may prompt an enable (napi_enable) that gets skipped due to the reset states misaligned.

What Could Attackers Do?

This bug’s impact was mostly denial of service (DoS) — not direct code execution or privilege escalation (as far as we know). However, if an attacker could repeatedly force a reset (say, through crafted commands or exploiting other issues in a multi-tenant cloud system), they could potentially crash servers or take down network connectivity.

The Fix

The fix is a mindful step added to sync the network’s state management. Specifically, developers ensured that the DOWN process is triggered during UINIT.

Code Snippet

Here’s a simplified pseudocode of the buggy vs. fixed part. (For real commits, see mainline patch and lkml reference.)

Old:

void hns3_nic_reset_handle(struct hns3_nic_priv *priv) {
    // ...do reset stuff...
    hns3_nic_uninit(priv);
    // ...more reset...
}

void hns3_nic_uninit(struct hns3_nic_priv *priv) {
    netif_napi_del(&priv->napi);
    // (no further DOWN process!)
}

Patched

void hns3_nic_uninit(struct hns3_nic_priv *priv) {
    hns3_nic_down(priv);           // <-- DOWN process added
    netif_napi_del(&priv->napi);
    // clean and safe!
}

How to Protect Yourself

1. Upgrade Kernel: Make sure your Linux systems include the patched kernel (June 2024 and later). Check your distribution’s Errata or Security Updates page.
2. Mitigation: Don’t run TC configuration changes during active reset events if you’re running an old kernel.
3. Monitor for DoS: If you see unexplained network freezes on hns3 hardware, check your kernel and dmesg logs for race conditions.

References

- Linux Kernel Patch (git.kernel.org)
- LKML Discussion
- CVE-2024-44995 NVD Entry
- Linux TC Documentation (kernel.org)

Conclusion

CVE-2024-44995 is a textbook example of how messy multi-threaded state can get, even in seasoned kernel code. Thanks to fast reporting and a clear patch, this deadlock has been resolved. If you’re a sysadmin or kernel tinkerer using the hns3 driver, updating now is your best bet to prevent outages and enjoy smoother networking.

*Stay secure — and remember, even tiny race conditions can be big headaches!*

Timeline

Published on: 09/04/2024 20:15:08 UTC
Last modified on: 09/06/2024 16:28:37 UTC