In June 2024, a critical bug was discovered and patched in the Linux kernel’s PCIe endpoint driver for NVIDIA Tegra194 system-on-chips (SoCs). The issue, designated CVE-2024-53152, could lead to system crashes when a PCIe endpoint device listens for signals from the host, specifically around cleanup routines tied to power and reset signals. This article explains the bug, how it could be triggered, what an exploit might look like, how it was fixed, and where you can read more.

Understanding the Vulnerability: What Went Wrong?

PCIe endpoints (devices) need to listen to the host’s signals to know when to shut down or clean up resources. On Tegra194 SoCs, many of these controller cleanup actions (like freeing DMA engines or shutting down peripherals) happen immediately when the host asserts the PERST# (reset) signal.

The problem? Some endpoint hardware, like Tegra194, relies on the host to provide a “refclk” clock signal. Once the host asserts reset, it quickly turns off this refclk. Any code accessing hardware registers after this loses hardware access—resulting in a total endpoint crash.

In short

- Old behavior: Cleanup happens as soon as the reset signal is received—risking register access after clock is gone.

In the Linux kernel, the following sequence was the culprit

static void pex_ep_event_pex_rst_assert(...)
{
    // Called when host asserts PERST#
    dw_pcie_ep_cleanup(); // ACCESS REGISTERS HERE
    pci_epc_deinit_notify(); // NOTIFY EPF
    // ...other code...
    // Shortly afterwards, host turns off refclk
}

Any register access after the clock goes away causes a controller hang or crash on Tegra194 endpoints.

Exploit Details: How Could This Be Abused?

This is not a traditional remote exploit, but it is a denial of service vector for devices using Tegra194 as PCIe endpoints. Here are possible practical impacts:

The endpoint’s kernel then crashes—causing device unavailability or requiring a hard reboot.

- In multi-device setups, this crash could affect all endpoints of this type—risking serious downtime.

Proof of concept for a host controller (Python-like pseudocode)

# Imagine a testbench using PCIe hotplug
pci.assert_perst(endpoint_id)
sleep(.01) # 10ms, very fast!
pci.disable_refclk(endpoint_id)
# Endpoint may now crash if running a vulnerable kernel

The Fix: Safe Cleanup Under Refclk

The Linux kernel patch moves cleanup calls to a safer spot, triggered _after_ the host deasserts PERST# and _before_ cleaning up resources, at a time when the refclk is guaranteed to be available.

Key changes (pseudocode)

static void pex_ep_event_pex_rst_deassert(...)
{
    // Called when host releases PERST#
    enable_resources(); // Enable clocks, power, etc.
    dw_pcie_ep_cleanup(); // Do cleanup now -- refclk is valid!
    pci_epc_deinit_notify(); // Notify kernel subsystems
    // Continue with usual setup...
}

- Linux Kernel Patch Discussion (lore.kernel.org)
- Linux Kernel pci-tegra194.c file
- NVIDIA Tegra194 SoC Technical Reference Manual
- Linux PCI Endpoint Subsystem Documentation

Conclusion: What Should You Do?

- Device makers: If you use NVIDIA Tegra194 as a PCIe endpoint and run Linux, update your kernel to include this fix.
- System integrators: Avoid hacking around via host clock hacks; let the kernel manage cleanup as intended.
- PCIe devs/testers: Retest endpoints with your host asserting/deasserting PERST#—watch for crashes, verify patched behavior.

Summary:
CVE-2024-53152 is a classic example of how tight timings in hardware handshakes can stall software. This patch ensures safe and robust device cleanup, shielding your endpoints from denial-of-service by buggy or malicious hosts.


Stay secure. Patch your endpoints!
Share this post with peers working with custom Linux hardware and Tegra-based PCIe endpoints.


> If you found this useful, check Linux kernel release notes and security lists for all endpoint CVEs.
> Follow up-to-date guidance at kernel.org and lore.kernel.org.

Timeline

Published on: 12/24/2024 12:15:23 UTC
Last modified on: 05/04/2025 09:54:22 UTC