CVE-2024-26986 - Understanding the Linux Kernel Memory Leak in AMD KFD (Exploit Details, Patch Review, and How to Stay Safe)

A new security issue, CVE-2024-26986, was recently patched in the Linux kernel. This bug affected the AMDKFD, which stands for AMD Kernel Fusion Driver—a component in the DRM (Direct Rendering Manager) subsystem responsible for advanced AMD GPU functions. The vulnerability specifically deals with a memory leak that happened in certain error situations during GPU resets.

If you want to know what went wrong, how it could be abused, and see a real code example, you’re in the right place.

Affected kernel: Linux (various versions with AMD KFD support)

- Component: DRM/AMDKFD (amdgpu)

Severity: Medium (privilege escalation or DoS by local attacker)

- Fixed in: mainline Linux commit 1c7b883dab0399
- Discovery: AMD/Kernel maintainers

Summary:
When a user process tries to access a GPU while the device is undergoing a reset, the kernel accidentally “leaks” (doesn’t release) a reference to the process’s memory mapping (mmget). If this happens many times, system memory could become exhausted—potentially crashing the system or letting local users degrade performance.

How the Bug Happens

The memory leak happens inside the function that creates a process context for GPU work. If an error occurs during certain device states (like a GPU reset), the error handling code forgets to put back the reference to the process’s memory manager.

In code terms, every call to mmget() should pair with an eventual mmput(). In this bug, the cleanup was incomplete.

Vulnerable Code Snippet (Before Patch)

/* linux/drivers/gpu/drm/amd/amdkfd/kfd_process.c */
int kfd_create_process(...)
{
    struct kfd_process *process;

    process = kzalloc(sizeof(*process), GFP_KERNEL);

    /* ... */
    process->mm = get_task_mm(current);
    if (!process->mm)
        return -ENOMEM;

    mmget(process->mm);

    /* Something goes wrong due to GPU reset... */
    if (device_is_resetting()) {
        pr_err("GPU reset in progress");
        // Oops: forgot to call mmput(process->mm)
        kfree(process);
        return -EIO;
    }

    /* ... */
}

> *Notice*: mmget() increments the mm reference, but in the error case (device_is_resetting()), there's no matching mmput() call.

Risk

This is not a remote code execution bug, but it can be triggered by a local user, like an unprivileged account running GPU workloads through ROCm/CUDA/HIP or OpenCL.

Each process that hits the error path will leak kernel memory.

4. Eventually, the system's memory or per-process limits are exhausted, resulting in a crash (denial of service).

In Pseudo-Exploit

import os
import time
from multiprocessing import Process

def gpu_workload():
    # Do something that queries the AMD GPU,
    # e.g., open /dev/kfd, load ROCm/CUDA, etc.
    os.system("some-gpu-program")

def stress_kernel_leak():
    while True:
        p = Process(target=gpu_workload)
        p.start()
        p.join()

# Meanwhile, trigger a GPU reset (by overclocking, removing power, etc.) 
# or wait for system stress.

# This loop will pile up kernel memory from mmget() references.
stress_kernel_leak()

*This is a demonstration—Do not use maliciously.*

The Fix

The patch added the missing mmput() call in the error path. Here’s what the correct code looks like:

Fixed Code Snippet (After Patch)

if (device_is_resetting()) {
    pr_err("GPU reset in progress");
    mmput(process->mm);    // <--- Added
    kfree(process);
    return -EIO;
}

View full patch here

Upgrade your kernel! Any Linux version released after April 2024 with the fix is safe.

- Distributions: Check your vendor advisories (Red Hat, Debian, Ubuntu).
- Cloud environments: Make sure GPU-using tenants or containers cannot exhaust memory by patching hosts.

References & More Info

- Official CVE record
- Linux patch commit
- AMD KFD documentation
- Red Hat security advisory

Conclusion

*CVE-2024-26986* is a classic example of how a simple error in kernel resource management can have security and stability effects, even if it doesn't allow code execution. It also reminds us how critical good cleanup in error handling is.

If you rely on AMD GPUs or provide GPU resources to users (on-prem or cloud), upgrade your kernel to stay safe.

Timeline

Published on: 05/01/2024 06:15:16 UTC
Last modified on: 08/02/2024 00:21:05 UTC