A new Linux kernel vulnerability, CVE-2024-57888, recently made headlines in the open-source and security communities. It involves an unusual warning triggered during workqueue operations within the GPU driver stack, particularly when cancelling memory reclaim (WQ_MEM_RECLAIM) jobs from a non-reclaim context. In this deep-dive, we’ll unpack the background, explain why the warning appeared, how it was resolved, and what it means for you as a developer or system administrator.

What Happened?

After commit 746ae46c1113 ("drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM"), users started to see kernel warnings related to AMD GPUs. The warnings looked like this:

workqueue: WQ_MEM_RECLAIM sdma:drm_sched_run_job_work [gpu_sched] is flushing !WQ_MEM_RECLAIM events:amdgpu_device_delay_enable_gfx_off [amdgpu]
...
Call Trace:
 <TASK>
...
  ? check_flush_dependency+xf5/x110
...
  cancel_delayed_work_sync+x6e/x80
  amdgpu_gfx_off_ctrl+xab/x140 [amdgpu]
  amdgpu_ring_alloc+x40/x50 [amdgpu]
  amdgpu_ib_schedule+xf4/x810 [amdgpu]
  ? drm_sched_run_job_work+x22c/x430 [gpu_sched]
  amdgpu_job_run+xaa/x1f [amdgpu]
  drm_sched_run_job_work+x257/x430 [gpu_sched]
  process_one_work+x217/x720
...
 </TASK>

At first glance, this seemed to flag a logic bug or dangerous kernel workflow, but further investigation showed the warning was more bark than bite.

The Core of the Vulnerability

Background:
Linux uses *workqueues* for asynchronous processing. Some workqueues are marked WQ_MEM_RECLAIM, meaning they are safe to use during out-of-memory (OOM) and memory reclaim operations. If a reclaim operation tries to flush or synchronize a non-reclaim-safe (normal) workqueue, it can cause deadlocks — and the kernel rightfully warns about this.

The Problem:
After marking certain GPU workqueues as WQ_MEM_RECLAIM, some cancel operations (e.g., cancel_delayed_work_sync()) were run from non-reclaim contexts. This led to check_flush_dependency issuing warnings, because it detected that a MEM_RECLAIM workqueue was dealing with operations not flagged similarly — but in the case of cancelling, this is actually safe!

In simple terms:
The kernel was warning about something that isn’t dangerous: cancelling work is different from flushing it, and doesn’t risk deadlocks.

Here’s the part of the kernel that recently changed

if (WARN_ON_ONCE(wq->flags & WQ_MEM_RECLAIM &&
     !current_is_reclaim_context))
    pr_warn("workqueue: WQ_MEM_RECLAIM %s is flushing !WQ_MEM_RECLAIM %s\n",
        wq->name, work->func_name);

And here’s where the false alarming happened — cancel_delayed_work_sync() was triggering this path even though cancelling is safe.

How Was It Fixed?

The fix involved relaxing the warning in check_flush_dependency when the context is a cancellation rather than a flush. Since cancelling a work item either stops an already-running job or ensures it won’t run at all, there’s no risk for the kind of memory reclaim deadlocks the original check was guarding against.

Patch Overview

- check_flush_dependency(wq, work);
+ check_flush_dependency(wq, work, is_cancel);

The logic now handles "cancel" context differently and doesn’t trigger the warning.

For a more detailed patch, see this commit.

How This Could Be Exploited

Let’s be clear:
This is not a direct code execution or privilege escalation bug, but the noisy warning could mask *real* deadlock or reclaim issues, and might mislead system administrators. In rare production environments, this might cause monitoring systems to flag kernel logs, resulting in confusion or unplanned interventions.

If a bad actor could repeatedly trigger this warning (e.g., via GPU workloads in VMs or containers), they could potentially flood logs — a kind of log spam / minor denial of service rather than a full-blown vulnerability.

Here’s a simplified pseudo-code to show the problem

// This is the amdgpu triggering code
cancel_delayed_work_sync(&my_delayed_work); // called from a non-reclaim worker

This would provoke the warning even though it’s not a problem.

References

- Commit 746ae46c1113 – drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM
- Full Patch Resolving the Warning
- Linux Kernel Workqueue Documentation

Final Thoughts

CVE-2024-57888 is a great example of how defense-in-depth can sometimes cause confusion, but also how quickly the Linux community patches up minor misfires. Staying updated and understanding what’s a *warning* versus a *critical exploit* is key for everyone running production Linux systems.

If you’re a kernel developer or sysadmin, keep an eye out for this patch in your next security update!

Timeline

Published on: 01/15/2025 13:15:13 UTC
Last modified on: 05/04/2025 10:05:56 UTC