The Linux Kernel’s memory controller (memcg) enables administrators and developers to limit, account for, and isolate the memory usage of groups of processes. This is crucial in environments where multiple applications or tenants share system resources, such as containers or virtualized systems. However, complex resource accounting mechanisms have historically introduced challenging bugs. One such problem, now resolved, revolved around wrongly held references to memory cgroups ("memcgs")—specifically, how some kernel memory allocations could keep these references alive even after administrative deletion, leading to resource leaks and system slowdowns.
This post provides an exclusive, in-depth explanation of CVE-2021-47011, covering its cause, how it was detected, its security implications, and how the vulnerability was ultimately patched.
Background: How Memory Cgroups Hold References
When the Linux kernel allocates memory for a task (like a process stack), the memory controller needs to "charge" that allocation to a specific memcg so the correct group is billed for its RAM usage. For usual memory (including slab allocations for many small objects), recent kernel APIs use a structure called obj_cgroup to track who should be charged. This setup keeps reference counts clean and allows cgroups to be deleted when idle.
However, there’s a class of allocations—especially large ones, like process stacks bigger than 2 pages (e.g., 16KB stacks on arm64/x86_64)—that are not charged this way. Instead, they’re accounted as "kmem pages," holding a direct reference to the memcg. If you migrate (move) a thread from one cgroup (A) to another (B), the stack allocation continues to pin cgroup A in memory.
So, when cgroup A is deleted administratively, the actual resources aren’t released until all references are dropped. Under some circumstances, the kernel can keep thousands of dying, but never gone, cgroups hanging around—consuming kernel memory.
Demonstration Script
Here’s a simple (abbreviated for clarity) script to expose the issue (do not use in production):
#!/bin/bash
cd /sys/fs/cgroup/memory
echo 1 > memory.move_charge_at_immigrate
for i in {1..500}
do
mkdir kmem_test
echo $$ > kmem_test/cgroup.procs
sleep 360 &
echo $$ > cgroup.procs
echo cat kmem_test/cgroup.procs > cgroup.procs
rmdir kmem_test
done
cat /proc/cgroups | grep memory
*Result*: 500 dying cgroups will be visible after the script runs.
The Root Cause
The bug originates from holding references to memcg from "kmem pages" after moving the thread to another cgroup and subsequent deletion of the old one. The reference doesn't get properly dropped, meaning the old cgroup isn't freed.
Where It Gets Tricky
Internally, freeing a kernel allocation should also drop the memcg reference. But the existing code handled this incorrectly: it used rcu_read_lock() to try and ensure the memcg would not go away, without confirming the reference was truly safe to get via css_get(). Under racing conditions, this could allow the system to resurrect a just-decremented (and meant-to-die) reference count.
This pattern looks like
rcu_read_lock();
memcg = obj_cgroup_memcg(old);
__memcg_kmem_uncharge(memcg);
refill_stock(memcg);
if (stock->cached != memcg)
// css_get can change the ref counter from back to 1.
css_get(&memcg->css);
rcu_read_unlock();
In effect, the code would incorrectly increase the refcount on a memcg that was supposed to be dying.
Patch and Remediation
The patch series “Use obj_cgroup APIs to charge kmem pages” was introduced, with the main fix ensuring that all kmem pages drop the memcg reference using the new APIs. The kernel ensures it holds a *guaranteed accurate* reference before invoking the uncharge logic, so that references aren’t incorrectly "resurrected." Specifically, the fix applies the pattern from commit eefbfa7fd678 and always takes a reference before calling __memcg_kmem_uncharge.
Relevant Patch Excerpt
/* pseudocode for illustration */
struct mem_cgroup *memcg = obj_cgroup_memcg(obj_cgrp);
if (memcg)
css_get(&memcg->css); // Safely grab reference count
__memcg_kmem_uncharge(memcg); // Properly drop reference afterwards
Result: When you run the problematic script above on a patched kernel, extra dying cgroups do not accumulate—the references are released properly, and unused memory cgroups are cleaned up as expected.
Security Impacts
- Denial of Service: An attacker or misbehaving app can exhaust kernel memory by repeatedly creating and deleting cgroups, eventually leading to system slowdown or a kernel panic.
- Resource Isolation Violation: “Dead” cgroups hanging around violate the expected isolation and cleanup guarantees of cgroups, interfering with container platforms and automated infra.
- Stale Resource Tracking: System administrators and monitoring frameworks might misreport or misunderstand actual resource usage, causing confusion and misdiagnoses.
Conclusion
The fix for CVE-2021-47011 closes a subtle, but impactful, kernel reference counting bug in handling memory cgroups. If you run any multi-tenant, container-heavy, or virtualized Linux systems, update to a patched kernel!
References and More Reading
- Linux kernel commit - eefbfa7fd678 "mm: memcg/slab: fix use after free in obj_cgroup_charge"
- Patch series: Use obj_cgroup APIs to charge kmem pages (discussion/patches)
- CVE-2021-47011 at the NVD
Always keep your kernel up to date to protect systems from memory management bugs.
- Understand the real-world impact of kernel reference leaks—sometimes they look like mere “bookkeeping” mistakes, but lead to exploitable systemic weaknesses.
- Leverage structured kernel APIs for memory charging/accounting, and avoid manual reference-count juggling whenever possible.
Timeline
Published on: 02/28/2024 09:15:38 UTC
Last modified on: 01/08/2025 18:02:38 UTC