CVE-2021-47044 - Understanding and Exploiting the Linux Kernel sched/fair Shift-Out-Of-Bounds Bug in load

In mid-2021, a subtle yet critical bug was discovered and fixed in the Linux kernel’s scheduler, specifically inside the CFS (Completely Fair Scheduler) code responsible for balancing tasks across CPUs. The bug, tracked as CVE-2021-47044, posed a risk of undefined behavior through a shift-out-of-bounds operation, potentially making systems unstable or opening doors for exploitation—even if local, non-privileged, and tricky to hit in the wild.

Let’s break down what happened, why it’s dangerous, how it can be exploited (with code!), and how it was fixed.

Background: The Linux Scheduler and load_balance()

The Linux kernel scheduler constantly works to distribute processes between CPUs, so no core sits idle while another is overloaded. The piece of code at fault is inside kernel/sched/fair.c, namely in the function load_balance(). When a CPU is too busy and another has spare cycles, this function attempts to “push” tasks toward idle CPUs.

If the scheduler can’t balance the load, it keeps track of failures in a counter named nr_balance_failed. When this counter grows too large, the scheduler will attempt an "active balance", migrating tasks more aggressively.

The Vulnerability: Shift-By-Too-Much

The Bug:
Each time load balancing fails, the nr_balance_failed counter is bumped up. After several failures, the kernel uses this value as part of a bit shift operation. But there's a problem: If you shift by a number equal to or greater than the number of bits in the variable (say, shift a 32-bit int by 32 or more), C says the behavior is undefined. That’s dangerous, because the CPU could do anything—or crash.

How bad can the counter get?
Under certain real-world and fuzzed conditions (as discovered by syzbot), the counter could hit surprisingly high values—way more than originally expected (think: 86, 149, etc).

Why?
The automatic “reset” logic could be skipped when a candidate task is not allowed to run on the destination CPU. That meant the counter could be incremented, again and again, with no upper bound.

Here's a simplified snippet similar to the vulnerable pattern

// psuedo-structure
struct sched_domain {
    int nr_balance_failed;
    int cache_nice_tries;
};

void load_balance(struct sched_domain *sd) {
    // ... (lots of code skipped)
    if (balance_failed) {
        sd->nr_balance_failed++; // vulnerable growth
    }
    // Sometime later, used as a shift value:
    mask = (1 << sd->nr_balance_failed) - 1;  // dangerous! BAD
}

If nr_balance_failed gets above, say, 31 (on a 32-bit system), the shift right here is undefined. A smart attacker could potentially abuse this for denial-of-service, kernel panic, or even for exploiting memory errors, depending on architecture.

How an Exploit Could Work

While the bug alone doesn’t allow arbitrary code execution, DoS (denial-of-service) is possible by carefully keeping CPUs imbalanced (sometimes via CPU affinity, cgroups, or creative scheduling—exact details depend on your privileges). One could repeatedly trigger load balancing failures, pumping nr_balance_failed ever higher.

Example (Pseudo-Exploit Flow)

# Normally requires system tools and privileges, but in concept:
import os
import threading

def hog_cpu(affinity_cpu):
    os.sched_setaffinity(, {affinity_cpu})
    while True:
        pass

# Create N threads all pinned to a single CPU, others idle
for i in range(os.cpu_count() - 1):
    t = threading.Thread(target=hog_cpu, args=(,))
    t.daemon = True
    t.start()

# Meanwhile, repeatedly create processes on other idle CPUs with restrictive affinity
# to trigger scheduler balance failures.

Tools like syzkaller automate this process and can even find non-obvious bugs via fuzzing.

The Fix: Capping the Shift

To guard against undefined behavior, the kernel maintainers implemented a cap on the shift value to the type's bit width minus one. In plain English: If you’re going to shift an int, never let the shift count exceed 31 (for 32-bit int).

Fixed code (simplified)

int safe_shift = min_t(int, sd->nr_balance_failed, BITS_PER_TYPE(typeof(sd->nr_balance_failed)) - 1 );
mask = (1 << safe_shift) - 1; // safe!

The macro BITS_PER_TYPE() ensures the shift is always valid for the type in question—no more undefined behavior!

Code Audit: How to Spot More

The kernel maintainers used Coccinelle scripts to scan for similar patterns. For example, below is a pattern you can use:

@expr@
position pos;
expression E1;
expression E2;
@@
(
E1 >> E2@pos
|
E1 << E2@pos
)

The audit showed this bug was isolated—other uses in the scheduler area (like rq_clock_thermal()) already capped the shift.

Links & References

- Linux kernel commit fixing CVE-2021-47044
- Syzbot bug report
- CVE-2021-47044 on NVD
- CVE-2021-47044 Red Hat Security Advisory
- Coccinelle: Program matching and transformation for C code
- syzkaller: kernel fuzzer project

Conclusion

CVE-2021-47044 is a classic example of how a small arithmetic oversight in kernel code (a shift beyond type-width) can have big security consequences. While in practice exploitation might be tricky and usually limited to DoS, it’s a reminder of the complexities lurking in core OS code. The fix is now in all major kernels—so patch up if you haven’t!

Timeline

Published on: 02/28/2024 09:15:40 UTC
Last modified on: 11/04/2024 17:35:01 UTC