CVE-2024-12720 - Deep Dive into a ReDoS Flaw in huggingface/transformers (v4.46.3)

The world of machine learning relies heavily on open source libraries for productivity and performance. Hugging Face's transformers is one of the most popular Python packages for natural language processing (NLP). But even the best-maintained projects aren’t immune to security issues. Today, we explore CVE-2024-12720, a Regular Expression Denial of Service (ReDoS) vulnerability that affects the latest release (v4.46.3) of transformers—in particular, a regex in the tokenization_nougat_fast.py file.

What Is CVE-2024-12720?

This vulnerability arises from a specific implementation detail in the post_process_single() function inside tokenization_nougat_fast.py. Here, a regular expression attempts to sanitize or process input text, but under certain crafted input patterns, the regex causes exponential backtracking. That means as input size grows, processing time sharply escalates, monopolizing CPU resources and causing unresponsiveness—the classic symptom of a Denial of Service (DoS) attack.

Summary Table

| Identifier | CVE-2024-12720 |
|--------------------|--------------------------------|
| Library | huggingface/transformers |
| Affected File | tokenization_nougat_fast.py |
| Vulnerable Version | v4.46.3 (latest, as of June 2024) |
| Attack Type | ReDoS (Regular Expression Denial of Service) |
| Impact | High CPU usage, potential downtime |
| Severity | HIGH (per NVD) |

The Problematic Function

Located within tokenization_nougat_fast.py, the function post_process_single() is supposed to perform quick regex-based text cleanups. However, speed becomes a problem if patterns are inefficient.

Snippet from Vulnerable Code

import re

def post_process_single(text):
    # Regex pattern meant to fix whitespace issues
    text_new = re.sub(r'( +\n)+', '\n', text)
    return text_new

*Note: The actual regex in the file may be more complex, but the vulnerability arises from the same principle—a regex that permits catastrophic backtracking given certain crafted inputs.*

How Attackers Exploit It

Attackers can trigger excessive backtracking by feeding the function specially crafted strings. Here’s an example exploit input:

malicious_input = " " * 10000 + "\n" * 10000  # String of 10,000 spaces followed by 10,000 newlines

# This will cause the vulnerable regex to spend an inordinate time "untangling" the pattern.
post_process_single(malicious_input)

Or, in shell terms (e.g., if calling from an API)

curl -d 'text='$(python3 -c "print(' ' * 10000 + '\n' * 10000)") \
     http://vulnerable-transformers-api/process

Why Does It Happen? (Technical Deep Dive)

Regular expressions like r'( +\n)+' match one or more spaces followed by a newline, repeated one or more times. If input data doesn’t clearly conform to the pattern, especially when *long runs* of spaces and newlines are interleaved, the regex tries every possible grouping combination, causing performance to degrade exponentially.

Exponential Backtracking Visualization

Given input: " \n \n \nX"
Where X doesn't match the pattern, the regex engine tries to match as many (space + newline) pairs as possible, backing up and retrying over and over, making hundreds of thousands—even millions—of passes.

For more about catastrophic backtracking.

Service is locked up or crashes, denying access to others.

In multi-tenant platforms, this could even be leveraged to disrupt co-hosted services.

`

- Look for the upstream patch or upgrade to a patched version as soon as it's released.

References and Further Reading

- CVE-2024-12720 – NVD Page
- huggingface/transformers GitHub
- OWASP: Regular expression Denial of Service - ReDoS
- Catastrophic Backtracking Explained
- Sample Patch Discussion

Conclusion

While regular expressions are powerful tools, careless use can open the door to subtle yet devastating vulnerabilities like ReDoS. If you rely on Hugging Face transformers in any user-facing context, audit your usage, test using adversarial inputs, and update frequently. And remember: always be skeptical of regular expressions in the wild!

Timeline

Published on: 03/20/2025 10:15:29 UTC
Last modified on: 03/20/2025 14:15:18 UTC