lxml_html_clean is an HTML cleaning project derived from lxml.html.clean primarily used for sanitizing untrusted HTML content. However, a recently discovered vulnerability (CVE-2024-52595) in versions earlier than .4. could put sensitive contexts and users at risk.

The Vulnerability

Before version .4., the HTML Parser in lxml does not adequately handle context-switching for specific HTML tags such as <svg>, <math>, and <noscript>. This inconsistency deviates from the standard parsing and interpretation of these tags in web browsers.

Particularly, content in CSS comments is not considered by lxml_html_clean but may be interpreted differently by web browsers. This disconnect allows malicious scripts to evade the cleaning process, potentially leading to Cross-Site Scripting (XSS) attacks.

Users who are using lxml_html_clean with default settings to sanitize untrusted HTML content are at risk and should upgrade to version .4. which addresses this issue.

Temporary Mitigation

Until users can upgrade to lxml .4., they can configure lxml_html_clean with the following settings to prevent the exploitation of this vulnerability:

Via remove_tags, specify tags to remove - their content is transferred to their parents' tags.

from lxml.html import clean

cleaner = clean.Cleaner(remove_tags=['svg', 'math', 'noscript'])
cleaned_html = cleaner.clean_html("<html><body><svg>...</svg></body></html>")

Via kill_tags, specify tags to be removed entirely.

from lxml.html import clean

cleaner = clean.Cleaner(kill_tags=['svg', 'math', 'noscript'])
cleaned_html = cleaner.clean_html("<html><body><svg>...</svg></body></html>")

3. Via allow_tags, restrict the set of permitted tags, excluding context-switching tags like <svg>, <math>, and <noscript>.

from lxml.html import clean

# Define a list of allowed tags, excluding 'svg', 'math', and 'noscript'
allowed_tags = ['a', 'p', 'div', ...]

cleaner = clean.Cleaner(allow_tags=allowed_tags)
cleaned_html = cleaner.clean_html("<html><body><svg>...</svg></body></html>")

For more information, please refer to the following original references

- lxml release notes: https://lxml.de/4./changes-4...html
- CVE details: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2024-52595
- lxml documentation: https://lxml.de/2./C.html

Conclusion

The lxml_html_clean project in versions prior to .4. has a vulnerability (CVE-2024-52595) that enables potential XSS attacks. It is crucial for users working with HTML content in security-sensitive contexts to upgrade to lxml .4. or implement the suggested temporary mitigation strategies.

Timeline

Published on: 11/19/2024 22:15:21 UTC
Last modified on: 11/25/2024 14:27:38 UTC