Nokogiri is one of Ruby’s most trusted gems for parsing HTML and XML. Developers from all over use it to scrape and transform web data. But in 2022, Nokogiri faced a surprising security problem affecting versions before 1.13.4: its HTML encoding detection relied on a problematic regular expression. This could let attackers freeze your app with a specially crafted document—a serious danger for web services and automation.

In this article, let’s break down what CVE-2022-24836 means, see how the vulnerable regex works, look at the details behind the exploit, and learn what you should do if you’re running affected Nokogiri versions.

Fixed in: Nokogiri 1.13.4

- Original advisory: GitHub Security Advisory GHSA-7f9m-4q4f-8r4q

Nokogiri used a regular expression to detect the encoding of HTML documents. If someone supplies a specially crafted HTML file, this regex could eat up massive CPU time, causing your Ruby process to hang.

How Does the Vulnerable Code Work?

In Nokogiri < 1.13.4, a part of the code tries to find the encoding declared in an HTML document by matching against a regex. The vulnerable regex looked for things like <meta charset="UTF-8">. Here’s a simplified version of what was happening:

# Nokogiri's < v1.13.4 vulnerable regex snippet (simplified)
encoding = html_doc[/<meta\s+[^>]*charset=["']?([^"'>\s]*)/i, 1]

Why is this an issue? The regex engine can suffer from _catastrophic backtracking_ if it has to check a huge or maliciously crafted HTML string that nearly—but not quite—matches the expected pattern.

Suppose someone sends you HTML like this

<meta  chaaaaaaaaaaaaaaaaaaaaa <meta  chaaaaaaaaaaaaaaa ...

The regex will try to match over and over, backtracking at every step, consuming huge CPU cycles.

Python Demo (the issue is similar in Ruby)

import re
bad_html = '<meta ' + 'a' * 50000 + '>'
pattern = r'<meta\s+[^>]*charset=["\']?([^"\'>\s]*)'
re.search(pattern, bad_html, re.IGNORECASE)

Ruby’s regex engine will behave similarly, taking an extremely long time to process the malicious input.

Exploit Details

The issue is not about leaking data, but about consuming all your CPU—your app appears to freeze or becomes non-responsive. Any Nokogiri-powered app that parses user-supplied HTML is at risk, especially web crawlers, data importers, and servers that let users upload or submit HTML content.

Prepare Malicious HTML

The attacker builds a document with a <meta ...> tag containing tons of filler before any charset= appears.

Watch It Hang

The server’s CPU spikes. Each regex operation that tries to find the charset slows to a crawl, potentially timing out or forcing the server to reboot.

The Fix: Nokogiri >= 1.13.4

Nokogiri’s maintainers fixed the regex and refactored the code to avoid catastrophic backtracking for input like the above. They also added a microbenchmark and tests to catch future performance issues.

No easy workaround exists: you have to upgrade.

Update RubyGems and bump your Gemfile or install manually

gem install nokogiri
# or, with Bundler in Gemfile:
# gem 'nokogiri', '>= 1.13.4'
bundle update nokogiri

Parse or process untrusted HTML (user uploads, web scrapes, etc.)

You are not affected if you only process trusted data, but upgrading is always good security hygiene.

Resources & References

- GitHub advisory: GHSA-7f9m-4q4f-8r4q
- Nokogiri’s Changelog
- CVE Details for CVE-2022-24836

Conclusion

CVE-2022-24836 is a reminder that even simple things like regular expressions can introduce security risks—especially for popular tools like Nokogiri. While the bug doesn’t allow data theft, it can take down critical services with a well-timed bad document.

If you use Nokogiri, upgrade today to 1.13.4 or later. Don’t leave your Ruby apps open to a denial-of-service, even if it comes from something as ordinary as a regex!

Timeline

Published on: 04/11/2022 22:15:00 UTC
Last modified on: 08/15/2022 11:18:00 UTC