In the fast-moving world of open source, patching security holes is an ongoing challenge—especially when older branches and non-standard features are involved. Let’s dive into the story behind CVE-2022-30973, a denial of service (DoS) vulnerability that appeared in Apache Tika due to a missed fix for a previous issue (CVE-2022-30126) in the 1.x branch during the 1.28.2 release.

This post will break down the bug, explain the fix—and show how a tiny, overlooked piece of code created a window of risk. If you’re still running Tika 1.28.2 (or earlier in the 1.x branch) with advanced extraction handlers, it’s time to double-check your setup.

Background: What Is Apache Tika?

Apache Tika is a toolkit for detecting and extracting metadata and text from over a thousand different file types. It’s widely used for search indexing, content analysis, and more.

Tika works by using "content handlers" to process the extracted data. Most users rely on the standard handlers—but there are also specialized ones, like StandardsExtractingContentHandler, which are less common but powerful.

The Problem: CVE-2022-30973 Explained

A regular expression (regex) in Tika’s StandardsText class was not written defensively. It could be exploited to cause excessive backtracking, which would either hang the process or chew through CPU and memory resources. This is a classic ReDoS (Regular Expression Denial of Service).

Here’s the catch:  
The vulnerability only affected those using the StandardsExtractingContentHandler (a non-standard handler); ordinary users were NOT exposed.

How Did This Happen?

This bug was actually supposed to have been fixed when CVE-2022-30126 was addressed. That patch landed cleanly in Tika’s 2.x branch *and* the 1.28.2 release—*or so everyone thought*.

But:  
The actual fix for the regex in the StandardsText class wasn't properly applied to the 1.x branch during the 1.28.2 release process. This meant the bug lived on for another cycle in 1.x, and only finally got fixed in 1.28.3, after someone realized the mistake.

Why Is Regex Dangerous?

Regexes are powerful, but badly constructed patterns can introduce “catastrophic backtracking.” That’s when the regex engine tries thousands or millions of ways to match, slowing your app to a crawl for certain input.

Here’s a made-up (but similar) example

String regex = "(a+)+b";
String input = "aaaaaaaaaaaaaaaaaaaaaaa"; // long string of a's, no 'b'
input.matches(regex); // causes excessive backtracking

If the StandardsText class used something like this for parsing, a crafted input could hog all server resources—leading to a DoS.

Vulnerable Regex (from the StandardsText class)

private static final Pattern BAD_PATTERN = Pattern.compile(
    "(\\d+\\.\\d+\\s*)+"
);


*Note: The exact regex was more involved, matching "standards" patterns, but this gets across the gist.*

What Happened?

Someone realized that carefully constructed "standards" text would make this regex backtrack enormously.

The safer fix is to add a non-backtracking anchor or rewrite the pattern

private static final Pattern SAFER_PATTERN = Pattern.compile(
    "(?:\\d+\\.\\d+\\s*)+"
); // or a different approach with less nesting

For a real fix, even more defensive coding might be used—see the patch pull request.

How to Exploit It

If you send a specially crafted file (e.g., a Word or PDF with "standards" lines that match the vulnerable regex), and it gets processed with StandardsExtractingContentHandler, then you could hang the Tika process and tie up system resources.

Simple PoC (conceptual)

String evilInput = "1.1 1.1 1.1 1.1 ... (many times, crafted just right)";
Matcher m = Pattern.compile("(\\d+\\.\\d+\\s*)+").matcher(evilInput);
boolean found = m.find(); // hangs or takes a very long time

In practice, a file containing this pattern would trigger the DoS during content extraction.

The Official Fix

The Tika team patched this in 1.28.3 (see also GitHub PR #655). The regex in question was rewritten to eliminate the ReDoS vector.

Upgrade to 1.28.3 or later if you’re on the 1.x branch.

- If you use Tika’s non-default content handlers—or allow untrusted input—review your setup for regexes with similar patterns.
- Regularly audit any code (yours or open source) that processes untrusted data with regular expressions.

References

- CVE-2022-30973 at NVD
- CVE-2022-30126 at MITRE
- Tika 1.28.3 Release Notes
- Patch on GitHub
- OWASP ReDoS Explanation

Conclusion

CVE-2022-30973 shows how quickly a missed patch—especially in a less maintained branch—can leave users exposed, even if a fix exists upstream. If you run Tika with custom handlers, don’t assume upstream fixes reach you automatically. And always be wary of regexes over untrusted input!

Timeline

Published on: 05/31/2022 14:15:00 UTC
Last modified on: 07/22/2022 19:15:00 UTC