Default to ZSTD compression when writing Parquet by kosiew · Pull Request #981 · apache/datafusion-python

kosiew · 2024-12-24T05:39:27Z

Which issue does this PR close?

Closes #978.

Rationale for this change

Currently, the write_parquet method defaults to "uncompressed" Parquet files, which can lead to inefficient storage and slower performance during I/O operations. This change sets the default compression method to "ZSTD", a modern compression algorithm that provides an excellent balance of compression speed and ratio. Additionally, it introduces a default compression level of 3 for ZSTD, which is optimal for many use cases.

What changes are included in this PR?

Updated the default compression parameter in the write_parquet method from "uncompressed" to "ZSTD".
Introduced a default compression level of 3 for ZSTD if no level is specified.
Added validation to ensure the compression level for ZSTD falls within the valid range (1 to 22) and raises a ValueError otherwise.
Updated the docstring to clarify the default values and provide guidance for users on compression levels.

Are there any user-facing changes?

Yes:

The default behavior of write_parquet now compresses output files using ZSTD with a default compression level of 3, instead of leaving files uncompressed.
Users specifying an invalid compression level for ZSTD will now encounter a ValueError.

… write_parquet method

kylebarron · 2024-12-24T17:41:52Z

python/datafusion/dataframe.py

-        """
+        path (str | pathlib.Path): The file path to write the Parquet file.
+        compression (str): The compression algorithm to use. Default is "ZSTD".
+        compression_level (int | None): The compression level to use. For ZSTD, the


We should document that the compression level is different per algorithm. It's only zstd that has a 1-22 range IIRC.

Do you mean like

compression_level (int | None): The compression level to use. For ZSTD, the recommended range is 1 to 22, with the default being 3. Higher levels provide better compression but slower speed.

kylebarron · 2024-12-24T17:42:19Z

python/datafusion/dataframe.py

+        # default compression level to 3 for ZSTD
+        if compression == "ZSTD":
+            if compression_level is None:
+                compression_level = 3


3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.

It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files

3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.

I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html

I could not find a default in DuckDB's documentation.

hi @kylebarron ,

Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?

Sure, that sounds good to me.

Thanks.
I have amended the default to 4.

…t method

ion-elgreco · 2024-12-28T10:18:02Z

In delta-rs we have the default to use "snappy" compression, except our optimize operation which uses ZSTD(4)

…method

python/datafusion/dataframe.py

ion-elgreco · 2025-01-07T15:23:21Z

python/datafusion/dataframe.py

+                recommended range is 1 to 22, with the default being 4. Higher levels
+                provide better compression but slower speed.
+        """
+        if compression == "ZSTD":


You might want to reuse this code I added in deltalake a while ago:

https://github.com/delta-io/delta-rs/blob/053601b72c2ee053f7ff0309c7d27e122c4e3852/python/deltalake/table.py#L79-L129

Thanks @ion-elgreco ,

I added the Compression Enum but omitted the check_valid_levels because these are already implemented in Rust DataFrame eg

datafusion-python/src/dataframe.rs

Lines 490 to 493 in 63b13da

"zstd" => Compression::ZSTD(

ZstdLevel::try_new(verify_compression_level(compression_level)? as i32)

.map_err(|e| PyValueError::new_err(format!("{e}")))?,

),

Compression levels are tested in:

datafusion-python/python/tests/test_dataframe.py

Lines 1093 to 1106 in 63b13da

@pytest.mark.parametrize(

"compression, compression_level",

[("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)],

)

def test_write_compressed_parquet_wrong_compression_level(

df, tmp_path, compression, compression_level

):

path = tmp_path

with pytest.raises(ValueError):

df.write_parquet(

str(path),

compression=compression,

compression_level=compression_level,

…se it

…s' write_parquet method

…t writing method doc

…t compression levels

…thod

timsaucer

Overall this is a very nice addition. It looks like you have a slight adjustment that ruff is complaining about to fix the CI. My comments here are all minor.

timsaucer · 2025-01-08T22:16:12Z

python/datafusion/dataframe.py

        self,
        path: str | pathlib.Path,
-        compression: str = "uncompressed",
+        compression: str = Compression.ZSTD.value,


It would be nice to have this take as the type for compression std | Compression and do a quick check and get the value passed a Compression.

Good point!

timsaucer · 2025-01-08T22:19:30Z

python/datafusion/dataframe.py

+        if compression_enum in {Compression.GZIP, Compression.BROTLI, Compression.ZSTD}:
+            if compression_level is None:
+                compression_level = compression_enum.get_default_level()


Rather than doing the checking here it would be slightly more ergonomic to just call compression_enum.get_default_level() and have it return None rather than raise an error. But I could also see how some would see calling get_default_level on the others as invalid. I'm not married to this idea.

This passes the None handling to Rust.
No tests broken, so this is a good ergonomic suggestion.

timsaucer · 2025-01-08T22:20:03Z

python/datafusion/dataframe.py

+        """Convert a string to a Compression enum value.
+
+        Args:
+            value (str): The string representation of the compression type.


nit: since the type hint indicates a str you shouldn't have to repeat here, per the google code design spec.

Good nit 😄

timsaucer · 2025-01-08T22:20:21Z

python/datafusion/dataframe.py

+        """Get the default compression level for the compression type.
+
+        Returns:
+            int: The default compression level.


nit: int not required since it's in the hint

Good nit 😄

… in DataFrame.write_parquet

…ions

kosiew · 2025-01-09T07:22:28Z

Does anyone know how to fix this CI error:

ruff check --output-format=github python/
  ruff format --check python/
  shell: /usr/bin/bash -e {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/[3](https://github.com/apache/datafusion-python/actions/runs/12683801800/job/35351448911#step:5:3).11.11/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.11.11/x6[4](https://github.com/apache/datafusion-python/actions/runs/12683801800/job/35351448911#step:5:4)/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.11/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.11/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.11.11/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.11.11/x64/lib
Would reformat: python/tests/test_dataframe.py
1 file would be reformatted, 3[5](https://github.com/apache/datafusion-python/actions/runs/12683801800/job/35351448911#step:5:5) files already formatted
Error: Process completed with exit code 1.

I tried these commands and they complete without error on my machine:

ruff check --output-format=github python/
ruff format --check python/

ruff format python/tests/test_dataframe.py

timsaucer · 2025-01-09T11:52:44Z

It looks like some minor difference in ruff versions probably caused yours to pass and the CI to fail. I pushed a correction to this branch.

timsaucer · 2025-01-11T02:12:35Z

Thank you for another great addition @kosiew !

fix: update default compression to ZSTD and improve documentation for…

0c3fed9

… write_parquet method

kylebarron reviewed Dec 24, 2024

View reviewed changes

fix: clarify compression level documentation for ZSTD in write_parque…

b1db46c

…t method

kosiew added 2 commits January 7, 2025 08:57

fix: update default compression level for ZSTD to 4 in write_parquet …

819de0d

…method

fix: improve docstring formatting for DataFrame parquet writing method

56965f4

kevinjqliu reviewed Jan 7, 2025

View reviewed changes

python/datafusion/dataframe.py Show resolved Hide resolved

ion-elgreco reviewed Jan 7, 2025

View reviewed changes

kosiew added 5 commits January 8, 2025 11:54

feat: implement Compression enum and update write_parquet method to u…

df7d65e

…se it

add test

f62a7a8

fix: remove unused import and update default compression to ZSTD in r…

b5b3c47

…s' write_parquet method

fix: update compression type strings to lowercase in DataFrame parque…

2362992

…t writing method doc

test: update parquet compression tests to validate invalid and defaul…

b86b142

…t compression levels

kosiew force-pushed the parquet-default-compression branch from 4d3fd8d to e7ec09b Compare January 8, 2025 06:02

add comment on source of Compression

41e1742

kosiew force-pushed the parquet-default-compression branch from e7ec09b to 41e1742 Compare January 8, 2025 06:03

kosiew added 2 commits January 8, 2025 14:14

docs: enhance Compression enum documentation and add default level me…

fe502e8

…thod

test: include gzip in default compression level tests for write_parquet

67529b8

timsaucer reviewed Jan 8, 2025

View reviewed changes

kosiew added 3 commits January 9, 2025 10:19

refactor: simplify Compression enum methods and improve type handling…

811f633

… in DataFrame.write_parquet

docs: update Compression enum methods to include return type descript…

50a58b3

…ions

move comment to within test

55fc97e

Ruff format

73519fe

timsaucer approved these changes Jan 9, 2025

View reviewed changes

timsaucer merged commit 2d8b1d3 into apache:main Jan 11, 2025
15 checks passed

	"zstd" => Compression::ZSTD(
	ZstdLevel::try_new(verify_compression_level(compression_level)? as i32)
	.map_err(\|e\| PyValueError::new_err(format!("{e}")))?,
	),

	@pytest.mark.parametrize(
	"compression, compression_level",
	[("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)],
	)
	def test_write_compressed_parquet_wrong_compression_level(
	df, tmp_path, compression, compression_level
	):
	path = tmp_path

	with pytest.raises(ValueError):
	df.write_parquet(
	str(path),
	compression=compression,
	compression_level=compression_level,

Conversation

kosiew commented Dec 24, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Dec 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ion-elgreco commented Dec 28, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer commented Jan 9, 2025

Uh oh!

Uh oh!

timsaucer commented Jan 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kosiew Dec 27, 2024 •

edited

Loading

kosiew Jan 7, 2025 •

edited

Loading

kosiew Jan 8, 2025 •

edited

Loading

kosiew commented Jan 9, 2025 •

edited

Loading