Default to ZSTD compression when writing Parquet #981
Default to ZSTD compression when writing Parquet #981timsaucer merged 16 commits intoapache:mainfrom
Conversation
… write_parquet method
python/datafusion/dataframe.py
Outdated
| """ | ||
| path (str | pathlib.Path): The file path to write the Parquet file. | ||
| compression (str): The compression algorithm to use. Default is "ZSTD". | ||
| compression_level (int | None): The compression level to use. For ZSTD, the |
There was a problem hiding this comment.
We should document that the compression level is different per algorithm. It's only zstd that has a 1-22 range IIRC.
There was a problem hiding this comment.
Do you mean like
compression_level (int | None): The compression level to use. For ZSTD, the
recommended range is 1 to 22, with the default being 3. Higher levels
provide better compression but slower speed.
python/datafusion/dataframe.py
Outdated
| # default compression level to 3 for ZSTD | ||
| if compression == "ZSTD": | ||
| if compression_level is None: | ||
| compression_level = 3 |
There was a problem hiding this comment.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
There was a problem hiding this comment.
It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files
There was a problem hiding this comment.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html
I could not find a default in DuckDB's documentation.
There was a problem hiding this comment.
hi @kylebarron ,
Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?
There was a problem hiding this comment.
Sure, that sounds good to me.
There was a problem hiding this comment.
Thanks.
I have amended the default to 4.
|
In delta-rs we have the default to use "snappy" compression, except our optimize operation which uses ZSTD(4) |
python/datafusion/dataframe.py
Outdated
| recommended range is 1 to 22, with the default being 4. Higher levels | ||
| provide better compression but slower speed. | ||
| """ | ||
| if compression == "ZSTD": |
There was a problem hiding this comment.
You might want to reuse this code I added in deltalake a while ago:
There was a problem hiding this comment.
Thanks @ion-elgreco ,
I added the Compression Enum but omitted the check_valid_levels because these are already implemented in Rust DataFrame eg
datafusion-python/src/dataframe.rs
Lines 490 to 493 in 63b13da
Compression levels are tested in:
datafusion-python/python/tests/test_dataframe.py
Lines 1093 to 1106 in 63b13da
…s' write_parquet method
…t writing method doc
…t compression levels
4d3fd8d to
e7ec09b
Compare
e7ec09b to
41e1742
Compare
timsaucer
left a comment
There was a problem hiding this comment.
Overall this is a very nice addition. It looks like you have a slight adjustment that ruff is complaining about to fix the CI. My comments here are all minor.
python/datafusion/dataframe.py
Outdated
| self, | ||
| path: str | pathlib.Path, | ||
| compression: str = "uncompressed", | ||
| compression: str = Compression.ZSTD.value, |
There was a problem hiding this comment.
It would be nice to have this take as the type for compression std | Compression and do a quick check and get the value passed a Compression.
python/datafusion/dataframe.py
Outdated
| if compression_enum in {Compression.GZIP, Compression.BROTLI, Compression.ZSTD}: | ||
| if compression_level is None: | ||
| compression_level = compression_enum.get_default_level() |
There was a problem hiding this comment.
Rather than doing the checking here it would be slightly more ergonomic to just call compression_enum.get_default_level() and have it return None rather than raise an error. But I could also see how some would see calling get_default_level on the others as invalid. I'm not married to this idea.
There was a problem hiding this comment.
This passes the None handling to Rust.
No tests broken, so this is a good ergonomic suggestion.
python/datafusion/dataframe.py
Outdated
| """Convert a string to a Compression enum value. | ||
|
|
||
| Args: | ||
| value (str): The string representation of the compression type. |
There was a problem hiding this comment.
nit: since the type hint indicates a str you shouldn't have to repeat here, per the google code design spec.
python/datafusion/dataframe.py
Outdated
| """Get the default compression level for the compression type. | ||
|
|
||
| Returns: | ||
| int: The default compression level. |
There was a problem hiding this comment.
nit: int not required since it's in the hint
|
It looks like some minor difference in ruff versions probably caused yours to pass and the CI to fail. I pushed a correction to this branch. |
|
Thank you for another great addition @kosiew ! |

Which issue does this PR close?
Closes #978.
Rationale for this change
Currently, the write_parquet method defaults to "uncompressed" Parquet files, which can lead to inefficient storage and slower performance during I/O operations. This change sets the default compression method to "ZSTD", a modern compression algorithm that provides an excellent balance of compression speed and ratio. Additionally, it introduces a default compression level of 3 for ZSTD, which is optimal for many use cases.
What changes are included in this PR?
Updated the default compression parameter in the write_parquet method from "uncompressed" to "ZSTD".
Introduced a default compression level of 3 for ZSTD if no level is specified.
Added validation to ensure the compression level for ZSTD falls within the valid range (1 to 22) and raises a ValueError otherwise.
Updated the docstring to clarify the default values and provide guidance for users on compression levels.
Are there any user-facing changes?
Yes:
The default behavior of write_parquet now compresses output files using ZSTD with a default compression level of 3, instead of leaving files uncompressed.
Users specifying an invalid compression level for ZSTD will now encounter a ValueError.