feat: Support Parquet writer options#1123
Conversation
timsaucer
left a comment
There was a problem hiding this comment.
Overall, I really like the idea. Right now this does include a breaking change to a very popular user facing function. I think if we make the suggestion to allow for two function signatures we'll be able to include this in the next release.
timsaucer
left a comment
There was a problem hiding this comment.
Very nice. Thank you!
I might add a follow on PR that would overload the write_parquet to simply identify it was getting passed these options or the old signature. I don't think that's blocking for what you have here.
I think write_parquet_with_options would be a slightly more explicit function name, but also not blocking for this PR.
If you can resolve the merge conflicts, I'll rerun CI and if all goes through I can merge it in soon.
Thank you again!
|
Conflicts are resolved. I also renamed to |
|
Looks great. There are some minor ruff errors. After that it looks good to merge! |
Which issue does this PR close?
N/A.
Rationale for this change
Supporting all Parquet writer options allows us more flexibility when creating data directly from
datafusion-python.For consistency, it supports all writer options defined by
ParquetOptionsindatafusion, using the same defaults: https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.What changes are included in this PR?
write_parquetwith all writer options, including column-specific options.pyarrowdoes not expose page-level information, some options could not be directly tested, like enabling bloom-filters (an external tool confirmed that this option works). For this specific case, there is a test that compares the file sizes.)Are there any user-facing changes?
The main difference relates to the existing
compressionfield, which now uses astrlikedatafusion, instead of a custom enum. The main advantage is that future algorithms will not require updating the Python-side code.Additionally, the default compression was changed from
zstd(4)tozstd(3), the same asdatafusion.