Add benchmark script and documentation for maximizing CPU usage in DataFusion Python#1216
Merged
timsaucer merged 3 commits intoapache:mainfrom Aug 31, 2025
Merged
Add benchmark script and documentation for maximizing CPU usage in DataFusion Python#1216timsaucer merged 3 commits intoapache:mainfrom
timsaucer merged 3 commits intoapache:mainfrom
Conversation
timsaucer
reviewed
Aug 30, 2025
Member
timsaucer
left a comment
There was a problem hiding this comment.
This is an excellent addition!
I think it could benefit from a little extra text in the online documentation or the script itself to tell the users that this benchmark is an example of one type of operation. The actual performance they see can be impacted by a variety of factors, including the types of table providers they are using, what IO that must happen for their setup, and what operations they are performing. It is recommended that the user build a similar benchmark for themself to evaluate using their own hardware and work loads.
Contributor
Author
|
@timsaucer , |
timsaucer
approved these changes
Aug 31, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This change provides users with practical guidance and examples for tuning DataFusion’s parallelism to maximize CPU utilization. By documenting configuration options and including a benchmark script, users can better understand how to configure partitions and repartitioning to improve query performance.
What changes are included in this PR?
Added a new benchmark script
benchmarks/max_cpu_usage.pyshowing how to configure DataFusion for optimal parallelism and measure performance impact.Updated README.md with a reference to the new documentation section.
Expanded user guide (
docs/source/user-guide/configuration.rst) with a new section Maximizing CPU Usage, including:SessionConfigfor higher partition counts.Are these changes tested?
The new
benchmarks/max_cpu_usage.pyscript serves as a functional test and demonstration of configuration options. It generates synthetic data and measures query performance, showcasing partitioning impacts. While not a formal unit test, it validates correct behavior of partitioning and parallelism features.Are there any user-facing changes?
Yes:
benchmarks/for users to run and test parallelism configuration.No breaking API changes are introduced.