Conversation
a6fd8e4 to
f7af294
Compare
python/datafusion/io.py
Outdated
|
|
||
| from datafusion.dataframe import DataFrame | ||
| from datafusion.expr import Expr | ||
| import pyarrow |
There was a problem hiding this comment.
Side note: it would be great to use ruff (https://stackoverflow.com/a/77876298) or isort to deterministically and programmatically sort python imports, and validate that in CI. I think isort/ruff would have a newline here between the third-party and first-party imports.
There was a problem hiding this comment.
there a pre-commit config for ruff linter and formatter
datafusion-python/.pre-commit-config.yaml
Lines 23 to 30 in 79c22d6
There was a problem hiding this comment.
As the SO answer above explains, import sorting isn't currently part of the default ruff-format behavior. We'd need to opt-in by adding an I element here:
datafusion-python/pyproject.toml
Line 66 in 79c22d6
timsaucer
left a comment
There was a problem hiding this comment.
I'm not opposed to this addition, but there is a potential source of confusion that we can mitigate with documentation. If a new user creates a session context themself and registers functions, and then creates a dataframe using this method, the functions they registered will not be available. I think it could lead to a fair amount of confusion.
I think this is easily mitigated by adding documentation to these functions that describes that it uses a default global session context and if the user needs a custom context they need to use the functions .
ba61c8f to
b166bdd
Compare
|
I added a few lines to the documentation, rebased, and applied updated ruff formatting. |
|
Key Points:
Details: After replacing all Just for reference, these are the scripts I used to generate and test the functions: Details``` ####test.py from datafusion import SessionContextCreate a new sessionctx = SessionContext() Read different file formatsdf1 = ctx.read_csv("data.csv") # Accepts str or Path print(df1) ####create.py - to create the data files Sample data as a dictionarydata = { Create DataFramedf = pd.DataFrame(data) Save as Parquetdf.to_parquet('data.parquet') Save as JSON (line-delimited)with open('data.json', 'w') as f: Save as Avroschema = { records = df.to_dict('records') |
|
There needs to be an initial import There are ruff rules to check this and we should turn them on |
|
Thanks, Kyle. More generally I’ll see about the impact of turning on all the rules and then removing a few specifically as needed |
Which issue does this PR close?