Introduce Table wrapper, unify table registration via register_table; deprecate legacy APIs#1243
Introduce Table wrapper, unify table registration via register_table; deprecate legacy APIs#1243kosiew wants to merge 23 commits intoapache:mainfrom
Conversation
docs/tests, add DataFrame view support, and improve Send/concurrency support. migrates the codebase from using `Table` to a `TableProvider`-based API, refactors registration and access paths to simplify catalog/context interactions, and updates documentation and examples. DataFrame view handling is improved (`into_view` is now public), the test-suite is expanded to cover new registration and async SQL scenarios, and `TableProvider` now supports the `Send` trait across modules for safer concurrency. Minor import cleanup and utility adjustments (including a refined `pyany_to_table_provider`) are included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule improvements. -- Registration logic & API * Refactor of table provider registration logic for improved clarity and simpler call sites. * Remove PyTableProvider registration from an internal module (reduces surprising side effects). * Update table registration method to call `register_table` instead of `register_table_provider`. * Extend `register_table` to support `TableProviderExportable` so more provider types can be registered uniformly. * Improve error messages related to registration failures (missing PyCapsule name and DataFrame registration errors). -- DataFrame ↔ TableProvider conversions * Introduce utility functions to simplify table provider conversions and centralize conversion logic. * Rename `into_view_provider` → `to_view_provider` for clearer intent. * Fix `from_dataframe` to return the correct type and update `DataFrame.into_view` to import the correct `TableProvider`. * Remove an obsolete `dataframe_into_view` test case after the refactor. -- FFI / PyCapsule handling * Update `FFI_TableProvider` initialization to accept an optional parameter (improves FFI ergonomics). * Introduce `table_provider_from_pycapsule` utility to standardize pycapsule-based construction. * Improve the error message when a PyCapsule name is missing to help debugging. -- DeltaTable & specific integrations * Update TableProvider registration for `DeltaTable` to use the correct registration method (matches the new API surface). -- Tests, docs & minor fixes * Add tests for registering a `TableProvider` from a `DataFrame` and from a capsule to ensure conversion paths are covered. * Fix a typo in the `register_view` docstring and another typo in the error message for unsupported volatility type. * Simplify version retrieval by removing exception handling around `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance. * Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`. * Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code. * Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions. * Added type hints to `register_schema` and `deregister_table` methods. * Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly. * Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`. Notes: * Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration. * Audit any code relying on `Send` for trait objects passed across threads. * Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes * Normalized table-provider handling and simplified registration flow across the codebase; multiple commits centralize provider coercion and normalization. * Introduced utility helpers (`coerce_table_provider`, `extract_table_provider`, `_normalize_table_provider`) to centralize extraction, error handling, and improve clarity. * Simplified `from_dataframe` / `into_view` behavior: clearer implementations, direct returns of DataFrame views where appropriate, and added internal tests for DataFrame flows. * Fixed DataFrame registration semantics: enforce `TypeError` for invalid registrations; added handling for `DataFrameWrapper` by converting it to a view. * Added tests, including a schema registration test using a PyArrow dataset and internal DataFrame tests to cover new flows. * Documentation improvements: expanded `from_dataframe` docstrings with parameter details, added usage examples for `into_view`, and documented deprecations (e.g., `register_table_provider` → `register_table`). * Warning and UX fixes: synchronized deprecation `stacklevel` so warnings point to caller code; improved `__dir__` to return sorted, unique attributes. * Cleanup: removed unused imports (including an unused error import from `utils.rs`) and other dead code to reduce noise.
…dating method calls
c47b0f1 to
ea2973c
Compare
ea2973c to
1872a7f
Compare
…d avoid documentation duplication
timsaucer
left a comment
There was a problem hiding this comment.
This is an incredible start!
From a naming perspective I think it's more intuitive to just call these Table instead of TableProvider. I know we have a Table class in datafusion.catalog. It feels this is is a real opportunity to give the user a more unified experience even further.
If we are going to be making big changes like this and deprecating some functions, then I really want to make sure we give an extremely pleasant end user experience.
dev/changelog/49.0.0.md
Outdated
| **Deprecations:** | ||
|
|
||
| - Document that `SessionContext.register_table_provider` is deprecated in favor of `SessionContext.register_table`. | ||
|
|
There was a problem hiding this comment.
These changelogs are automatically generated, so I don't think we want to make changes here. Regardless, these would go into the 51.0.0 release.
There was a problem hiding this comment.
I will revert this change.
docs/source/conf.py
Outdated
| # Skip private members that start with underscore to avoid duplication | ||
| if name.split(".")[-1].startswith("_") and what in ("data", "variable"): | ||
| skip = True | ||
|
|
There was a problem hiding this comment.
So I can understand better, why do we need both this rule and the one above in lines 86-88?
There was a problem hiding this comment.
- The explicit skip_contents list handles targeted, known problem cases (re-exports, specific deprecated APIs, or particular items that cause duplication or confusion). It’s precise and intentional.
- The private-name filter is a broad rule to remove many small implementation details (module-level private constants) without listing them all manually. This prevents the docs from listing every private variable.
I'll also add clarifying comments in autoapi_skip_member_fn
| provider = TableProvider.from_capsule(delta_table.__datafusion_table_provider__()) | ||
| ctx.register_table("my_delta_table", provider) |
There was a problem hiding this comment.
This feels like a worse experience than before. Why can we not just call register_table("my_delta_table", delta_table)?
| capsule = provider.__datafusion_table_provider__() | ||
| capsule_provider = TableProvider.from_capsule(capsule) | ||
|
|
||
| df = ctx.from_pydict({"a": [1]}) | ||
| view_provider = TableProvider.from_dataframe(df) | ||
| # or: view_provider = df.into_view() | ||
|
|
||
| ctx.register_table("capsule_table", capsule_provider) | ||
| ctx.register_table("view_table", view_provider) | ||
|
|
||
| ctx.table("capsule_table").show() | ||
| ctx.table("view_table").show() |
There was a problem hiding this comment.
This example takes a bit of cognitive load to understand what we're demonstrating.
First off, similar to my comments above I don't think we want our users to have to think about if they're using something that comes from a PyCapsule interface or not. Suppose I am a library user and I get a delta table object that implements PyCapsule. As a user of that library, I shouldn't have to understand how the interfacing works. I should just be able to use it directly. So I want to be able to just pass those objects directly to TableProvider or register_table without having to think about or understand these mechanics behind the scene.
python/datafusion/__init__.py
Outdated
| # isort: skip_file # Prevent import-sorting linter errors (I001) | ||
| # ruff: noqa: I001 |
There was a problem hiding this comment.
Is this ruff lint causing a problem?
python/datafusion/dataframe.py
Outdated
| This is the preferred way to obtain a view for | ||
| :py:meth:`~datafusion.context.SessionContext.register_table`. |
There was a problem hiding this comment.
I don't understand this statement.
There was a problem hiding this comment.
Here are the reasons:
1. Direct API: Most efficient path - directly calls the underlying Rust
DataFrame.into_view() method without intermediate delegations.
2. Clear semantics: The into_ prefix follows Rust conventions,
indicating conversion from one type to another.
3. Canonical method: Other approaches like TableProvider.from_dataframe
delegate to this method internally, making this the single source of truth.
4. Deprecated alternatives: The older TableProvider.from_view helper
is deprecated and issues warnings when used.
I will add the above to the comment in def to_view too
python/datafusion/dataframe.py
Outdated
| >>> from datafusion import SessionContext | ||
| >>> ctx = SessionContext() | ||
| >>> df = ctx.sql("SELECT 1 AS value") | ||
| >>> provider = df.into_view() |
There was a problem hiding this comment.
From an end user's perspective, they turn a dataframe into a view, which they then register so they can use it later. I don't think this end user needs to understand the concept of TableProvider at all. In the example I would change the variable name provider to view
There was a problem hiding this comment.
This makes sense, given that we're moving away from 'provider'
…ge and advantages
…age of Table instead
245e89f to
918b1ce
Compare
I removed TableProvider class in Python. |
TableProvider wrapper & unified register_table API; deprecate register_table_provider|
Closed with #1256 |
Which issue does this PR close?
Rationale for this change
This change consolidates and modernizes table provider registration in DataFusion's Python bindings. Previously, there were multiple ad-hoc mechanisms (
register_table_provider,Table.from_view(), directTableor pycapsule usage) that led to confusing APIs, inconsistent behaviors, and fragmented documentation.This PR introduces a clean, centralized approach using the high-level
Tablewrapper class and a normalization layer that supports multiple table provider inputs, including:TableobjectsDatasetsDataFrameviewsBy consolidating registration into
SessionContext.register_table()and extendingSchema.register_table()to match, we simplify the user experience, reduce internal complexity, and align the API more closely with Pythonic expectations.What changes are included in this PR?
High-level Summary
Introduces a new high-level Python API:
datafusion.Table.from_capsule(),.from_dataframe(), and.from_dataset()Deprecates
SessionContext.register_table_provider()in favor ofregister_table()Deprecates
Table.from_view()in favor ofDataFrame.into_view()andTable.from_dataframe()Updates
Schema.register_table()to support any object implementing__datafusion_table_provider__andpyarrow.dataset.DatasetAdds
_normalize_table_providerutility to coerce supported input typesCentralizes coercion logic in Rust with
coerce_table_providerandtable_provider_from_pycapsule()Enhances documentation and examples to reflect modern registration idioms
Improves test coverage for new usage patterns and coercion logic
Introduces
datafusion.EXPECTED_PROVIDER_MSGfor stable, testable error messagesAre these changes tested?
Yes. This PR includes comprehensive test coverage:
Unit tests for new
Tablemethods and error handlingIntegration tests verifying:
Table.from_dataframe(),from_capsule(), andinto_view()pyarrow.dataset.DatasetobjectsDeprecationWarningSessionContextandSchemaregistration paths behave identically__datafusion_table_provider__can be used directlyDataFramewithout conversion)Are there any user-facing changes?
✅ Additions
New public API:
datafusion.TableTable.from_dataframe(df)Table.from_capsule(capsule)Table.from_dataset(dataset)DataFrame.into_view()— recommended way to convert to a table providerdatafusion.EXPECTED_PROVIDER_MSG— stable constant for validation errorsSchema.register_table(...)now accepts all supported inputs (likeSessionContext.register_table)SessionContext.register_table_provider(...)is deprecatedregister_tableTable.from_view()is deprecatedDeprecationWarning; useinto_view()orfrom_dataframe()instead📋 Documentation & Examples
Tableandregister_tableDataFrameobjects🔁 Compatibility
Fully backwards compatible
Existing table registration logic continues to work as expected
Encourages migration to the new
TableAPI for consistency and future-proofingBreaking changes?
No. This is a non-breaking refactor that preserves all existing behaviors through shims and deprecation paths. However, users relying on internal or undocumented APIs (e.g., raw table objects or bypassing coercion) may encounter changes.