Add user documentation for the FFI approach#1031

Merged

timsaucer merged 3 commits intoapache:mainfrom

timsaucer:doc/document-ffi-approach

Feb 21, 2025

Member

timsaucer commented Feb 20, 2025

Which issue does this PR close?

Closes #1027

Rationale for this change

User requested.

What changes are included in this PR?

Adds a page to the online documentation.

Are there any user-facing changes?

None


          Initial commit for FFI user documentation

f2e0275

robtandy commented Feb 20, 2025

Fantastic! No edits from me. I learned more about FFI and the approach reading that too. That is the exact doc I would have needed and wanted when encountering this issue organically

robtandy commented Feb 20, 2025

Thinking out loud. Do we want to have a FAQ of some sort in the top level Readme.md? It could have a Q and A that would direct someone to this doc. My concern is, as important as I think this information is, it might not be as discoverable otherwise.

Unsure what other things you could touch on with the FAQ but it could be useful I think.


          Update readme to point to the online documentation. Fix a small typo.

e225ae1

davisp approved these changes

View reviewed changes

Member

davisp left a comment

+1 Great write up.


          Small text adjustments for clarity and formatting

3c3eb1f

Member Author

timsaucer commented Feb 21, 2025

@kevinjqliu Since you've been working with this code closely recently, would you mind doing a review?

kevinjqliu approved these changes

View reviewed changes

Contributor

kevinjqliu left a comment

Thank you so much for writing this up! This is a wealth of information 🙏
I added a few comments. Overall, this is amazing.

docs/source/contributor-guide/ffi.rst

+              The Primary Issue
+              -----------------
+              Suppose you wish to use DataFusion and you have a custom data source that can produce tables that

Contributor

kevinjqliu Feb 21, 2025

Just as a callout, whats so special about this method is that the abstraction is at the "data provider" ("TableProvider") layer. Most other projects are integrated with the Arrow ecosystem but requires passing arrow table or arrow batches around.
This here allows datafusion to integrate with a data source and produce the necessary data at the root level, which allows further optimizations and custom handling.

Contributor

kevinjqliu Feb 21, 2025

for example, currently pyiceberg pass data to other engines by materializing the arrow table first and then registering the table with the engine
https://github.com/apache/iceberg-python/blob/b95e792d86f551d3705c3ea6b7e9985a2a0aaf3b/pyiceberg/table/__init__.py#L1785-L1828

docs/source/contributor-guide/ffi.rst

+              as performant as possible and to utilize the features of DataFusion, you may decide to write
+              your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library.
+              At first glance, it may appear the best way to do this is to add the ``datafusion-python``

Contributor

kevinjqliu Feb 21, 2025

the datafusion-python crate is interesting. what is it used for outside of creating the python bindings as the datafusion python library.

Member Author

timsaucer Feb 21, 2025

Before we started doing the FFI work, re-exporting the datafusion-python crate was basically they only way you could add extensions.

Contributor

kevinjqliu Feb 21, 2025

thanks! i see this is called out in the Alternative Approach section. Was just wondering if there was another reason I missed

docs/source/contributor-guide/ffi.rst

+              .. code-block:: rust
+                  let my_provider = MyTableProvider::default();
+                  let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);

Contributor

kevinjqliu Feb 21, 2025

nit: since this is the implementation details section, i think its worth calling out the ability to passing tokio runtime and why we would want to do that
apache/datafusion#13937

timsaucer merged commit e6f6e66 into apache:main

15 checks passed

timsaucer deleted the doc/document-ffi-approach branch

February 21, 2025 21:03

kylebarron reviewed

View reviewed changes

docs/source/contributor-guide/ffi.rst

+              The FFI Approach
+              ----------------
+              Rust supports interacting with other programming languages through it's Foreign Function

Member

kylebarron Feb 26, 2025

nit: it's should be its

docs/source/contributor-guide/ffi.rst

Comment on lines +100 to +102

+              In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow``
+              goes one step further to make sharing the interfaces easier in Python. They do this
+              by exposing PyCapsules that contain the expected functionality.

Member

kylebarron Feb 26, 2025

It would be nice to note that this isn't just a pyarrow implementation detail; it's a full specification for Python, of which pyarrow is just one implementor https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

Member

kylebarron Feb 26, 2025

In particular, pyarrow didn't used to use capsules; they use to use raw integers in the _export_to_c and _import_from_c methods, where those integers refer to data pointers.

But this can be unsafe/lead to memory leaks if those pointers aren't always consumed, because then the Arrow release callback may never be called.

The added benefit and safety of the capsule is specifically because the capsule allows you to define a destructor, so that the destructor will be called by the Python GC if the capsule is never used. So you can't get memory leaks when using capsules.

docs/source/contributor-guide/ffi.rst

+                  let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None);
+              If you were interfacing with a library that provided the above ``FFI_TableProvider`` and
+              you needed to turn it back into an ``TableProvider``, you can turn it into a

Member

kylebarron Feb 26, 2025

into an should be into a

docs/source/contributor-guide/ffi.rst

+              .. code-block:: rust
+                  let name = CString::new("datafusion_table_provider")?;
+                  let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?;

Member

kylebarron Feb 26, 2025

I think you really want to use PyCapsule::new_with_destructor so that you don't get memory leaks if the capsule isn't used.

Or maybe PyCapsule::new will automatically call the Drop impl and you don't need to manually call new_with_destructor? I can't remember

docs/source/contributor-guide/ffi.rst

+              .. code-block:: rust
+                  let capsule = capsule.downcast::<PyCapsule>()?;
+                  let provider = unsafe { capsule.reference::<FFI_TableProvider>() };

Member

kylebarron Feb 26, 2025

IMO you should always check the name of the capsule before opening it. https://github.com/kylebarron/arro3/blob/a041de6bae703a2c8fa3e3b31cd863d4801101d2/pyo3-arrow/src/ffi/from_python/utils.rs#L48

Your FFI interface should also define a name that all implementations should attach to the pycapsule.

docs/source/contributor-guide/ffi.rst

Comment on lines +193 to +195

+              As we discussed, this is not guaranteed to work across different compiler versions and
+              optimization levels. If you wish to go down this route, there are two approaches we
+              have identified you can use.

Member

kylebarron Feb 26, 2025

IMO I wouldn't even suggest this because this is explicitly recommended against by the pyo3 maintainers: PyO3/pyo3#1444

kevinjqliu mentioned this pull request

March 2025 ASF Board Report (March 12) apache/datafusion#13713

Closed

4 tasks

kosiew pushed a commit to kosiew/datafusion-python that referenced this pull request


          Add user documentation for the FFI approach (apache#1031)

9adda85

* Initial commit for FFI user documentation

* Update readme to point to the online documentation. Fix a small typo.

* Small text adjustments for clarity and formatting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet