Add user documentation for the FFI approach#1031
Conversation
|
Fantastic! No edits from me. I learned more about FFI and the approach reading that too. That is the exact doc I would have needed and wanted when encountering this issue organically |
|
Thinking out loud. Do we want to have a FAQ of some sort in the top level Unsure what other things you could touch on with the FAQ but it could be useful I think. |
|
@kevinjqliu Since you've been working with this code closely recently, would you mind doing a review? |
kevinjqliu
left a comment
There was a problem hiding this comment.
Thank you so much for writing this up! This is a wealth of information 🙏
I added a few comments. Overall, this is amazing.
| The Primary Issue | ||
| ----------------- | ||
|
|
||
| Suppose you wish to use DataFusion and you have a custom data source that can produce tables that |
There was a problem hiding this comment.
Just as a callout, whats so special about this method is that the abstraction is at the "data provider" ("TableProvider") layer. Most other projects are integrated with the Arrow ecosystem but requires passing arrow table or arrow batches around.
This here allows datafusion to integrate with a data source and produce the necessary data at the root level, which allows further optimizations and custom handling.
There was a problem hiding this comment.
for example, currently pyiceberg pass data to other engines by materializing the arrow table first and then registering the table with the engine
https://github.com/apache/iceberg-python/blob/b95e792d86f551d3705c3ea6b7e9985a2a0aaf3b/pyiceberg/table/__init__.py#L1785-L1828
| as performant as possible and to utilize the features of DataFusion, you may decide to write | ||
| your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library. | ||
|
|
||
| At first glance, it may appear the best way to do this is to add the ``datafusion-python`` |
There was a problem hiding this comment.
the datafusion-python crate is interesting. what is it used for outside of creating the python bindings as the datafusion python library.
There was a problem hiding this comment.
Before we started doing the FFI work, re-exporting the datafusion-python crate was basically they only way you could add extensions.
There was a problem hiding this comment.
thanks! i see this is called out in the Alternative Approach section. Was just wondering if there was another reason I missed
| .. code-block:: rust | ||
|
|
||
| let my_provider = MyTableProvider::default(); | ||
| let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); |
There was a problem hiding this comment.
nit: since this is the implementation details section, i think its worth calling out the ability to passing tokio runtime and why we would want to do that
apache/datafusion#13937
| The FFI Approach | ||
| ---------------- | ||
|
|
||
| Rust supports interacting with other programming languages through it's Foreign Function |
| In addition to using these interfaces to transfer Arrow data between libraries, ``pyarrow`` | ||
| goes one step further to make sharing the interfaces easier in Python. They do this | ||
| by exposing PyCapsules that contain the expected functionality. |
There was a problem hiding this comment.
It would be nice to note that this isn't just a pyarrow implementation detail; it's a full specification for Python, of which pyarrow is just one implementor https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
There was a problem hiding this comment.
In particular, pyarrow didn't used to use capsules; they use to use raw integers in the _export_to_c and _import_from_c methods, where those integers refer to data pointers.
But this can be unsafe/lead to memory leaks if those pointers aren't always consumed, because then the Arrow release callback may never be called.
The added benefit and safety of the capsule is specifically because the capsule allows you to define a destructor, so that the destructor will be called by the Python GC if the capsule is never used. So you can't get memory leaks when using capsules.
| let ffi_provider = FFI_TableProvider::new(Arc::new(my_provider), false, None); | ||
|
|
||
| If you were interfacing with a library that provided the above ``FFI_TableProvider`` and | ||
| you needed to turn it back into an ``TableProvider``, you can turn it into a |
| .. code-block:: rust | ||
|
|
||
| let name = CString::new("datafusion_table_provider")?; | ||
| let my_capsule = PyCapsule::new_bound(py, provider, Some(name))?; |
There was a problem hiding this comment.
I think you really want to use PyCapsule::new_with_destructor so that you don't get memory leaks if the capsule isn't used.
Or maybe PyCapsule::new will automatically call the Drop impl and you don't need to manually call new_with_destructor? I can't remember
| .. code-block:: rust | ||
|
|
||
| let capsule = capsule.downcast::<PyCapsule>()?; | ||
| let provider = unsafe { capsule.reference::<FFI_TableProvider>() }; |
There was a problem hiding this comment.
IMO you should always check the name of the capsule before opening it. https://github.com/kylebarron/arro3/blob/a041de6bae703a2c8fa3e3b31cd863d4801101d2/pyo3-arrow/src/ffi/from_python/utils.rs#L48
Your FFI interface should also define a name that all implementations should attach to the pycapsule.
| As we discussed, this is not guaranteed to work across different compiler versions and | ||
| optimization levels. If you wish to go down this route, there are two approaches we | ||
| have identified you can use. |
There was a problem hiding this comment.
IMO I wouldn't even suggest this because this is explicitly recommended against by the pyo3 maintainers: PyO3/pyo3#1444
* Initial commit for FFI user documentation * Update readme to point to the online documentation. Fix a small typo. * Small text adjustments for clarity and formatting
Which issue does this PR close?
Closes #1027
Rationale for this change
User requested.
What changes are included in this PR?
Adds a page to the online documentation.
Are there any user-facing changes?
None