-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-copy Python inputs using arrow
#224
Comments
Thanks for this detailed analysis! Curious about when you say:
This is because we can't listen on the python arrow itself, right? This is why you mentioned that we could eventually listen for a drop of the event object itself. ( with a one-shot Tokio channel for ex.) In the case we listen for a drop of the event object, wouldn't we be able to know when the data is no longer used? |
I have to check again but from what I remember, the unsafe slice gives you a |
Dropping the event is not enough since the user can take the data out of input events and store them somewhere else (e.g. in some list). So we really have to wait until the data is dropped too. With Arrow, this is signaled through the The issue with the
The issue with the |
I see your point. Thanks for the clarification! I'll try to do some test as well next week. |
To give some more details:
|
Could you clarify why you chose the Also, it looks like there are some proposals to merging |
Yeah, I thought that In many ways the arrow2 crate seemed more easy to work with |
In any case, it should be simple to change from arrow2 to arrow as they both can read c pointers as input to make a array. I can add some comments to go from one to the other if you need. Also, isn't the deallocation method of an array that we have built be called at the end of its lifetime, which in our case is when we export it to a python arrow array, and not when it's not used within python? |
Yeah, I think it shouldn't be too difficult to switch from
This depends on whether the created arrow array only borrows the data or whether it takes ownership. The When exporting the array to C through the |
Implemented in #228. |
Motivation
While we already have support for zero-copy inputs in Rust nodes, we still require copying the data for Python nodes. This can lead to decreased performance when using messages with large data. To avoid these slowdowns, we want to support zero-copy inputs for Python nodes too.
Challenges
The fundamental challenge is that Python normally operates on owned data. So we have to use a special type such as the
memoryview
object to make the data accessible to Python without copying. What makes things more complex is that we need a custom freeing logic because we need to unmap the shared memory and report a drop event back to the sender node.Using a non-standard data type makes it more complex to interact with the data. For example, special functions might be needed for reading, cloning, and slicing the data. Also, it is common to convert the data to other types (e.g. numpy arrays), which requires special conversion functions.
Apache Arrow
To keep the Python API simple and easy to use, it's a good idea to use some mature existing data format rather than creating our own custom data format. The Apache Arrow project provides such a data format that supports zero-copy data transfer across language boundaries. It's already quite popular and used in many projects and tools, so it looks like a good candidate for dora.
Arrow vs PyO3
arrow
provides two useful features for usThe Arrow
Array
datatyperelease
callback.private_data
pointer fieldImplementation
pyarrow
:ArrowArray
from Rust with custom drop semantics.The
arrow2
cratePrimitiveArray
and thearrow2::ffi::mmap::slice
function.arrow2::array::Array
trait for a custom typearrow2::ffi::export_array_to_c
function to create theArrow
arrayarrow2
library automatically fills in arelease
implementation that calls the drop handler of our custom typeexport_array_to_c
function only works with the predefined array types (via downcasting)Arrow
array with custom drop logic using thearrow2
crateManual Creation
Arrow
arrays is specified, so we could do the creation manuallyThe official
arrow
crateBuffer::from_custom_allocation
looks quite promising for our use caseBuffer
can then be converted toArrayData
usingnew_unchecked
:ArrayData
struct can the be used to construct aFFI_ArrowArray
through thenew
constructor.FFI_ArrowSchema
type:FFI_ArrowSchema::try_from(data.data_type())?
The text was updated successfully, but these errors were encountered: