-
Notifications
You must be signed in to change notification settings - Fork 223
[Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository #1429
Comments
I wonder if @jorgecarleitao @ritchie46 @b41sh or others can share their perspectives on this propsal |
Since Arrow2 had many improvements and features that |
Do nothing !!! |
Thanks for the overview @alamb. I am in favor of proposal 3. Where the core buffers are in I believe that this will hit the most important pain point in the community, that is that currently there is no way to interop with arrow-rs/arrow2 without paying a huge compilation cost. |
Would you be willing to receive PRs to polars to accelerate this transition along, or do you see there being fundamental blockers to this? I'm very motivated to get everyone working on the same implementation if possible, rather than splitting our efforts, the interop in my mind is just a means to this end |
Specifically, perhaps we can help show how to use |
Just checking in here. Has there been any further progress on the merger? We on the nushell team are hoping we can move onto the resulting library once it's ready. |
I am not aware of any further work under this issue, the state of play as I understand it is that:
My understanding is that polars-arrow is intended to serve the needs of polars, and not as a general purpose arrow library (although @ritchie46 please correct me if I am wrong), and therefore workloads should look to switch over to using arrow-rs. Edit: unless of course those workloads are integrated with polars, in which case I guess they switch to using polars-arrow?? |
@tustvold Thanks for your quick response ! |
@tustvold - Thanks for the quick and thorough response. It's much appreciated. We'll reach out to @ritchie46 to chat about compatibility concerns we need to be aware of if we go with arrow-rs. Thanks again for the help. |
It appears that databend, another of the historical major users of arrow2 has also switched to arrow-rs https://github.com/search?q=repo%3Adatafuselabs%2Fdatabend+arrow+language%3ATOML&type=code&l=TOML |
Maybe it makes sense to add a note to the readme of this repo explaining the status as well? |
We (GreptimeDB) have also switched to arrow-rs from arrow2. |
Here is a proposal to add a note to the readme: #1606 |
this library consistenly outperforms (almost 2x) the decode time of Also it has completely seperate decoding/io parts for parquet which is very useful |
I agree it would be great to get some additional maintainers (or maybe figure out how to port whatever is working well for you to the parquet crate) |
Actually sorry about the performance claim, I was parallelizing with arrow2. I'll try to parallelize I'll try to seperate decoding and io with |
Just to add a little clarity on this document. Ideally there would be some sort of stronger declaration about this crate -- either that it was deprecated and urging people to move to a maintained crate, or that someone / a group was rallying a community to maintain it going forward. However, given the current lack of community engagement / maintenance, from my perspective, the challenge is that it is not at all clear who would make such a decision and no one seems willing to put the time in to chart a path forward. I am not sure of @jorgecarleitao, as the original author, has any thoughts to share on this matter. |
Would mark this repo as archived be a reasonable action (besides the note in the README)? |
Personally, I think it'd be helpful for many visitors that besides the repo being archived there is a short note in the README with what users are expected to use instead of this repo. And links to the official Arrow-rs, maybe the Polars copy, and a link to this issue too. |
If we want to adjust the notice at the beginning of the readme, perhaps something like: Important This repository has been superseded by the official Apache Arrow repository, arrow-rs, and is no longer maintained. Polars, which was the original motivation for this project, is now maintaining its own pared-down clone of arrow2 in-tree.
|
Is there still a plan to include arrow2 arrays as a part of the arrow-rs library like it was discussed earlier in the form of something like PhysicalArray? I was working on a PR to add the newly introduced binary_view and utf8_view (in arrow spec and c ABI) to arrow2. I have 90% of the code ready. I saw the deprecation notice and I am not sure on how to proceed further. I feel that there is not enough clarity on what features have been added from arrow2 to arrow-rs, apart from the interoperability PRs for buffers, arrays and schema. Since the original plan was to merge arrow-rs with arrow2, I am not sure if we are reaping the benefits of arrow2 in terms of performance, safety and intuitive API usage, after the recent commits. Can we summarize the final changes and additions? @jorgecarleitao @tustvold @alamb Thanks to you and everyone else who contributed to this repo! As an arrow2 user, I would want to know how to go about with my existing code without sacrificing performance and the safe & intuitive design of arrow2 over arrow-rs. |
There is no one actively working on this that I know of
I don't know of any summary of the additions made to arrow-rs but I would also be interested in any summary you are able to provide -- perhaps you can look at the current API in arrow-rs and judge for yourself if it suits your needs.
Again, I would encourage you to look at arrow-rs and if there are particular things you find lacking in terms of performance or design, make proposals and work with us to improve its design. For example, perhaps you would be interested in helping implement string view in arrow_rs -- I filed a ticket to track the work here: apache/arrow-rs#5374 |
I am actively looking at both the libraries and will try to compile my insights into a document. I am happy to provide some kind of summary for the API difference as well. However, what I am trying to say is that the note on README is not enough for users to understand what to do with their existing arrow2 code. Perhaps @jorgecarleitao and @tustvold will be able to provide more info on the most recent changes that they have incorporated into both the repositories for interoperability. Also, does this mean that everyone should transition to arrow-rs since there are talks to archive arrow2 and mark it as read-only. |
At this point the arrow-rs arrays contain strongly typed buffers, with zero-copy interoperability with There are differences in the broader APIs provided, but in terms of functionality, the only thing I'm aware of that isn't supported by arrow-rs (yet) but is supported by arrow2, is avro.
I think in lieu of a long-term maintenance story for arrow2, this would be my recommendation. |
I think each user should make this decision based on what their needs are. Ideally we would help the decision by providing the type of information you are describing @urvishdesai. |
This proposal is a response to the seeming desire to unify Rust Arrow development communities going forward and my summary of the discussion on apache/arrow-rs#1176 (specifically apache/arrow-rs#1176 (comment)) . There is a summary with some diagrams that may be helpful in https://docs.google.com/presentation/d/1cqQEpC-kJES2Mng152r_qZyaOqHjtb5YFuseSTWyulU/edit
I am trying to help the community with this proposal. Please provide your feedback -- I don't expect to proceed unless there is consensus that this is a desired path.
Current state
Proposal 1 End State (alternate options are listed below)
This will necessitate users of arrow2 to either change their code, or find alternate maintainers with sufficient capacity for arrow2 if they want a maintained version of arrow going forward.
Prior to the Deprecation Period
After the Depreciation Period
Proposed Milestones
Milestone 1: Ergonomic (and Zero runtime cost) conversion
Buffer
/ForeignVec
with PrimitiveArrayData, as described in Discussion: relationship / unification of arrow-rs and arrow2 going forward apache/arrow-rs#1176 (comment)This will result in:
array-data
rather thanforeign_vec
From<..> impls
) between the arrow-rs and arrow2 Array typesMilestone 2: Deprecation period
Milestone 3: Deprecation is over, arrow2 is archived
Alternatives Considered
Proposal 2: deprecate arrow2 but do not run (explicit) IP clearance
Pros: less work
Cons: potentially complicates the 'porting code from arrow2 to arrow' process as the IP provenance isn't as clear. However, since the entire arrow2 repo is apache2 licensed, it may be ok.
Proposal 3: arrow2 continues development outside the ASF, wiht
Pros: Users of arrow2 can continue to use it without any migration or effort
Cons: Unclear where the resources will come from
Proposal 4: Do nothing
The text was updated successfully, but these errors were encountered: