Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756

WillAyd · 2024-08-29T12:25:17Z

This is something I've chatted with @MarcoGorelli offline about. At the time it was implemented in seaborn, the Dataframe Interchange Protocol was the best option for exchanging dataframe-like data. However, since that was implemented in seaborn, the PyArrow Capsule Interface has come along and solved many of the issues that the DataFrame Interchange Protocol left open.

Without knowing the current state of the interchange implementation of seaborn, switching to the PyArrow Capsule Interface should solve at least the following issues:

It will add support for polars and other dataframe libraries (Feature Request for Polars library Support on Seaborn #3277 and Support for Pola.rs DF #3188)
It will use the Arrow type system, which supports aggregate types (Try to_pandas rather than erroring if interchanging to pandas doesn't work? #3533)
The wonkiness of pandas' type system won't be inherited by seaborn (potentially solving size parameter of scatterplot does not accept Float64 type #3519)

The interface has been adopted by a good deal of projects already, some of which are being tracked in apache/arrow#39195

The text was updated successfully, but these errors were encountered:

mwaskom · 2024-08-29T13:26:29Z

Thanks for flagging. Just skimmed your link but it looks like it's operating at a very different level from the dataframe interchange protocol? The relevance to seaborn (i.e., is there a simple way to be more agnostic about input data structure types) isn't super obvious.

WillAyd · 2024-08-29T14:14:26Z

Apologies as I should have been more clear - the technical documentation I provided was just a reference, not something I'd expect seaborn to have to implement from scratch. The dataframe libraries that you would interact with should do most of the heavy lifting for that.

@MarcoGorelli probably knows best here, but from a cursory glance of the seaborn source code, I think you could adopt the Arrow PyCapsule interface in a piece-wise fashion:

Maintain the dependency on pandas, but just swap out checks for the __dataframe__ dunder with checks for __arrow_c_schema__
Drop the dependency on pandas, using a more generic library like narwhals for any internal dataframe operations

Step 1 I think would be pretty easy, and would immediately open up seaborn for use from polars, excluding any data types that polars has which pandas does not (most likely Decimal / aggregate types)

Step 2 would take a little more time. I'm not sure if narwhals is even fully capable of abstracting all of the dataframe operations that seaborn needs today, but in theory this would make your dependencies more lightweight by dropping pandas

Overall, rather than seaborn having to customize solutions towards the various dataframe type systems, the ecosystem would just converge on just the Arrow type system. Assuming seaborn still requires NumPy types for interactivity with matplotlib, there will still be a gap where Arrow types don't have a plottable equivalent, but I think that's better than the status quo where seaborn is tied to pandas type-system, given Arrow is better documented and more stable

mwaskom · 2024-08-30T01:26:43Z

Thanks for elaborating.

and would immediately open up seaborn for use from polars

To be clear, this is already the case:

import polars as pl
import seaborn as sns
df = pl.DataFrame({"cat": ["x", "y", "z"], "val": [1, 2, 3]})
sns.barplot(df, x="cat", y="val")

using a more generic library like narwhals for any internal dataframe operations

This is a complete non-starter.

Assuming seaborn still requires NumPy types for interactivity with matplotlib

I can't see that changing any time soon, but I don't know what specifically is on matplotlib's roadmap.

MarcoGorelli · 2024-08-30T08:47:05Z

Thanks for the ping, and thanks both for comments! 🙏

It's true that Seaborn accepts Polars objects, but they fail if the object contains data types not recognised by the interchange protocol (#3533). (I think we all find this frustrating, and feel at least slightly let down by the interchange protocol, but that's a different story..)

Seaborn currently uses

pd.api.interchange.from_dataframe(data)

and that's what fails for when the interchange protocol falls short. But if in pandas we first tried using the (superior, better maintained, less fallible) PyCapsule interface, then Seaborn's current code could "just work"

using a more generic library like narwhals for any internal dataframe operations

This is a complete non-starter.

😆 fair enough

So, in summary, there might be anything actionable on Seaborn's side here (though I hope the fallback in #3534 makes it into the next release). Still, good to catch up and hear your opinion on the topic 🙌

MarcoGorelli · 2024-11-09T17:45:16Z

But if in pandas we first tried using the (superior, better maintained, less fallible) PyCapsule interface, then Seaborn's current code could "just work"

This would require waiting for pandas 3.0, which might take quite some more time

I've opened #3782 to suggest going via PyArrow's PyCapsule Interface, which is widely used and robust

I'm hoping this can be considered - if not, I'm hoping we can discuss some other alternatives, because in any case, if Seaborn ends up being the only (!) project still using the Interchange Protocol, then that'll introduce further risk

mwaskom · 2025-01-26T15:25:44Z

Based on this resolution to #3782 it sounds like this will now happen by default within pandas, which is a great solution.

WillAyd · 2025-01-27T18:46:06Z

I would still advise keeping this open. What was done for pandas is just a backwards compat shim, but long term the future of the DataFrame Interchange Protocol is sadly very bleak. At some point, I would think pandas deprecates and removes it

mwaskom · 2025-01-27T19:00:44Z

Maybe I misunderstood the resolution to the linked PR but I took it to understand that Pandas is moving the details of how to convert arbitrary dataframe-thingies into pandas.DataFrame into pandas. That sounds like the right solution to me! Why should non-dataframe libraries need to be updating how they do that every 3 months when there's a hot new solution, instead of calling into a pandas API?

MarcoGorelli · 2025-01-27T19:12:49Z

Agree, I don't think there's any need to remove pd.api.interchange.from_dataframe

@WillAyd even if all the __dataframe__ code in pandas were removed, I don't think there'd be a need to remove pd.api.interchange.from_dataframe and just have it keep calling the PyCapsule Interface. It's pretty much 10 lines of code, it's not like it's going to be a maintenance burden, and like that we can keep things working for Seaborn (and for consumers of say Plotly who upgrade pandas but don't upgrade Plotly)

WillAyd · 2025-01-27T20:15:00Z

@WillAyd even if all the __dataframe__ code in pandas were removed, I don't think there'd be a need to remove pd.api.interchange.from_dataframe and just have it keep calling the PyCapsule Interface

Ah OK I didn't realize that was the path you were thinking of going down. There is already an open issue to discuss how to best import capsule data into pandas and I don't think that was discussed as an option: pandas-dev/pandas#59631

Ultimately what I'm trying to avoid is having a canonical way of importing data, alongside a repurposed API from the now defunct dataframe interchange protocol that is solely maintained for backwards compat. It would be better to just align on one approach, not even just for pandas but for the entire ecosystem

mwaskom · 2025-01-27T23:15:41Z

It just really isn't obvious to people who don't think about this stuff all the time that the canonical way to get a library-specific representation of a DataFrame would be through pyarrow, an "internals" library. The "interchange API" approach of having pandas offer an API that consumed an arbitrary dataframe made plenty of sense, even if something wasn't ideal about the implementation. Whether it's some function in the pd namespace or a DataFrame.from_whatever method doesn't matter too much (although I'd find the latter more obvious).

Maybe I'm missing something though. It really just feels like the "pyarrow" part of this is an implementation detail, and should remain such.

WillAyd · 2025-01-28T00:40:22Z

The Arrow PyCapsule interface does not require PyArrow at all, so you are correct in that it is just an implementation detail of pandas. Other libraries (e.g. polars) do not need PyArrow for this type of interchange.

Maybe the terminology is just getting confused between Arrow PyCapsule Interface and the PyArrow library? I think the former is the interchange API you have in mind, as it standardizes the methods that need to be called as well as the memory management semantics of libraries that want to exchange dataframe-like data.

Unfortunately, the dataframe API that pandas adoped first is at this point abandonware

mwaskom · 2025-01-28T01:14:13Z

Again I really can't stress enough how much I, a maintainer of a data visualization library, don't care about the nuances of converting between different dataframe representations. I just want a black box that can consume something advertising itself as "a dataframe" and get a pandas.DataFrame in return. That is what I understood the pd.api.interchange.from_dataframe function to provide. If it's doing a better job now because it's using Arrow or PyArrow or whatever now, that's great.

MarcoGorelli mentioned this issue Nov 9, 2024

fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782

Closed

kylebarron mentioned this issue Nov 21, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

mwaskom closed this as completed Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756

Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756

WillAyd commented Aug 29, 2024

mwaskom commented Aug 29, 2024

WillAyd commented Aug 29, 2024 •

edited

Loading

mwaskom commented Aug 30, 2024 •

edited

Loading

MarcoGorelli commented Aug 30, 2024

MarcoGorelli commented Nov 9, 2024 •

edited

Loading

mwaskom commented Jan 26, 2025

WillAyd commented Jan 27, 2025

mwaskom commented Jan 27, 2025

MarcoGorelli commented Jan 27, 2025 •

edited

Loading

WillAyd commented Jan 27, 2025

mwaskom commented Jan 27, 2025

WillAyd commented Jan 28, 2025

mwaskom commented Jan 28, 2025

Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756

Use Arrow PyCapsule Interface instead of Dataframe Interchange Protocol #3756

Comments

WillAyd commented Aug 29, 2024

mwaskom commented Aug 29, 2024

WillAyd commented Aug 29, 2024 • edited Loading

mwaskom commented Aug 30, 2024 • edited Loading

MarcoGorelli commented Aug 30, 2024

MarcoGorelli commented Nov 9, 2024 • edited Loading

mwaskom commented Jan 26, 2025

WillAyd commented Jan 27, 2025

mwaskom commented Jan 27, 2025

MarcoGorelli commented Jan 27, 2025 • edited Loading

WillAyd commented Jan 27, 2025

mwaskom commented Jan 27, 2025

WillAyd commented Jan 28, 2025

mwaskom commented Jan 28, 2025

WillAyd commented Aug 29, 2024 •

edited

Loading

mwaskom commented Aug 30, 2024 •

edited

Loading

MarcoGorelli commented Nov 9, 2024 •

edited

Loading

MarcoGorelli commented Jan 27, 2025 •

edited

Loading