Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What methods does Column have? #107

Closed
MarcoGorelli opened this issue Mar 14, 2023 · 7 comments
Closed

What methods does Column have? #107

MarcoGorelli opened this issue Mar 14, 2023 · 7 comments

Comments

@MarcoGorelli
Copy link
Contributor

Currently, nothing is defined

Just for my understanding, is this because it hasn't been done yet, or because it would be the same as from the Protocol?

https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py

@honno
Copy link
Member

honno commented Mar 14, 2023

Just for my understanding, is this because it hasn't been done yet, or because it would be the same as from the Protocol?

https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py

Hasn't been done yet. Its worth noting the df interchange protocol specifies columns used in interchange (say df.__dataframe__().get_column(...)) and doesn't have to be supported by a libraries "top-level" column object (i.e. pd.Series).

@rgommers
Copy link
Member

Hasn't been done yet.

Not quite. From gh-50, which is quite relevant here: We had (/have) a pretty strong consensus that there should not be a separate Series-like object, but only a DataFrame object with a single column. IIRC the key issue is that Series and DataFrame have so much API duplication, for little benefit. And that statement goes all the way back to gh-6, which contains a lot of the early discussion. The duplication pandas & co have with essentially duplicate APIs on the dataframe and series objects seemed undesirable to most folks.

That said, when actually implementing the interchange protocol we figured out that it's not 100% practical to not have a Column object at all. So the remaining question is what needs adding to a Column for an API, and where a single-column dataframe will do just fine.

@jorisvandenbossche
Copy link
Member

While those older github issues indeed discussed to not have a separate 1D / column object, I seem to remember that on one of the more recent meetings where we discussed this, we landed on the opposite conclusion?

@MarcoGorelli
Copy link
Contributor Author

Is it enough to return a 1-column dataframe?

Say someone has the following (very common) pattern in pandas:

In [33]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [34]: mask = df['a'] > 1

In [35]: df.loc[mask, ]
Out[35]:
   a  b
1  2  5
2  3  6

How could this be written with the standard? If Column supported __gt__, say, then the following would work:

df_standard = dataframe_standard(df)
mask = df.get_column_by_name('a') > 1
df_standard.get_rows_by_mask(mask)

Without a Column, or without Column.__gt__, what would the alternative be?

@rgommers
Copy link
Member

rgommers commented Mar 15, 2023

While those older github issues indeed discussed to not have a separate 1D / column object, I seem to remember that on one of the more recent meetings where we discussed this, we landed on the opposite conclusion?

Right, indeed - thank you for finding that discussion. I think we need to update gh-50 with a clear summary of that. The tl;dr would be "so column is array-like and then has special behavior for missing values in reductions (with uniform skip_nulls kwarg across all functions), etc."

@rgommers
Copy link
Member

I think we need to update gh-50 with a clear summary of that.

This is done now.

I believe this particular is resolved, since we have a good collection of methods on Column. So I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants