Separate object for a dataframe colum? (is Series needed?) #6

datapythonista · 2020-05-18T10:20:31Z

Probably a bit early to the discussion, but I think this will need to be discussed eventually.

Is a separate object representing a single column needed? Like having Series, instead of just using one column DataFrame.

Having two separate objects, IMO adds a decent amount of complexity, both in the implementation and for the user. Whether this complexity is worth or not, I don't know. But I think this shouldn't be replicated from pandas without a discussion.

TomAugspurger · 2020-05-18T11:09:55Z

Agreed. I think we should focus on tabular (2D) objects.

If there's no disagreement, I recommend that we add this to the list we're collecting in #4.

TomAugspurger · 2020-05-18T11:11:09Z

Question for the group: Is there anyone who wants / needs more background on this topic? Trying to gauge the level of familiarity with pandas.

datapythonista · 2020-05-18T11:16:23Z

I'm surely happy to add this to #4. The name "deficiencies" in the issue intimidated me from adding it there directly. But will add it now.

maartenbreddels · 2020-05-18T11:16:35Z

Some more background would be appreciated.

My thoughts on this are that we can first say that the columns are opaque objects that follow the Array-API, if that makes sense.

TomAugspurger · 2020-05-18T11:22:51Z

A small bit of background (Marc can add more).

Series are 1-d objects with an index. The index of row labels (and perhaps name) differentiates them from a NumPy ndarray.

They primarily arise in

DataFrame indexing when selecting a single column
DataFrame reductions (.mean()). In pandas, this returns a single DataFrame.

In [2]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})

In [3]: df
Out[3]:
   A  B
0  1  3
1  2  4

In [4]: df['A']
Out[4]:
0    1
1    2
Name: A, dtype: int64

In [5]: df.sum()
Out[5]:
A    3
B    7
dtype: int64

As alternatives, Out[4] would return the same type as the input df, a 2-D dataframe.

Out[5] might be similar to NumPy's reductions with keepdims=True. A 2-d object with a single row, but the same columns as the input (the proper row label for that single row is a bit unclear).

datapythonista · 2020-05-18T11:27:28Z

Some more background would be appreciated.

From my side, it's more about the user API, than about what a column is.

Not sure if the example is very meaningful, but consider the three cases:

>>> df[['col1', 'col2']].sum()
>>> df[['col1']].sum()
>>> df['col1'].sum()

Is it worth having the third option, considering that it requires having two separate classes, with very similar (but not identical) APIs, with all the implied code complexity, and complexity for the user? Also some extra complexity, like the example that @TomAugspurger mentions.

Think of SQL (even if it's not the same) as an API that doesn't have the concept of "one column".

devin-petersohn · 2020-05-18T14:28:02Z

I agree with @datapythonista here in that we should not specify a column abstraction as a part of the APIs. Specifying a "column" abstraction would mean that every implementation must have a column abstraction. It would also mean we either need multiple output types given the arguments of something like how pandas handles __getitem__ (see @datapythonista's coment) or we would need get_column and get_columns, which is explicit, but bloated.

The APIs for a pandas.Series object are very similar to the APIs in pandas.DataFrame, with a few extra numpy-like utilities in pandas.Series that only really make sense on 1-D objects. Likely the API we define for dataframes will be the same as the API for columns, unless we go with what @maartenbreddels suggests, which is the array API.

Perhaps the solution to this is to have an API that explicitly creates a 1-D array?

maartenbreddels · 2020-05-28T13:11:16Z

Perhaps the solution to this is to have an API that explicitly creates a 1-D array?

I think df['x'] very naturally would be the candidate for that, if a DataFrame can be seen as a dict of arrays.

So we could guarantee:

x = df['x']  # or some other API, but this feels natural to me
y = x**2  # we only know that y follows the Array spec
y_numpy = np.asarray(y)  # if we explicitly want it as numpy
y_arrow = pa.array(y)  # if we explicitly want it as arrow array

We could later on also say that this object also has additional methods if we believe it really adds something (say we reach dataframe spec 'level-X' and want a df['x'].value_counts()). I think they can be the same class in an implementation, but they follow the Array API & whatever API we think might be needed, if needed. I doubt having an API to get an Array, and an API to get a 'Column' would add anything for users.

>>> df[['col1', 'col2']].sum()
>>> df[['col1']].sum()
>>> df['col1'].sum()

I think we could 'outsource' whatever the last line does to the Array spec for level-0, while the DataFrame spec would specify the first two.

Think of SQL (even if it's not the same) as an API that doesn't have the concept of "one column".

I think actually it's a good distinction between SQL and a DataFrame.

devin-petersohn · 2020-05-28T15:47:59Z

@maartenbreddels Using something like __getitem__ might be problematic for systems that already have an implementation (e.g. pandas). I would suggest that we try our best to avoid reusing APIs that are common in other libraries.

I think we should also be careful about returning different dimensions or types of objects given different types of the input parameters. df['a'] vs df[['a']] is perhaps too subtle. It should probably be a bit more explicit so that some deeper prior understanding of the API is not required.

Something like this:

# Projections/Selecting a column (maybe not the best API name)
df.project("col1") # dataframe return type
df.project(["col1"]) # dataframe return type
df.project(["col1", "col2"]) # dataframe return type

# Getting an array
df.project("col1").asarray(squeeze=True) # 1D array
df.project(["col1"]).asarray(squeeze=True) # 1D array
df.project("col1").asarray(squeeze=False) # 2D array

The idea is to be explicit rather than implicit.

amueller · 2020-05-28T16:06:54Z

If we don't have a Series equivalent, how, if at all, would we implement the methods that only make sense on particular types or on 1d objects? Say you want string matching or lower-casing. Do we not want to include them in the API? Or do we want to require them on the dataframe level?

also FYI:

import pandas as pd
print([x for x in dir(pd.Series) if not x.startswith("_") and x not in dir(pd.DataFrame)])

['argmax', 'argmin', 'argsort', 'array', 'autocorr', 'between', 'cat', 'divmod', 'dt', 'dtype', 'factorize', 'hasnans', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'item', 'map', 'name', 'nbytes', 'ravel', 'rdivmod', 'repeat', 'searchsorted', 'str', 'to_frame', 'to_list', 'tolist', 'unique', 'value_counts', 'view']

(str, cat and dt are namespaces basically)

datapythonista · 2020-05-28T16:51:52Z

I also would be careful on using __getitem__. I think the main reason is to be able to assign to a column or set of columns df[col] = 1. Not sure if I would keep this or not, but I would probably avoid having df[condition] in favor of df.where(condition) or df.filter(condition).

Regarding @amueller comment, my opinion is that operations like .str.lower() should be at DataFrame level. See these two examples:

df['width'] = df['width'].astype(float)
df[['width', 'length']] = df[['width', 'length']].astype(float)

df['first_name'] = df['first_name'].str.lower()
df[['first_'name', 'last_name']] = df[['first_'name', 'last_name']].str.lower()  # <- This fails

This behavior feels arbitrary to me, I don't think we should keep it. I think it's much simpler to assume that operations can be applied to N columns. There are few special cases where this may be tricky, but when a specific number of columns is needed, and a different number is selected, raising an exception seems like a good solution.

TomAugspurger · 2020-06-05T19:45:14Z

my opinion is that operations like .str.lower() should be at DataFrame level.

Just to confirm, you're proposing we raise an exception when there are a mixture of data types, some of which don't support .str operations? (this overlaps with the discussion in #11 on how to handle various "nuisance" columns).

Another place this comes up is in boolean filtering. NumPy assigns different meaning to 1D and 2D boolean masks.

In [3]: a = np.ones((4, 2))

In [4]: m = np.array([True, False, False, True])

In [5]: a[m]
Out[5]:
array([[1., 1.],
       [1., 1.]])

In [6]: a[np.atleast_2d(m)]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-4cd960bb8565> in <module>
----> 1 a[np.atleast_2d(m)]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 1

I think we want to support boolean masking by some condition on a column. Do we require that the mask just be of shape (N, 1)?

datapythonista · 2020-06-06T10:25:06Z

Just to confirm, you're proposing we raise an exception when there are a mixture of data types

Yes, that looks like the most reasonable think to me. Could be consistent with pandas current behavior:

>>> df['population'].str.upper()
AttributeError: Can only use .str accessor with string values!

But allowing to perform it to N columns, it would be required that all them are string.

Of course another option is to ignore the columns, like pandas do in:

>>> df[['name', 'population']].mean()
population    2.729748e+07
dtype: float64

I personally don't like that happening. If I select two columns, I would like to know in advance that I'm getting two values (or an exception). But I guess that depends also in the use case, as discussed in #5. If I'm writing production code with the software engineer hat, ignoring columns that I'm selecting feels hacky. If I'm in a notebook with a data analyst hat, it seems reasonable.

I guess we could use a parameter like in the reductions (string_only, numeric_only, bool_only) could be used. But since I think half of the API could require it, may be would be better to have an option to control that behavior? Or may be instead of df[['name', 'population']].mean() something of the kind df.select('name', 'population', fail_for_dtype=False).mean() could make sense?

I'll create a new issue for this, since I think it's complex, and as you mentioned, it's the same discussion for reductions.

I think we want to support boolean masking by some condition on a column. Do we require that the mask just be of shape (N, 1)?

That's a very good point. Yes, I think requiring one column dataframes for filters is probably the best.

May be there are other options in both cases that can be considered. But IMHO, these two solutions would be better (much simpler) than having separate Series and DataFrame structures.

TomAugspurger · 2020-06-11T13:59:57Z

Are people generally OK with "special-casing" (n_rows, 1)-shape dataframes for operations like boolean masking? I can't quite put my finger on why, but it feels a bit strange to me. But it's not a deal-breaker for me.

One other nice thing this side-steps is pandas' awkward behavior with trying to mimic NumPy's broadcasting

In [2]: df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]}, index=['a', 'b', 'c'])

In [3]: df
Out[3]:
   A  B
a  1  3
b  2  4
c  3  5

In [4]: df + df.A
Out[4]:
    A   B   a   b   c
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN

In [5]: df + df[['A']]
Out[5]:
   A   B
a  2 NaN
b  4 NaN
c  6 NaN

datapythonista · 2020-06-19T11:52:23Z

Are people generally OK with "special-casing" (n_rows, 1)-shape dataframes for operations like boolean masking? I can't quite put my finger on why, but it feels a bit strange to me. But it's not a deal-breaker for me.

Just to be clear, it also feels a bit strange to me. But I think there is a trade off, and how having a single structure simplifies things, for for the user, and in the code base, makes it worth IMO.

I think boolean masks may require a bit more thinking. I think having an internal method to_boolean_mask or check_boolean_mask that is called for all operations that require one can make sense. Or we could consider a specific type for boolean masks, instead of a fully featured Series object. Which IMHO is not worth the extra complexity, compared to 1-D dataframes.

What are your thoughts on avoiding Series, but consider if a public boolean mask object makes sense?

TomAugspurger · 2020-06-19T14:44:12Z

On Fri, Jun 19, 2020 at 6:52 AM Marc Garcia ***@***.***> wrote: Are people generally OK with "special-casing" (n_rows, 1)-shape dataframes for operations like boolean masking? I can't quite put my finger on why, but it feels a bit strange to me. But it's not a deal-breaker for me. Just to be clear, it also feels a bit strange to me. But I think there is a trade off, and how having a single structure simplifies things, for for the user, and in the code base, makes it worth IMO. I think boolean masks may require a bit more thinking. I think having an internal method to_boolean_mask or check_boolean_mask that is called for all operations that require one can make sense.

Can you clarify "internal method". I was assuming we couldn't really have any of those since we're only specifying the user-facing API, and leaving implementation things up to each library. Are you thinking we could have a small utility library that projects can depend on / vendor with things like these checks?

Or we could consider a specific type for boolean masks, instead of a fully featured Series object. Which IMHO is not worth the extra complexity, compared to 1-D dataframes. What are your thoughts on avoiding Series, but consider if a public boolean mask object makes sense?

Thinking on this more, I'm more comfortable with "special casing" `(n_rows, 1)` dataframes to allow them as boolean masks. It's just a requirement on the shape that I think is *very* similar to the requirement that in the expression `numpy_array[boolean_mask]`, we require that the length of the mask match the length of the array being masked.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIREHD2HJJ7ZCBSLLVDRXNGQLANCNFSM4ND6H2QQ> .

datapythonista · 2020-06-19T15:11:52Z

Can you clarify "internal method". I was assuming we couldn't really have any of those since we're only specifying the user-facing API, and leaving implementation things up to each library. Are you thinking we could have a small utility library that projects can depend on / vendor with things like these checks?

Good point. I was talking about a private method. Not that I want to include it in the API, or in a common package. More, about what this approach could mean in terms of implementation. Just wanted to point out, that I don't think anything else than a function that validates/converts the dataframe is needed. And your second comment is a good point, I think besides validating it's 1-D, it's also required to validate that it's the same length as the object being filtered, so I guess it doesn't make a big difference having a 1-D structure that needs to be validated for shape anyway.

SimonHeybrock · 2021-09-06T13:43:06Z

Sorry for being late to the party: My comment #59 (comment) is related to this. Similar to xarray, scipp supports what we can think of as a higher-dimensional generalization of Series (DataArray) with more explicit indices (coordinates). I think when working with higher-dimensional data then working with these Series-like objects is more common than working with a "dict-of-series" (DataFrame).

I also note some discussion about mask-columns above. Scipp does this explicitly by attaching not just a coords dict to an array of data values, but also a masks dict. See https://scipp.github.io/user-guide/masking.html for an introduction to the concept.

If there is interest I'd be happy to talk in more detail about these aspects.

rgommers · 2021-09-06T17:58:59Z

It looks to me like there may be a need for a separate column object - in the end that's what we needed in the interchange protocol as well - however the trick will then be to figure out how to add that without having a large set of duplicate APIs between column and dataframe objects.

jbrockmendel · 2022-07-07T21:27:13Z

What would df.iloc[n] give in this scenario? It might avoid casting in mixed-dtype cases which code be nice.

rgommers · 2022-07-14T16:41:17Z

@jbrockmendel in what scenario? What @SimonHeybrock said about scipp, or in the simple with/without a column object?

jbrockmendel · 2022-07-14T16:45:14Z

in what scenario?

df = pd.DataFrame({"A": [1], "B": [pd.Timestamp.now()]})

ATM df.iloc[0] gives an object-dtype Series. IIUC in the proposal here df.iloc[0] would just return a view on df, so would not do any casting.

jbrockmendel · 2022-08-31T19:11:59Z

Is there a .dtypes? and if so, what does it return if there is no Series?

kkraus14 · 2022-08-31T19:58:44Z

Is there a .dtypes? and if so, what does it return if there is no Series?

Please do not return a dataframe column for this information. This is a source of inconsistency for C++ and GPU backed dataframe implementations, where the dtypes are usually stored in the host side structure. In cuDF today for example, if someone calls the .dtypes method it returns a pandas Series as opposed to a cuDF series since it doesn't make sense for those to get copied to the GPU.

rgommers · 2022-09-01T13:12:39Z

A standard Python tuple or other container would probably make more sense than a Series object, but those also can't live on a GPU (and anyway, dtype is already stored on the host)? @kkraus14 is the API inherently problematic, or are you fine with it being a tuple and there needing to be synchronizations (I think?) when dtypes are used in control flow for example?

jbrockmendel · 2022-09-08T20:58:59Z

What if we had a Series but required very little of it API-wise. Really just enough so that frame.dtypes[col] or frame.max()[col] makes sense

rgommers · 2022-09-09T07:34:36Z

Isn't frame.dtypes[col] better/equivalently spelled as frame[col].dtype?

For frame.max()[col] I may be missing your point, can you elaborate? It's a reduction; any reduction is still giving a dataframe back, so why is this different from frame[col]?

jbrockmendel · 2022-09-09T16:07:19Z

Isn't frame.dtypes[col] better/equivalently spelled as frame[col].dtype?

That would require a discussion about what type frame[col] returns, which I'm trying to side-step for now.

For frame.max()[col] I may be missing your point, can you elaborate? It's a reduction; any reduction is still giving a dataframe back, so why is this different from frame[col]?

It isn't obvious to me that any reduction gives a dataframe back.

kkraus14 · 2022-09-11T03:48:04Z

@kkraus14 is the API inherently problematic, or are you fine with it being a tuple and there needing to be synchronizations (I think?) when dtypes are used in control flow for example?

A tuple is good. Sorry, my point was that the information is already on the host in Python objects and you don't want to try to shove those into columns that may or may be typically backed by device memory.

jorisvandenbossche · 2022-09-13T14:16:43Z

If we want to consider this, I think it would be good if someone could investigate and list the potential impact on broadcasting and scalar behaviour in various situations, so we can think through those cases.

Boolean filtering was already mentioned above (in numpy, boolean indexing with a 2D array instead of 1D doesn't work as you would expect for dataframes), but two other cases I am currently thinking of:

Series is 1d, which impacts broadcasting rules in operations. Now, since standard broadcasting tries to match on the last dimension, in our case the columns, that might not be too problematic? (df + s also already matches on columns of df)
- However, pandas complicates things doing column name alignment, which is another discussion, but might impact this one as well
- If something like df.mean() gives a (1, n) shaped DataFrame, then df / df.mean() would work according to numpy broadcasting rules and for column label alignment, but in pandas we also align on row labels. So this would for pandas only work if we implement optional indexes, so a reduction can give a len-1 DataFrame without row index.
Related to the frame.max()[col] or frame[col].max() that Brock mentioned above: under the "everything DataFrame" idea, df[col].max() would give a 1-row 1-col DataFrame as result instead of a scalar? How would such a dataframe behave as a scalar? Or would the user be expected to explicitly convert this into a scalar?
- For example, a user doing if df["col"].max() > 0: ..., how would this work?

shwina · 2022-09-13T18:11:04Z

Are we opposed to df["col"] returning a Column (instead of an indexed (named) "Series")?

rgommers · 2022-09-13T18:19:49Z

@shwina I think to answer that, we should have some idea of the methods on a new Column. I believe the main issue with Series was/is the large amount of duplication in methods/properties between it and DataFrame.

jbrockmendel · 2022-09-13T18:35:06Z

I believe the main issue with Series was/is the large amount of duplication in methods/properties between it and DataFrame.

I don't understand why that's a problem, especially if a lot of code can be shared.

shwina · 2022-09-13T18:43:52Z

@rgommers - yup, I totally understand we'd need to go through the work of defining an API for Column separate from DataFrame, but ultimately I really do believe that would be the best user experience.

I believe the main issue with Series was/is the large amount of duplication in methods/properties between it and DataFrame.

I think the major issue with Series is that it needs to be both array-like and DataFrame-like at the same time. What we're calling Column here would be devoid of DataFrame semantics. Unfortunately, it can't be a true data-apis compliant ndarray either, as the latter does not support all the data types we want to support here.

rgommers · 2022-09-14T07:41:49Z

I don't understand why that's a problem, especially if a lot of code can be shared.

API surface size does matter even if there are no differences in signatures or semantics (and I'm not sure that's even true?) imho, from a user perspective, a library maintainer perspective, and from a "this is a lot of work for us to standardize for little gain" perspective. That said, @shwina's point below is more important.

I think the major issue with Series is that it needs to be both array-like and DataFrame-like at the same time.

Good point - yes, that sounds about right to me, that is the more important conceptual issue.

Unfortunately, it can't be a true data-apis compliant ndarray either, as the latter does not support all the data types we want to support here.

There is nothing in the array API standard that says one cannot support extra dtypes, or add some extra functions/methods to support such dtypes. I think what you are getting at is more of an implementation issue, rather than an API one. Do you see anything API-wise that would not work? The main thing I can think of is semantics for nullable dtypes, which is unspecified.

Making Column adhere to the array API standard is related to gh-50. It would have a lot of benefits, like making it easy to support column input in libraries like SciPy.

jbrockmendel · 2022-09-15T20:38:22Z

My impression from today's call is that we are leaning towards having a Column object that would be 1D but not have a .index or .name, making it more array-like than Series-like. There was discussion of missing-handling that sounded like that would distinguish it from an array-api object, but I didn't totally follow.

rgommers · 2022-09-26T18:44:36Z

There was discussion of missing-handling that sounded like that would distinguish it from an array-api object, but I didn't totally follow.

I'm not sure it would necessarily prevent following the array API standard. However, when missing data is present, any functions that are not element-wise will have to have their semantics specified. That is something that's outside of the scope of the array API standard (because no array implementations actually support missing data). It's a tractable problem though, there are not that many functions that do need extra semantics. I think these are:

reductions (max, min, mean, prod, std, sum, var)
sorting, set and search functions (sort, argsort, unique*, argmax, argmin, nonzero, where)
all, any
linear algebra functions
Fourier transform functions

rgommers · 2022-09-29T12:58:05Z

My sense is that for reductions, sort/set/search and any/all, the missing value semantics are mostly "ignore". Plus some corner cases like:

if all values in a column being reduced over are NA, the result is NA (?),
for sorting the output size should equal the input size, so missing values should be sorted to the beginning or end of the column (similar to nan handling)

And linear algebra and Fourier transforms should probably simply raise exceptions.

I'm curious if that kind of approach would work. EDIT: that would basically address gh-50.

MarcoGorelli · 2023-07-20T11:19:33Z

We've gone ahead with adding Column to the spec, so I think this can now be closed, right?

Closing then, but please do let me know if I've misunderstood

TomAugspurger mentioned this issue May 18, 2020

Avoiding the "pandas trap" #4

Open

devin-petersohn mentioned this issue May 29, 2020

Mutability #10

Open

datapythonista mentioned this issue Jun 6, 2020

Calling methods on invalid types #13

Open

datapythonista mentioned this issue Aug 7, 2020

Dataframe interchange protocol #25

Closed

kgryte mentioned this issue Dec 10, 2020

Add statistical methods #33

Closed

rgommers mentioned this issue Mar 14, 2023

What methods does Column have? #107

Closed

MarcoGorelli closed this as completed Jul 20, 2023

Separate object for a dataframe colum? (is Series needed?) #6

Separate object for a dataframe colum? (is Series needed?) #6

Comments

datapythonista commented May 18, 2020

TomAugspurger commented May 18, 2020

TomAugspurger commented May 18, 2020

datapythonista commented May 18, 2020

maartenbreddels commented May 18, 2020

TomAugspurger commented May 18, 2020

datapythonista commented May 18, 2020

devin-petersohn commented May 18, 2020

maartenbreddels commented May 28, 2020

devin-petersohn commented May 28, 2020

amueller commented May 28, 2020 • edited Loading

datapythonista commented May 28, 2020

TomAugspurger commented Jun 5, 2020

datapythonista commented Jun 6, 2020

TomAugspurger commented Jun 11, 2020

datapythonista commented Jun 19, 2020

TomAugspurger commented Jun 19, 2020 via email

datapythonista commented Jun 19, 2020

SimonHeybrock commented Sep 6, 2021 • edited Loading

rgommers commented Sep 6, 2021

jbrockmendel commented Jul 7, 2022

rgommers commented Jul 14, 2022

jbrockmendel commented Jul 14, 2022

jbrockmendel commented Aug 31, 2022

kkraus14 commented Aug 31, 2022

rgommers commented Sep 1, 2022

jbrockmendel commented Sep 8, 2022

rgommers commented Sep 9, 2022

jbrockmendel commented Sep 9, 2022

kkraus14 commented Sep 11, 2022

jorisvandenbossche commented Sep 13, 2022

shwina commented Sep 13, 2022 • edited Loading

rgommers commented Sep 13, 2022

jbrockmendel commented Sep 13, 2022

shwina commented Sep 13, 2022 • edited Loading

rgommers commented Sep 14, 2022

jbrockmendel commented Sep 15, 2022

rgommers commented Sep 26, 2022

rgommers commented Sep 29, 2022 • edited Loading

MarcoGorelli commented Jul 20, 2023

amueller commented May 28, 2020 •

edited

Loading

SimonHeybrock commented Sep 6, 2021 •

edited

Loading

shwina commented Sep 13, 2022 •

edited

Loading

shwina commented Sep 13, 2022 •

edited

Loading

rgommers commented Sep 29, 2022 •

edited

Loading