Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate object for a dataframe colum? (is Series needed?) #6

Closed
datapythonista opened this issue May 18, 2020 · 39 comments
Closed

Separate object for a dataframe colum? (is Series needed?) #6

datapythonista opened this issue May 18, 2020 · 39 comments

Comments

@datapythonista
Copy link
Member

Probably a bit early to the discussion, but I think this will need to be discussed eventually.

Is a separate object representing a single column needed? Like having Series, instead of just using one column DataFrame.

Having two separate objects, IMO adds a decent amount of complexity, both in the implementation and for the user. Whether this complexity is worth or not, I don't know. But I think this shouldn't be replicated from pandas without a discussion.

@TomAugspurger
Copy link

Agreed. I think we should focus on tabular (2D) objects.

If there's no disagreement, I recommend that we add this to the list we're collecting in #4.

@TomAugspurger
Copy link

Question for the group: Is there anyone who wants / needs more background on this topic? Trying to gauge the level of familiarity with pandas.

@datapythonista
Copy link
Member Author

I'm surely happy to add this to #4. The name "deficiencies" in the issue intimidated me from adding it there directly. But will add it now.

@maartenbreddels
Copy link

Some more background would be appreciated.

My thoughts on this are that we can first say that the columns are opaque objects that follow the Array-API, if that makes sense.

@TomAugspurger
Copy link

A small bit of background (Marc can add more).

Series are 1-d objects with an index. The index of row labels (and perhaps name) differentiates them from a NumPy ndarray.

They primarily arise in

  • DataFrame indexing when selecting a single column
  • DataFrame reductions (.mean()). In pandas, this returns a single DataFrame.
In [2]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})

In [3]: df
Out[3]:
   A  B
0  1  3
1  2  4

In [4]: df['A']
Out[4]:
0    1
1    2
Name: A, dtype: int64

In [5]: df.sum()
Out[5]:
A    3
B    7
dtype: int64

As alternatives, Out[4] would return the same type as the input df, a 2-D dataframe.

Out[5] might be similar to NumPy's reductions with keepdims=True. A 2-d object with a single row, but the same columns as the input (the proper row label for that single row is a bit unclear).

@datapythonista
Copy link
Member Author

Some more background would be appreciated.

From my side, it's more about the user API, than about what a column is.

Not sure if the example is very meaningful, but consider the three cases:

>>> df[['col1', 'col2']].sum()
>>> df[['col1']].sum()
>>> df['col1'].sum()

Is it worth having the third option, considering that it requires having two separate classes, with very similar (but not identical) APIs, with all the implied code complexity, and complexity for the user? Also some extra complexity, like the example that @TomAugspurger mentions.

Think of SQL (even if it's not the same) as an API that doesn't have the concept of "one column".

@devin-petersohn
Copy link
Member

I agree with @datapythonista here in that we should not specify a column abstraction as a part of the APIs. Specifying a "column" abstraction would mean that every implementation must have a column abstraction. It would also mean we either need multiple output types given the arguments of something like how pandas handles __getitem__ (see @datapythonista's coment) or we would need get_column and get_columns, which is explicit, but bloated.

The APIs for a pandas.Series object are very similar to the APIs in pandas.DataFrame, with a few extra numpy-like utilities in pandas.Series that only really make sense on 1-D objects. Likely the API we define for dataframes will be the same as the API for columns, unless we go with what @maartenbreddels suggests, which is the array API.

Perhaps the solution to this is to have an API that explicitly creates a 1-D array?

@maartenbreddels
Copy link

Perhaps the solution to this is to have an API that explicitly creates a 1-D array?

I think df['x'] very naturally would be the candidate for that, if a DataFrame can be seen as a dict of arrays.

So we could guarantee:

x = df['x']  # or some other API, but this feels natural to me
y = x**2  # we only know that y follows the Array spec
y_numpy = np.asarray(y)  # if we explicitly want it as numpy
y_arrow = pa.array(y)  # if we explicitly want it as arrow array

We could later on also say that this object also has additional methods if we believe it really adds something (say we reach dataframe spec 'level-X' and want a df['x'].value_counts()). I think they can be the same class in an implementation, but they follow the Array API & whatever API we think might be needed, if needed. I doubt having an API to get an Array, and an API to get a 'Column' would add anything for users.

>>> df[['col1', 'col2']].sum()
>>> df[['col1']].sum()
>>> df['col1'].sum()

I think we could 'outsource' whatever the last line does to the Array spec for level-0, while the DataFrame spec would specify the first two.

Think of SQL (even if it's not the same) as an API that doesn't have the concept of "one column".

I think actually it's a good distinction between SQL and a DataFrame.

@devin-petersohn
Copy link
Member

@maartenbreddels Using something like __getitem__ might be problematic for systems that already have an implementation (e.g. pandas). I would suggest that we try our best to avoid reusing APIs that are common in other libraries.

I think we should also be careful about returning different dimensions or types of objects given different types of the input parameters. df['a'] vs df[['a']] is perhaps too subtle. It should probably be a bit more explicit so that some deeper prior understanding of the API is not required.

Something like this:

# Projections/Selecting a column (maybe not the best API name)
df.project("col1") # dataframe return type
df.project(["col1"]) # dataframe return type
df.project(["col1", "col2"]) # dataframe return type

# Getting an array
df.project("col1").asarray(squeeze=True) # 1D array
df.project(["col1"]).asarray(squeeze=True) # 1D array
df.project("col1").asarray(squeeze=False) # 2D array

The idea is to be explicit rather than implicit.

@amueller
Copy link

amueller commented May 28, 2020

If we don't have a Series equivalent, how, if at all, would we implement the methods that only make sense on particular types or on 1d objects? Say you want string matching or lower-casing. Do we not want to include them in the API? Or do we want to require them on the dataframe level?

also FYI:

import pandas as pd
print([x for x in dir(pd.Series) if not x.startswith("_") and x not in dir(pd.DataFrame)])
['argmax', 'argmin', 'argsort', 'array', 'autocorr', 'between', 'cat', 'divmod', 'dt', 'dtype', 'factorize', 'hasnans', 'is_monotonic', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'item', 'map', 'name', 'nbytes', 'ravel', 'rdivmod', 'repeat', 'searchsorted', 'str', 'to_frame', 'to_list', 'tolist', 'unique', 'value_counts', 'view']

(str, cat and dt are namespaces basically)

@datapythonista
Copy link
Member Author

I also would be careful on using __getitem__. I think the main reason is to be able to assign to a column or set of columns df[col] = 1. Not sure if I would keep this or not, but I would probably avoid having df[condition] in favor of df.where(condition) or df.filter(condition).

Regarding @amueller comment, my opinion is that operations like .str.lower() should be at DataFrame level. See these two examples:

df['width'] = df['width'].astype(float)
df[['width', 'length']] = df[['width', 'length']].astype(float)

df['first_name'] = df['first_name'].str.lower()
df[['first_'name', 'last_name']] = df[['first_'name', 'last_name']].str.lower()  # <- This fails

This behavior feels arbitrary to me, I don't think we should keep it. I think it's much simpler to assume that operations can be applied to N columns. There are few special cases where this may be tricky, but when a specific number of columns is needed, and a different number is selected, raising an exception seems like a good solution.

@TomAugspurger
Copy link

my opinion is that operations like .str.lower() should be at DataFrame level.

Just to confirm, you're proposing we raise an exception when there are a mixture of data types, some of which don't support .str operations? (this overlaps with the discussion in #11 on how to handle various "nuisance" columns).


Another place this comes up is in boolean filtering. NumPy assigns different meaning to 1D and 2D boolean masks.

In [3]: a = np.ones((4, 2))

In [4]: m = np.array([True, False, False, True])

In [5]: a[m]
Out[5]:
array([[1., 1.],
       [1., 1.]])

In [6]: a[np.atleast_2d(m)]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-4cd960bb8565> in <module>
----> 1 a[np.atleast_2d(m)]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 1

I think we want to support boolean masking by some condition on a column. Do we require that the mask just be of shape (N, 1)?

@datapythonista
Copy link
Member Author

Just to confirm, you're proposing we raise an exception when there are a mixture of data types

Yes, that looks like the most reasonable think to me. Could be consistent with pandas current behavior:

>>> df['population'].str.upper()
AttributeError: Can only use .str accessor with string values!

But allowing to perform it to N columns, it would be required that all them are string.

Of course another option is to ignore the columns, like pandas do in:

>>> df[['name', 'population']].mean()
population    2.729748e+07
dtype: float64

I personally don't like that happening. If I select two columns, I would like to know in advance that I'm getting two values (or an exception). But I guess that depends also in the use case, as discussed in #5. If I'm writing production code with the software engineer hat, ignoring columns that I'm selecting feels hacky. If I'm in a notebook with a data analyst hat, it seems reasonable.

I guess we could use a parameter like in the reductions (string_only, numeric_only, bool_only) could be used. But since I think half of the API could require it, may be would be better to have an option to control that behavior? Or may be instead of df[['name', 'population']].mean() something of the kind df.select('name', 'population', fail_for_dtype=False).mean() could make sense?

I'll create a new issue for this, since I think it's complex, and as you mentioned, it's the same discussion for reductions.

I think we want to support boolean masking by some condition on a column. Do we require that the mask just be of shape (N, 1)?

That's a very good point. Yes, I think requiring one column dataframes for filters is probably the best.

May be there are other options in both cases that can be considered. But IMHO, these two solutions would be better (much simpler) than having separate Series and DataFrame structures.

@TomAugspurger
Copy link

Are people generally OK with "special-casing" (n_rows, 1)-shape dataframes for operations like boolean masking? I can't quite put my finger on why, but it feels a bit strange to me. But it's not a deal-breaker for me.

One other nice thing this side-steps is pandas' awkward behavior with trying to mimic NumPy's broadcasting

In [2]: df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]}, index=['a', 'b', 'c'])

In [3]: df
Out[3]:
   A  B
a  1  3
b  2  4
c  3  5

In [4]: df + df.A
Out[4]:
    A   B   a   b   c
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN

In [5]: df + df[['A']]
Out[5]:
   A   B
a  2 NaN
b  4 NaN
c  6 NaN

@datapythonista
Copy link
Member Author

Are people generally OK with "special-casing" (n_rows, 1)-shape dataframes for operations like boolean masking? I can't quite put my finger on why, but it feels a bit strange to me. But it's not a deal-breaker for me.

Just to be clear, it also feels a bit strange to me. But I think there is a trade off, and how having a single structure simplifies things, for for the user, and in the code base, makes it worth IMO.

I think boolean masks may require a bit more thinking. I think having an internal method to_boolean_mask or check_boolean_mask that is called for all operations that require one can make sense. Or we could consider a specific type for boolean masks, instead of a fully featured Series object. Which IMHO is not worth the extra complexity, compared to 1-D dataframes.

What are your thoughts on avoiding Series, but consider if a public boolean mask object makes sense?

@TomAugspurger
Copy link

TomAugspurger commented Jun 19, 2020 via email

@datapythonista
Copy link
Member Author

Can you clarify "internal method". I was assuming we couldn't really have any of those since we're only specifying the user-facing API, and leaving implementation things up to each library. Are you thinking we could have a small utility library that projects can depend on / vendor with things like these checks?

Good point. I was talking about a private method. Not that I want to include it in the API, or in a common package. More, about what this approach could mean in terms of implementation. Just wanted to point out, that I don't think anything else than a function that validates/converts the dataframe is needed. And your second comment is a good point, I think besides validating it's 1-D, it's also required to validate that it's the same length as the object being filtered, so I guess it doesn't make a big difference having a 1-D structure that needs to be validated for shape anyway.

@SimonHeybrock
Copy link

SimonHeybrock commented Sep 6, 2021

Sorry for being late to the party: My comment #59 (comment) is related to this. Similar to xarray, scipp supports what we can think of as a higher-dimensional generalization of Series (DataArray) with more explicit indices (coordinates). I think when working with higher-dimensional data then working with these Series-like objects is more common than working with a "dict-of-series" (DataFrame).

I also note some discussion about mask-columns above. Scipp does this explicitly by attaching not just a coords dict to an array of data values, but also a masks dict. See https://scipp.github.io/user-guide/masking.html for an introduction to the concept.

If there is interest I'd be happy to talk in more detail about these aspects.

@rgommers
Copy link
Member

rgommers commented Sep 6, 2021

It looks to me like there may be a need for a separate column object - in the end that's what we needed in the interchange protocol as well - however the trick will then be to figure out how to add that without having a large set of duplicate APIs between column and dataframe objects.

@jbrockmendel
Copy link
Contributor

What would df.iloc[n] give in this scenario? It might avoid casting in mixed-dtype cases which code be nice.

@rgommers
Copy link
Member

@jbrockmendel in what scenario? What @SimonHeybrock said about scipp, or in the simple with/without a column object?

@jbrockmendel
Copy link
Contributor

in what scenario?

df = pd.DataFrame({"A": [1], "B": [pd.Timestamp.now()]})

ATM df.iloc[0] gives an object-dtype Series. IIUC in the proposal here df.iloc[0] would just return a view on df, so would not do any casting.

@jbrockmendel
Copy link
Contributor

Is there a .dtypes? and if so, what does it return if there is no Series?

@kkraus14
Copy link
Collaborator

Is there a .dtypes? and if so, what does it return if there is no Series?

Please do not return a dataframe column for this information. This is a source of inconsistency for C++ and GPU backed dataframe implementations, where the dtypes are usually stored in the host side structure. In cuDF today for example, if someone calls the .dtypes method it returns a pandas Series as opposed to a cuDF series since it doesn't make sense for those to get copied to the GPU.

@rgommers
Copy link
Member

rgommers commented Sep 1, 2022

A standard Python tuple or other container would probably make more sense than a Series object, but those also can't live on a GPU (and anyway, dtype is already stored on the host)? @kkraus14 is the API inherently problematic, or are you fine with it being a tuple and there needing to be synchronizations (I think?) when dtypes are used in control flow for example?

@jbrockmendel
Copy link
Contributor

What if we had a Series but required very little of it API-wise. Really just enough so that frame.dtypes[col] or frame.max()[col] makes sense

@rgommers
Copy link
Member

rgommers commented Sep 9, 2022

Isn't frame.dtypes[col] better/equivalently spelled as frame[col].dtype?

For frame.max()[col] I may be missing your point, can you elaborate? It's a reduction; any reduction is still giving a dataframe back, so why is this different from frame[col]?

@jbrockmendel
Copy link
Contributor

Isn't frame.dtypes[col] better/equivalently spelled as frame[col].dtype?

That would require a discussion about what type frame[col] returns, which I'm trying to side-step for now.

For frame.max()[col] I may be missing your point, can you elaborate? It's a reduction; any reduction is still giving a dataframe back, so why is this different from frame[col]?

It isn't obvious to me that any reduction gives a dataframe back.

@kkraus14
Copy link
Collaborator

@kkraus14 is the API inherently problematic, or are you fine with it being a tuple and there needing to be synchronizations (I think?) when dtypes are used in control flow for example?

A tuple is good. Sorry, my point was that the information is already on the host in Python objects and you don't want to try to shove those into columns that may or may be typically backed by device memory.

@jorisvandenbossche
Copy link
Member

If we want to consider this, I think it would be good if someone could investigate and list the potential impact on broadcasting and scalar behaviour in various situations, so we can think through those cases.

Boolean filtering was already mentioned above (in numpy, boolean indexing with a 2D array instead of 1D doesn't work as you would expect for dataframes), but two other cases I am currently thinking of:

  • Series is 1d, which impacts broadcasting rules in operations. Now, since standard broadcasting tries to match on the last dimension, in our case the columns, that might not be too problematic? (df + s also already matches on columns of df)
    • However, pandas complicates things doing column name alignment, which is another discussion, but might impact this one as well
    • If something like df.mean() gives a (1, n) shaped DataFrame, then df / df.mean() would work according to numpy broadcasting rules and for column label alignment, but in pandas we also align on row labels. So this would for pandas only work if we implement optional indexes, so a reduction can give a len-1 DataFrame without row index.
  • Related to the frame.max()[col] or frame[col].max() that Brock mentioned above: under the "everything DataFrame" idea, df[col].max() would give a 1-row 1-col DataFrame as result instead of a scalar? How would such a dataframe behave as a scalar? Or would the user be expected to explicitly convert this into a scalar?
    • For example, a user doing if df["col"].max() > 0: ..., how would this work?

@shwina
Copy link
Contributor

shwina commented Sep 13, 2022

Are we opposed to df["col"] returning a Column (instead of an indexed (named) "Series")?

@rgommers
Copy link
Member

@shwina I think to answer that, we should have some idea of the methods on a new Column. I believe the main issue with Series was/is the large amount of duplication in methods/properties between it and DataFrame.

@jbrockmendel
Copy link
Contributor

I believe the main issue with Series was/is the large amount of duplication in methods/properties between it and DataFrame.

I don't understand why that's a problem, especially if a lot of code can be shared.

@shwina
Copy link
Contributor

shwina commented Sep 13, 2022

@rgommers - yup, I totally understand we'd need to go through the work of defining an API for Column separate from DataFrame, but ultimately I really do believe that would be the best user experience.

I believe the main issue with Series was/is the large amount of duplication in methods/properties between it and DataFrame.

I think the major issue with Series is that it needs to be both array-like and DataFrame-like at the same time. What we're calling Column here would be devoid of DataFrame semantics. Unfortunately, it can't be a true data-apis compliant ndarray either, as the latter does not support all the data types we want to support here.

@rgommers
Copy link
Member

I don't understand why that's a problem, especially if a lot of code can be shared.

API surface size does matter even if there are no differences in signatures or semantics (and I'm not sure that's even true?) imho, from a user perspective, a library maintainer perspective, and from a "this is a lot of work for us to standardize for little gain" perspective. That said, @shwina's point below is more important.

I think the major issue with Series is that it needs to be both array-like and DataFrame-like at the same time.

Good point - yes, that sounds about right to me, that is the more important conceptual issue.

Unfortunately, it can't be a true data-apis compliant ndarray either, as the latter does not support all the data types we want to support here.

There is nothing in the array API standard that says one cannot support extra dtypes, or add some extra functions/methods to support such dtypes. I think what you are getting at is more of an implementation issue, rather than an API one. Do you see anything API-wise that would not work? The main thing I can think of is semantics for nullable dtypes, which is unspecified.

Making Column adhere to the array API standard is related to gh-50. It would have a lot of benefits, like making it easy to support column input in libraries like SciPy.

@jbrockmendel
Copy link
Contributor

My impression from today's call is that we are leaning towards having a Column object that would be 1D but not have a .index or .name, making it more array-like than Series-like. There was discussion of missing-handling that sounded like that would distinguish it from an array-api object, but I didn't totally follow.

@rgommers
Copy link
Member

There was discussion of missing-handling that sounded like that would distinguish it from an array-api object, but I didn't totally follow.

I'm not sure it would necessarily prevent following the array API standard. However, when missing data is present, any functions that are not element-wise will have to have their semantics specified. That is something that's outside of the scope of the array API standard (because no array implementations actually support missing data). It's a tractable problem though, there are not that many functions that do need extra semantics. I think these are:

  • reductions (max, min, mean, prod, std, sum, var)
  • sorting, set and search functions (sort, argsort, unique*, argmax, argmin, nonzero, where)
  • all, any
  • linear algebra functions
  • Fourier transform functions

@rgommers
Copy link
Member

rgommers commented Sep 29, 2022

My sense is that for reductions, sort/set/search and any/all, the missing value semantics are mostly "ignore". Plus some corner cases like:

  • if all values in a column being reduced over are NA, the result is NA (?),
  • for sorting the output size should equal the input size, so missing values should be sorted to the beginning or end of the column (similar to nan handling)

And linear algebra and Fourier transforms should probably simply raise exceptions.

I'm curious if that kind of approach would work. EDIT: that would basically address gh-50.

@MarcoGorelli
Copy link
Contributor

We've gone ahead with adding Column to the spec, so I think this can now be closed, right?

Closing then, but please do let me know if I've misunderstood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests