-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate object for a dataframe colum? (is Series needed?) #6
Comments
Agreed. I think we should focus on tabular (2D) objects. If there's no disagreement, I recommend that we add this to the list we're collecting in #4. |
Question for the group: Is there anyone who wants / needs more background on this topic? Trying to gauge the level of familiarity with pandas. |
I'm surely happy to add this to #4. The name "deficiencies" in the issue intimidated me from adding it there directly. But will add it now. |
Some more background would be appreciated. My thoughts on this are that we can first say that the columns are opaque objects that follow the Array-API, if that makes sense. |
A small bit of background (Marc can add more). Series are 1-d objects with an index. The index of row labels (and perhaps name) differentiates them from a NumPy ndarray. They primarily arise in
In [2]: df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
In [3]: df
Out[3]:
A B
0 1 3
1 2 4
In [4]: df['A']
Out[4]:
0 1
1 2
Name: A, dtype: int64
In [5]: df.sum()
Out[5]:
A 3
B 7
dtype: int64 As alternatives,
|
From my side, it's more about the user API, than about what a column is. Not sure if the example is very meaningful, but consider the three cases: >>> df[['col1', 'col2']].sum()
>>> df[['col1']].sum()
>>> df['col1'].sum() Is it worth having the third option, considering that it requires having two separate classes, with very similar (but not identical) APIs, with all the implied code complexity, and complexity for the user? Also some extra complexity, like the example that @TomAugspurger mentions. Think of SQL (even if it's not the same) as an API that doesn't have the concept of "one column". |
I agree with @datapythonista here in that we should not specify a column abstraction as a part of the APIs. Specifying a "column" abstraction would mean that every implementation must have a column abstraction. It would also mean we either need multiple output types given the arguments of something like how pandas handles The APIs for a Perhaps the solution to this is to have an API that explicitly creates a 1-D array? |
I think So we could guarantee: x = df['x'] # or some other API, but this feels natural to me
y = x**2 # we only know that y follows the Array spec
y_numpy = np.asarray(y) # if we explicitly want it as numpy
y_arrow = pa.array(y) # if we explicitly want it as arrow array We could later on also say that this object also has additional methods if we believe it really adds something (say we reach dataframe spec 'level-X' and want a
I think we could 'outsource' whatever the last line does to the Array spec for level-0, while the DataFrame spec would specify the first two.
I think actually it's a good distinction between SQL and a DataFrame. |
@maartenbreddels Using something like I think we should also be careful about returning different dimensions or types of objects given different types of the input parameters. Something like this: # Projections/Selecting a column (maybe not the best API name)
df.project("col1") # dataframe return type
df.project(["col1"]) # dataframe return type
df.project(["col1", "col2"]) # dataframe return type
# Getting an array
df.project("col1").asarray(squeeze=True) # 1D array
df.project(["col1"]).asarray(squeeze=True) # 1D array
df.project("col1").asarray(squeeze=False) # 2D array The idea is to be explicit rather than implicit. |
If we don't have a Series equivalent, how, if at all, would we implement the methods that only make sense on particular types or on 1d objects? Say you want string matching or lower-casing. Do we not want to include them in the API? Or do we want to require them on the dataframe level? also FYI: import pandas as pd
print([x for x in dir(pd.Series) if not x.startswith("_") and x not in dir(pd.DataFrame)])
(str, cat and dt are namespaces basically) |
I also would be careful on using Regarding @amueller comment, my opinion is that operations like df['width'] = df['width'].astype(float)
df[['width', 'length']] = df[['width', 'length']].astype(float)
df['first_name'] = df['first_name'].str.lower()
df[['first_'name', 'last_name']] = df[['first_'name', 'last_name']].str.lower() # <- This fails This behavior feels arbitrary to me, I don't think we should keep it. I think it's much simpler to assume that operations can be applied to |
Just to confirm, you're proposing we raise an exception when there are a mixture of data types, some of which don't support Another place this comes up is in boolean filtering. NumPy assigns different meaning to 1D and 2D boolean masks. In [3]: a = np.ones((4, 2))
In [4]: m = np.array([True, False, False, True])
In [5]: a[m]
Out[5]:
array([[1., 1.],
[1., 1.]])
In [6]: a[np.atleast_2d(m)]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-6-4cd960bb8565> in <module>
----> 1 a[np.atleast_2d(m)]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 1 I think we want to support boolean masking by some condition on a column. Do we require that the mask just be of shape |
Yes, that looks like the most reasonable think to me. Could be consistent with pandas current behavior: >>> df['population'].str.upper()
AttributeError: Can only use .str accessor with string values! But allowing to perform it to N columns, it would be required that all them are string. Of course another option is to ignore the columns, like pandas do in: >>> df[['name', 'population']].mean()
population 2.729748e+07
dtype: float64 I personally don't like that happening. If I select two columns, I would like to know in advance that I'm getting two values (or an exception). But I guess that depends also in the use case, as discussed in #5. If I'm writing production code with the software engineer hat, ignoring columns that I'm selecting feels hacky. If I'm in a notebook with a data analyst hat, it seems reasonable. I guess we could use a parameter like in the reductions ( I'll create a new issue for this, since I think it's complex, and as you mentioned, it's the same discussion for reductions.
That's a very good point. Yes, I think requiring one column dataframes for filters is probably the best. May be there are other options in both cases that can be considered. But IMHO, these two solutions would be better (much simpler) than having separate |
Are people generally OK with "special-casing" One other nice thing this side-steps is pandas' awkward behavior with trying to mimic NumPy's broadcasting In [2]: df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]}, index=['a', 'b', 'c'])
In [3]: df
Out[3]:
A B
a 1 3
b 2 4
c 3 5
In [4]: df + df.A
Out[4]:
A B a b c
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
In [5]: df + df[['A']]
Out[5]:
A B
a 2 NaN
b 4 NaN
c 6 NaN |
Just to be clear, it also feels a bit strange to me. But I think there is a trade off, and how having a single structure simplifies things, for for the user, and in the code base, makes it worth IMO. I think boolean masks may require a bit more thinking. I think having an internal method What are your thoughts on avoiding |
On Fri, Jun 19, 2020 at 6:52 AM Marc Garcia ***@***.***> wrote:
Are people generally OK with "special-casing" (n_rows, 1)-shape
dataframes for operations like boolean masking? I can't quite put my finger
on why, but it feels a bit strange to me. But it's not a deal-breaker for
me.
Just to be clear, it also feels a bit strange to me. But I think there is
a trade off, and how having a single structure simplifies things, for for
the user, and in the code base, makes it worth IMO.
I think boolean masks may require a bit more thinking. I think having an
internal method to_boolean_mask or check_boolean_mask that is called for
all operations that require one can make sense.
Can you clarify "internal method". I was assuming we couldn't really have
any of those since we're only specifying the user-facing API, and leaving
implementation things up to each library. Are you thinking we could have a
small utility library that projects can depend on / vendor with things like
these checks?
Or we could consider a specific type for boolean masks, instead of a fully
featured Series object. Which IMHO is not worth the extra complexity,
compared to 1-D dataframes.
What are your thoughts on avoiding Series, but consider if a public
boolean mask object makes sense?
Thinking on this more, I'm more comfortable with "special casing" `(n_rows,
1)` dataframes to allow them as boolean masks. It's just a requirement on
the shape that I think is *very* similar to the requirement that in the
expression `numpy_array[boolean_mask]`, we require that the length of the
mask match the length of the array being masked.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIREHD2HJJ7ZCBSLLVDRXNGQLANCNFSM4ND6H2QQ>
.
|
Good point. I was talking about a private method. Not that I want to include it in the API, or in a common package. More, about what this approach could mean in terms of implementation. Just wanted to point out, that I don't think anything else than a function that validates/converts the dataframe is needed. And your second comment is a good point, I think besides validating it's 1-D, it's also required to validate that it's the same length as the object being filtered, so I guess it doesn't make a big difference having a 1-D structure that needs to be validated for shape anyway. |
Sorry for being late to the party: My comment #59 (comment) is related to this. Similar to xarray, scipp supports what we can think of as a higher-dimensional generalization of I also note some discussion about mask-columns above. Scipp does this explicitly by attaching not just a If there is interest I'd be happy to talk in more detail about these aspects. |
It looks to me like there may be a need for a separate column object - in the end that's what we needed in the interchange protocol as well - however the trick will then be to figure out how to add that without having a large set of duplicate APIs between column and dataframe objects. |
What would |
@jbrockmendel in what scenario? What @SimonHeybrock said about |
ATM |
Is there a .dtypes? and if so, what does it return if there is no Series? |
Please do not return a dataframe column for this information. This is a source of inconsistency for C++ and GPU backed dataframe implementations, where the dtypes are usually stored in the host side structure. In cuDF today for example, if someone calls the |
A standard Python tuple or other container would probably make more sense than a Series object, but those also can't live on a GPU (and anyway, dtype is already stored on the host)? @kkraus14 is the API inherently problematic, or are you fine with it being a tuple and there needing to be synchronizations (I think?) when dtypes are used in control flow for example? |
What if we had a Series but required very little of it API-wise. Really just enough so that |
Isn't For |
That would require a discussion about what type
It isn't obvious to me that any reduction gives a dataframe back. |
A tuple is good. Sorry, my point was that the information is already on the host in Python objects and you don't want to try to shove those into columns that may or may be typically backed by device memory. |
If we want to consider this, I think it would be good if someone could investigate and list the potential impact on broadcasting and scalar behaviour in various situations, so we can think through those cases. Boolean filtering was already mentioned above (in numpy, boolean indexing with a 2D array instead of 1D doesn't work as you would expect for dataframes), but two other cases I am currently thinking of:
|
Are we opposed to |
@shwina I think to answer that, we should have some idea of the methods on a new |
I don't understand why that's a problem, especially if a lot of code can be shared. |
@rgommers - yup, I totally understand we'd need to go through the work of defining an API for
I think the major issue with |
API surface size does matter even if there are no differences in signatures or semantics (and I'm not sure that's even true?) imho, from a user perspective, a library maintainer perspective, and from a "this is a lot of work for us to standardize for little gain" perspective. That said, @shwina's point below is more important.
Good point - yes, that sounds about right to me, that is the more important conceptual issue.
There is nothing in the array API standard that says one cannot support extra dtypes, or add some extra functions/methods to support such dtypes. I think what you are getting at is more of an implementation issue, rather than an API one. Do you see anything API-wise that would not work? The main thing I can think of is semantics for nullable dtypes, which is unspecified. Making |
My impression from today's call is that we are leaning towards having a Column object that would be 1D but not have a .index or .name, making it more array-like than Series-like. There was discussion of missing-handling that sounded like that would distinguish it from an array-api object, but I didn't totally follow. |
I'm not sure it would necessarily prevent following the array API standard. However, when missing data is present, any functions that are not element-wise will have to have their semantics specified. That is something that's outside of the scope of the array API standard (because no array implementations actually support missing data). It's a tractable problem though, there are not that many functions that do need extra semantics. I think these are:
|
My sense is that for reductions, sort/set/search and
And linear algebra and Fourier transforms should probably simply raise exceptions. I'm curious if that kind of approach would work. EDIT: that would basically address gh-50. |
We've gone ahead with adding Closing then, but please do let me know if I've misunderstood |
Probably a bit early to the discussion, but I think this will need to be discussed eventually.
Is a separate object representing a single column needed? Like having Series, instead of just using one column DataFrame.
Having two separate objects, IMO adds a decent amount of complexity, both in the implementation and for the user. Whether this complexity is worth or not, I don't know. But I think this shouldn't be replicated from pandas without a discussion.
The text was updated successfully, but these errors were encountered: