Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: concensus primitives #84

Closed
jbrockmendel opened this issue Sep 7, 2022 · 7 comments
Closed

DISC: concensus primitives #84

jbrockmendel opened this issue Sep 7, 2022 · 7 comments

Comments

@jbrockmendel
Copy link
Contributor

jbrockmendel commented Sep 7, 2022

In writing up proposals for arithmetic I find myself referencing other methods/characteristics that might not be well-defined within the spec, but that I think should be clear from context. Before spending much more time on this, I want to double-check that everyone is a) in agreement about what these mean informally and b) OK with these being used informally until something more formal is available.

  1. "concat", analogous to pd.concat([left, right], axis=1). In the cases of interest len(left) == len(right).
    Example: "Scalar arithmetic commutes with concatenation, so concat([X, Y], axis=1) + Z matches concat([X+Z, Y+Z], axis=1)"
  2. "matches", analogous to pd.testing.assert_frame_equal. Similar but not identical to pd.DataFrame.equals
  3. "indexed-like", analogous to pd.DataFrame._indexed_same. In cases without a row-index, would require columns match and lengths match
    Example: "DataFrame arithmetic X+Y is defined so long as X is indexed-like Y"

update
4) "extract_array", analogous to pd.core.construction.extract_array. For a single-column dataframe df, get the array backing it. In pandas we would usually do this on a Series with ser._values. This may be making assumptions about dataframe internals that we don't want to make.
Example: "This operation wraps the array behavior op(X) matches DataFrame(op(extract_array(X)))"

@rgommers
Copy link
Member

(1) and 3 seem clear to me, and sound good.

Regarding (2), pd.testing.assert_frame_equal seems a bit loose on floating-point tolerance. I assume that that is not what you meant, but to be sure: I'd expect floating-point numbers to be almost equal to the expected precision for array operations - so typically O(1 nulp) when comparing for example commuting operations in the same library, and typical numbers (e.g. <1e-12) when comparing 64-bit precision operations between different libraries. Rather than the rtol=1e-5, atol=1e-8 of assert_frame_equal.
For thing like check_column_type, I think "equivalent" rather than "identical" sounds fine to me.

Regarding (4): sounds fine, as long as we can ignore that arrays may not support every dtype. Also, DataFrame(op(extract_array(X))) may lose metadata like column names. So I assume you mean something like "apply op to the data values contained in the dataframe, while leaving all metadata of the dataframe unchanged (except if op is a shape-changing operation, in that case the dataframe shape changes)".

@rgommers
Copy link
Member

@jbrockmendel regarding numerical operations, gh-50 may be relevant.

@jorisvandenbossche
Copy link
Member

  1. "concat", analogous to pd.concat([left, right], axis=1). In the cases of interest len(left) == len(right).
    Example: "Scalar arithmetic commutes with concatenation, so concat([X, Y], axis=1) + Z matches concat([X+Z, Y+Z], axis=1)"

Maybe not exactly what you wanted to discuss here (rather terminology in general, not specific examples), but I don't really understand this example (at least in pandas this doesn't hold because of alignment?)
On the terminology front, "concat" on axis 1 or 0 is a quite different operations. Do you want to imply with the above you mostly consider axis=1 for "concat"? (I would personally typically first think of axis 0 for generic "concat", since axis 1 is more like a "join" operation)

@jbrockmendel
Copy link
Contributor Author

w/r/t matches/equals/assert_frame_equals I was thinking of exact matches, i.e. 0 tolerance for floating points, but I don't know if that's necessary when it comes time to instrumentalize it. The point in this context is to be able to make meaningful statements about commutativity and column-wise operations.

w/r/t concat im only interested in axis=1 in this context. Again, the motivating case is to be able to describe frame+scalar in terms of the column-wise operation.

@jorisvandenbossche
Copy link
Member

Again, the motivating case is to be able to describe frame+scalar in terms of the column-wise operation.

OK, but then I don't understand how "concat" is useful to describe those operations (or are the X and Y in your example columns, not dataframes?)

@jbrockmendel
Copy link
Contributor Author

OK, but then I don't understand how "concat" is useful to describe those operations (or are the X and Y in your example columns, not dataframes?)

It should hold for any decomposition of a DataFrame along the lines of X = df.iloc[:, :N], Y = df.iloc[:, N:]. Is there a non-concat way you'd suggest to express "equivalent to operating column-by-column"?

@rgommers
Copy link
Member

@jbrockmendel it seems like we don't need these primitives (at least in an API, they may be useful for an implementer, but Marco's MVP code doesn't use them I believe). I think this can be closed now - is that okay with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants