DISC: concensus primitives #84

jbrockmendel · 2022-09-07T17:17:20Z

In writing up proposals for arithmetic I find myself referencing other methods/characteristics that might not be well-defined within the spec, but that I think should be clear from context. Before spending much more time on this, I want to double-check that everyone is a) in agreement about what these mean informally and b) OK with these being used informally until something more formal is available.

"concat", analogous to pd.concat([left, right], axis=1). In the cases of interest len(left) == len(right).
Example: "Scalar arithmetic commutes with concatenation, so concat([X, Y], axis=1) + Z matches concat([X+Z, Y+Z], axis=1)"
"matches", analogous to pd.testing.assert_frame_equal. Similar but not identical to pd.DataFrame.equals
"indexed-like", analogous to pd.DataFrame._indexed_same. In cases without a row-index, would require columns match and lengths match
Example: "DataFrame arithmetic X+Y is defined so long as X is indexed-like Y"

update
4) "extract_array", analogous to pd.core.construction.extract_array. For a single-column dataframe df, get the array backing it. In pandas we would usually do this on a Series with ser._values. This may be making assumptions about dataframe internals that we don't want to make.
Example: "This operation wraps the array behavior op(X) matches DataFrame(op(extract_array(X)))"

The text was updated successfully, but these errors were encountered:

rgommers · 2022-09-13T11:27:06Z

(1) and 3 seem clear to me, and sound good.

Regarding (2), pd.testing.assert_frame_equal seems a bit loose on floating-point tolerance. I assume that that is not what you meant, but to be sure: I'd expect floating-point numbers to be almost equal to the expected precision for array operations - so typically O(1 nulp) when comparing for example commuting operations in the same library, and typical numbers (e.g. <1e-12) when comparing 64-bit precision operations between different libraries. Rather than the rtol=1e-5, atol=1e-8 of assert_frame_equal.
For thing like check_column_type, I think "equivalent" rather than "identical" sounds fine to me.

Regarding (4): sounds fine, as long as we can ignore that arrays may not support every dtype. Also, DataFrame(op(extract_array(X))) may lose metadata like column names. So I assume you mean something like "apply op to the data values contained in the dataframe, while leaving all metadata of the dataframe unchanged (except if op is a shape-changing operation, in that case the dataframe shape changes)".

rgommers · 2022-09-13T11:30:41Z

@jbrockmendel regarding numerical operations, gh-50 may be relevant.

jorisvandenbossche · 2022-09-13T14:28:39Z

"concat", analogous to pd.concat([left, right], axis=1). In the cases of interest len(left) == len(right).
Example: "Scalar arithmetic commutes with concatenation, so concat([X, Y], axis=1) + Z matches concat([X+Z, Y+Z], axis=1)"

Maybe not exactly what you wanted to discuss here (rather terminology in general, not specific examples), but I don't really understand this example (at least in pandas this doesn't hold because of alignment?)
On the terminology front, "concat" on axis 1 or 0 is a quite different operations. Do you want to imply with the above you mostly consider axis=1 for "concat"? (I would personally typically first think of axis 0 for generic "concat", since axis 1 is more like a "join" operation)

jbrockmendel · 2022-09-13T15:37:58Z

w/r/t matches/equals/assert_frame_equals I was thinking of exact matches, i.e. 0 tolerance for floating points, but I don't know if that's necessary when it comes time to instrumentalize it. The point in this context is to be able to make meaningful statements about commutativity and column-wise operations.

w/r/t concat im only interested in axis=1 in this context. Again, the motivating case is to be able to describe frame+scalar in terms of the column-wise operation.

jorisvandenbossche · 2022-09-13T16:07:39Z

Again, the motivating case is to be able to describe frame+scalar in terms of the column-wise operation.

OK, but then I don't understand how "concat" is useful to describe those operations (or are the X and Y in your example columns, not dataframes?)

jbrockmendel · 2022-09-13T16:31:36Z

OK, but then I don't understand how "concat" is useful to describe those operations (or are the X and Y in your example columns, not dataframes?)

It should hold for any decomposition of a DataFrame along the lines of X = df.iloc[:, :N], Y = df.iloc[:, N:]. Is there a non-concat way you'd suggest to express "equivalent to operating column-by-column"?

rgommers · 2023-04-27T23:21:23Z

@jbrockmendel it seems like we don't need these primitives (at least in an API, they may be useful for an implementer, but Marco's MVP code doesn't use them I believe). I think this can be closed now - is that okay with you?

rgommers added the API design label Sep 13, 2022

jbrockmendel closed this as completed Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISC: concensus primitives #84

DISC: concensus primitives #84

jbrockmendel commented Sep 7, 2022 •

edited

Loading

rgommers commented Sep 13, 2022

rgommers commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

jbrockmendel commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

jbrockmendel commented Sep 13, 2022

rgommers commented Apr 27, 2023

DISC: concensus primitives #84

DISC: concensus primitives #84

Comments

jbrockmendel commented Sep 7, 2022 • edited Loading

rgommers commented Sep 13, 2022

rgommers commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

jbrockmendel commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

jbrockmendel commented Sep 13, 2022

rgommers commented Apr 27, 2023

jbrockmendel commented Sep 7, 2022 •

edited

Loading