-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make column_empty
mask buffer creation consistent with libcudf
#16715
Make column_empty
mask buffer creation consistent with libcudf
#16715
Conversation
@@ -1486,7 +1454,12 @@ def _has_any_nan(arbitrary: pd.Series | np.ndarray) -> bool: | |||
def column_empty( | |||
row_count: int, dtype: Dtype = "object", masked: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the masked
parameter no longer used?
I also want to raise a concern: the nullability of a column is conveyed by the presence/absence of a mask. While a column with zero rows has no data, the nullability is moreso a part of its dtype
which is independent of the row count. We have to track nullability carefully for things like round-tripping Parquet data (where nullability is part of the schema). I would check carefully if this changes any I/O behaviors for zero-row files -- we may not have adequate test coverage today. libcudf/cudf are notoriously weak/imprecise in this area and I want to get stricter, not less strict, over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the masked parameter no longer used?
Correct.
I would check carefully if this changes any I/O behaviors for zero-row files -- we may not have adequate test coverage today.
I added a small, zero row roundtrip parquet unit test in 34fbc8a to validate this was unaffected. Agreed that it would be good to make libcudf/cudf stricter w.r.t. mask presence, and at least this PR makes it more consistent for zero row vs all null columns generated in cudf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the masked
parameter now that it is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to remove this parameter in a follow up since there are some usages of masked
in other RAPIDS libraries. I'll address those first (after this PR) before opening up a PR to remove it here
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
…schke/cudf into ref/column_empty/consistent
@bdice could you take another look here? |
Just putting this in draft to not merge while I'm out (might, but not intended to, require some related fixed in other RAPIDS projects), but this PR is fully ready to review. |
Opened back up now that I'm back from PTO. Could you take another look when you have the time @bdice? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked a potentially breaking question about removing the masked
parameter. Also this feels like a potentially disruptive change. This should probably target 25.02, to avoid breaking 24.12 just before freeze.
@@ -1486,7 +1454,12 @@ def _has_any_nan(arbitrary: pd.Series | np.ndarray) -> bool: | |||
def column_empty( | |||
row_count: int, dtype: Dtype = "object", masked: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the masked
parameter now that it is unused?
@@ -154,7 +154,10 @@ def jit_groupby_apply(offsets, grouped_values, function, *args): | |||
offsets = cp.asarray(offsets) | |||
ngroups = len(offsets) - 1 | |||
|
|||
output = cudf.core.column.column_empty(ngroups, dtype=return_type) | |||
# numba doesn't support masks | |||
output = cudf.core.column.column_empty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we're losing performance by allocating a null mask with all nulls, and then dropping it. Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I added a for_numba
argument in this function to avoid creating a mask in these cases
Thanks for taking a look Bradley! Moving to 25.02 |
…schke/cudf into ref/column_empty/consistent
@mroeschke sorry for not getting this merged earlier, unfortunately there are conflicts now. Could you please resolve? |
…schke/cudf into ref/column_empty/consistent
/merge |
rapidsai/cudf#16715 makes size 0 cudf columns have no mask buffer which I think aligns with the intention in `empty_geometry_column`. Additionally, it makes the `masked` keyword unnecessary and unused so removing it here. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) URL: #1496
Description
Based on offline discussions, this PR makes
column_empty
consistent with libcudf whereAdditionally removes
column_empty_like
which can be subsumed bycolumn_empty
(I didn't find any active usage of this method across RAPIDS https://github.com/search?q=org%3Arapidsai%20column_empty_like&type=code)column_empty
will have an unusedmasked
argument, but since there is usage of this method across RAPIDS I'll need to adjust them before removing that keyword here (https://github.com/search?q=org%3Arapidsai%20column_empty&type=code)Checklist