-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652
base: main
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Mostly questions about messaging + conveying some of the nuances
# Drop factor because of #44661: | ||
# NotImplemented: Function 'hash_one' has no kernel matching input types | ||
# (dictionary<values=string, indices=int8, ordered=0>, uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 110-111 the error that someone would get if they tried distinct(..., .keep_all = TRUE)
with a factor in the table/data.frame?
We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's the error message. I'd have to think about how/where best to catch that and translate that to R-speak. As it turns out, dictionary isn't the only unsupported type, it's just the only one we have in this test data frame. I think list types and other non-simple types are also not supported, IIRC from RTFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value. | ||
# However, Acero's `hash_one` function prefers returning non-null values. | ||
# So, you'll get the same shape of data, but the values may differ. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?).
Or maybe with a one-time warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is documented on the acero man page, that's the change to arrow-package.R. I'd rather not one-time warning; that's a slippery slope if we were going to be chatty about every subtle difference between how Acero works from dplyr on data.frames.
Rationale for this change
Support a missing feature, just wiring up some stuff from R to Acero, then adding docs and tests.
This is mostly picking up where #13934 started and finishing it out. Thanks @mopcup for the initial lift.
What changes are included in this PR?
An aggregation binding, some symbol manipulation, and tests. I also cleaned up some dplyr test shims from 2022.
Are these changes tested?
Yes, though if anyone knows of odd corners in
distinct()
that aren't covered by this, we can add moreAre there any user-facing changes?
Yes indeed.