Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Nov 5, 2024

Rationale for this change

Support a missing feature, just wiring up some stuff from R to Acero, then adding docs and tests.

This is mostly picking up where #13934 started and finishing it out. Thanks @mopcup for the initial lift.

What changes are included in this PR?

An aggregation binding, some symbol manipulation, and tests. I also cleaned up some dplyr test shims from 2022.

Are these changes tested?

Yes, though if anyone knows of odd corners in distinct() that aren't covered by this, we can add more

Are there any user-facing changes?

Yes indeed.

Copy link

github-actions bot commented Nov 5, 2024

⚠️ GitHub issue #29642 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Mostly questions about messaging + conveying some of the nuances

Comment on lines +109 to +111
# Drop factor because of #44661:
# NotImplemented: Function 'hash_one' has no kernel matching input types
# (dictionary<values=string, indices=int8, ordered=0>, uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 110-111 the error that someone would get if they tried distinct(..., .keep_all = TRUE) with a factor in the table/data.frame?

We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the error message. I'd have to think about how/where best to catch that and translate that to R-speak. As it turns out, dictionary isn't the only unsupported type, it's just the only one we have in this test data frame. I think list types and other non-simple types are also not supported, IIRC from RTFS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +31 to +33
# Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value.
# However, Acero's `hash_one` function prefers returning non-null values.
# So, you'll get the same shape of data, but the values may differ.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?).

Or maybe with a one-time warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented on the acero man page, that's the change to arrow-package.R. I'd rather not one-time warning; that's a slippery slope if we were going to be chatty about every subtle difference between how Acero works from dplyr on data.frames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants