Replies: 2 comments 9 replies
-
I suggest a |
Beta Was this translation helpful? Give feedback.
4 replies
-
@easystats/core-team WDYT? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In psychology, we often have to deal with duplicates, perhaps as a result of merging several datasets, technical difficulties, or participants completing a survey twice. Consider the following example:
We have two duplicates: participants 1 and 3. To deal with this, we could attempt to use an existing function,
dplyr::distinct
, but its behaviour is to keep only the first duplicate. This is problematic because often the first case will be incomplete or partial answers.Attempts to use base R
duplicated
yields the same resultTherefore, we actually want to keep the rows with the fewest missing values. Furthermore, in the case of ties (e.g., participant completed survey twice), we might want to prioritize the first duplicate (to minimize practice effects, etc.). This is what
rempsyc::best_duplicate
does.For the first duplicate, it kept the one with the fewer NAs. In the second duplicate (equal number of NAs), it kept the first one.
It might also be recommended to manually inspect the duplicates before proceeding to their automatic removal. That’s where
rempsyc::extract_duplicates
comes in.This data frame of duplicates includes the original row number as well as the number of NAs to help you make your own decision. If you are not fully satisfied with the automatic approach of
rempsyc::best_duplicate
, one can use the row numbers provided to make a manual subselection.Do you think this would make a good addition to datawizard’s armamentarium?
I did a small social experiment on Twitter and it got good traction, with 25 likes and 9 retweets—that’s my most ““viral”” tweet so far (considering my mere 100 followers).
I am open to finding a better name for
best_duplicate
. I was thinkingkeep_best_duplicate
, but that’s a bit long. An alternative could be:filter_duplicates
.Created on 2022-10-21 with reprex v2.0.2
3 votes ·
Beta Was this translation helpful? Give feedback.
All reactions