How to deal with duplicates: a proposition #301

rempsyc · 2022-10-22T00:25:08Z

rempsyc
Oct 22, 2022
Maintainer Sponsor

In psychology, we often have to deal with duplicates, perhaps as a result of merging several datasets, technical difficulties, or participants completing a survey twice. Consider the following example:

df <- data.frame(
  id = c(1, 2, 3, 1, 3),
  item1 = c(NA, 1, 1, 2, 3),
  item2 = c(NA, 1, 1, 2, 3),
  item3 = c(NA, 1, 1, 2, 3)
)
df
#>   id item1 item2 item3
#> 1  1    NA    NA    NA
#> 2  2     1     1     1
#> 3  3     1     1     1
#> 4  1     2     2     2
#> 5  3     3     3     3

We have two duplicates: participants 1 and 3. To deal with this, we could attempt to use an existing function, dplyr::distinct, but its behaviour is to keep only the first duplicate. This is problematic because often the first case will be incomplete or partial answers.

dplyr::distinct(df, id, .keep_all = TRUE)
#>   id item1 item2 item3
#> 1  1    NA    NA    NA
#> 2  2     1     1     1
#> 3  3     1     1     1

Attempts to use base R duplicated yields the same result

df[!duplicated(df$id), ]
#>   id item1 item2 item3
#> 1  1    NA    NA    NA
#> 2  2     1     1     1
#> 3  3     1     1     1

Therefore, we actually want to keep the rows with the fewest missing values. Furthermore, in the case of ties (e.g., participant completed survey twice), we might want to prioritize the first duplicate (to minimize practice effects, etc.). This is what rempsyc::best_duplicate does.

rempsyc::best_duplicate(df, id = "id")
#> (2 duplicates removed)
#>   id item1 item2 item3
#> 1  1     2     2     2
#> 2  2     1     1     1
#> 3  3     1     1     1

For the first duplicate, it kept the one with the fewer NAs. In the second duplicate (equal number of NAs), it kept the first one.

It might also be recommended to manually inspect the duplicates before proceeding to their automatic removal. That’s where rempsyc::extract_duplicates comes in.

rempsyc::extract_duplicates(df, id = "id")
#>   Row id item1 item2 item3 count_na
#> 1   1  1    NA    NA    NA        3
#> 2   4  1     2     2     2        0
#> 3   3  3     1     1     1        0
#> 4   5  3     3     3     3        0

This data frame of duplicates includes the original row number as well as the number of NAs to help you make your own decision. If you are not fully satisfied with the automatic approach of rempsyc::best_duplicate, one can use the row numbers provided to make a manual subselection.

df2 <- df[-c(1, 5),]
df2
#>   id item1 item2 item3
#> 2  2     1     1     1
#> 3  3     1     1     1
#> 4  1     2     2     2

Do you think this would make a good addition to datawizard’s armamentarium?

I did a small social experiment on Twitter and it got good traction, with 25 likes and 9 retweets—that’s my most ““viral”” tweet so far (considering my mere 100 followers).

I am open to finding a better name for best_duplicate. I was thinking keep_best_duplicate, but that’s a bit long. An alternative could be: filter_duplicates.

^{Created on 2022-10-21 with reprex v2.0.2}

Should `best_duplicate` & `extract_duplicates` join datawizard?

Yes, they fit with the general spirit of datawizard, should join

66%

They could stay with rempsyc

0%

Other option (please specify in comment)

33%

3 votes

mattansb · 2022-10-22T10:17:45Z

mattansb
Oct 22, 2022
Maintainer

I suggest a data_unique() function, that has two method: "best" and "first"?

4 replies

etiennebacher Oct 27, 2022
Maintainer

Something like data_unique(data, select, keep = "best") (or keep = "first")?

rempsyc Oct 27, 2022
Maintainer Author Sponsor

That would work. Out of curiosity, though, in which situations would you want to keep the first duplicate instead of the best one? Would it be just so we also have the possibility to mimic the default dplyr behaviour?

Also, probably not a big issue, but I was just thinking that perhaps people would expect data_unique to keep only all rows that are completely unique (i.e,. for all columns), rather than based on a single column like a unique identifier. This is what happens e.g., with base R duplicated if you're not careful to subset only the ID column. Though the argument is weak since dplyr::distinct does that already and people have no issue with it so probably safe to ignore.

bwiernik Oct 27, 2022
Maintainer

One example -- if the results are survey responses and I want to keep a respondent's first set of answers (say immediate reactions to a stimulus), even if they are missing some questions.

We probably also want keep = "last"

rempsyc Oct 27, 2022
Maintainer Author Sponsor

Yes agreed, that was my other thought, if we add "first", there's little harm in adding "last" to be consistent and as flexible as possible for users. Thanks for the example.

So last thing would be to decide whether we want data_unique and (e.g.,) data_duplicated, or data_filter(filter = "unique") and data_filter(filter = "duplicated").

rempsyc · 2022-10-27T15:00:49Z

rempsyc
Oct 27, 2022
Maintainer Author Sponsor

@easystats/core-team WDYT?

5 replies

etiennebacher Oct 27, 2022
Maintainer

I don't really have this kind of usecases but looks interesting to have this in the datawizard toolbox

strengejacke Oct 27, 2022
Maintainer

I'd suggest:

As @mattansb suggested, data_unique() with args best/first
and for extracting, we could add an option to data_filter(), where the filter argument also accepts the value "duplicates" or similar.

Does this work?

rempsyc Oct 27, 2022
Maintainer Author Sponsor

@strengejacke data_filter with "duplicates" as argument could work. However, I feel like it would be more useful in a dedicated function.

First, extract_duplicates adds two columns: the row and the number of missing values (that said, it could possibly be incorporated in data_filter).

Second, when I look for functionalities in package functions, I look at the function names first, and I would expect data_filter to help me filter data, as it does, but not to extract duplicates. So I might not think to look into data_filter just in case there's an argument to filter for duplicates (so discoverability issue) since I think people think about those steps separately.

If I look at the function names and see something with duplicates in it, then I will easily discover it. Also, since data_filter's filter argument normally takes a logical expression, I don't know how consistent it would be (unless we eventually find several other cases like this to add).

One possibility could be to make it so (data_filter with "duplicates" as argument) but then also create a convenience alias with the argument already set to "duplicates"? That said, if others think as well that it should go in data_filter, then we can go that route.

If we do go that route, to be consistent, should we not also use data_filter instead of data_unique? Then you can just use negative indexing on that "duplicates" argument or something like that (or use e.g., "not_duplicate" or "unique"). I feel like the same logic should apply (one you filter to keep all duplicates; the other you filter to keep all non-duplicates).

DominiqueMakowski Oct 28, 2022
Maintainer

yes for data_unique()
I didn't catch the difference with extract_duplicates?
data_filter() should probably not add columns

PS: thanks for making me learn the word armamentarium 👌

rempsyc Oct 28, 2022
Maintainer Author Sponsor

extract_duplicates = keep all duplicates
data_unique = keep all non-duplicates

Should we use the same structure for the first one for consistency then, data_duplicated?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with duplicates: a proposition #301

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to deal with duplicates: a proposition #301

rempsyc Oct 22, 2022 Maintainer Sponsor

Replies: 2 comments · 9 replies

mattansb Oct 22, 2022 Maintainer

etiennebacher Oct 27, 2022 Maintainer

rempsyc Oct 27, 2022 Maintainer Author Sponsor

bwiernik Oct 27, 2022 Maintainer

rempsyc Oct 27, 2022 Maintainer Author Sponsor

rempsyc Oct 27, 2022 Maintainer Author Sponsor

etiennebacher Oct 27, 2022 Maintainer

strengejacke Oct 27, 2022 Maintainer

rempsyc Oct 27, 2022 Maintainer Author Sponsor

DominiqueMakowski Oct 28, 2022 Maintainer

rempsyc Oct 28, 2022 Maintainer Author Sponsor

rempsyc
Oct 22, 2022
Maintainer Sponsor

Replies: 2 comments 9 replies

mattansb
Oct 22, 2022
Maintainer

etiennebacher Oct 27, 2022
Maintainer

rempsyc Oct 27, 2022
Maintainer Author Sponsor

bwiernik Oct 27, 2022
Maintainer

rempsyc Oct 27, 2022
Maintainer Author Sponsor

rempsyc
Oct 27, 2022
Maintainer Author Sponsor

etiennebacher Oct 27, 2022
Maintainer

strengejacke Oct 27, 2022
Maintainer

rempsyc Oct 27, 2022
Maintainer Author Sponsor

DominiqueMakowski Oct 28, 2022
Maintainer

rempsyc Oct 28, 2022
Maintainer Author Sponsor