Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge.data.table() output is wrong #6804

Open
zhizhongpu opened this issue Feb 9, 2025 · 6 comments
Open

merge.data.table() output is wrong #6804

zhizhongpu opened this issue Feb 9, 2025 · 6 comments

Comments

@zhizhongpu
Copy link

zhizhongpu commented Feb 9, 2025

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

library(dplyr)
library(data.table)

set.seed(42)
edmt <- data.table(
    id = 1:20,
    value = runif(20, 50, 500)
)

ic <- data.table(
    id = sample(1:20, 10)
)

spl <- edmt |> 
    merge(ic, by = "id") |> 
    arrange(-value) |> 
    head(10)

edmt_insample_A = edmt |> 
    merge.data.table(spl, by = "id", all = FALSE)
edmt_insample_A |> dim()
# [1] 1 3

edmt_insample_B = edmt |> 
    inner_join(spl, by = "id")
edmt_insample_B |> dim()
# [1] 10  3

Replacing arrange() with setorder resolves the issue.
Removing any line among these 3 resolves the issue.

  merge(ic, by = "id") |> 
  arrange(-value) |> 
  head(10)

# Output of sessionInfo()

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.7.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] eventstudyr_1.1.3  stringi_1.8.4      readxl_1.4.3       tidytab_0.0.0.9000 lubridate_1.9.4    forcats_1.0.0      stringr_1.5.1     
 [8] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
[15] janitor_2.2.1      data.table_1.16.4 

loaded via a namespace (and not attached):
 [1] generics_0.1.3    pracma_2.4.4      hms_1.1.3         magrittr_2.0.3    grid_4.4.0        timechange_0.3.0  cellranger_1.1.0 
 [8] Formula_1.2-5     httr_1.4.7        scales_1.3.0      abind_1.4-8       cli_3.6.3         crayon_1.5.3      rlang_1.1.5      
[15] texreg_1.39.4     munsell_0.5.1     withr_3.0.2       tools_4.4.0       tzdb_0.4.0        colorspace_2.1-1  estimatr_1.0.4   
[22] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   snakecase_0.11.1  car_3.1-3         MASS_7.3-64       pkgconfig_2.0.3  
[29] pillar_1.10.1     gtable_0.3.6      glue_1.8.0        Rcpp_1.0.14       tidyselect_1.2.1  rstudioapi_0.17.1 farver_2.1.2     
[36] carData_3.0-5     labeling_0.4.3    compiler_4.4.0 
@aitap
Copy link
Contributor

aitap commented Feb 9, 2025

Can confirm with both the CRAN version on R-4.2.2 and the current master branch on R-devel. Somehow merge(edmt, spl, by = "id", all = FALSE) has much fewer rows than merge.data.frame(edmt, spl, by = "id", all = FALSE):

> merge(edmt, spl, by = "id", all = FALSE)
i.id has same type (integer) as x.id. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
forder.c received 20 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
bmerge: looping bmerge_r took 0.000s
bmerge: took 0.000s
bmerge done in 0.001s elapsed (0.001s cpu)
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Key: <id>
      id  value.x  value.y
   <int>    <num>    <num>
1:    15 258.0318 258.0318

Edited your MRE a bit to (1) make it easier to copy&paste and (2) make sure all the relevant packages are attached.

Part of the problem must be due to the data.table retaining the key after its rows are reordered by arrange, but then why it works without the head(10) after arrange(...)?

@tdhock
Copy link
Member

tdhock commented Feb 10, 2025

for a work-around please use [.data.table instead of merge

@zhizhongpu
Copy link
Author

for a work-around please use [.data.table instead of merge

I think I explicitly stated merge.data.table and the problem persisted - sorry it's not reflected in the mwe @tdhock

@aitap
Copy link
Contributor

aitap commented Feb 10, 2025

[.data.table gives the same wrong answer:

> spl[edmt, on = 'id', nomatch = NULL]
i.id has same type (integer) as x.id. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
forder.c received 20 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
bmerge: looping bmerge_r took 0.000s
bmerge: took 0.000s
bmerge done in 0.000s elapsed (0.000s cpu)
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
Key: <id>
      id    value  i.value
   <int>    <num>    <num>
1:    15 258.0318 258.0318

Evidently, a data.table that had its values reordered and then the sorted attribute grafted on is invalid:

edmt |> merge(ic, by = "id") |> setorder(-value) |> setattr('sorted', 'id') -> spl3
merge(edmt, spl3, by = "id", all = FALSE) # same wrong output as in other cases

The difference between edmt |> merge(ic, by = "id") |> arrange(-value) |> merge(edmt, y=_, by = "id") (which seems to work) and edmt |> merge(ic, by = "id") |> arrange(-value) |> head(10) |> merge(edmt, y=_, by = "id") (which visibly breaks) is that the value returned by arrange(_, -value) is not an over-allocated data.table. When data.table:::bmerge(i,x) calls shallow(x), the function checks .selfrefok(x) before deciding whether to retain the key and correctly throws it away. On the other hand, when head(_, 10) is applied to the return value of arrange(_, -value), [.data.table makes sure to both preserve the sorted attribue and to over-allocate the vector of column pointers with internal attributes so that .selfrefok() would pass.

While the cause of the problem is evident (dplyr::arrange made a broken data.table and head() made the situation worse), it's hard to say how to avoid it.

@MichaelChirico
Copy link
Member

Possibly related: #5084.

I haven't read carefully, but I wonder if adding an Enchances: dependency on {dplyr} so we can provide an appropriate method makes sense? Or possibly this could live with {dplyr}? I am not sure what the best practice is for Enhances: generally.

@aitap
Copy link
Contributor

aitap commented Feb 13, 2025

Either package can Enhance the other, providing arrange.data.table. We'd need delayed registration (like we do with knitr::knit_print) because they own the generic.

#5084 is very relevant (and almost a palindrome of this issue number). We could try to make [.data.tablesetalloccol() do the same thing as the R function shallow() and throw away the keys and indices, but it's not easy. The following breaks a lot of tests that expect a key to be automatically assigned:

--- a/src/assign.c
+++ b/src/assign.c
@@ -256,8 +256,15 @@ SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose)
   // names may be NULL when null.data.table() passes list() to alloccol for example.
   // So, careful to use length() on names, not LENGTH().
   if (length(names)!=l) internal_error(__func__, "length of names (%d) is not length of dt (%d)", length(names),l); // # nocov
-  if (!selfrefok(dt,verbose))
-    return shallow(dt,R_NilValue,(n>l) ? n : l);  // e.g. test 848 and 851 in R > 3.0.2
+  if (!selfrefok(dt,verbose)) {
+    SEXP newdt = PROTECT(shallow(dt,R_NilValue,(n>l) ? n : l));  // e.g. test 848 and 851 in R > 3.0.2
+    // The attributes of non-intact data.tables are not to be trusted, so behave similarly
+    // to .shallow(x, retain.key = selfrefok(x)), like R shallow() does. #5084 #6804
+    setAttrib(newdt, sym_index, R_NilValue);
+    setAttrib(newdt, sym_sorted, R_NilValue);
+    UNPROTECT(1);
+    return newdt;
+  }
     // added (n>l) ? ... for #970, see test 1481.
   // TO DO:  test realloc names if selfrefnamesok (users can setattr(x,"name") themselves for example.
   // if (TRUELENGTH(getAttrib(dt,R_NamesSymbol))!=tl)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants