merge.data.table() output is wrong #6804

zhizhongpu · 2025-02-09T03:40:20Z

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

library(dplyr)
library(data.table)

set.seed(42)
edmt <- data.table(
    id = 1:20,
    value = runif(20, 50, 500)
)

ic <- data.table(
    id = sample(1:20, 10)
)

spl <- edmt |> 
    merge(ic, by = "id") |> 
    arrange(-value) |> 
    head(10)

edmt_insample_A = edmt |> 
    merge.data.table(spl, by = "id", all = FALSE)
edmt_insample_A |> dim()
# [1] 1 3

edmt_insample_B = edmt |> 
    inner_join(spl, by = "id")
edmt_insample_B |> dim()
# [1] 10  3

Replacing arrange() with setorder resolves the issue.
Removing any line among these 3 resolves the issue.

  merge(ic, by = "id") |> 
  arrange(-value) |> 
  head(10)

# Output of sessionInfo()

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.7.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] eventstudyr_1.1.3  stringi_1.8.4      readxl_1.4.3       tidytab_0.0.0.9000 lubridate_1.9.4    forcats_1.0.0      stringr_1.5.1     
 [8] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   
[15] janitor_2.2.1      data.table_1.16.4 

loaded via a namespace (and not attached):
 [1] generics_0.1.3    pracma_2.4.4      hms_1.1.3         magrittr_2.0.3    grid_4.4.0        timechange_0.3.0  cellranger_1.1.0 
 [8] Formula_1.2-5     httr_1.4.7        scales_1.3.0      abind_1.4-8       cli_3.6.3         crayon_1.5.3      rlang_1.1.5      
[15] texreg_1.39.4     munsell_0.5.1     withr_3.0.2       tools_4.4.0       tzdb_0.4.0        colorspace_2.1-1  estimatr_1.0.4   
[22] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   snakecase_0.11.1  car_3.1-3         MASS_7.3-64       pkgconfig_2.0.3  
[29] pillar_1.10.1     gtable_0.3.6      glue_1.8.0        Rcpp_1.0.14       tidyselect_1.2.1  rstudioapi_0.17.1 farver_2.1.2     
[36] carData_3.0-5     labeling_0.4.3    compiler_4.4.0

The text was updated successfully, but these errors were encountered:

aitap · 2025-02-09T21:40:30Z

Can confirm with both the CRAN version on R-4.2.2 and the current master branch on R-devel. Somehow merge(edmt, spl, by = "id", all = FALSE) has much fewer rows than merge.data.frame(edmt, spl, by = "id", all = FALSE):

> merge(edmt, spl, by = "id", all = FALSE)
i.id has same type (integer) as x.id. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
forder.c received 20 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
bmerge: looping bmerge_r took 0.000s
bmerge: took 0.000s
bmerge done in 0.001s elapsed (0.001s cpu)
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Key: <id>
      id  value.x  value.y
   <int>    <num>    <num>
1:    15 258.0318 258.0318

Edited your MRE a bit to (1) make it easier to copy&paste and (2) make sure all the relevant packages are attached.

Part of the problem must be due to the data.table retaining the key after its rows are reordered by arrange, but then why it works without the head(10) after arrange(...)?

tdhock · 2025-02-10T12:54:08Z

for a work-around please use [.data.table instead of merge

zhizhongpu · 2025-02-10T13:12:36Z

for a work-around please use [.data.table instead of merge

I think I explicitly stated merge.data.table and the problem persisted - sorry it's not reflected in the mwe @tdhock

aitap · 2025-02-10T16:39:14Z

[.data.table gives the same wrong answer:

> spl[edmt, on = 'id', nomatch = NULL]
i.id has same type (integer) as x.id. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
forder.c received 20 rows and 2 columns
forderReuseSorting: opt=-1, took 0.000s
bmerge: looping bmerge_r took 0.000s
bmerge: took 0.000s
bmerge done in 0.000s elapsed (0.000s cpu)
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
Key: <id>
      id    value  i.value
   <int>    <num>    <num>
1:    15 258.0318 258.0318

Evidently, a data.table that had its values reordered and then the sorted attribute grafted on is invalid:

edmt |> merge(ic, by = "id") |> setorder(-value) |> setattr('sorted', 'id') -> spl3
merge(edmt, spl3, by = "id", all = FALSE) # same wrong output as in other cases

The difference between edmt |> merge(ic, by = "id") |> arrange(-value) |> merge(edmt, y=_, by = "id") (which seems to work) and edmt |> merge(ic, by = "id") |> arrange(-value) |> head(10) |> merge(edmt, y=_, by = "id") (which visibly breaks) is that the value returned by arrange(_, -value) is not an over-allocated data.table. When data.table:::bmerge(i,x) calls shallow(x), the function checks .selfrefok(x) before deciding whether to retain the key and correctly throws it away. On the other hand, when head(_, 10) is applied to the return value of arrange(_, -value), [.data.table makes sure to both preserve the sorted attribue and to over-allocate the vector of column pointers with internal attributes so that .selfrefok() would pass.

While the cause of the problem is evident (dplyr::arrange made a broken data.table and head() made the situation worse), it's hard to say how to avoid it.

MichaelChirico · 2025-02-10T17:41:47Z

Possibly related: #5084.

I haven't read carefully, but I wonder if adding an Enchances: dependency on {dplyr} so we can provide an appropriate method makes sense? Or possibly this could live with {dplyr}? I am not sure what the best practice is for Enhances: generally.

aitap · 2025-02-13T11:41:06Z

Either package can Enhance the other, providing arrange.data.table. We'd need delayed registration (like we do with knitr::knit_print) because they own the generic.

#5084 is very relevant (and almost a palindrome of this issue number). We could try to make [.data.table → setalloccol() do the same thing as the R function shallow() and throw away the keys and indices, but it's not easy. The following breaks a lot of tests that expect a key to be automatically assigned:

--- a/src/assign.c
+++ b/src/assign.c
@@ -256,8 +256,15 @@ SEXP alloccol(SEXP dt, R_len_t n, Rboolean verbose)
   // names may be NULL when null.data.table() passes list() to alloccol for example.
   // So, careful to use length() on names, not LENGTH().
   if (length(names)!=l) internal_error(__func__, "length of names (%d) is not length of dt (%d)", length(names),l); // # nocov
-  if (!selfrefok(dt,verbose))
-    return shallow(dt,R_NilValue,(n>l) ? n : l);  // e.g. test 848 and 851 in R > 3.0.2
+  if (!selfrefok(dt,verbose)) {
+    SEXP newdt = PROTECT(shallow(dt,R_NilValue,(n>l) ? n : l));  // e.g. test 848 and 851 in R > 3.0.2
+    // The attributes of non-intact data.tables are not to be trusted, so behave similarly
+    // to .shallow(x, retain.key = selfrefok(x)), like R shallow() does. #5084 #6804
+    setAttrib(newdt, sym_index, R_NilValue);
+    setAttrib(newdt, sym_sorted, R_NilValue);
+    UNPROTECT(1);
+    return newdt;
+  }
     // added (n>l) ? ... for #970, see test 1481.
   // TO DO:  test realloc names if selfrefnamesok (users can setattr(x,"name") themselves for example.
   // if (TRUELENGTH(getAttrib(dt,R_NamesSymbol))!=tl)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge.data.table() output is wrong #6804

merge.data.table() output is wrong #6804

zhizhongpu commented Feb 9, 2025 •

edited by aitap

Loading

aitap commented Feb 9, 2025 •

edited

Loading

tdhock commented Feb 10, 2025

zhizhongpu commented Feb 10, 2025

aitap commented Feb 10, 2025

MichaelChirico commented Feb 10, 2025

aitap commented Feb 13, 2025

merge.data.table() output is wrong #6804

merge.data.table() output is wrong #6804

Comments

zhizhongpu commented Feb 9, 2025 • edited by aitap Loading

aitap commented Feb 9, 2025 • edited Loading

tdhock commented Feb 10, 2025

zhizhongpu commented Feb 10, 2025

aitap commented Feb 10, 2025

MichaelChirico commented Feb 10, 2025

aitap commented Feb 13, 2025

zhizhongpu commented Feb 9, 2025 •

edited by aitap

Loading

aitap commented Feb 9, 2025 •

edited

Loading