Filter by a column in another table #1476

Jiefei-Wang · 2024-03-14T20:42:30Z

This is an amazing package that helps me a lot in my work! I'm not sure if this is just a limitation in my backend database(I'm using Oracle SQL) or a feature that hasn't been developed. Suppose I have a database with two tables A and B where both tables have an id column, what I want to do is to filter the rows in table A where the id is presented in B. Here is a minimum example in dplyr:

library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))
A |> filter(id %in% B$id)

This, however, cannot be easily converted to dbplyr if both A and B are tables in a database. Subsetting the object B by column name will give a NULL value(e.g. B$id = NULL) and therefore lead to an error. As a workaround, I have to preprocess the table B to create distinct ids, then perform an inner join. It is still fine, but not that elegant. I wonder if there is any better way to do the filter.

The text was updated successfully, but these errors were encountered:

fh-mthomson · 2024-03-15T01:55:22Z

Would the following work (translating A and B to tables on Oracle)?

library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))

A %>% 
  semi_join(B, by = "id")

As you mentioned, using x$id to identify a column is intentionally not supported for lazy tibbles, per the error message:

table <- dbplyr::lazy_frame(i = 1, con = dbplyr::simulate_dbi())
table$id
#> Error in `table$id`:
#> ! The `$` method of <tbl_lazy> is for internal use only.
#> ℹ Use `dplyr::pull()` to get the values in a column.

^{Created on 2024-03-15 with reprex v2.1.0}

hadley · 2024-03-15T14:50:53Z

Yeah, a semi join is what you want here. Thanks for the explanation @fh-mthomson!

Jiefei-Wang · 2024-03-15T15:16:23Z

Hi @fh-mthomson , Thanks for your great explanation! I'm just curious about the reason that the $ operation is not supported. I understand collecting a column from the table is costly, but is it possible to return a delayed vector? What is the major concern about it? Thanks.

fh-mthomson · 2024-03-15T21:01:16Z

The x object is a "lazy tibble" which is constructured as a list object, rather than a data frame object. Technically, x$column_name only makes sense for data frames, but not the lazy version. However, you can still readily the values via dplyr::pull(table, id).

See more technical detail in #1056

The same can be said for $ -- even if it doesn't return a column, I would guess that's what the vast majority of people who wind up using $ on a tbl_lazy object will have intended.

hadley · 2024-04-02T13:51:04Z

To add to @fh-mthomson's explanation, I don't think there's an obvious way to translate id %in% B$id to SQL unless you did something like id %in% ('a', 'b', 'c', ...). And that's unnecessarily expensive because you'd have to download that entire vector into R and convert it to SQL and then send it back to the database. With the semi join, all the data stays in one place (the database).

hadley closed this as completed Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter by a column in another table #1476

Filter by a column in another table #1476

Jiefei-Wang commented Mar 14, 2024 •

edited

Loading

fh-mthomson commented Mar 15, 2024

hadley commented Mar 15, 2024

Jiefei-Wang commented Mar 15, 2024

fh-mthomson commented Mar 15, 2024 •

edited

Loading

hadley commented Apr 2, 2024

Filter by a column in another table #1476

Filter by a column in another table #1476

Comments

Jiefei-Wang commented Mar 14, 2024 • edited Loading

fh-mthomson commented Mar 15, 2024

hadley commented Mar 15, 2024

Jiefei-Wang commented Mar 15, 2024

fh-mthomson commented Mar 15, 2024 • edited Loading

hadley commented Apr 2, 2024

Jiefei-Wang commented Mar 14, 2024 •

edited

Loading

fh-mthomson commented Mar 15, 2024 •

edited

Loading