Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter by a column in another table #1476

Closed
Jiefei-Wang opened this issue Mar 14, 2024 · 5 comments
Closed

Filter by a column in another table #1476

Jiefei-Wang opened this issue Mar 14, 2024 · 5 comments

Comments

@Jiefei-Wang
Copy link

Jiefei-Wang commented Mar 14, 2024

This is an amazing package that helps me a lot in my work! I'm not sure if this is just a limitation in my backend database(I'm using Oracle SQL) or a feature that hasn't been developed. Suppose I have a database with two tables A and B where both tables have an id column, what I want to do is to filter the rows in table A where the id is presented in B. Here is a minimum example in dplyr:

library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))
A |> filter(id %in% B$id)

This, however, cannot be easily converted to dbplyr if both A and B are tables in a database. Subsetting the object B by column name will give a NULL value(e.g. B$id = NULL) and therefore lead to an error. As a workaround, I have to preprocess the table B to create distinct ids, then perform an inner join. It is still fine, but not that elegant. I wonder if there is any better way to do the filter.

@fh-mthomson
Copy link
Contributor

Would the following work (translating A and B to tables on Oracle)?

library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))

A %>% 
  semi_join(B, by = "id")

As you mentioned, using x$id to identify a column is intentionally not supported for lazy tibbles, per the error message:

table <- dbplyr::lazy_frame(i = 1, con = dbplyr::simulate_dbi())
table$id
#> Error in `table$id`:
#> ! The `$` method of <tbl_lazy> is for internal use only.
#> ℹ Use `dplyr::pull()` to get the values in a column.

Created on 2024-03-15 with reprex v2.1.0

@hadley
Copy link
Member

hadley commented Mar 15, 2024

Yeah, a semi join is what you want here. Thanks for the explanation @fh-mthomson!

@hadley hadley closed this as completed Mar 15, 2024
@Jiefei-Wang
Copy link
Author

Hi @fh-mthomson , Thanks for your great explanation! I'm just curious about the reason that the $ operation is not supported. I understand collecting a column from the table is costly, but is it possible to return a delayed vector? What is the major concern about it? Thanks.

@fh-mthomson
Copy link
Contributor

fh-mthomson commented Mar 15, 2024

The x object is a "lazy tibble" which is constructured as a list object, rather than a data frame object. Technically, x$column_name only makes sense for data frames, but not the lazy version. However, you can still readily the values via dplyr::pull(table, id).

See more technical detail in #1056

The same can be said for $ -- even if it doesn't return a column, I would guess that's what the vast majority of people who wind up using $ on a tbl_lazy object will have intended.

@hadley
Copy link
Member

hadley commented Apr 2, 2024

To add to @fh-mthomson's explanation, I don't think there's an obvious way to translate id %in% B$id to SQL unless you did something like id %in% ('a', 'b', 'c', ...). And that's unnecessarily expensive because you'd have to download that entire vector into R and convert it to SQL and then send it back to the database. With the semi join, all the data stays in one place (the database).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants