-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathREADME.Rmd
138 lines (100 loc) · 6.32 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tidyquery <img src="man/figures/logo.png" align="right" width="120" />
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/tidyquery)](https://cran.r-project.org/package=tidyquery)
[![GitHub Actions build status](https://github.com/ianmcook/tidyquery/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/ianmcook/tidyquery/actions/workflows/check-standard.yaml)
[![Codecov test coverage](https://codecov.io/gh/ianmcook/tidyquery/branch/master/graph/badge.svg)](https://codecov.io/gh/ianmcook/tidyquery?branch=master)
<!-- badges: end -->
**tidyquery** runs SQL queries on R data frames.
It uses [queryparser](https://github.com/ianmcook/queryparser) to translate SQL queries into R expressions, then it uses [dplyr](https://dplyr.tidyverse.org) to evaluate these expressions and return results. **tidyquery** does not load data frames into a database; it queries them in place.
For an introduction to **tidyquery** and **queryparser**, watch the recording of the talk ["Bridging the Gap between SQL and R"](https://www.youtube.com/watch?v=JwP5KdWSgqE) from rstudio::conf(2020).
## Installation
Install the released version of **tidyquery** from [CRAN](https://CRAN.R-project.org/package=tidyquery) with:
``` r
install.packages("tidyquery")
```
Or install the development version from [GitHub](https://github.com/ianmcook/tidyquery) with:
``` r
# install.packages("remotes")
remotes::install_github("ianmcook/tidyquery")
```
## Usage
**tidyquery** exports two functions: `query()` and `show_dplyr()`.
### Using `query()`
To run a SQL query on an R data frame, call the function `query()`, passing a `SELECT` statement enclosed in quotes as the first argument. The table names in the `FROM` clause should match the names of data frames in your current R session:
```{r}
library(tidyquery)
library(nycflights13)
query(
" SELECT origin, dest,
COUNT(flight) AS num_flts,
round(SUM(seats)) AS num_seats,
round(AVG(arr_delay)) AS avg_delay
FROM flights f LEFT OUTER JOIN planes p
ON f.tailnum = p.tailnum
WHERE distance BETWEEN 200 AND 300
AND air_time IS NOT NULL
GROUP BY origin, dest
HAVING num_flts > 3000
ORDER BY num_seats DESC, avg_delay ASC
LIMIT 2;"
)
```
Alternatively, for single-table queries, you can pass a data frame as the first argument and a `SELECT` statement as the second argument, omitting the `FROM` clause. This allows `query()` to function like a dplyr verb:
```{r, message=FALSE}
library(dplyr)
airports %>%
query("SELECT name, lat, lon ORDER BY lat DESC LIMIT 5")
```
You can chain dplyr verbs before and after `query()`:
```{r}
planes %>%
filter(engine == "Turbo-fan") %>%
query("SELECT manufacturer AS maker, COUNT(*) AS num_planes GROUP BY maker") %>%
arrange(desc(num_planes)) %>%
head(5)
```
In the `SELECT` statement, the names of data frames and columns are case-sensitive (like in R) but keywords and function names are case-insensitive (like in SQL).
In addition to R data frames and tibbles (`tbl_df` objects), `query()` can be used to query other data frame-like objects, including:
- `dtplyr_step` objects created with [dtplyr](https://dtplyr.tidyverse.org), a [data.table](https://r-datatable.com/) backend for dplyr
- Apache Arrow `Table`, `RecordBatch`, `Dataset`, and `arrow_dplyr_query` objects created with [arrow](https://arrow.apache.org/docs/r/)
- `tbl_sql` objects created with [dbplyr](https://dbplyr.tidyverse.org) or a dbplyr backend package, enabling you to write SQL which is translated to dplyr then translated back to SQL and run in a database (a fun party trick!)
### Using `show_dplyr()`
**tidyquery** works by generating dplyr code. To print the dplyr code instead of running it, use `show_dplyr()`:
```{r}
show_dplyr(
" SELECT manufacturer,
COUNT(*) AS num_planes
FROM planes
WHERE engine = 'Turbo-fan'
GROUP BY manufacturer
ORDER BY num_planes DESC;"
)
```
## Current Limitations
**tidyquery** is subject to the current limitations of the queryparser package. Please see the **Current Limitations** section of the queryparser README on [CRAN](https://cran.r-project.org/package=queryparser/readme/README.html#current-limitations) or [GitHub](https://github.com/ianmcook/queryparser#current-limitations).
**tidyquery** also has the following additional limitations:
- Joins involving three or more tables are not supported.
- Because joins in dplyr currently work in a fundamentally different way than joins in SQL, some other types of join queries are not supported. Examples of unsupported join queries include non-equijoin queries and outer join queries with qualified references to the join column(s). Planned changes in dplyr will enable future versions of tidyquery to support more types of joins.
- In the code printed by `show_dplyr()`, calls to functions with more than five arguments might be truncated, with arguments after the fifth replaced with `...`.
## Related Work
The **sqldf** package ([CRAN](https://cran.r-project.org/package=sqldf), [GitHub](https://github.com/ggrothendieck/sqldf)) runs SQL queries on R data frames by transparently setting up a database, loading data from R data frames into the database, running SQL queries in the database, and returning results as R data frames.
The **duckdb** package ([CRAN](https://cran.r-project.org/package=duckdb), [GitHub](https://github.com/duckdb/duckdb/tree/master/tools/rpkg)) includes the function `duckdb_register()` which registers an R data frame as a virtual table in a [DuckDB](https://duckdb.org) database, enabling you to run SQL queries on the data frame with `DBI::dbGetQuery()`.
The **[dbplyr](https://dbplyr.tidyverse.org)** package ([CRAN](https://cran.r-project.org/package=dbplyr), [GitHub](https://github.com/tidyverse/dbplyr)) is like tidyquery in reverse: it converts dplyr code into SQL, allowing you to use dplyr to work with data in a database.
In **Python**, the
[**dataframe_sql**](https://github.com/zbrookle/dataframe_sql)
package (targeting [**pandas**](https://pandas.pydata.org)) and the
[**sql_to_ibis**](https://github.com/zbrookle/sql_to_ibis) package
(targeting [**Ibis**](https://ibis-project.org)) are analogous to
tidyquery.