-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy pathREADME.Rmd
371 lines (292 loc) · 11.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# RPresto
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/RPresto)](https://CRAN.R-project.org/package=RPresto)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![test-coverage](https://github.com/prestodb/RPresto/actions/workflows/test-coverage.yaml/badge.svg)](https://github.com/prestodb/RPresto/actions/workflows/test-coverage.yaml)
[![check-standard](https://github.com/prestodb/RPresto/actions/workflows/check-standard.yaml/badge.svg)](https://github.com/prestodb/RPresto/actions/workflows/check-standard.yaml)
<!-- badges: end -->
RPresto is a [DBI](https://dbi.r-dbi.org/)-based adapter for
the open source distributed SQL query engine [Presto](https://prestodb.io/)
for running interactive analytic queries.
## Installation
RPresto is both on [CRAN](https://cran.r-project.org/package=RPresto)
and [github](https://github.com/prestodb/RPresto).
For the CRAN version, you can use
```{r install_cran, eval = FALSE}
install.packages("RPresto")
```
You can install the development version of RPresto from
[GitHub](https://github.com/) with:
```{r install_github, eval = FALSE}
# install.packages("devtools")
devtools::install_github("prestodb/RPresto")
```
## Usage
**The following examples assume that you have a in-memory Presto server set up
locally.** It's the simplest server which stores all data and metadata in RAM on
workers and both are discarded when Presto restarts. If you don't have one set
up, please refer to the [memory connector documentation][1].
```{r setup, eval = TRUE}
# Load libaries and connect to Presto
library(RPresto)
library(DBI)
con <- DBI::dbConnect(
drv = RPresto::Presto(),
host = "http://localhost",
port = 8080,
user = Sys.getenv("USER"),
catalog = "memory",
schema = "default"
)
```
There are two levels of APIs: `DBI` and `dplyr`.
### `DBI` APIs
The easiest and most flexible way of executing a `SELECT` query is using a
[`dbGetQuery()`][2] call. It returns the query result in a [`tibble`][3].
```{r dbGetQuery, eval=TRUE}
DBI::dbGetQuery(con, "SELECT CAST(3.14 AS DOUBLE) AS pi")
```
[`dbWriteTable()`][4] can be used to write a small data frame into a Presto
table.
```{r dbWriteTable_pre_hook, eval = TRUE, echo = FALSE}
if (DBI::dbExistsTable(con, "iris")) {
DBI::dbRemoveTable(con, "iris")
}
```
```{r dbWriteTable, eval = TRUE}
# Writing iris data frame into Presto
DBI::dbWriteTable(con, "iris", iris)
```
[`dbExistsTable()`][5] checks if a table exists.
```{r dbExistsTable, eval = TRUE}
DBI::dbExistsTable(con, "iris")
```
[`dbReadTable()`][6] reads the entire table into R. It's essentially a `SELECT *`
query on the table.
```{r dbReadTable, eval = TRUE}
DBI::dbReadTable(con, "iris")
```
[`dbRemoveTable()`][7] drops the table from Presto.
```{r dbRemoveTable, eval = TRUE}
DBI::dbRemoveTable(con, "iris")
```
You can execute a statement and returns the number of rows affected using
[`dbExecute()`][8].
```{r dbExecute_1_pre_hook, eval = TRUE, echo = FALSE}
if (DBI::dbExistsTable(con, "testing_table")) {
DBI::dbRemoveTable(con, "testing_table")
}
```
```{r dbExecute_1, eval = TRUE}
# Create an empty table using CREATE TABLE
DBI::dbExecute(
con, "CREATE TABLE testing_table (field1 BIGINT, field2 VARCHAR)"
)
```
`dbExecute()` returns the number of rows affected by the statement. Since a
`CREATE TABLE` statement creates an empty table, it returns 0.
```{r dbExecute_2, eval = TRUE}
DBI::dbExecute(
con,
"INSERT INTO testing_table VALUES (1, 'abc'), (2, 'xyz')"
)
```
Since 2 rows are inserted into the table, it returns 2.
```{r check_dbExecute, eval = TRUE}
# Check the previous INSERT statment works
DBI::dbReadTable(con, "testing_table")
```
### `dplyr` APIs
We also include `dplyr` database backend integration (which is mainly
implemented using the [`dbplyr` package][9]).
```{r dplyr_setup, eval = TRUE, message = FALSE}
# Load packages
library(dplyr)
library(dbplyr)
```
We can use [`dplyr::copy_to()`][11] to write a local data frame to a
`PrestoConnection` and immediately create a remote table on it.
```{r dplyr_copyto, eval = TRUE}
# Add mtcars to Presto
if (DBI::dbExistsTable(con, "mtcars")) {
DBI::dbRemoveTable(con, "mtcars")
}
tbl.mtcars <- dplyr::copy_to(dest = con, df = mtcars, name = "mtcars")
# colnames() gives the column names
tbl.mtcars %>% colnames()
```
[`dplyr::tbl()`][10] also work directly on the `PrestoConnection`.
```{r dplyr_tbl, eval = TRUE}
# Treat "iris" in Presto as a remote data source that dplyr can now manipulate
if (!DBI::dbExistsTable(con, "iris")) {
DBI::dbWriteTable(con, "iris", iris)
}
tbl.iris <- dplyr::tbl(con, "iris")
tbl.iris %>% colnames()
# dplyr verbs can be applied onto the remote data source
tbl.iris %>%
group_by(species) %>%
summarize(
mean_sepal_length = mean(sepal.length, na.rm = TRUE)
) %>%
arrange(species) %>%
collect()
```
### `BIGINT` handling
RPresto's handling of `BIGINT` (i.e. 64-bit integers) is similar to other DBI
packages (e.g. [`bigrquery`][12], [`RPostgres`][13]). We provide a `bigint`
argument that users can use in multiple interfaces to specify how they want
`BIGINT` typed data to be translated into R.
The `bigint` argument takes one of the following 4 possible values.
1. `bigint = "integer"` is the default setting. It translates `BIGINT` to R's
native integer type (i.e. 32-bit integer). The range of 32-bit integer is
`[-2,147,483,648, 2,147,483,647]` which should cover most integer use cases.
1. In case that you need to represent integer values outside of the 32-bit
integer range, you have 2 options: `bigint = "numeric"` which translates the
number into a `double` floating-point type in R; and `bigint = "integer64"`
which packages the number using the `bit64::integer64` class. Note that both of
those two approaches actually the same precision-preservation range:
`+/-(2^53-1) = +/-9,007,199,254,740,991`, due to the fact that the Presto REST
API uses JSON to encode the number and JSON has a limit at 53 bits (rather than
64 bits).
1. `bigint = "character"` casts the number into a string. This is most useful
when `BIGINT` is used to represent an ID rather than a real arithmetic number.
#### Where to use `bigint`
The DBI interface function `dbGetQuery()` is the most fundamental interface
whereby `bigint` can be specified. All other interfaces are either built on top
of `dbGetQuery()` or only take effect when used with `dbGetQuery()`.
```{r dbgetquery_bigint, eval = TRUE}
# BIGINT within the 32-bit integer range is simply translated into integer
DBI::dbGetQuery(con, "SELECT CAST(1 AS BIGINT) AS small_bigint")
# BIGINT outside of the 32-bit integer range generates a warning and returns NA
# when bigint is not specified
DBI::dbGetQuery(con, "SELECT CAST(POW(2, 31) AS BIGINT) AS overflow_bigint")
# Using bigint to specify numeric or integer64 translations
DBI::dbGetQuery(
con, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_numeric",
bigint = "numeric"
)
DBI::dbGetQuery(
con, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_integer64",
bigint = "integer64"
)
```
When used with the `dplyr` interface, `bigint` can be specified in two places.
1. Users can pass the `bigint` argument to `dbConnect()` when creating the
`PrestoConnection`. All queries that use the connection later will use the
specified `bigint` setting.
```{r dbconnect_bigint, eval = TRUE}
con.bigint <- DBI::dbConnect(
drv = RPresto::Presto(),
host = "http://localhost",
port = 8080,
user = Sys.getenv("USER"),
catalog = "memory",
schema = "default",
# bigint can be specified in dbConnect
bigint = "integer64"
)
# BIGINT outside of the 32-bit integer range is automatically translated to
# integer64, per the connection setting earlier
DBI::dbGetQuery(
con.bigint, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_integer64"
)
```
1. If you only want to specify `bigint` for a particular query when using the
`dplyr` interface without affecting other queries, you can pass `bigint` to the
`collect()` call.
```{r collect_bigint, eval = TRUE}
tbl.bigint <- dplyr::tbl(
con, sql("SELECT CAST(POW(2, 31) AS BIGINT) AS bigint")
)
# Default collect() generates a warning and returns NA
dplyr::collect(tbl.bigint)
# Passing bigint to collect() specifies BIGINT treatment
dplyr::collect(tbl.bigint, bigint = "integer64")
```
## Connecting to Trino
To connect to Trino you must set the `use.trino.headers` parameter so `RPresto`
knows to send the correct headers to the server. Otherwise all the same
functionality is supported.
```{r, eval=FALSE}
con.trino <- DBI::dbConnect(
RPresto::Presto(),
use.trino.headers = TRUE,
host = "http://localhost",
port = 8080,
user = Sys.getenv("USER"),
schema = "<schema>",
catalog = "<catalog>",
source = "<source>"
)
```
## Passing extra credentials to the connector
To pass extraCredentials that gets added to the `X-Presto-Extra-Credential`
header use the `extra.credentials` parameter so `RPresto` will add that to the
header while creating the `PrestoConnection`.
Set `use.trino.headers` if you want to pass extraCredentials through the
`X-Trino-Extra-Credential` header.
```{r, eval=FALSE}
con <- DBI::dbConnect(
RPresto::Presto(),
host = "http://localhost",
port = 7777,
user = Sys.getenv("USER"),
schema = "<schema>",
catalog = "<catalog>",
source = "<source>",
extra.credentials = "test.token.foo=bar",
)
```
## Use RPresto with Kerberos
Assuming that you have Kerberos already set up with the Presto coordinator, you
can use `RPresto` with Kerberos by passing a Kerberos config header to the
`request.config` argument of `dbConnect()`. We provide a convenient wrapper
`kerberos_configs()` to further simplify the workflow.
```{r, eval = FALSE}
con <- DBI::dbConnect(
RPresto::Presto(),
host = "http://localhost",
port = 7777,
user = Sys.getenv("USER"),
schema = "<schema>",
catalog = "<catalog>",
source = "<source>",
request.config = kerberos_configs()
)
```
## How RPresto works
Presto exposes its interface via a REST based API[^1]. We utilize the
[httr](https://github.com/r-lib/httr) package to make the API calls and
use [jsonlite](https://github.com/jeroen/jsonlite) to reshape the
data into a `tibble`.
RPresto has been tested on Presto 0.100.
## License
RPresto is BSD-licensed.
[^1]: See <https://github.com/prestodb/presto/wiki/HTTP-Protocol> for a
description of the API.
[1]: https://prestodb.io/docs/current/connector/memory.html
[2]: https://dbi.r-dbi.org/reference/dbGetQuery
[3]: https://tibble.tidyverse.org/
[4]: https://dbi.r-dbi.org/reference/dbWriteTable
[5]: https://dbi.r-dbi.org/reference/dbExistsTable
[6]: https://dbi.r-dbi.org/reference/dbReadTable
[7]: https://dbi.r-dbi.org/reference/dbRemoveTable
[8]: https://dbi.r-dbi.org/reference/dbExecute
[9]: https://dbplyr.tidyverse.org/
[10]: https://dplyr.tidyverse.org/reference/tbl.html
[11]: https://dplyr.tidyverse.org/reference/copy_to.html
[12]: https://bigrquery.r-dbi.org/reference/bigquery.html
[13]: https://rpostgres.r-dbi.org/reference/Postgres