by Steph Hazlitt & Nic Crane
This repository contains materials for the Introduction to Arrow in R ~90 minutes workshop delivered to NEDS (18th July 2024).
Slides: https://tinyurl.com/introtoarrowneds/
This workshop will focus on using the arrow R package—a mature R interface to Apache Arrow—to process larger-than-memory files and multi-file datasets with arrow using familiar dplyr syntax. You'll learn to create and use the interoperable data file format Parquet for efficient data storage and access. This workshop will provide a foundation for using Arrow, giving you access to a powerful suite of tools for performant analysis of larger-than-memory data in R.
Detailed instructions for software requirements and data sources are shown below.
Set up a new project in RStudio using this workshop, so you have a copy of the slides and code.
Repository URL: https://github.com/thisisnic/introtoarrowneds
To install the required core packages for the workshop, run the following:
install.packages(c("arrow", "dplyr"))
Please note - macOS users only: at time of writing (2nd April 2024), the version of arrow which can be installed from CRAN doesn't have all the features enabled. Instead please install from an alternative repo such as R Universe:
install.packages('arrow', repos = c('https://apache.r-universe.dev'))
This is the data we will use in the workshop. It's a good-sized, single CSV file—9GB on-disk in total, which can be downloaded from an AWS S3 bucket via https:
options(timeout = 1800)
dir.create("data")
download.file(
url = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
destfile = "./data/seattle-library-checkouts.csv"
)
If you don't have time or disk space to download the 9Gb dataset (and still have disk space to do the exercises), you can run the code in the workshop with "tiny" version of this data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data—although note you may not see the same performance improvements that you would get when working with larger data.
options(timeout = 1800)
dir.create("data")
download.file(
url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv",
destfile = "./data/seattle-library-checkouts-tiny.csv"
)
If you want to participate in the coding exercise or follow along, please try your very best to begin the workshop ready with the required software & packages installed and the data downloaded on to your laptop.
This work is licensed under a Creative Commons Attribution 4.0 International License.