Introduction to Arrow in R Workshop

by Steph Hazlitt & Nic Crane

This repository contains materials for the Introduction to Arrow in R ~90 minutes workshop delivered to NEDS (18th July 2024).

Slides: https://tinyurl.com/introtoarrowneds/

Workshop Overview

This workshop will focus on using the arrow R package—a mature R interface to Apache Arrow—to process larger-than-memory files and multi-file datasets with arrow using familiar dplyr syntax. You'll learn to create and use the interoperable data file format Parquet for efficient data storage and access. This workshop will provide a foundation for using Arrow, giving you access to a powerful suite of tools for performant analysis of larger-than-memory data in R.

Workshop Prework

Detailed instructions for software requirements and data sources are shown below.

1. (Optional) Set up a new project in RStudio

Set up a new project in RStudio using this workshop, so you have a copy of the slides and code.

Repository URL: https://github.com/thisisnic/introtoarrowneds

2. Install Required Packages

To install the required core packages for the workshop, run the following:

install.packages(c("arrow", "dplyr"))

Please note - macOS users only: at time of writing (2nd April 2024), the version of arrow which can be installed from CRAN doesn't have all the features enabled. Instead please install from an alternative repo such as R Universe:

install.packages('arrow', repos = c('https://apache.r-universe.dev'))

3 (Option a). Seattle Checkouts by Title Data

This is the data we will use in the workshop. It's a good-sized, single CSV file—9GB on-disk in total, which can be downloaded from an AWS S3 bucket via https:

options(timeout = 1800)
dir.create("data")
download.file(
  url = "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  destfile = "./data/seattle-library-checkouts.csv"
)

3 (Option b). Tiny Data Option

If you don't have time or disk space to download the 9Gb dataset (and still have disk space to do the exercises), you can run the code in the workshop with "tiny" version of this data. Although the focus in this course is working with larger-than-memory data, you can still learn about the concepts and workflows with smaller data—although note you may not see the same performance improvements that you would get when working with larger data.

options(timeout = 1800)
dir.create("data")
download.file(
  url = "https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/seattle-library-checkouts-tiny.csv",
  destfile = "./data/seattle-library-checkouts-tiny.csv"
)

If you want to participate in the coding exercise or follow along, please try your very best to begin the workshop ready with the required software & packages installed and the data downloaded on to your laptop.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
index_files/libs		index_files/libs
.gitignore		.gitignore
.nojekyll		.nojekyll
README.md		README.md
index.html		index.html
index.qmd		index.qmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Arrow in R Workshop

Workshop Overview

Workshop Prework

1. (Optional) Set up a new project in RStudio

2. Install Required Packages

3 (Option a). Seattle Checkouts by Title Data

3 (Option b). Tiny Data Option

About

Releases

Packages

Languages

thisisnic/introtoarrowneds

Folders and files

Latest commit

History

Repository files navigation

Introduction to Arrow in R Workshop

Workshop Overview

Workshop Prework

1. (Optional) Set up a new project in RStudio

2. Install Required Packages

3 (Option a). Seattle Checkouts by Title Data

3 (Option b). Tiny Data Option

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages