Save output as CSV, and read and write using `data.table` #262

nmdefries · 2023-04-17T14:49:15Z

Using RDS to read/write/store data is slow and not portable. We may want to switch the data pipeline or dashboard to being written in Python in the future, so using a format that Python can easily read would be preferable.

Looking at a comparison of different formats and packages, storing as a CSV and reading/writing using data.table via fread and fwrite seems pretty good. feather could also be an option (is supported in both R and Python) but it's not clear if standard dataframe and dplyr procedures work with it. data.table data can be seamlessly processed either with dplyr or with faster data.table syntax.

The text was updated successfully, but these errors were encountered:

nmdefries · 2023-04-17T15:09:29Z

We're currently using a special AWS readRDS function, fread should be able to read directly from the S3 bucket. fread can read from URLs and the S3 bucket is publicly accessible.

nmdefries · 2023-08-24T17:24:49Z

My preference is to store the data as CSV, and read/write using data.table for speed and potential for additional speedup. One consideration is that RDS stores the column types, so we will need to worry about, e.g. setting date timezones, or strings-as-factors causing weird behavior when switching to CSV.

nmdefries · 2023-09-22T18:27:57Z

data.table is definitely faster writing, but only equally fast reading from the s3 bucket. The cause seems to be that compressed RDS files are smaller than gzipped CSVs. Since reading from the s3 bucket involves downloading an object to a temporary file, downloading the larger gzipped CSVs is slow. The increased download time eats into the speedup of using fread over readRDS.

A potential improvement to this would be to download objects from s3 to permanent files on disk (can do this manually when we detect a change in the bucket state or use aws.s3::s3sync to keep files up to date) so that most users don't need to re-download objects and can just read from disk. Do files persist on the dashboard hosts?

Download time would also be less significant if we had more and smaller files so that we only have to load small subsets of the data.

nmdefries · 2023-09-22T21:29:21Z

I had originally wanted to store the data in uncompressed CSVs with the thought that a future optimization could append data to the end of an existing file without reading in the whole thing (if deduplication weren't needed). However, uncompressed CSVs are too slow to read directly from the s3 bucket to the dashboard to be feasible due to the temp download step. They could work if we kept an on-disk cache of the s3 bucket.

nmdefries · 2023-09-27T20:31:21Z

Running into memory issues, even when only processing hospitalizations.

nmdefries added the good first issue Good for newcomers label Apr 17, 2023

nmdefries added the enhancement New feature or request label Apr 26, 2023

nmdefries linked a pull request Sep 22, 2023 that will close this issue

Store pipeline output in compressed CSVs and read/write them using data.table #308

Open

nmdefries self-assigned this Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save output as CSV, and read and write using `data.table` #262

Save output as CSV, and read and write using `data.table` #262

nmdefries commented Apr 17, 2023

nmdefries commented Apr 17, 2023

nmdefries commented Aug 24, 2023 •

edited

Loading

nmdefries commented Sep 22, 2023 •

edited

Loading

nmdefries commented Sep 22, 2023

nmdefries commented Sep 27, 2023

Save output as CSV, and read and write using data.table #262

Save output as CSV, and read and write using data.table #262

Comments

nmdefries commented Apr 17, 2023

nmdefries commented Apr 17, 2023

nmdefries commented Aug 24, 2023 • edited Loading

nmdefries commented Sep 22, 2023 • edited Loading

nmdefries commented Sep 22, 2023

nmdefries commented Sep 27, 2023

Save output as CSV, and read and write using `data.table` #262

Save output as CSV, and read and write using `data.table` #262

nmdefries commented Aug 24, 2023 •

edited

Loading

nmdefries commented Sep 22, 2023 •

edited

Loading