Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save output as CSV, and read and write using data.table #262

Open
nmdefries opened this issue Apr 17, 2023 · 5 comments · May be fixed by #308
Open

Save output as CSV, and read and write using data.table #262

nmdefries opened this issue Apr 17, 2023 · 5 comments · May be fixed by #308
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@nmdefries
Copy link
Collaborator

Using RDS to read/write/store data is slow and not portable. We may want to switch the data pipeline or dashboard to being written in Python in the future, so using a format that Python can easily read would be preferable.

Looking at a comparison of different formats and packages, storing as a CSV and reading/writing using data.table via fread and fwrite seems pretty good. feather could also be an option (is supported in both R and Python) but it's not clear if standard dataframe and dplyr procedures work with it. data.table data can be seamlessly processed either with dplyr or with faster data.table syntax.

@nmdefries nmdefries added the good first issue Good for newcomers label Apr 17, 2023
@nmdefries
Copy link
Collaborator Author

We're currently using a special AWS readRDS function, fread should be able to read directly from the S3 bucket. fread can read from URLs and the S3 bucket is publicly accessible.

@nmdefries nmdefries added the enhancement New feature or request label Apr 26, 2023
@nmdefries
Copy link
Collaborator Author

nmdefries commented Aug 24, 2023

My preference is to store the data as CSV, and read/write using data.table for speed and potential for additional speedup. One consideration is that RDS stores the column types, so we will need to worry about, e.g. setting date timezones, or strings-as-factors causing weird behavior when switching to CSV.

@nmdefries
Copy link
Collaborator Author

nmdefries commented Sep 22, 2023

data.table is definitely faster writing, but only equally fast reading from the s3 bucket. The cause seems to be that compressed RDS files are smaller than gzipped CSVs. Since reading from the s3 bucket involves downloading an object to a temporary file, downloading the larger gzipped CSVs is slow. The increased download time eats into the speedup of using fread over readRDS.

A potential improvement to this would be to download objects from s3 to permanent files on disk (can do this manually when we detect a change in the bucket state or use aws.s3::s3sync to keep files up to date) so that most users don't need to re-download objects and can just read from disk. Do files persist on the dashboard hosts?

Download time would also be less significant if we had more and smaller files so that we only have to load small subsets of the data.

@nmdefries
Copy link
Collaborator Author

I had originally wanted to store the data in uncompressed CSVs with the thought that a future optimization could append data to the end of an existing file without reading in the whole thing (if deduplication weren't needed). However, uncompressed CSVs are too slow to read directly from the s3 bucket to the dashboard to be feasible due to the temp download step. They could work if we kept an on-disk cache of the s3 bucket.

@nmdefries
Copy link
Collaborator Author

Running into memory issues, even when only processing hospitalizations.

@nmdefries nmdefries self-assigned this Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
1 participant