-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save output as CSV, and read and write using data.table
#262
Comments
We're currently using a special AWS |
My preference is to store the data as CSV, and read/write using |
A potential improvement to this would be to download objects from s3 to permanent files on disk (can do this manually when we detect a change in the bucket state or use Download time would also be less significant if we had more and smaller files so that we only have to load small subsets of the data. |
I had originally wanted to store the data in uncompressed CSVs with the thought that a future optimization could append data to the end of an existing file without reading in the whole thing (if deduplication weren't needed). However, uncompressed CSVs are too slow to read directly from the s3 bucket to the dashboard to be feasible due to the temp download step. They could work if we kept an on-disk cache of the s3 bucket. |
Running into memory issues, even when only processing hospitalizations. |
Using RDS to read/write/store data is slow and not portable. We may want to switch the data pipeline or dashboard to being written in Python in the future, so using a format that Python can easily read would be preferable.
Looking at a comparison of different formats and packages, storing as a CSV and reading/writing using
data.table
viafread
andfwrite
seems pretty good.feather
could also be an option (is supported in both R and Python) but it's not clear if standard dataframe anddplyr
procedures work with it.data.table
data can be seamlessly processed either withdplyr
or with fasterdata.table
syntax.The text was updated successfully, but these errors were encountered: