-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05a-pulling-raw-data.qmd
44 lines (28 loc) · 6.24 KB
/
05a-pulling-raw-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Reproducibly Pulling Raw Data
The first step in data munging is ensuring your raw data files are placed within the standardized folder structure and reproducibly pulling your raw data into R.
## Where to put your raw data
Any raw data you generated should live in the following place in our standardized project repository:
./00-data/a-raw
Any data in raw data that is not your own and IS NOT findable by DOI or in an existing long term data repository should either be in an /external-sources subdirectory of the ./00-data/a-raw/ directory, which contains a metadata file that details the source of each of the external data sources. Raw data that IS findable in an existing long-term data repository by DOI should be reproducibly pulled in by script (see below).
## How to reproducibly pull raw data into your workflow
The way in which you reproducibly pull raw data into your workflow depends on whether: 1) you generated it or it is an external datasource that is NOT openly finadable and accessible by DOI or 2) it IS an external datasource that is openly accessible and findable by DOI.
### Reproducibly pulling data that is your own or is not accessible by DOI
Even though the project directory template already contains a .Rproj file, you should always use the here::here function in the [here package in R](https://here.r-lib.org/) when referencing file locations in your project repo - there are excellent reasons for using both .Rproj and here::here - when combined these are a powerful way to guarantee there won't be any issues with other users running your code due to directory issues. [Malcom Barrett](https://www.rstudio.com/authors/malcolm-barrett/) gave an an excellent talk at the [useR! 2020 conference](https://user2020.r-project.org/program/contributed/) entitled ["Why use the *here* package when you're already using projects?"](https://m.youtube.com/watch?v=QYrdsjBvZN4). The .Rproj file contained in this tenplate is pre-configured to start with a *completely clean workspace EVERY TIME* by selecting in *Project Options \> General* "Restore .RData into workspace at startup" = NO and "Save workspace to .RData on exit" = NO, "Disable .Pprofile execution on session start/resume" = CHECKED, "Quit child processes on exit" = CHECKED. These two options combined will also ensure that others running your code (or even you at a later date) don't experience errors or conflictions due to [workspace-specific background](https://alexd106.github.io/intro2R/project_setup.html) - this is often the cause of the following issue: "I swear...my code worked the last time I ran it and I didn't change anything!". This will also avoid [Jenny Bryan](https://jennybryan.org/) coming into your office and [setting your computer on fire](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/).
### Reproducibly pulling data that is not your own or is accessible by DOI
Any data sources required for your workflow that are open and accessible by DOI should be pulled into your workflow reproducibly by using the [download.file](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/download.file) and if necessary the [unzip](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/unzip) functions in R utils. This data should be placed in either the ./00-data/a-raw folder if it needs to be further munged or in the ./00-data/b-prepared/ folder if it is ready to be used immediately in your workflow.
### Tidyverse Tibbles or Dataframes when importing non-spatial data?
Our standard format for non-spatial data is a .csv. There are two ways to import .csv files into your workflow: [base::read.csv](http://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv) and [tidyverse::readr::read_csv]( Although many folks have converted completely to tibbles I'm not quite sure there is a black and white answer to this and it depends a bit on your workflow and previous R experience. I still use dataframes but might be rapidly becoming a dinosaur. There are a lot of [advantages to tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), however one potential disadvantage is some of their workability related to [arithmetic operations and assigning variable types](https://jtr13.github.io/cc21fall1/tibble-vs.-dataframe.html#:~:text=4%203.-,Unlike%20data%20frames%2C%20tibbles%20don%27t%20show%20the%20entire%20dataset,column%2C%20but%20data%20frames%20can). However, Tidyverse and Tibble adoption has increased dramatically and Tibbles, strictly speaking provide more transparent and stricter controls over data formats than data frames. Tibbles can easily be [converted to data frames and vice versa](https://bookdown.org/ajohnso6/r_training_public/data.html). And there are very few if any downsides to using tibbles instead of dataframes (mostly because tibbles are actually still dataframes) - the only real downsides are that some older packages may not [play well with tibbles](https://www.jumpingrivers.com/blog/the-trouble-with-tibbles/), but then you can always just convert them if necessary!
## Additional Resources
[Amy Johnson::R training for Stanford Library Software Services and Data Science (SSDS) Group - Data](https://bookdown.org/ajohnso6/r_training_public/data.html)
[Jingfei Fang::Tibble vs Data Frame](https://jtr13.github.io/cc21fall1/tibble-vs.-dataframe.html#:~:text=4%203.-,Unlike%20data%20frames%2C%20tibbles%20don%27t%20show%20the%20entire%20dataset,column%2C%20but%20data%20frames%20can)
[Josh Gonzales::R Functions:read_csv()](https://medium.com/r-tutorials/r-functions-daily-read-csv-3c418c25cba4)
[CRAN R Project::Tibbles](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)
[Jumping Rivers::The Trouble with Tibbles](https://www.jumpingrivers.com/blog/the-trouble-with-tibbles/)
[Hadley Wickham::tibble 1.0.0](https://posit.co/blog/tibble-1-0-0/#:~:text=There%20are%20two%20main%20differences,to%20work%20with%20large%20data.)
[Hadley Wickham::R for Data Science (2e) - Whole Game - Data import](https://r4ds.hadley.nz/data-import.html)
[Software Carpentry::Programming with R - Reading and writing csv files](http://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv)
[tibble.tidyverse.org::tibble 3.2.1](https://tibble.tidyverse.org/reference/tibble.html)
::: callout-tip
## Required Tools - Data Munging
- R/RStudio
:::