-
Notifications
You must be signed in to change notification settings - Fork 0
/
210-data-checklist.qmd
73 lines (48 loc) · 3.99 KB
/
210-data-checklist.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Checklist {#data-checklist .unnumbered}
```{r data-checklist-setup}
#| results: asis
#| echo: false
source("_common.R")
status("complete")
```
## Videos / Chapters {.unnumbered}
- [ ] [Tabular Data](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d80e9045-22e7-4a0e-a0fc-af8100d3e727) (27 min) [[slides]](https://github.com/zakvarty/effective-data-science-slides-2022/raw/main/02-01-tabular-data-and-csvs/02-01-tabular-data.pdf)
- [ ] [Web Scraping](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=239ba39e-8a06-4e7b-a6c1-af7200f91d2b) (22 min) [[slides]](https://github.com/zakvarty/effective-data-science-slides-2022/raw/main/02-02-webscraping/02-02-web-scraping.pdf)
- [ ] [APIs](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=e1ed8e4f-cbaa-40c2-8c44-af7200ee2e9f) (25 min) [[slides]](https://github.com/zakvarty/effective-data-science-slides-2022/raw/main/02-03-apis/02-03-apis.pdf)
## Reading {.unnumbered}
Use the [Acquiring and Sharing Data section](#data-reading) of the reading list to support and guide your exploration of this week's topics. Note that these texts are divided into core reading, reference materials and materials of interest.
## Tasks {.unnumbered}
_Core:_
- Revisit the Projects that you explored on Github last week. This time look for any data or documentation files.
- Are there any file types that are new to you?
- If so, are there packages or helper function that would let you read this data into R?
- Why might you not find many data files on Github?
- Play [CSS Diner](https://flukeout.github.io/) to familiarise yourself with some CSS selectors.
- Identify 3 APIs that give access to data on topics that interest you. Write a post on the discussion forum describing the APIs and use one of them to load some data into R.
- Scraping Book Reviews:
- Visit the Amazon page for R for Data Science. Write code to scrape the percentage of customers giving each "star" rating (5`r emo::ji("star")`, ..., 1`r emo::ji("star")`).
- Turn your code into a function that will return a tibble of the form:
```{r amazon-webscraping-example-output}
#| echo: false
example_output <- tibble::tibble(
product = "example_name",
n_reviews = 1000,
percent_5_star = 20,
percent_4_star = 20,
percent_3_star = 20,
percent_2_star = 20,
percent_1_star = 20,
url = "www.example.com"
)
knitr::kable(x = example_output)
```
- Generalise your function to work for other Amazon products, where the function takes as input a vector of product names and an associated vector of URLs.
- Use your function to compare the reviews of the following three books: [R for Data Science](https://www.amazon.com/Data-Science-Transform-Visualize-Model/dp/1491910399/ref=sr_1_1?keywords=r+for+data+science&qid=1674145765&s=books&sprefix=R+for+data+%2Cstripbooks-intl-ship%2C157&sr=1-1), [R packages](https://www.amazon.com/Packages-Organize-Test-Document-Share/dp/1491910593/ref=sr_1_1?crid=XWR8O7WPKZS9&keywords=R+packages&qid=1674145743&s=books&sprefix=r+package%2Cstripbooks-intl-ship%2C158&sr=1-1) and [ggplot2](https://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/331924275X/ref=sr_1_1?crid=24WRUZ93PL2E6&keywords=ggplot2&qid=1674145703&s=books&sprefix=ggplot2%2Cstripbooks-intl-ship%2C190&sr=1-1).
_Bonus:_
- Add this function to the R package you made last week, remembering to add tests and documentation.
## Live Session {.unnumbered}
In the live session we will begin with a discussion of this week's tasks. We will then work through some examples of how to read data from non-standard sources.
Please come to the live session prepared to discuss the following points:
- Roger Peng states that files can be imported and exported using `readRDS()` and `saveRDS()` for fast and space efficient data storage. What is the downside to doing so?
- What data types have you come across (that we have not discussed already) and in what context are they used?
- What do you have to give greater consideration to when scraping data than when using an API?