-
Notifications
You must be signed in to change notification settings - Fork 0
/
600-reading-list.qmd
216 lines (108 loc) · 14.1 KB
/
600-reading-list.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# Reading List {#reading-list}
```{r, results='asis', echo=FALSE}
source("_common.R")
status("complete")
```
This reading list is organised by topic, according to each week of the course. These are split into several categories.
- **Core Materials:** These form a core part of the course activities.
- **Reference Materials:** These will be used extensively in the course, but should be seen as helpful guides, rather than required reading from cover to cover.
- **Materials of Interest:** These will not form a core part of the course, but will give you a deeper understanding or interesting perspective on the weekly topic. There might be some fun other stuff in here too.
## Effective Data Science Workflows {#workflows-reading}
### Core Materials {.unnumbered}
- The [Tidyverse R Style Guide](https://style.tidyverse.org/) by Hadley Wickham.
<!-- This will be the coding style guide that we will follow in this course. It is a core reading in week 1 and will serve as a reference text in future weeks. -->
- [Wilson, et al (2017)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510&ref=https://githubhelp.com). Good Enough Practices in Scientific Computing. PLOS Computational Biology.
<!-- A collection of simple ways to implement good computing practices during a research project. The article is aimed specifically at people who are new to computational research. -->
### Reference Materials {.unnumbered}
- [R For Data Science Chapters 2, 6 and 8](https://r4ds.had.co.nz/index.html) by Hadley Wickham and Garrett Grolemund. Chapters covering R workflow basics, a scripting and project based workflow.
- [Documentation](https://here.r-lib.org/articles/here.html) for the {here} package
- [R Packages Book](https://r-pkgs.org/) (Second Edition) by Hadley Wickham and Jenny Bryan.
<!-- Covers the basics (and much more) of creating your own R package. Will be useful as a reference during the live session in Week 1. The chapter on [function documentation](https://r-pkgs.org/man.html) introduces `{Roxygen2}` and the chapter on [Testing basics](https://r-pkgs.org/testing-basics.html) introduces `{testthat}`. -->
### Materials of Interest {.unnumbered}
- [STAT545, Part 1](https://stat545.com/index.html) by Jennifer Bryan and The STAT 545 TAs
<!-- If this is your first time using R in earnest, you might find part 1 of the STAT545 notes from the University of British Columbia helpful in getting set up. Chapters 1 and 2 cover how to install R and RStudio, basics of programming in R and a bare bones workflow. -->
- [What they forgot to teach you about R, Chapters 2-4](https://rstats.wtf/) by Jennifer Bryan and Jim Hester.
<!--Resources on a project oriented workflow, practising safe paths, and how to name files.-->
- [Broman et al (2017)](https://www.amstat.org/docs/default-source/amstat-documents/pol-reproducibleresearchrecommendations.pdf). Recommendations to Funding Agencies for Supporting Reproducible Research. American Statistical Association.
<!-- Source of reproducibility definition used in lecture slides and a fun read! -->
- [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham Section introductions on [functional](https://adv-r.hadley.nz/fp.html) and [object oriented](https://adv-r.hadley.nz/oo.html) approaches to programming.
- [Atlassian Article](https://www.atlassian.com/agile/project-management) on Agile Project Management
<!-- Taking a broader view of organising your work and functioning well in a team, this article provides an introductory guide to agile development.-->
- [The Pragmatic Programmer, 20th Anniversary Edition Edition](https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/) by David Thomas and Andrew Hunt. The section on [DRY coding](https://media.pragprog.com/titles/tpp20/dry.pdf) and a few others are freely available.
<!-- Advice on good practice in programming (language agnostic) and when running projects that involve software. Written in small sections that make the book a joy to dip in and out of. The section on [DRY coding](https://media.pragprog.com/titles/tpp20/dry.pdf) and a few others are freely available. -->
- [Efficient R programming](https://csgillespie.github.io/efficientR/) by Colin Gillespie and Robin Lovelace. Chapter 5 considers [Efficient Input/Output](https://csgillespie.github.io/efficientR/input-output.html) is relevant to this week. Chapter 4 on [Efficient Workflows](https://csgillespie.github.io/efficientR/workflow.html) links nicely with last week's topics.
- [Towards A Principled Bayesian Workflow](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#References) by Michael Betancourt.
<!-- A different perspective on workflows, focusing on the statistical rather than coding and management aspects of a Bayesian data science project. -->
- [Happy Git and GitHub for the useR](https://happygitwithr.com/) by Jennifer Bryan
<!-- If you'd like to get started using version control with your R projects this book will be your guiding light. -->
- [Make Tutorial](https://monashbioinformaticsplatform.github.io/2017-11-16-open-science-training/topics/automation.html) by the Monash Informatics Platform.
<!-- Motivates makefiles via compiling Rmd to multiple output formats -->
- [Makefiles for R and LaTeX projects](https://robjhyndman.com/hyndsight/makefiles/) blog post by Rob Hyndman
<!-- Old, but motivates the use of make files and separating code and writing well. -->
- [Makefile tutorial](https://makefiletutorial.com/#getting-started) by Chase Lambert
<!-- Pretty and current but very extensive and with a strong CS focus -->
----
## Aquiring and Sharing Data {#data-reading}
### Core Materials {.unnumbered}
- [R for Data Science Chapters 9 - 12](https://r4ds.had.co.nz/tidy-data.html) by Hadley Wickham. These chapters introduce tibbles as a data structure, how to import data into R and how to wrangle that data into tidy format.
- [Efficient R programming](https://csgillespie.github.io/efficientR/) by Colin Gillespie and Robin Lovelace. Chapter 5 considers [Efficient Input/Output](https://csgillespie.github.io/efficientR/input-output.html) is relevant to this week.
- [Wickham (2014)](https://vita.had.co.nz/papers/tidy-data.html). Tidy Data. Journal of Statistical Software. The paper that brought tidy data to the mainstream.
### Reference Materials {.unnumbered}
- The {readr} [documentation](https://readr.tidyverse.org/)
- The {data.table} [documentation](https://cran.r-project.org/web/packages/data.table/data.table.pdf) and [vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)
- The {rvest} [documentation](https://rvest.tidyverse.org/)
- The {tidyr} [documentation](https://tidyr.tidyverse.org/)
- MDN Web Docs on [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML) and [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS)
### Materials of Interest {.unnumbered}
- [Introduction to APIs](https://zapier.com/learn/apis/chapter-1-introduction-to-apis/) by Brian Cooksey
- [R for Data Science (Second Edition)](https://r4ds.hadley.nz/) Chapters within the [Import](https://r4ds.hadley.nz/import.html) section.
This covers importing data from spreadsheets, databases, using Apache Arrow and importing hierarchical data as well as web scraping.
## Data Exploration and Visualisation {#edav-reading}
### Core Materials {.unnumbered}
- [Exploratory Data Analysis with R](https://bookdown.org/rdpeng/exdata/) by Roger Peng.
Chapters 3 and 4 are core reading, respectively introducing [data frame manipulation with {dplyr}](https://bookdown.org/rdpeng/exdata/managing-data-frames-with-the-dplyr-package.html) and an example [workflow for exploratory data analysis](https://bookdown.org/rdpeng/exdata/exploratory-data-analysis-checklist.html). Other chapters may be useful as references.
- [Flexible Imputation of Missing Data](https://stefvanbuuren.name/fimd/) by Stef van Buuren. [Sections 1.1-1.4](https://stefvanbuuren.name/fimd/ch-introduction.html) give a thorough introduction to missing data problems.
### Referene Materials {.unnumbered}
- [A ggplot2 Tutorial for Beautiful Plotting in R]() https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/) by Cédric Scherer.
- The {dplyr} [documentation](https://dplyr.tidyverse.org/)
- [RStudio Data Transformation Cheat Sheet](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)
- [R for Data Science (First Edition)](https://r4ds.had.co.nz/index.html) Chapters on [Data Transformations](https://r4ds.had.co.nz/transform.html), [Exploratory Data Analysis](https://r4ds.had.co.nz/exploratory-data-analysis.html) and [Relational Data](https://r4ds.had.co.nz/relational-data.html).
- Equivalent sections in R for Data Science [Second Edition](https://r4ds.hadley.nz/)
### Materials of Interest {.unnumbered}
- [Wickham, H. (2010)](https://library-search.imperial.ac.uk/discovery/fulldisplay?docid=cdi_informaworld_taylorfrancis_310_1198_jcgs_2009_07098&context=PC&vid=44IMP_INST:ICL_VU1&lang=en&search_scope=MyInst_and_CI&adaptor=Primo%20Central&tab=Everything&query=any,contains,layered%20grammar%20of%20graphics&offset=0). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics.
- [Better Data Visualisations](https://library-search.imperial.ac.uk/discovery/fulldisplay?docid=alma991000664639501591&context=L&vid=44IMP_INST:ICL_VU1&lang=en&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,better%20data%20visualisations&offset=0) by Jonathan Schwabish
<!-- Strategies to create more effective data visualizations, presented in a way that is agnostic to the software you use to construct your visualisations. -->
- [Data Visualization: A Practical Introduction](https://library-search.imperial.ac.uk/discovery/fulldisplay?docid=alma991000211295101591&context=L&vid=44IMP_INST:ICL_VU1&lang=en&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,Data%20Visualization%20%E2%80%93%20A%20Practical%20Introduction&offset=0) by Kieran Healy
## Preparing for Production {#production-reading}
### Core Materials {.unnumbered}
- [The Ethical Algorithm](https://library-search.imperial.ac.uk/discovery/fulldisplay?docid=alma991000531083101591&context=L&vid=44IMP_INST:ICL_VU1&lang=en&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,kearns%20and%20roth&mode=Basic) M Kearns and A Roth (Chapter 4)
- [Ribeiro et al (2016)](https://arxiv.org/abs/1602.04938). "Why Should I Trust You?": Explaining the Predictions of Any Classifier.
### Reference Materials {.unnumbered}
- The [Docker Curriculum](https://docker-curriculum.com/) by Prakhar Srivastav.
- LIME [package documentation](https://cran.r-project.org/web/packages/lime/index.html) on CRAN.
- [Interpretable Machine Learning: A Guide for Making Black Box Models Explainable](https://christophm.github.io/interpretable-ml-book/) by Christoph Molnar.
- Documentation for [apply()](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/apply), [map()](https://purrr.tidyverse.org/reference/map.html) and [pmap()](https://furrr.futureverse.org/)
- [Advanced R (Second Edition)](https://adv-r.hadley.nz/index.html) by Hadley Wickham. [Chapter 23](https://adv-r.hadley.nz/perf-measure.html) on measuring performance and [Chapter 24](https://adv-r.hadley.nz/perf-improve.html) on improving performance.
### Materials of Interest {.unnumbered}
* [The ASA Statement on $p$-values: Context, Process and Purpose](https://library-search.imperial.ac.uk/discovery/fulldisplay?docid=cdi_informaworld_taylorfrancis_310_1080_00031305_2016_1154108&context=PC&vid=44IMP_INST:ICL_VU1&lang=en&search_scope=MyInst_and_CI&adaptor=Primo%20Central&tab=Everything&query=any,contains,ASA%20p-value&offset=0)
- [The Garden of Forking Paths: Why multiple comparisons can be a problem,
even when there is no "Fishing expedition" or "p-hacking" and the research
hypothesis was posited ahead of time](http://stat.columbia.edu/~gelman/research/unpublished/forking.pdf). A Gelman and E loken (2013)
- [Understanding LIME tutorial](https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html) by T Pedersen and M Benesty.
- [Advanced R (Second Edition)](https://adv-r.hadley.nz/index.html) by Hadley Wickham. [Chapter 25](https://adv-r.hadley.nz/rcpp.html) on writing R code in C++.
## Data Science Ethics
### Core Materials {.unnumbered}
- [The Ethical Algorithm](https://library-search.imperial.ac.uk/discovery/fulldisplay?docid=alma991000531083101591&context=L&vid=44IMP_INST:ICL_VU1&lang=en&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,kearns%20and%20roth&mode=Basic) M Kearns and A Roth. Chapters 1 and 2 on Algorithmic Privacy and Algortihmic Fairness.
- [Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification](https://proceedings.mlr.press/v81/buolamwini18a.html) by Joy Buolamwini and Timnit Gebru (2018). Proceedings of the 1st Conference on Fairness, Accountability and Transparency.
- [Robust De-anonymization of Large Sparse Datasets](https://ieeexplore.ieee.org/document/4531148) by Arvind Narayanan and Vitaly Shmatikov (2008). IEEE Symposium on Security and Privacy.
### Reference Materials {.unnumbered}
- [Fairness and machine learning
Limitations and Opportunities](https://fairmlbook.org/) by Solon Barocas, Moritz Hardt and Arvind Narayanan.
- Professional Guidleines on Data Ethics from:
- [The American Mathematical Society](http://www.ams.org/about-us/governance/policy-statements/sec-ethics)
- [The European Union](https://op.europa.eu/s/sUPP)
- [UK Government](https://www.gov.uk/guidance/understanding-artificial-intelligence-ethics-and-safety)
- [Royal Statistical Society](https://rss.org.uk/RSS/media/News-and-publications/Publications/Reports%20and%20guides/A-Guide-for-Ethical-Data-Science-Final-Oct-2019.pdf)
- [Dutch Government](https://www.government.nl/documents/reports/2021/07/31/impact-assessment-fundamental-rights-and-algorithms)
### Materials of Interest {.unnumbered}
- [Algorithmic Fairness](https://arxiv.org/abs/2001.09784) (2020). Pre-print of review paper by Dana Pessach and Erez Shmueli.