UBC-DSCI · trevorcampbell · Sep 28, 2023 · Sep 25, 2023 · Sep 25, 2023 · Sep 25, 2023
diff --git a/data/canada_wiki.html b/data/canada_wiki.html
diff --git a/data/nasa.json b/data/nasa.json
diff --git a/img/reading/NASA-API-Rho-Ophiuchi.png b/img/reading/NASA-API-Rho-Ophiuchi.png
diff --git a/img/reading/NASA-API-limits.png b/img/reading/NASA-API-limits.png
diff --git a/img/reading/NASA-API-parameters.png b/img/reading/NASA-API-parameters.png
diff --git a/img/reading/NASA-API-signup.png b/img/reading/NASA-API-signup.png
diff --git a/img/reading/authorize_question.png b/img/reading/authorize_question.png
diff --git a/img/reading/sg4.png b/img/reading/sg4.png
diff --git a/img/reading/tidyverse_twitter.png b/img/reading/tidyverse_twitter.png
@@ -12,6 +12,8 @@ list of topics in this book. We would especially like to thank Matías
 Salibían-Barrera for his mentorship during the initial development and roll-out
 of both DSCI 100 and this book. His door was always open when
 we needed to chat about how to best introduce and teach data science to our first-year students.
+We would also like to thank Gabriela Cohen Freue for her DSCI 561 (Regression I) teaching materials
+from the UBC Master of Data Science program, as some of our linear regression figures were inspired from these.
 
 We would also like to thank all those who contributed to the process of 
 publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions:

@@ -1438,8 +1438,8 @@ prediction <- predict(knn_fit, new_observation)
 prediction
 ```
 
-The classifier predicts that the first observation is benign ("B"), while the second is
-malignant ("M"). Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this 
+The classifier predicts that the first observation is benign, while the second is
+malignant. Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this 
 trained $K$-nearest neighbor model will make on a large range of new observations.
 Although you have seen colored prediction map visualizations like this a few times now,
 we have not included the code to generate them, as it is a little bit complicated.

@@ -410,8 +410,8 @@ train_prop <- cancer_train |>
 ```
 
 We can use `group_by` and `summarize` to \index{group\_by}\index{summarize} find the percentage of malignant and benign classes 
-in `cancer_train` and we see about `r round(filter(train_prop, Class == "B")$proportion, 2)*100`% of the training
-data are benign and `r round(filter(train_prop, Class == "M")$proportion, 2)*100`% 
+in `cancer_train` and we see about `r round(filter(train_prop, Class == "Benign")$proportion, 2)*100`% of the training
+data are benign and `r round(filter(train_prop, Class == "Malignant")$proportion, 2)*100`% 
 are malignant, indicating that our class proportions were roughly preserved when we split the data.
 
 ```{r 06-train-proportion}
@@ -591,7 +591,7 @@ cancer_proportions
 
 ```{r 06-proportions-2, echo = FALSE, warning = FALSE}
 cancer_propn_1 <- cancer_proportions |>
-                filter(Class == 'B') |>
+                filter(Class == 'Benign') |>
                 select(percent)
 ```
 
@@ -1332,13 +1332,13 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
 However, it becomes very slow when you have even a moderate
 number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
 grows very quickly with the number of predictors, and you have to train the model (itself
-a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
+a slow process!) for each one. For example, if we have 2 predictors&mdash;let's call
 them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
-and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
+and B together. If we have 3 predictors&mdash;A, B, and C&mdash;then we have 7
 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
 we have to train for $m$ predictors is $2^m-1$; in other words, when we 
-get to $10$ predictors we have over *one thousand* models to train, and 
-at $20$ predictors we have over *one million* models to train! 
+get to 10 predictors we have over *one thousand* models to train, and 
+at 20 predictors we have over *one million* models to train! 
 So although it is a simple method, best subset selection is usually too computationally 
 expensive to use in practice.
 
@@ -1360,8 +1360,8 @@ This pattern continues for as many iterations as you want. If you run the method
 all the way until you run out of predictors to choose, you will end up training
 $\frac{1}{2}m(m+1)$ separate models. This is a *big* improvement from the $2^m-1$
 models that best subset selection requires you to train! For example, while best subset selection requires
-training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
- Therefore we will continue the rest of this section using forward selection.
+training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
+Therefore we will continue the rest of this section using forward selection.
 
 > **Note:** One word of caution before we move on. Every additional model that you train 
 > increases the likelihood that you will get unlucky and stumble 
@@ -1378,12 +1378,9 @@ training over 1000 candidate models with $m=10$ predictors, forward selection re
 
 We now turn to implementing forward selection in R.
 Unfortunately there is no built-in way to do this using the `tidymodels` framework,
-so we will have to code it ourselves. First we will use the `select` function
-to extract the "total" set of predictors that we are willing to work with. 
-Here we will load the modified version of the cancer data with irrelevant 
-predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
-as potential predictors, and the `Class` variable as the label.
-We will also extract the column names for the full set of predictor variables.
+so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors
+to work with in this illustrative example&mdash;`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as
+well as the `Class` variable as the label. We will also extract the column names for the full set of predictors.
 
 ```{r 06-fwdsel-seed, echo = FALSE, warning = FALSE, message = FALSE}
 # hidden seed

@@ -289,7 +289,7 @@ Then we would compute the coordinates, $\mu_x$ and $\mu_y$, of the cluster cente
 $$\mu_x = \frac{1}{4}(x_1+x_2+x_3+x_4) \quad \mu_y = \frac{1}{4}(y_1+y_2+y_3+y_4).$$
 
 In the first cluster from the example, there are `r nrow(clus1)` data points. These are shown with their cluster center 
-(`r paste("flipper_length_standardized =", round(mean(clus1$flipper_length_standardized),2))` and `r paste("bill_length_standardized =", round(mean(clus1$bill_length_standardized),2))`) highlighted 
+(standardized flipper length `r round(mean(clus1$flipper_length_standardized),2)`, standardized bill length `r round(mean(clus1$bill_length_standardized),2)`) highlighted 
 in Figure \@ref(fig:10-toy-example-clus1-center).
 
 (ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.