Merge pull request #420 from UBC-DSCI/dev

transfer dev to master
UBC-DSCI · Jan 12, 2022 · d10e86a · d10e86a
2 parents 4aa2fc0 + 062be9c
commit d10e86a
Show file tree

Hide file tree

Showing 162 changed files with 1,388 additions and 695 deletions.
diff --git a/build_pdf.sh b/build_pdf.sh
@@ -3,6 +3,7 @@
 # Copy files
 cp references.bib pdf/
 cp authors.Rmd pdf/
+cp foreword-text.Rmd pdf/
 cp preface-text.Rmd pdf/
 cp acknowledgements.Rmd pdf/
 cp intro.Rmd pdf/
@@ -29,6 +30,7 @@ docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsc
 # clean files in pdf dir
 rm -rf pdf/references.bib
 rm -rf pdf/authors.Rmd
+rm -rf pdf/foreword-text.Rmd
 rm -rf pdf/preface-text.Rmd
 rm -rf pdf/acknowledgements.Rmd
 rm -rf pdf/intro.Rmd 

diff --git a/classification1.Rmd b/classification1.Rmd
@@ -455,7 +455,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
 distance using the formula above: we square the differences between the two observations' perimeter 
 and concavity coordinates, add the squared differences, and then take the square root.
 
-```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
+```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
 perim_concav <- bind_rows(cancer, 
                           tibble(Perimeter = new_point[1], 
                                  Concavity = new_point[2], 
@@ -1096,7 +1096,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
 set.seed(3)
 ```
 
-```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data."}
+```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
 rare_cancer <- bind_rows(
       filter(cancer, Class == "B"),
       cancer |> filter(Class == "M") |> slice_head(n = 3)
@@ -1255,7 +1255,7 @@ classifier would make. We can see that the decision is more reasonable; when the
 to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are 
 closer to the benign tumor observations.
 
-```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
+```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
 knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
   set_engine("kknn") |>
   set_mode("classification")

diff --git a/classification2.Rmd b/classification2.Rmd
@@ -643,7 +643,7 @@ Here, $C=5$ different chunks of the data set are used,
 resulting in 5 different choices for the **validation set**; we call this
 *5-fold* cross-validation. 
 
-```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.retina = 2, out.width = "100%"}
+```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
 knitr::include_graphics("img/cv.png")
 ```
 
@@ -863,24 +863,7 @@ regardless of what the new observation looks like. In general, if the model
 *isn't influenced enough* by the training data, it is said to **underfit** the
 data.
 
-**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
-individual data point has a stronger and stronger vote regarding nearby points.
-Since the data themselves are noisy, this causes a more "jagged" boundary
-corresponding to a *less simple* model.  If you take this case to the extreme,
-setting $K = 1$, then the classifier is essentially just matching each new
-observation to its closest neighbor in the training data set. This is just as
-problematic as the large $K$ case, because the classifier becomes unreliable on
-new data: if we had a different training set, the predictions would be
-completely different.  In general, if the model *is influenced too much* by the
-training data, it is said to **overfit** the data.
-
-Both overfitting and underfitting are problematic and will lead to a model 
-that does not generalize well to new data. When fitting a model, we need to strike
-a balance between the two. You can see these two effects in Figure 
-\@ref(fig:06-decision-grid-K), which shows how the classifier changes as 
-we set the number of neighbors $K$ to 1, 7, 20, and 300.
-
-```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
+```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
 ks <- c(1, 7, 20, 300)
 plots <- list()
 
@@ -935,6 +918,23 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
 plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
 ```
 
+**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
+individual data point has a stronger and stronger vote regarding nearby points.
+Since the data themselves are noisy, this causes a more "jagged" boundary
+corresponding to a *less simple* model.  If you take this case to the extreme,
+setting $K = 1$, then the classifier is essentially just matching each new
+observation to its closest neighbor in the training data set. This is just as
+problematic as the large $K$ case, because the classifier becomes unreliable on
+new data: if we had a different training set, the predictions would be
+completely different.  In general, if the model *is influenced too much* by the
+training data, it is said to **overfit** the data.
+
+Both overfitting and underfitting are problematic and will lead to a model 
+that does not generalize well to new data. When fitting a model, we need to strike
+a balance between the two. You can see these two effects in Figure 
+\@ref(fig:06-decision-grid-K), which shows how the classifier changes as 
+we set the number of neighbors $K$ to 1, 7, 20, and 300.
+
 ## Summary
 
 Classification algorithms use one or more quantitative variables to predict the
@@ -948,7 +948,7 @@ can tune the classifier (e.g., select the number of neighbors $K$ in $K$-NN)
 by maximizing estimated accuracy via cross-validation. The overall 
 process is summarized in Figure \@ref(fig:06-overview).
 
-```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
+```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
 knitr::include_graphics("img/train-test-overview.jpeg")
 ```
 

diff --git a/clustering.Rmd b/clustering.Rmd
@@ -91,6 +91,8 @@ principal component analysis, multidimensional scaling, and more;
 see the additional resources section at the end of this chapter 
 for where to begin learning more about these other methods.
 
+\newpage
+
 > **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised} 
 > where only some of the data come with response variable labels/values, 
 > but the vast majority don't. 
@@ -164,11 +166,12 @@ penguin_data <- read_csv("data/penguins_standardized.csv")
 penguin_data
 ```
 
-
 Next, we can create a scatter plot using this data set 
 to see if we can detect subtypes or groups in our data set.
 
-```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
+\newpage
+
+```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
 ggplot(data, aes(x = flipper_length_standardized, 
                  y = bill_length_standardized)) +
   geom_point() +
@@ -203,7 +206,7 @@ This procedure will separate the data into groups;
 Figure \@ref(fig:10-toy-example-clustering) shows these groups
 denoted by colored scatter points.
 
-```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
+```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
 ggplot(data, aes(y = bill_length_standardized, 
                  x = flipper_length_standardized, color = cluster)) +
   geom_point() +
@@ -261,7 +264,7 @@ in Figure \@ref(fig:10-toy-example-clus1-center).
 
 (ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
 
-```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
+```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
 base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
   geom_point() +
   xlab("Flipper Length (standardized)") +
@@ -308,7 +311,7 @@ These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-di
 
 (ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
 
-```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
+```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
 base <- ggplot(clus1) +
   geom_point(aes(y = bill_length_standardized, 
                  x = flipper_length_standardized),
@@ -347,7 +350,7 @@ Figure \@ref(fig:10-toy-example-all-clus-dists).
 
 (ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
 
-```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
+```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
 
 
 all_clusters_base <- data |>
@@ -406,6 +409,8 @@ all_clusters_base <- all_clusters_base +
 all_clusters_base
 ```
 
+\newpage
+
 ### The clustering algorithm
 
 We begin the K-means \index{K-means!algorithm} algorithm by picking K, 
@@ -597,7 +602,7 @@ These, however, are beyond the scope of this book.
 Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
 For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
 
-```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Random initialization of labels."}
+```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
 penguin_data <- penguin_data |>
   mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 
                              1L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)))
@@ -618,7 +623,7 @@ Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means wo
 
 (ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
 
-```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 6.75, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
+```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 6.75, fig.width = 8, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
 list_plot_cntrs <- vector(mode = "list", length = 5)
 list_plot_lbls <- vector(mode = "list", length = 5)
 
@@ -776,7 +781,7 @@ Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
 on K-means clustering of our penguin flipper and bill length data 
 by showing the different clusterings for K's ranging from 1 to 9.
 
-```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
+```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
 set.seed(3)
 
 kclusts <- tibble(k = 1:9) |>
@@ -840,7 +845,7 @@ decrease the total WSSD, but by only a *diminishing amount*. If we plot the tota
 clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly 
 the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).
 
-```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.5, fig.width = 4.5, fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
+```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
 p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
   geom_point(size = 2) +
   geom_line() +
@@ -931,7 +936,7 @@ clustered_data
 Now that we have this information in a tidy data frame, we can make a visualization
 of the cluster assignments for each point, as shown in Figure \@ref(fig:10-plot-clusters-2).
 
-```{r 10-plot-clusters-2, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "The data colored by the cluster assignments returned by K-means."}
+```{r 10-plot-clusters-2, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "The data colored by the cluster assignments returned by K-means."}
 cluster_plot <- ggplot(clustered_data,
   aes(x = flipper_length_mm, 
       y = bill_length_mm, 
@@ -1040,7 +1045,7 @@ clustering_statistics
 Now that we have `tot.withinss` and `k` as columns in a data frame, we can make a line plot 
 (Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use. 
 
-```{r 10-plot-choose-k, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
+```{r 10-plot-choose-k, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
 elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
   geom_point() +
   geom_line() +
@@ -1075,7 +1080,7 @@ but there is a trade-off that doing many clusterings
 could take a long time.
 So this is something that needs to be balanced.
 
-```{r 10-choose-k-nstart, fig.height = 3.5, fig.width = 4.5, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
+```{r 10-choose-k-nstart, fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
 penguin_clust_ks <- tibble(k = 1:9) |>
   rowwise() |>
   mutate(penguin_clusts = list(kmeans(standardized_data, nstart = 10, k)),