Skip to content

Commit

Permalink
Merge pull request #420 from UBC-DSCI/dev
Browse files Browse the repository at this point in the history
transfer dev to master
  • Loading branch information
trevorcampbell authored Jan 12, 2022
2 parents 4aa2fc0 + 062be9c commit d10e86a
Show file tree
Hide file tree
Showing 162 changed files with 1,388 additions and 695 deletions.
2 changes: 2 additions & 0 deletions build_pdf.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Copy files
cp references.bib pdf/
cp authors.Rmd pdf/
cp foreword-text.Rmd pdf/
cp preface-text.Rmd pdf/
cp acknowledgements.Rmd pdf/
cp intro.Rmd pdf/
Expand All @@ -29,6 +30,7 @@ docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsc
# clean files in pdf dir
rm -rf pdf/references.bib
rm -rf pdf/authors.Rmd
rm -rf pdf/foreword-text.Rmd
rm -rf pdf/preface-text.Rmd
rm -rf pdf/acknowledgements.Rmd
rm -rf pdf/intro.Rmd
Expand Down
6 changes: 3 additions & 3 deletions classification1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
distance using the formula above: we square the differences between the two observations' perimeter
and concavity coordinates, add the squared differences, and then take the square root.

```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
perim_concav <- bind_rows(cancer,
tibble(Perimeter = new_point[1],
Concavity = new_point[2],
Expand Down Expand Up @@ -1096,7 +1096,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
set.seed(3)
```

```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data."}
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
rare_cancer <- bind_rows(
filter(cancer, Class == "B"),
cancer |> filter(Class == "M") |> slice_head(n = 3)
Expand Down Expand Up @@ -1255,7 +1255,7 @@ classifier would make. We can see that the decision is more reasonable; when the
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
closer to the benign tumor observations.

```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
set_engine("kknn") |>
set_mode("classification")
Expand Down
40 changes: 20 additions & 20 deletions classification2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -643,7 +643,7 @@ Here, $C=5$ different chunks of the data set are used,
resulting in 5 different choices for the **validation set**; we call this
*5-fold* cross-validation.

```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.retina = 2, out.width = "100%"}
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/cv.png")
```

Expand Down Expand Up @@ -863,24 +863,7 @@ regardless of what the new observation looks like. In general, if the model
*isn't influenced enough* by the training data, it is said to **underfit** the
data.

**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
individual data point has a stronger and stronger vote regarding nearby points.
Since the data themselves are noisy, this causes a more "jagged" boundary
corresponding to a *less simple* model. If you take this case to the extreme,
setting $K = 1$, then the classifier is essentially just matching each new
observation to its closest neighbor in the training data set. This is just as
problematic as the large $K$ case, because the classifier becomes unreliable on
new data: if we had a different training set, the predictions would be
completely different. In general, if the model *is influenced too much* by the
training data, it is said to **overfit** the data.

Both overfitting and underfitting are problematic and will lead to a model
that does not generalize well to new data. When fitting a model, we need to strike
a balance between the two. You can see these two effects in Figure
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
we set the number of neighbors $K$ to 1, 7, 20, and 300.

```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
ks <- c(1, 7, 20, 300)
plots <- list()
Expand Down Expand Up @@ -935,6 +918,23 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
```

**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
individual data point has a stronger and stronger vote regarding nearby points.
Since the data themselves are noisy, this causes a more "jagged" boundary
corresponding to a *less simple* model. If you take this case to the extreme,
setting $K = 1$, then the classifier is essentially just matching each new
observation to its closest neighbor in the training data set. This is just as
problematic as the large $K$ case, because the classifier becomes unreliable on
new data: if we had a different training set, the predictions would be
completely different. In general, if the model *is influenced too much* by the
training data, it is said to **overfit** the data.

Both overfitting and underfitting are problematic and will lead to a model
that does not generalize well to new data. When fitting a model, we need to strike
a balance between the two. You can see these two effects in Figure
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
we set the number of neighbors $K$ to 1, 7, 20, and 300.

## Summary

Classification algorithms use one or more quantitative variables to predict the
Expand All @@ -948,7 +948,7 @@ can tune the classifier (e.g., select the number of neighbors $K$ in $K$-NN)
by maximizing estimated accuracy via cross-validation. The overall
process is summarized in Figure \@ref(fig:06-overview).

```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
knitr::include_graphics("img/train-test-overview.jpeg")
```

Expand Down
31 changes: 18 additions & 13 deletions clustering.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,8 @@ principal component analysis, multidimensional scaling, and more;
see the additional resources section at the end of this chapter
for where to begin learning more about these other methods.

\newpage

> **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised}
> where only some of the data come with response variable labels/values,
> but the vast majority don't.
Expand Down Expand Up @@ -164,11 +166,12 @@ penguin_data <- read_csv("data/penguins_standardized.csv")
penguin_data
```


Next, we can create a scatter plot using this data set
to see if we can detect subtypes or groups in our data set.

```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
\newpage

```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
ggplot(data, aes(x = flipper_length_standardized,
y = bill_length_standardized)) +
geom_point() +
Expand Down Expand Up @@ -203,7 +206,7 @@ This procedure will separate the data into groups;
Figure \@ref(fig:10-toy-example-clustering) shows these groups
denoted by colored scatter points.

```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
ggplot(data, aes(y = bill_length_standardized,
x = flipper_length_standardized, color = cluster)) +
geom_point() +
Expand Down Expand Up @@ -261,7 +264,7 @@ in Figure \@ref(fig:10-toy-example-clus1-center).

(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.

```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
geom_point() +
xlab("Flipper Length (standardized)") +
Expand Down Expand Up @@ -308,7 +311,7 @@ These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-di

(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.

```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
base <- ggplot(clus1) +
geom_point(aes(y = bill_length_standardized,
x = flipper_length_standardized),
Expand Down Expand Up @@ -347,7 +350,7 @@ Figure \@ref(fig:10-toy-example-all-clus-dists).

(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.

```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
all_clusters_base <- data |>
Expand Down Expand Up @@ -406,6 +409,8 @@ all_clusters_base <- all_clusters_base +
all_clusters_base
```

\newpage

### The clustering algorithm

We begin the K-means \index{K-means!algorithm} algorithm by picking K,
Expand Down Expand Up @@ -597,7 +602,7 @@ These, however, are beyond the scope of this book.
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.

```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Random initialization of labels."}
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
penguin_data <- penguin_data |>
mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)))
Expand All @@ -618,7 +623,7 @@ Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means wo

(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.

```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 6.75, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 6.75, fig.width = 8, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
list_plot_cntrs <- vector(mode = "list", length = 5)
list_plot_lbls <- vector(mode = "list", length = 5)
Expand Down Expand Up @@ -776,7 +781,7 @@ Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
on K-means clustering of our penguin flipper and bill length data
by showing the different clusterings for K's ranging from 1 to 9.

```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
set.seed(3)
kclusts <- tibble(k = 1:9) |>
Expand Down Expand Up @@ -840,7 +845,7 @@ decrease the total WSSD, but by only a *diminishing amount*. If we plot the tota
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).

```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.5, fig.width = 4.5, fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
geom_point(size = 2) +
geom_line() +
Expand Down Expand Up @@ -931,7 +936,7 @@ clustered_data
Now that we have this information in a tidy data frame, we can make a visualization
of the cluster assignments for each point, as shown in Figure \@ref(fig:10-plot-clusters-2).

```{r 10-plot-clusters-2, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "The data colored by the cluster assignments returned by K-means."}
```{r 10-plot-clusters-2, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "The data colored by the cluster assignments returned by K-means."}
cluster_plot <- ggplot(clustered_data,
aes(x = flipper_length_mm,
y = bill_length_mm,
Expand Down Expand Up @@ -1040,7 +1045,7 @@ clustering_statistics
Now that we have `tot.withinss` and `k` as columns in a data frame, we can make a line plot
(Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use.

```{r 10-plot-choose-k, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
```{r 10-plot-choose-k, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
Expand Down Expand Up @@ -1075,7 +1080,7 @@ but there is a trade-off that doing many clusterings
could take a long time.
So this is something that needs to be balanced.

```{r 10-choose-k-nstart, fig.height = 3.5, fig.width = 4.5, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
```{r 10-choose-k-nstart, fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
penguin_clust_ks <- tibble(k = 1:9) |>
rowwise() |>
mutate(penguin_clusts = list(kmeans(standardized_data, nstart = 10, k)),
Expand Down
Loading

0 comments on commit d10e86a

Please sign in to comment.