Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various fixes #541

Merged
merged 14 commits into from
Sep 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,061 changes: 3,061 additions & 0 deletions data/canada_wiki.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/nasa.json

Large diffs are not rendered by default.

Binary file added img/reading/NASA-API-Rho-Ophiuchi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/reading/NASA-API-limits.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/reading/NASA-API-parameters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/reading/NASA-API-signup.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/reading/authorize_question.png
Binary file not shown.
Binary file modified img/reading/sg4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/reading/tidyverse_twitter.png
Binary file not shown.
2 changes: 2 additions & 0 deletions source/acknowledgments.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ list of topics in this book. We would especially like to thank Matías
Salibían-Barrera for his mentorship during the initial development and roll-out
of both DSCI 100 and this book. His door was always open when
we needed to chat about how to best introduce and teach data science to our first-year students.
We would also like to thank Gabriela Cohen Freue for her DSCI 561 (Regression I) teaching materials
from the UBC Master of Data Science program, as some of our linear regression figures were inspired from these.

We would also like to thank all those who contributed to the process of
publishing this book. In particular, we would like to thank all of our reviewers for their feedback and suggestions:
Expand Down
4 changes: 2 additions & 2 deletions source/classification1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1438,8 +1438,8 @@ prediction <- predict(knn_fit, new_observation)
prediction
```

The classifier predicts that the first observation is benign ("B"), while the second is
malignant ("M"). Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this
The classifier predicts that the first observation is benign, while the second is
malignant. Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this
trained $K$-nearest neighbor model will make on a large range of new observations.
Although you have seen colored prediction map visualizations like this a few times now,
we have not included the code to generate them, as it is a little bit complicated.
Expand Down
27 changes: 12 additions & 15 deletions source/classification2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -410,8 +410,8 @@ train_prop <- cancer_train |>
```

We can use `group_by` and `summarize` to \index{group\_by}\index{summarize} find the percentage of malignant and benign classes
in `cancer_train` and we see about `r round(filter(train_prop, Class == "B")$proportion, 2)*100`% of the training
data are benign and `r round(filter(train_prop, Class == "M")$proportion, 2)*100`%
in `cancer_train` and we see about `r round(filter(train_prop, Class == "Benign")$proportion, 2)*100`% of the training
data are benign and `r round(filter(train_prop, Class == "Malignant")$proportion, 2)*100`%
are malignant, indicating that our class proportions were roughly preserved when we split the data.

```{r 06-train-proportion}
Expand Down Expand Up @@ -591,7 +591,7 @@ cancer_proportions

```{r 06-proportions-2, echo = FALSE, warning = FALSE}
cancer_propn_1 <- cancer_proportions |>
filter(Class == 'B') |>
filter(Class == 'Benign') |>
select(percent)
```

Expand Down Expand Up @@ -1332,13 +1332,13 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
However, it becomes very slow when you have even a moderate
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
grows very quickly with the number of predictors, and you have to train the model (itself
a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
a slow process!) for each one. For example, if we have 2 predictors&mdash;let's call
them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
and B together. If we have 3 predictors&mdash;A, B, and C&mdash;then we have 7
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
we have to train for $m$ predictors is $2^m-1$; in other words, when we
get to $10$ predictors we have over *one thousand* models to train, and
at $20$ predictors we have over *one million* models to train!
get to 10 predictors we have over *one thousand* models to train, and
at 20 predictors we have over *one million* models to train!
So although it is a simple method, best subset selection is usually too computationally
expensive to use in practice.

Expand All @@ -1360,8 +1360,8 @@ This pattern continues for as many iterations as you want. If you run the method
all the way until you run out of predictors to choose, you will end up training
$\frac{1}{2}m(m+1)$ separate models. This is a *big* improvement from the $2^m-1$
models that best subset selection requires you to train! For example, while best subset selection requires
training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
Therefore we will continue the rest of this section using forward selection.
training over 1000 candidate models with 10 predictors, forward selection requires training only 55 candidate models.
Therefore we will continue the rest of this section using forward selection.

> **Note:** One word of caution before we move on. Every additional model that you train
> increases the likelihood that you will get unlucky and stumble
Expand All @@ -1378,12 +1378,9 @@ training over 1000 candidate models with $m=10$ predictors, forward selection re

We now turn to implementing forward selection in R.
Unfortunately there is no built-in way to do this using the `tidymodels` framework,
so we will have to code it ourselves. First we will use the `select` function
to extract the "total" set of predictors that we are willing to work with.
Here we will load the modified version of the cancer data with irrelevant
predictors, and select `Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`
as potential predictors, and the `Class` variable as the label.
We will also extract the column names for the full set of predictor variables.
so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors
to work with in this illustrative example&mdash;`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as
well as the `Class` variable as the label. We will also extract the column names for the full set of predictors.

```{r 06-fwdsel-seed, echo = FALSE, warning = FALSE, message = FALSE}
# hidden seed
Expand Down
2 changes: 1 addition & 1 deletion source/clustering.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,7 @@ Then we would compute the coordinates, $\mu_x$ and $\mu_y$, of the cluster cente
$$\mu_x = \frac{1}{4}(x_1+x_2+x_3+x_4) \quad \mu_y = \frac{1}{4}(y_1+y_2+y_3+y_4).$$

In the first cluster from the example, there are `r nrow(clus1)` data points. These are shown with their cluster center
(`r paste("flipper_length_standardized =", round(mean(clus1$flipper_length_standardized),2))` and `r paste("bill_length_standardized =", round(mean(clus1$bill_length_standardized),2))`) highlighted
(standardized flipper length `r round(mean(clus1$flipper_length_standardized),2)`, standardized bill length `r round(mean(clus1$bill_length_standardized),2)`) highlighted
in Figure \@ref(fig:10-toy-example-clus1-center).

(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
Expand Down
Loading