diff --git a/source/acknowledgements.md b/source/acknowledgements.md
index 233e18fc..751e2a85 100755
--- a/source/acknowledgements.md
+++ b/source/acknowledgements.md
@@ -58,7 +58,7 @@ We would like to give special thanks to Navya Dahiya and Gloria Ye
 for completing the first round of translation of the R material to Python,
 and to Philip Austin for his leadership and guidance throughout the translation process.
 We also gratefully acknowledge the UBC Open Educational Resources Fund
-and the UBC Department of Statistics for supporting the translation of 
-the original R textbook and exercises to the python programming language.
+and the UBC Department of Statistics for supporting the translation of
+the original R textbook and exercises to the Python programming language.
 
 
diff --git a/source/authors.md b/source/authors.md
index e365683c..a622a9f6 100755
--- a/source/authors.md
+++ b/source/authors.md
@@ -20,6 +20,16 @@ Campbell, and Melissa Lee for the R programming language. The content of the R
 textbook was adapted to Python by Trevor Campbell, Joel Ostblom, and Lindsey
 Heagy.
 
+**[Tiffany Timbers](https://www.tiffanytimbers.com/)** is an Associate Professor of Teaching in the Department of
+Statistics and Co-Director for the Master of Data Science program (Vancouver
+Option) at the University of British Columbia. In these roles she teaches and
+develops curriculum around the responsible application of Data Science to solve
+real-world problems. One of her favorite courses she teaches is a graduate
+course on collaborative software development, which focuses on teaching how to
+create R and Python packages using modern tools and workflows.
+
++++
+
 **[Trevor Campbell](https://trevorcampbell.me/)** is an Associate Professor in the Department of Statistics at
 the University of British Columbia. His research focuses on automated, scalable
 Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and
@@ -32,15 +42,6 @@ program at the University of Toronto.
 
 +++
 
-**[Tiffany Timbers](https://www.tiffanytimbers.com/)** is an Associate Professor of Teaching in the Department of
-Statistics and Co-Director for the Master of Data Science program (Vancouver
-Option) at the University of British Columbia. In these roles she teaches and
-develops curriculum around the responsible application of Data Science to solve
-real-world problems. One of her favorite courses she teaches is a graduate
-course on collaborative software development, which focuses on teaching how to
-create R and Python packages using modern tools and workflows.
-+++
-
 **[Melissa Lee](https://www.stat.ubc.ca/users/melissa-lee)** is an Assistant Professor of Teaching in the Department of
 Statistics at the University of British Columbia. She teaches and develops
 curriculum for undergraduate statistics and data science courses. Her work
@@ -50,19 +51,8 @@ initiatives.
 
 +++
 
-**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric 
-Sciences and director of the Geophysical Inversion Facility at the University of British Columbia. 
-Her research combines computational methods in numerical simulations, inversions, and machine 
-learning to answer questions about the subsurface of the Earth. Primary applications include 
-mineral exploration, carbon sequestration, groundwater and environmental studies. She 
-completed her BSc at the University of Alberta, her PhD at the University of British Columbia, 
-and held a Postdoctoral research position at the University of California Berkeley prior to 
-starting her current position at UBC. 
-
-+++
-
 **[Joel Ostblom](https://joelostblom.com/)** is an Assistant Professor of Teaching in the Department of
-Statistics at the University of British Columbia. 
+Statistics at the University of British Columbia.
 During his PhD, Joel developed a passion for data science and reproducibility
 through the development of quantitative image analysis pipelines for studying
 stem cell and developmental biology. He has since co-created or lead the
@@ -71,3 +61,15 @@ is now an assistant professor of teaching in the statistics department at the
 University of British Columbia. Joel cares deeply about spreading data literacy
 and excitement over programmatic data analysis, which is reflected in his
 contributions to open source projects and data science learning resources.
+
++++
+
+**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric
+Sciences and director of the Geophysical Inversion Facility at the University of British Columbia.
+Her research combines computational methods in numerical simulations, inversions, and machine
+learning to answer questions about the subsurface of the Earth. Primary applications include
+mineral exploration, carbon sequestration, groundwater and environmental studies. She
+completed her BSc at the University of Alberta, her PhD at the University of British Columbia,
+and held a Postdoctoral research position at the University of California Berkeley prior to
+starting her current position at UBC.
+
diff --git a/source/classification1.md b/source/classification1.md
index 38b14e42..a393f295 100755
--- a/source/classification1.md
+++ b/source/classification1.md
@@ -25,12 +25,12 @@ import plotly.graph_objects as go
 (classification1)=
 # Classification I: training & predicting
 
-## Overview 
+## Overview
 In previous chapters, we focused solely on descriptive and exploratory
-data analysis questions. 
+data analysis questions.
 This chapter and the next together serve as our first
 foray into answering *predictive* questions about data. In particular, we will
-focus on *classification*, i.e., using one or more 
+focus on *classification*, i.e., using one or more
 variables to predict the value of a categorical variable of interest. This chapter
 will cover the basics of classification, how to preprocess data to make it
 suitable for use in a classifier, and how to use our observed data to make
@@ -38,7 +38,7 @@ predictions. The next chapter will focus on how to evaluate how accurate the
 predictions from our classifier are, as well as how to improve our classifier
 (where possible) to maximize its accuracy.
 
-## Chapter learning objectives 
+## Chapter learning objectives
 
 By the end of the chapter, readers will be able to do the following:
 
@@ -46,11 +46,10 @@ By the end of the chapter, readers will be able to do the following:
 - Describe what a training data set is and how it is used in classification.
 - Interpret the output of a classifier.
 - Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
-- Explain the $K$-nearest neighbor classification algorithm.
-- Perform $K$-nearest neighbor classification in Python using `scikit-learn`.
-- Use `StandardScaler` and `make_column_transformer` to preprocess data to be centered and scaled.
-- Use `sample` to preprocess data to be balanced.
-- Combine preprocessing and model training using `make_pipeline`.
+- Explain the K-nearest neighbors classification algorithm.
+- Perform K-nearest neighbors classification in Python using `scikit-learn`.
+- Use methods from `scikit-learn` to center, scale, balance, and impute data as a preprocessing step.
+- Combine preprocessing and model training into a `Pipeline` using `make_pipeline`.
 
 +++
 
@@ -66,7 +65,7 @@ In many situations, we want to make predictions based on the current situation
 as well as past experiences. For instance, a doctor may want to diagnose a
 patient as either diseased or healthy based on their symptoms and the doctor's
 past experience with patients; an email provider might want to tag a given
-email as "spam" or "not spam" based on the email's text and past email text data; 
+email as "spam" or "not spam" based on the email's text and past email text data;
 or a credit card company may want to predict whether a purchase is fraudulent based
 on the current purchase item, amount, and location as well as past purchases.
 These tasks are all examples of **classification**, i.e., predicting a
@@ -76,7 +75,7 @@ other variables (sometimes called *features*).
 ```{index} training set
 ```
 
-Generally, a classifier assigns an observation without a known class (e.g., a new patient) 
+Generally, a classifier assigns an observation without a known class (e.g., a new patient)
 to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations
 for which we do know the class (e.g., previous patients with known diseases and
 symptoms). These observations with known classes that we use as a basis for
@@ -89,14 +88,14 @@ the classifier to make predictions on new data for which we do not know the clas
 
 There are many possible methods that we could use to predict
 a categorical class/label for an observation. In this book, we will
-focus on the widely used **$K$-nearest neighbors** algorithm {cite:p}`knnfix,knncover`.
+focus on the widely used **K-nearest neighbors** algorithm {cite:p}`knnfix,knncover`.
 In your future studies, you might encounter decision trees, support vector machines (SVMs),
 logistic regression, neural networks, and more; see the additional resources
 section at the end of the next chapter for where to begin learning more about
 these other methods. It is also worth mentioning that there are many
-variations on the basic classification problem. For example, 
+variations on the basic classification problem. For example,
 we focus on the setting of **binary classification** where only two
-classes are involved (e.g., a diagnosis of either healthy or diseased), but you may 
+classes are involved (e.g., a diagnosis of either healthy or diseased), but you may
 also run into multiclass classification problems with more than two
 categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold).
 
@@ -105,16 +104,16 @@ categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common col
 ```{index} breast cancer, question; classification
 ```
 
-In this chapter and the next, we will study a data set of 
+In this chapter and the next, we will study a data set of
 [digitized breast cancer image features](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29),
 created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian {cite:p}`streetbreastcancer`.
 Each row in the data set represents an
 image of a tumor sample, including the diagnosis (benign or malignant) and
 several other measurements (nucleus texture, perimeter, area, and more).
-Diagnosis for each image was conducted by physicians. 
+Diagnosis for each image was conducted by physicians.
 
 As with all data analyses, we first need to formulate a precise question that
-we want to answer. Here, the question is *predictive*: can 
+we want to answer. Here, the question is *predictive*: can
 we use the tumor
 image measurements available to us to predict whether a future tumor image
 (with unknown diagnosis) shows a benign or malignant tumor? Answering this
@@ -162,24 +161,24 @@ Traditionally these procedures were quite invasive; modern methods such as fine
 needle aspiration, used to collect the present data set, extract only a small
 amount of tissue and are less invasive. Based on a digital image of each breast
 tissue sample collected for this data set, ten different variables were measured
-for each cell nucleus in the image (items 3&ndash;12 of the list of variables below), and then the mean 
+for each cell nucleus in the image (items 3&ndash;12 of the list of variables below), and then the mean
  for each variable across the nuclei was recorded. As part of the
 data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
 means and why we do it later in this chapter. Each image additionally was given
 a unique ID and a diagnosis by a physician.  Therefore, the
 total set of variables per image in this data set is:
 
-1. ID: identification number 
+1. ID: identification number
 2. Class: the diagnosis (M = malignant or B = benign)
 3. Radius: the mean of distances from center to points on the perimeter
 4. Texture: the standard deviation of gray-scale values
-5. Perimeter: the length of the surrounding contour 
+5. Perimeter: the length of the surrounding contour
 6. Area: the area inside the contour
 7. Smoothness: the local variation in radius lengths
 8. Compactness: the ratio of squared perimeter and area
-9. Concavity: severity of concave portions of the contour 
+9. Concavity: severity of concave portions of the contour
 10. Concave Points: the number of concave portions of the contour
-11. Symmetry: how similar the nucleus is when mirrored 
+11. Symmetry: how similar the nucleus is when mirrored
 12. Fractal Dimension: a measurement of how "rough" the perimeter is
 
 +++
@@ -187,7 +186,7 @@ total set of variables per image in this data set is:
 ```{index} info
 ```
 
-Below we use the `info` method to preview the data frame. This method can 
+Below we use the `info` method to preview the data frame. This method can
 make it easier to inspect the data when we have a lot of columns:
 it prints only the column names down the page (instead of across),
 as well as their data types and the number of non-missing entries.
@@ -211,7 +210,7 @@ cancer["Class"].unique()
 We will improve the readability of our analysis
 by renaming `"M"` to `"Malignant"` and `"B"` to `"Benign"` using the `replace`
 method. The `replace` method takes one argument: a dictionary that maps
-previous values to desired new values. 
+previous values to desired new values.
 We will verify the result using the `unique` method.
 
 ```{index} replace
@@ -240,7 +239,7 @@ glue("malignant_pct", "{:0.0f}".format(100*cancer["Class"].value_counts(normaliz
 ```
 
 Before we start doing any modeling, let's explore our data set. Below we use
-the `groupby` and `count` methods to find the number and percentage 
+the `groupby` and `count` methods to find the number and percentage
 of benign and malignant tumor observations in our data set. When paired with
 `groupby`, `count` counts the number of observations for each value of the `Class`
 variable. Then we calculate the percentage in each group by dividing by the total
@@ -248,9 +247,9 @@ number of observations and multiplying by 100.
 The total number of observations equals the number of rows in the data frame,
 which we can access via the `shape` attribute of the data frame
 (`shape[0]` is the number of rows and `shape[1]` is the number of columns).
-We have 
+We have
 {glue:text}`benign_count` ({glue:text}`benign_pct`\%) benign and
-{glue:text}`malignant_count` ({glue:text}`malignant_pct`\%) malignant 
+{glue:text}`malignant_count` ({glue:text}`malignant_pct`\%) malignant
 tumor observations.
 
 ```{code-cell} ipython3
@@ -260,7 +259,7 @@ tumor observations.
 ```{index} value_counts
 ```
 
-The `pandas` package also has a more convenient specialized `value_counts` method for 
+The `pandas` package also has a more convenient specialized `value_counts` method for
 counting the number of occurrences of each value in a column. If we pass no arguments
 to the method, it outputs a series containing the number of occurences
 of each value. If we instead pass the argument `normalize=True`, it instead prints the fraction
@@ -308,17 +307,17 @@ obtain a new observation not in the current data set that has all the variables
 measured *except* the label (i.e., an image without the physician's diagnosis
 for the tumor class). We could compute the standardized perimeter and concavity values,
 resulting in values of, say, 1 and 1. Could we use this information to classify
-that observation as benign or malignant? Based on the scatter plot, how might 
+that observation as benign or malignant? Based on the scatter plot, how might
 you classify that new observation? If the standardized concavity and perimeter
 values are 1 and 1 respectively, the point would lie in the middle of the
 orange cloud of malignant points and thus we could probably classify it as
-malignant. Based on our visualization, it seems like 
+malignant. Based on our visualization, it seems like
 it may be possible to make accurate predictions of the `Class` variable (i.e., a diagnosis) for
 tumor images with unknown diagnoses.
 
 +++
 
-## Classification with $K$-nearest neighbors
+## Classification with K-nearest neighbors
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
@@ -342,21 +341,21 @@ my_distances = euclidean_distances(perim_concav_with_new_point_df[attrs])[
 ```
 
 In order to actually make predictions for new observations in practice, we
-will need a classification algorithm. 
-In this book, we will use the $K$-nearest neighbors classification algorithm.
+will need a classification algorithm.
+In this book, we will use the K-nearest neighbors classification algorithm.
 To predict the label of a new observation (here, classify it as either benign
-or malignant), the $K$-nearest neighbors classifier generally finds the $K$
+or malignant), the K-nearest neighbors classifier generally finds the $K$
 "nearest" or "most similar" observations in our training set, and then uses
-their diagnoses to make a prediction for the new observation's diagnosis. $K$ 
+their diagnoses to make a prediction for the new observation's diagnosis. $K$
 is a number that we must choose in advance; for now, we will assume that someone has chosen
-$K$ for us. We will cover how to choose $K$ ourselves in the next chapter. 
+$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
 
-To illustrate the concept of $K$-nearest neighbors classification, we 
+To illustrate the concept of K-nearest neighbors classification, we
 will walk through an example.  Suppose we have a
-new observation, with standardized perimeter 
-of {glue:text}`new_point_1_0` and standardized concavity 
-of {glue:text}`new_point_1_1`, whose 
-diagnosis "Class" is unknown. This new observation is 
+new observation, with standardized perimeter
+of {glue:text}`new_point_1_0` and standardized concavity
+of {glue:text}`new_point_1_1`, whose
+diagnosis "Class" is unknown. This new observation is
 depicted by the red, diamond point in {numref}`fig:05-knn-2`.
 
 ```{code-cell} ipython3
@@ -397,7 +396,7 @@ glue("1-neighbor_con", "{:.1f}".format(near_neighbor_df.iloc[0, :]["Concavity"])
 {numref}`fig:05-knn-3` shows that the nearest point to this new observation is
 **malignant** and located at the coordinates ({glue:text}`1-neighbor_per`,
 {glue:text}`1-neighbor_con`). The idea here is that if a point is close to another
-in the scatter plot, then the perimeter and concavity values are similar, 
+in the scatter plot, then the perimeter and concavity values are similar,
 and so we may expect that they would have the same diagnosis.
 
 ```{code-cell} ipython3
@@ -481,7 +480,7 @@ Suppose we have another new observation with standardized perimeter
 scatter plot in {numref}`fig:05-knn-4`, how would you classify this red,
 diamond observation? The nearest neighbor to this new point is a
 **benign** observation at ({glue:text}`2-neighbor_per`, {glue:text}`2-neighbor_con`).
-Does this seem like the right prediction to make for this observation? Probably 
+Does this seem like the right prediction to make for this observation? Probably
 not, if you consider the other nearby points.
 
 +++
@@ -561,7 +560,7 @@ Suppose we have two observations $a$ and $b$, each having two predictor
 variables, $x$ and $y$.  Denote $a_x$ and $a_y$ to be the values of variables
 $x$ and $y$ for observation $a$; $b_x$ and $b_y$ have similar definitions for
 observation $b$.  Then the straight-line distance between observation $a$ and
-$b$ on the x-y plane can be computed using the following formula: 
+$b$ on the x-y plane can be computed using the following formula:
 
 $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$
 
@@ -569,13 +568,13 @@ $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$
 
 To find the $K$ nearest neighbors to our new observation, we compute the distance
 from that new observation to each observation in our training data, and select the $K$ observations corresponding to the
-$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new 
-observation with perimeter {glue:text}`3-new_point_0` and 
+$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new
+observation with perimeter {glue:text}`3-new_point_0` and
 concavity {glue:text}`3-new_point_1`, shown as a red diamond in {numref}`fig:05-multiknn-1`. Let's calculate the distances
 between our new point and each of the observations in the training set to find
-the $K=5$ neighbors that are nearest to our new point. 
+the $K=5$ neighbors that are nearest to our new point.
 You will see in the code below, we compute the straight-line
-distance using the formula above: we square the differences between the two observations' perimeter 
+distance using the formula above: we square the differences between the two observations' perimeter
 and concavity coordinates, add the squared differences, and then take the square root.
 In order to find the $K=5$ nearest neighbors, we will use the `nsmallest` function from `pandas`.
 
@@ -633,16 +632,16 @@ cancer["dist_from_new"] = (
      + (cancer["Concavity"] - new_obs_Concavity) ** 2
 )**(1/2)
 cancer.nsmallest(5, "dist_from_new")[[
-    "Perimeter", 
-    "Concavity", 
-    "Class", 
+    "Perimeter",
+    "Concavity",
+    "Class",
     "dist_from_new"
 ]]
 ```
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
-# code needed to render the latex table with distance calculations 
+# code needed to render the latex table with distance calculations
 from IPython.display import Latex
 five_neighbors = (
     cancer
@@ -685,7 +684,7 @@ training data.
 +++
 
 The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are
-malignant; since this is the majority, we classify our new observation as malignant. 
+malignant; since this is the majority, we classify our new observation as malignant.
 These 5 neighbors are circled in {numref}`fig:05-multiknn-3`.
 
 ```{code-cell} ipython3
@@ -714,21 +713,21 @@ Scatter plot of concavity versus perimeter with 5 nearest neighbors circled.
 
 +++
 
-### More than two explanatory variables 
+### More than two explanatory variables
 
-Although the above description is directed toward two predictor variables, 
-exactly the same $K$-nearest neighbors algorithm applies when you
+Although the above description is directed toward two predictor variables,
+exactly the same K-nearest neighbors algorithm applies when you
 have a higher number of predictor variables.  Each predictor variable may give us new
 information to help create our classifier.  The only difference is the formula
 for the distance between points. Suppose we have $m$ predictor
-variables for two observations $a$ and $b$, i.e., 
+variables for two observations $a$ and $b$, i.e.,
 $a = (a_{1}, a_{2}, \dots, a_{m})$ and
 $b = (b_{1}, b_{2}, \dots, b_{m})$.
 
 ```{index} distance; more than two variables
 ```
 
-The distance formula becomes 
+The distance formula becomes
 
 $$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.$$
 
@@ -758,17 +757,17 @@ cancer["dist_from_new"] = (
     + (cancer["Symmetry"] - new_obs_Symmetry) ** 2
 )**(1/2)
 cancer.nsmallest(5, "dist_from_new")[[
-    "Perimeter", 
-    "Concavity", 
-    "Symmetry", 
-    "Class", 
+    "Perimeter",
+    "Concavity",
+    "Symmetry",
+    "Class",
     "dist_from_new"
 ]]
 ```
 
-Based on $K=5$ nearest neighbors with these three predictors we would classify 
-the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. 
-{numref}`fig:05-more` shows what the data look like when we visualize them 
+Based on $K=5$ nearest neighbors with these three predictors we would classify
+the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class.
+{numref}`fig:05-more` shows what the data look like when we visualize them
 as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
 
 ```{code-cell} ipython3
@@ -873,9 +872,9 @@ nearest neighbors look like, for learning purposes.
 
 +++
 
-### Summary of $K$-nearest neighbors algorithm
+### Summary of K-nearest neighbors algorithm
 
-In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following:
+In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following:
 
 1. Compute the distance between the new observation and each observation in the training set.
 2. Find the $K$ rows corresponding to the $K$ smallest distances.
@@ -883,21 +882,21 @@ In order to classify a new observation using a $K$-nearest neighbor classifier,
 
 +++
 
-## $K$-nearest neighbors with `scikit-learn`
+## K-nearest neighbors with `scikit-learn`
 
 ```{index} scikit-learn
 ```
 
-Coding the $K$-nearest neighbors algorithm in Python ourselves can get complicated,
+Coding the K-nearest neighbors algorithm in Python ourselves can get complicated,
 especially if we want to handle multiple classes, more than two variables,
 or predict the class for multiple new observations. Thankfully, in Python,
-the $K$-nearest neighbors algorithm is 
-implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with 
-many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions 
-in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the 
-less we have to code ourselves, the fewer mistakes we will likely make. 
-Before getting started with $K$-nearest neighbors, we need to tell the `sklearn` package 
-that we prefer using `pandas` data frames over regular arrays via the `set_config` function. 
+the K-nearest neighbors algorithm is
+implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with
+many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions
+in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the
+less we have to code ourselves, the fewer mistakes we will likely make.
+Before getting started with K-nearest neighbors, we need to tell the `sklearn` package
+that we prefer using `pandas` data frames over regular arrays via the `set_config` function.
 ```{note}
 You will notice a new way of importing functions in the code below: `from ... import ...`. This lets us
 import *just* `set_config` from `sklearn`, and then call `set_config` without any package prefix.
@@ -914,14 +913,14 @@ from sklearn import set_config
 set_config(transform_output="pandas")
 ```
 
-We can now get started with $K$-nearest neighbors. The first step is to
+We can now get started with K-nearest neighbors. The first step is to
  import the `KNeighborsClassifier` from the `sklearn.neighbors` module.
 
 ```{code-cell} ipython3
 from sklearn.neighbors import KNeighborsClassifier
 ```
 
-Let's walk through how to use `KNeighborsClassifier` to perform $K$-nearest neighbors classification. 
+Let's walk through how to use `KNeighborsClassifier` to perform K-nearest neighbors classification.
 We will use the `cancer` data set from above, with
 perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
 we will use the classifier to predict the diagnosis label for a new observation with
@@ -936,15 +935,15 @@ cancer_train
 ```{index} scikit-learn; model object, scikit-learn; KNeighborsClassifier
 ```
 
-Next, we create a *model object* for $K$-nearest neighbors classification
+Next, we create a *model object* for K-nearest neighbors classification
 by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors;
 we will discuss how to choose $K$ in the next chapter.
 
 ```{note}
 You can specify the `weights` argument in order to control
 how neighbors vote when classifying a new observation. The default is `"uniform"`, where
-each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, 
-which weigh each neighbor's vote differently, can be found on 
+each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices,
+which weigh each neighbor's vote differently, can be found on
 [the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier).
 ```
 
@@ -975,7 +974,7 @@ knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]);
 
 After using the `fit` function, we can make a prediction on a new observation
 by calling `predict` on the classifier object, passing the new observation
-itself. As above, when we ran the $K$-nearest neighbors classification
+itself. As above, when we ran the K-nearest neighbors classification
 algorithm manually, the `knn` model object classifies the new observation as
 "Malignant". Note that the `predict` function outputs an `array` with the
 model's prediction; you can actually make multiple predictions at the same
@@ -988,8 +987,8 @@ knn.predict(new_obs)
 
 Is this predicted malignant label the actual class for this observation?
 Well, we don't know because we do not have this
-observation's diagnosis&mdash; that is what we were trying to predict! The 
-classifier's prediction is not necessarily correct, but in the next chapter, we will 
+observation's diagnosis&mdash; that is what we were trying to predict! The
+classifier's prediction is not necessarily correct, but in the next chapter, we will
 learn ways to quantify how accurate we think our predictions are.
 
 +++
@@ -1001,9 +1000,9 @@ learn ways to quantify how accurate we think our predictions are.
 ```{index} scaling
 ```
 
-When using $K$-nearest neighbor classification, the *scale* of each variable
+When using K-nearest neighbors classification, the *scale* of each variable
 (i.e., its size and range of values) matters. Since the classifier predicts
-classes by identifying observations nearest to it, any variables with 
+classes by identifying observations nearest to it, any variables with
 a large scale will have a much larger effect than variables with a small
 scale. But just because a variable has a large scale *doesn't mean* that it is
 more important for making accurate predictions. For example, suppose you have a
@@ -1027,20 +1026,20 @@ degrees Celsius, the two variables would differ by a constant shift of 273
 hypothetical job classification example, we would likely see that the center of
 the salary variable is in the tens of thousands, while the center of the years
 of education variable is in the single digits. Although this doesn't affect the
-$K$-nearest neighbor classification algorithm, this large shift can change the
+K-nearest neighbors classification algorithm, this large shift can change the
 outcome of using many other predictive models.
 
 ```{index} standardization; K-nearest neighbors
 ```
 
 To scale and center our data, we need to find
-our variables' *mean* (the average, which quantifies the "central" value of a 
-set of numbers) and *standard deviation* (a number quantifying how spread out values are). 
-For each observed value of the variable, we subtract the mean (i.e., center the variable) 
-and divide by the standard deviation (i.e., scale the variable). When we do this, the data 
-is said to be *standardized*, and all variables in a data set will have a mean of 0 
-and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest
-neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
+our variables' *mean* (the average, which quantifies the "central" value of a
+set of numbers) and *standard deviation* (a number quantifying how spread out values are).
+For each observed value of the variable, we subtract the mean (i.e., center the variable)
+and divide by the standard deviation (i.e., scale the variable). When we do this, the data
+is said to be *standardized*, and all variables in a data set will have a mean of 0
+and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest
+neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
 cancer data set; we have been using a standardized version of the data set up
 until now. We will apply the same initial wrangling steps as we did earlier,
 and to keep things simple we will just use the `Area`, `Smoothness`, and `Class`
@@ -1072,11 +1071,11 @@ The `scikit-learn` framework provides a collection of *preprocessors* used to ma
 data in the [`preprocessing` module](https://scikit-learn.org/stable/modules/preprocessing.html).
 Here we will use the `StandardScaler` transformer to standardize the predictor variables in
 the `unscaled_cancer` data. In order to tell the `StandardScaler` which variables to standardize,
-we wrap it in a 
+we wrap it in a
 [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) object
-using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. 
+using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function.
 `ColumnTransformer` objects also enable the use of multiple preprocessors at
-once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. 
+once, which is especially handy when you want to apply different preprocessing to each of the predictor variables.
 The primary argument of the `make_column_transformer` function is a sequence of
 pairs of (1) a preprocessor, and (2) the columns to which you want to apply that preprocessor.
 In the present case, we just have the one `StandardScaler` preprocessor to apply to the `Area` and `Smoothness` columns.
@@ -1101,14 +1100,14 @@ preprocessor
 ```
 
 You can see that the preprocessor includes a single standardization step
-that is applied to the `Area` and `Smoothness` columns. 
-Note that here we specified which columns to apply the preprocessing step to 
+that is applied to the `Area` and `Smoothness` columns.
+Note that here we specified which columns to apply the preprocessing step to
 by individual names; this approach can become quite difficult, e.g., when we have many
 predictor variables. Rather than writing out the column names individually,
-we can instead use the 
+we can instead use the
 [`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For
 example, if we wanted to standardize all *numerical* predictors,
-we would use `make_column_selector` and specify the `dtype_include` argument to be `"number"`. 
+we would use `make_column_selector` and specify the `dtype_include` argument to be `"number"`.
 This creates a preprocessor equivalent to the one we created previously.
 
 ```{code-cell} ipython3
@@ -1126,10 +1125,10 @@ preprocessor
 We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame.
 This happens in two steps. We first use the `fit` function to compute the values necessary to apply
 the standardization (the mean and standard deviation of each variable), passing the `unscaled_cancer` data as an argument.
-Then we use the `transform` function to actually apply the standardization.  
+Then we use the `transform` function to actually apply the standardization.
 It may seem a bit unnecessary to use two steps---`fit` *and* `transform`---to standardize the data.
-However, we do this in two steps so that we can specify a different data set in the `transform` step if we want. 
-This enables us to compute the quantities needed to standardize using one data set, and then 
+However, we do this in two steps so that we can specify a different data set in the `transform` step if we want.
+This enables us to compute the quantities needed to standardize using one data set, and then
 apply that standardization to another data set.
 
 ```{code-cell} ipython3
@@ -1145,7 +1144,7 @@ glue("scaled-cancer-column-1", '"'+scaled_cancer.columns[1]+'"')
 It looks like our `Smoothness` and `Area` variables have been standardized. Woohoo!
 But there are two important things to notice about the new `scaled_cancer` data frame. First, it only keeps
 the columns from the input to `transform` (here, `unscaled_cancer`) that had a preprocessing step applied
-to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer` 
+to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer`
 is to *drop* the remaining columns. This default behavior works well with the rest of `sklearn` (as we will see below
 in {numref}`08:puttingittogetherworkflow`), but for visualizing the result of preprocessing it can be useful to keep the other columns
 in our original data frame, such as the `Class` variable here.
@@ -1174,7 +1173,7 @@ scaled_cancer_all
 
 You may wonder why we are doing so much work just to center and
 scale our variables. Can't we just manually scale and center the `Area` and
-`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
+`Smoothness` variables ourselves before building our K-nearest neighbors model? Well,
 technically *yes*; but doing so is error-prone.  In particular, we might
 accidentally forget to apply the same centering / scaling when making
 predictions, or accidentally apply a *different* centering / scaling than what
@@ -1184,7 +1183,7 @@ the preprocessor is required only when you want to inspect the result of the
 preprocessing steps
 yourself. You will see further on in
 {numref}`08:puttingittogetherworkflow` that `scikit-learn` provides tools to
-automatically streamline the preprocesser and the model so that you can call `fit` 
+automatically streamline the preprocesser and the model so that you can call `fit`
 and `transform` on the `Pipeline` as necessary without additional coding effort.
 
 {numref}`fig:05-scaling-plt` shows the two scatter plots side-by-side&mdash;one for `unscaled_cancer` and one for
@@ -1195,10 +1194,10 @@ well within the cloud of benign observations, and the neighbors are all nearly
 vertically aligned with the new observation (which is why it looks like there
 is only one black line on this plot). {numref}`fig:05-scaling-plt-zoomed`
 shows a close-up of that region on the unstandardized plot. Here the computation of nearest
-neighbors is dominated by the much larger-scale area variable. The plot for standardized data 
+neighbors is dominated by the much larger-scale area variable. The plot for standardized data
 on the right in {numref}`fig:05-scaling-plt` shows a much more intuitively reasonable
 selection of nearest neighbors. Thus, standardizing the data can change things
-in an important way when we are using predictive algorithms. 
+in an important way when we are using predictive algorithms.
 Standardizing your data should be a part of the preprocessing you do
 before predictive modeling and you should always think carefully about your problem domain and
 whether you need to standardize your data.
@@ -1399,9 +1398,9 @@ Close-up of three nearest neighbors for unstandardized data.
 ```{index} balance, imbalance
 ```
 
-Another potential issue in a data set for a classifier is *class imbalance*, 
+Another potential issue in a data set for a classifier is *class imbalance*,
 i.e., when one label is much more common than another. Since classifiers like
-the $K$-nearest neighbor algorithm use the labels of nearby points to predict
+the K-nearest neighbors algorithm use the labels of nearby points to predict
 the label of a new point, if there are many more data points with one label
 overall, the algorithm is more likely to pick that label in general (even if
 the "pattern" of data suggests otherwise). Class imbalance is actually quite a
@@ -1410,19 +1409,19 @@ detection, there are many cases in which the "important" class to identify
 (presence of disease, malicious email) is much rarer than the "unimportant"
 class (no disease, normal email).
 
-To better illustrate the problem, let's revisit the scaled breast cancer data, 
+To better illustrate the problem, let's revisit the scaled breast cancer data,
 `cancer`; except now we will remove many of the observations of malignant tumors, simulating
 what the data would look like if the cancer was rare. We will do this by
 picking only 3 observations from the malignant group, and keeping all
 of the benign observations. We choose these 3 observations using the `.head()`
 method, which takes the number of rows to select from the top (`n`).
-We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) 
+We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
 function from `pandas` to glue the two resulting filtered
 data frames back together. The `concat` function *concatenates* data frames
 along an axis. By default, it concatenates the data frames vertically along `axis=0` yielding a single
 *taller* data frame, which is what we want to do here. If we instead wanted to concatenate horizontally
 to produce a *wider* data frame, we would specify `axis=1`.
-The new imbalanced data is shown in {numref}`fig:05-unbalanced`, 
+The new imbalanced data is shown in {numref}`fig:05-unbalanced`,
 and we print the counts of the classes using the `value_counts` function.
 
 ```{code-cell} ipython3
@@ -1452,8 +1451,8 @@ rare_cancer["Class"].value_counts()
 
 +++
 
-Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
-With only 3 observations of malignant tumors, the classifier 
+Suppose we now decided to use $K = 7$ in K-nearest neighbors classification.
+With only 3 observations of malignant tumors, the classifier
 will *always predict that the tumor is benign, no matter what its concavity and perimeter
 are!* This is because in a majority vote of 7 observations, at most 3 will be
 malignant (we only have 3 total malignant observations), so at least 4 must be
@@ -1525,9 +1524,9 @@ Imbalanced data with 7 nearest neighbors to a new observation highlighted.
 
 +++
 
-{numref}`fig:05-upsample-2` shows what happens if we set the background color of 
-each area of the plot to the predictions the $K$-nearest neighbor 
-classifier would make. We can see that the decision is 
+{numref}`fig:05-upsample-2` shows what happens if we set the background color of
+each area of the plot to the predictions the K-nearest neighbors
+classifier would make. We can see that the decision is
 always "benign," corresponding to the blue color.
 
 ```{code-cell} ipython3
@@ -1609,9 +1608,9 @@ Imbalanced data with background color indicating the decision of the classifier
 
 Despite the simplicity of the problem, solving it in a statistically sound manner is actually
 fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
-For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. 
+For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class.
 In other words, we will replicate rare observations multiple times in our data set to give them more
-voting power in the $K$-nearest neighbor algorithm. In order to do this, we will 
+voting power in the K-nearest neighbors algorithm. In order to do this, we will
 first separate the classes out into their own data frames by filtering.
 Then, we will
 use the `sample` method on the rare class data frame to increase the number of `Malignant` observations to be the same as the number
@@ -1624,7 +1623,7 @@ in data analysis in {numref}`Chapter %s <classification2>`.
 ```{code-cell} ipython3
 :tags: [remove-cell]
 # hidden seed call to make the below resample reproducible
-# we haven't taught students about seeds / prngs yet, so 
+# we haven't taught students about seeds / prngs yet, so
 # for now just hide this.
 np.random.seed(1)
 ```
@@ -1639,11 +1638,11 @@ upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer))
 upsampled_cancer["Class"].value_counts()
 ```
 
-Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data. 
-{numref}`fig:05-upsample-plot` shows what happens now when we set the background color 
-of each area of our scatter plot to the decision the $K$-nearest neighbor 
+Now suppose we train our K-nearest neighbors classifier with $K=7$ on this *balanced* data.
+{numref}`fig:05-upsample-plot` shows what happens now when we set the background color
+of each area of our scatter plot to the decision the K-nearest neighbors
 classifier would make. We can see that the decision is more reasonable; when the points are close
-to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are 
+to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
 closer to the benign tumor observations.
 
 ```{code-cell} ipython3
@@ -1739,13 +1738,13 @@ missing_cancer["Class"] = missing_cancer["Class"].replace({
 missing_cancer
 ```
 
-Recall that K-nearest neighbor classification makes predictions by computing
+Recall that K-nearest neighbors classification makes predictions by computing
 the straight-line distance to nearby training observations, and hence requires
 access to the values of *all* variables for *all* observations in the training
-data.  So how can we perform K-nearest neighbor classification in the presence
+data.  So how can we perform K-nearest neighbors classification in the presence
 of missing data?  Well, since there are not too many observations with missing
 entries, one option is to simply remove those observations prior to building
-the K-nearest neighbor classifier. We can accomplish this by using the
+the K-nearest neighbors classifier. We can accomplish this by using the
 `dropna` method prior to working with the data.
 
 ```{code-cell} ipython3
@@ -1759,7 +1758,7 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
 values based on the other observations in the data set. One reasonable choice
 is to perform *mean imputation*, where missing entries are filled in using the
 mean of the present entries in each variable. To perform mean imputation, we
-use a `SimpleImputer` transformer with the default arguments, and wrap it in a 
+use a `SimpleImputer` transformer with the default arguments, and wrap it in a
 `ColumnTransformer` to indicate which columns need imputation.
 
 ```{code-cell} ipython3
@@ -1782,7 +1781,7 @@ imputed_cancer = preprocessor.transform(missing_cancer)
 imputed_cancer
 ```
 
-Many other options for missing data imputation can be found in 
+Many other options for missing data imputation can be found in
 [the `scikit-learn` documentation](https://scikit-learn.org/stable/modules/impute.html).  However
 you decide to handle missing data in your data analysis, it is always crucial
 to think critically about the setting, how the data were collected, and the
@@ -1796,7 +1795,7 @@ question you are answering.
 ```{index} scikit-learn; pipeline
 ```
 
-The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), 
+The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline),
 a  way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
 To illustrate the whole workflow, let's start from scratch with the `wdbc_unscaled.csv` data.
 First we will load the data, create a model, and specify a preprocessor for the data.
@@ -1810,7 +1809,7 @@ unscaled_cancer["Class"] = unscaled_cancer["Class"].replace({
 })
 unscaled_cancer
 
-# create the KNN model
+# create the K-NN model
 knn = KNeighborsClassifier(n_neighbors=7)
 
 # create the centering / scaling preprocessor
@@ -1822,7 +1821,7 @@ preprocessor = make_column_transformer(
 ```{index} scikit-learn; make_pipeline, scikit-learn; fit
 ```
 
-Next we place these steps in a `Pipeline` using 
+Next we place these steps in a `Pipeline` using
 the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function.
 The `make_pipeline` function takes a list of steps to apply in your data analysis; in this
 case, we just have the `preprocessor` and `knn` steps.
@@ -1839,7 +1838,7 @@ from sklearn.pipeline import make_pipeline
 
 knn_pipeline = make_pipeline(preprocessor, knn)
 knn_pipeline.fit(
-    X=unscaled_cancer, 
+    X=unscaled_cancer,
     y=unscaled_cancer["Class"]
 )
 knn_pipeline
@@ -1848,7 +1847,7 @@ knn_pipeline
 As before, the fit object lists the function that trains the model. But now the fit object also includes information about
 the overall workflow, including the standardization preprocessing step.
 In other words, when we use the `predict` function with the `knn_pipeline` object to make a prediction for a new
-observation, it will first apply the same preprocessing steps to the new observation. 
+observation, it will first apply the same preprocessing steps to the new observation.
 As an example, we will predict the class label of two new observations:
 one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`.
 
@@ -1859,13 +1858,13 @@ prediction
 ```
 
 The classifier predicts that the first observation is benign, while the second is
-malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this 
-trained $K$-nearest neighbor model will make on a large range of new observations.
+malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this
+trained K-nearest neighbors model will make on a large range of new observations.
 Although you have seen colored prediction map visualizations like this a few times now,
 we have not included the code to generate them, as it is a little bit complicated.
-For the interested reader who wants a learning challenge, we now include it below. 
-The basic idea is to create a grid of synthetic new observations using the `meshgrid` function from `numpy`, 
-predict the label of each, and visualize the predictions with a colored scatter having a very high transparency 
+For the interested reader who wants a learning challenge, we now include it below.
+The basic idea is to create a grid of synthetic new observations using the `meshgrid` function from `numpy`,
+predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
 (low `opacity` value) and large point radius. See if you can figure out what each line is doing!
 
 ```{note}
@@ -1950,8 +1949,8 @@ Scatter plot of smoothness versus area where background color indicates the deci
 
 ## Exercises
 
-Practice exercises for the material covered in this chapter 
-can be found in the accompanying 
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
 [worksheets repository](https://worksheets.python.datasciencebook.ca)
 in the "Classification I: training and predicting" row.
 You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
diff --git a/source/classification2.md b/source/classification2.md
index 739b86d4..649b5aa3 100755
--- a/source/classification2.md
+++ b/source/classification2.md
@@ -21,25 +21,26 @@ kernelspec:
 from chapter_preamble import *
 ```
 
-## Overview 
+## Overview
 This chapter continues the introduction to predictive modeling through
 classification. While the previous chapter covered training and data
 preprocessing, this chapter focuses on how to evaluate the performance of
 a classifier, as well as how to improve the classifier (where possible)
 to maximize its accuracy.
 
-## Chapter learning objectives 
+## Chapter learning objectives
 By the end of the chapter, readers will be able to do the following:
 
 - Describe what training, validation, and test data sets are and how they are used in classification.
 - Split data into training, validation, and test data sets.
 - Describe what a random seed is and its importance in reproducible data analysis.
-- Set the random seed in Python using the `numpy.random.seed` function. 
+- Set the random seed in Python using the `numpy.random.seed` function.
 - Describe and interpret accuracy, precision, recall, and confusion matrices.
-- Evaluate classification accuracy in Python using a validation data set.
+- Evaluate classification accuracy, precision, and recall in Python using a test set, a single validation set, and cross-validation.
 - Produce a confusion matrix in Python.
-- Execute cross-validation in Python to choose the number of neighbors in a $K$-nearest neighbors classifier.
-- Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm.
+- Choose the number of neighbors in a K-nearest neighbors classifier by maximizing estimated cross-validation accuracy.
+- Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors classification.
+- Describe the advantages and disadvantages of the K-nearest neighbors classification algorithm.
 
 +++
 
@@ -51,7 +52,7 @@ By the end of the chapter, readers will be able to do the following:
 Sometimes our classifier might make the wrong prediction. A classifier does not
 need to be right 100\% of the time to be useful, though we don't want the
 classifier to make too many wrong predictions. How do we measure how "good" our
-classifier is? Let's revisit the 
+classifier is? Let's revisit the
 [breast cancer images data](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) {cite:p}`streetbreastcancer`
 and think about how our classifier will be used in practice. A biopsy will be
 performed on a *new* patient's tumor, the resulting image will be analyzed,
@@ -59,9 +60,9 @@ and the classifier will be asked to decide whether the tumor is benign or
 malignant. The key word here is *new*: our classifier is "good" if it provides
 accurate predictions on data *not seen during training*, as this implies that
 it has actually learned about the relationship between the predictor variables and response variable,
-as opposed to simply memorizing the labels of individual training data examples. 
+as opposed to simply memorizing the labels of individual training data examples.
 But then, how can we evaluate our classifier without visiting the hospital to collect more
-tumor images? 
+tumor images?
 
 
 ```{index} training set, test set
@@ -79,7 +80,7 @@ labels for new observations without known class labels.
 ```
 
 ```{note}
-If there were a golden rule of machine learning, it might be this: 
+If there were a golden rule of machine learning, it might be this:
 *you cannot use the test data to build the model!* If you do, the model gets to
 "see" the test data in advance, making it look more accurate than it really
 is. Imagine how bad it would be to overestimate your classifier's accuracy
@@ -106,7 +107,7 @@ How exactly can we assess how well our predictions match the actual labels for
 the observations in the test set? One way we can do this is to calculate the
 prediction **accuracy**. This is the fraction of examples for which the
 classifier made the correct prediction. To calculate this, we divide the number
-of correct predictions by the number of predictions made. 
+of correct predictions by the number of predictions made.
 The process for assessing if our predictions match the actual labels in the
 test set is illustrated in {numref}`fig:06-ML-paradigm-test`.
 
@@ -136,7 +137,7 @@ a test set of 65 observations.
 :header-rows: 1
 :name: confusion-matrix-table
 
-* - 
+* -
   - Predicted Malignant
   - Predicted Benign
 * - **Actually Malignant**
@@ -145,7 +146,7 @@ a test set of 65 observations.
 * - **Actually Benign**
   - 4
   - 57
-``` 
+```
 
 In the example in {numref}`confusion-matrix-table`, we see that there was
 1 malignant observation that was correctly classified as malignant (top left corner),
@@ -161,7 +162,7 @@ But we can also see that the classifier only identified 1 out of 4 total maligna
 tumors; in other words, it misclassified 75% of the malignant cases present in the
 data set! In this example, misclassifying a malignant tumor is a potentially
 disastrous error, since it may lead to a patient who requires treatment not receiving it.
-Since we are particularly interested in identifying malignant cases, this 
+Since we are particularly interested in identifying malignant cases, this
 classifier would likely be unacceptable even with an accuracy of 89%.
 
 Focusing more on one label than the other is
@@ -240,12 +241,12 @@ Beginning in this chapter, our data analyses will often involve the use
 of *randomness*. We use randomness any time we need to make a decision in our
 analysis that needs to be fair, unbiased, and not influenced by human input.
 For example, in this chapter, we need to split
-a data set into a training set and test set to evaluate our classifier. We 
+a data set into a training set and test set to evaluate our classifier. We
 certainly do not want to choose how to split
 the data ourselves by hand, as we want to avoid accidentally influencing the result
 of the evaluation. So instead, we let Python *randomly* split the data.
 In future chapters we will use randomness
-in many other ways, e.g., to help us select a small subset of data from a larger data set, 
+in many other ways, e.g., to help us select a small subset of data from a larger data set,
 to pick groupings of data, and more.
 
 ```{index} reproducible, seed
@@ -257,14 +258,14 @@ to pick groupings of data, and more.
 ```{index} seed; numpy.random.seed
 ```
 
-However, the use of randomness runs counter to one of the main 
+However, the use of randomness runs counter to one of the main
 tenets of good data analysis practice: *reproducibility*. Recall that a reproducible
 analysis produces the same result each time it is run; if we include randomness
 in the analysis, would we not get a different result each time?
-The trick is that in Python&mdash;and other programming languages&mdash;randomness 
+The trick is that in Python&mdash;and other programming languages&mdash;randomness
 is not actually random! Instead, Python uses a *random number generator* that
 produces a sequence of numbers that
-are completely determined by a 
+are completely determined by a
  *seed value*. Once you set the seed value, everything after that point may *look* random,
 but is actually totally reproducible. As long as you pick the same seed
 value, you get the same result!
@@ -272,12 +273,12 @@ value, you get the same result!
 ```{index} sample; numpy.random.choice
 ```
 
-Let's use an example to investigate how randomness works in Python. Say we 
+Let's use an example to investigate how randomness works in Python. Say we
 have a series object containing the integers from 0 to 9. We want
 to randomly pick 10 numbers from that list, but we want it to be reproducible.
 Before randomly picking the 10 numbers,
-we call the `seed` function from the `numpy` package, and pass it any integer as the argument. 
-Below we use the seed number `1`. At 
+we call the `seed` function from the `numpy` package, and pass it any integer as the argument.
+Below we use the seed number `1`. At
 that point, Python will keep track of the randomness that occurs throughout the code.
 For example, we can call the `sample` method
 on the series of numbers, passing the argument `n=10` to indicate that we want 10 samples.
@@ -294,8 +295,8 @@ random_numbers1 = nums_0_to_9.sample(n=10).to_numpy()
 random_numbers1
 ```
 You can see that `random_numbers1` is a list of 10 numbers
-from 0 to 9 that, from all appearances, looks random. If 
-we run the `sample` method again, 
+from 0 to 9 that, from all appearances, looks random. If
+we run the `sample` method again,
 we will get a fresh batch of 10 numbers that also look random.
 
 ```{code-cell} ipython3
@@ -336,18 +337,18 @@ random_numbers
 ```
 
 In other words, even though the sequences of numbers that Python is generating *look*
-random, they are totally determined when we set a seed value! 
+random, they are totally determined when we set a seed value!
 
 So what does this mean for data analysis? Well, `sample` is certainly not the
 only data frame method that uses randomness in Python. Many of the functions
 that we use in `scikit-learn`, `pandas`, and beyond use randomness&mdash;many
 of them without even telling you about it.  Also note that when Python starts
-up, it creates its own seed to use. So if you do not explicitly 
-call the `np.random.seed` function, your results 
+up, it creates its own seed to use. So if you do not explicitly
+call the `np.random.seed` function, your results
 will likely not be reproducible. Finally, be careful to set the seed *only once* at
 the beginning of a data analysis. Each time you set the seed, you are inserting
 your own human input, thereby influencing the analysis. For example, if you use
-the `sample` many times throughout your analysis but set the seed each time, the 
+the `sample` many times throughout your analysis but set the seed each time, the
 randomness that Python uses will not look as random as it should.
 
 In summary: if you want your analysis to be reproducible, i.e., produce *the same result*
@@ -363,32 +364,32 @@ package's *default random number generator*. Using the global default random
 number generator is easier than other methods, but has some potential drawbacks. For example,
 other code that you may not notice (e.g., code buried inside some
 other package) could potentially *also* call `np.random.seed`, thus modifying
-your analysis in an undesirable way. Furthermore, not *all* functions use 
+your analysis in an undesirable way. Furthermore, not *all* functions use
 `numpy`'s random number generator; some may use another one entirely.
-In that case, setting `np.random.seed` may not actually make your whole analysis 
+In that case, setting `np.random.seed` may not actually make your whole analysis
 reproducible.
 
 In this book, we will generally only use packages that play nicely with `numpy`'s
-default random number generator, so we will stick with `np.random.seed`. 
-You can achieve more careful control over randomness in your analysis 
-by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html) 
-once at the beginning of your analysis, and passing it to 
+default random number generator, so we will stick with `np.random.seed`.
+You can achieve more careful control over randomness in your analysis
+by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html)
+once at the beginning of your analysis, and passing it to
 the `random_state` argument that is available in many `pandas` and `scikit-learn`
-functions. Those functions will then use your `RandomState` to generate random numbers instead of 
+functions. Those functions will then use your `RandomState` to generate random numbers instead of
 `numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState`
 object with the `seed` value set to 1; we get the same lists of numbers once again.
 ```{code}
 rnd = np.random.RandomState(seed=1)
 random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
 random_numbers1_third
-``` 
+```
 ```{code}
 array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5])
 ```
 ```{code}
 random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy()
 random_numbers2_third
-``` 
+```
 ```{code}
 array([9, 5, 3, 0, 8, 4, 2, 1, 6, 7])
 ```
@@ -401,15 +402,15 @@ array([9, 5, 3, 0, 8, 4, 2, 1, 6, 7])
 ```
 
 Back to evaluating classifiers now!
-In Python, we can use the `scikit-learn` package not only to perform $K$-nearest neighbors
-classification, but also to assess how well our classification worked. 
+In Python, we can use the `scikit-learn` package not only to perform K-nearest neighbors
+classification, but also to assess how well our classification worked.
 Let's work through an example of how to use tools from `scikit-learn` to evaluate a classifier
  using the breast cancer data set from the previous chapter.
 We begin the analysis by loading the packages we require,
 reading in the breast cancer data,
 and then making a quick scatter plot visualization of
 tumor cell concavity versus smoothness colored by diagnosis in {numref}`fig:06-precode`.
-You will also notice that we set the random seed using the `np.random.seed` function, 
+You will also notice that we set the random seed using the `np.random.seed` function,
 as described in {numref}`randomseeds`.
 
 ```{code-cell} ipython3
@@ -478,7 +479,7 @@ it **stratifies** the data by the class label, to ensure that roughly
 the same proportion of each class ends up in both the training and testing sets. For example,
 in our data set, roughly 63% of the
 observations are from the benign class (`Benign`), and 37% are from the malignant class (`Malignant`),
-so specifying `stratify` as the class column ensures that roughly 63% of the training data are benign, 
+so specifying `stratify` as the class column ensures that roughly 63% of the training data are benign,
 37% of the training data are malignant,
 and the same proportions exist in the testing data.
 
@@ -518,19 +519,19 @@ glue("cancer_test_nrow", "{:d}".format(len(cancer_test)))
 ```{index} info
 ```
 
-We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations, 
+We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations,
 while the test set contains {glue:text}`cancer_test_nrow` observations. This corresponds to
 a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s <classification1>`
-that we use the `info` method to preview the number of rows, the variable names, their data types, and 
+that we use the `info` method to preview the number of rows, the variable names, their data types, and
 missing entries of a data frame.
 
 ```{index} groupby, count
 ```
 
-We can use the `value_counts` method with the `normalize` argument set to `True` 
-to find the percentage of malignant and benign classes 
+We can use the `value_counts` method with the `normalize` argument set to `True`
+to find the percentage of malignant and benign classes
 in `cancer_train`. We see about {glue:text}`cancer_train_b_prop`% of the training
-data are benign and {glue:text}`cancer_train_m_prop`% 
+data are benign and {glue:text}`cancer_train_m_prop`%
 are malignant, indicating that our class proportions were roughly preserved when we split the data.
 
 ```{code-cell} ipython3
@@ -546,7 +547,7 @@ glue("cancer_train_m_prop", "{:0.0f}".format(cancer_train["Class"].value_counts(
 
 ### Preprocess the data
 
-As we mentioned in the last chapter, $K$-nearest neighbors is sensitive to the scale of the predictors,
+As we mentioned in the last chapter, K-nearest neighbors is sensitive to the scale of the predictors,
 so we should perform some preprocessing to standardize them. An
 additional consideration we need to take when doing this is that we should
 create the standardization preprocessor using **only the training data**. This ensures that
@@ -559,7 +560,7 @@ training and test data sets.
 ```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler
 ```
 
-Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our 
+Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our
 analysis steps in a `Pipeline`, as in {numref}`Chapter %s <classification1>`.
 So below we construct and prepare
 the preprocessor using `make_column_transformer` just as before.
@@ -576,11 +577,11 @@ cancer_preprocessor = make_column_transformer(
 ### Train the classifier
 
 Now that we have split our original data set into training and test sets, we
-can create our $K$-nearest neighbors classifier with only the training set using
+can create our K-nearest neighbors classifier with only the training set using
 the technique we learned in the previous chapter. For now, we will just choose
 the number $K$ of neighbors to be 3, and use only the concavity and smoothness predictors by
-selecting them from the `cancer_train` data frame. 
-We will first import the `KNeighborsClassifier` model and `make_pipeline` from `sklearn`. 
+selecting them from the `cancer_train` data frame.
+We will first import the `KNeighborsClassifier` model and `make_pipeline` from `sklearn`.
 Then as before we will create a model object, combine
 the model object and preprocessor into a `Pipeline` using the `make_pipeline` function, and then finally
 use the `fit` method to build the classifier.
@@ -589,7 +590,7 @@ use the `fit` method to build the classifier.
 from sklearn.neighbors import KNeighborsClassifier
 from sklearn.pipeline import make_pipeline
 
-knn = KNeighborsClassifier(n_neighbors=3) 
+knn = KNeighborsClassifier(n_neighbors=3)
 
 X = cancer_train[["Smoothness", "Concavity"]]
 y = cancer_train["Class"]
@@ -605,7 +606,7 @@ knn_pipeline
 ```{index} pandas.concat
 ```
 
-Now that we have a $K$-nearest neighbors classifier object, we can use it to
+Now that we have a K-nearest neighbors classifier object, we can use it to
 predict the class labels for our test set and
 augment the original test data with a column of predictions.
 The `Class` variable contains the actual
@@ -663,7 +664,7 @@ glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1))
 
 +++
 
-The output shows that the estimated accuracy of the classifier on the test data 
+The output shows that the estimated accuracy of the classifier on the test data
 was {glue:text}`cancer_acc_1`%. To compute the precision and recall, we can use the
 `precision_score` and `recall_score` functions from `scikit-learn`. We specify
 the true labels from the `Class` variable as the `y_true` argument, the predicted
@@ -709,7 +710,7 @@ _ctab = pd.crosstab(cancer_test["Class"],
 
 c11 = _ctab["Malignant"]["Malignant"]
 c00 = _ctab["Benign"]["Benign"]
-c10 = _ctab["Benign"]["Malignant"] # classify benign, true malignant 
+c10 = _ctab["Benign"]["Malignant"] # classify benign, true malignant
 c01 = _ctab["Malignant"]["Benign"] # classify malignant, true benign
 
 glue("confu11", "{:d}".format(c11))
@@ -726,8 +727,8 @@ glue("confu_precision_0", "{:0.0f}".format(100*c11/(c11+c01)))
 glue("confu_recall_0", "{:0.0f}".format(100*c11/(c11+c10)))
 ```
 
-The confusion matrix shows {glue:text}`confu11` observations were correctly predicted 
-as malignant, and {glue:text}`confu00` were correctly predicted as benign. 
+The confusion matrix shows {glue:text}`confu11` observations were correctly predicted
+as malignant, and {glue:text}`confu00` were correctly predicted as benign.
 It also shows that the classifier made some mistakes; in particular,
 it classified {glue:text}`confu10` observations as benign when they were actually malignant,
 and {glue:text}`confu01` observations as malignant when they were actually benign.
@@ -768,15 +769,15 @@ glue("rec_eq_math_glued", rec_eq_math)
 ### Critically analyze performance
 
 We now know that the classifier was {glue:text}`cancer_acc_1`% accurate
-on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and 
-a recall of {glue:text}`cancer_rec_1`%. 
-That sounds pretty good! Wait, *is* it good? 
+on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and
+a recall of {glue:text}`cancer_rec_1`%.
+That sounds pretty good! Wait, *is* it good?
 Or do we need something higher?
 
 ```{index} accuracy; assessment
 ```
 
-In general, a *good* value for accuracy (as well as precision and recall, if applicable) 
+In general, a *good* value for accuracy (as well as precision and recall, if applicable)
 depends on the application; you must critically analyze your accuracy in the context of the problem
 you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
 of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
@@ -789,7 +790,7 @@ words, in this context, we need the classifier to have a *high recall*. On the
 other hand, it might be less bad for the classifier to guess "malignant" when
 the actual class is "benign" (a false positive), as the patient will then likely see a doctor who
 can provide an expert diagnosis. In other words, we are fine with sacrificing
-some precision in the interest of achieving high recall. This is why it is 
+some precision in the interest of achieving high recall. This is why it is
 important not only to look at accuracy, but also the confusion matrix.
 
 
@@ -801,12 +802,12 @@ classification problem: the *majority classifier*. The majority classifier
 *always* guesses the majority class label from the training data, regardless of
 the predictor variables' values.  It helps to give you a sense of
 scale when considering accuracies. If the majority classifier obtains a 90%
-accuracy on a problem, then you might hope for your $K$-nearest neighbors
+accuracy on a problem, then you might hope for your K-nearest neighbors
 classifier to do better than that. If your classifier provides a significant
 improvement upon the majority classifier, this means that at least your method
 is extracting some useful information from your predictor variables.  Be
 careful though: improving on the majority classifier does not *necessarily*
-mean the classifier is working well enough for your application. 
+mean the classifier is working well enough for your application.
 
 As an example, in the breast cancer data, recall the proportions of benign and malignant
 observations in the training data are as follows:
@@ -819,16 +820,16 @@ Since the benign class represents the majority of the training data,
 the majority classifier would *always* predict that a new observation
 is benign. The estimated accuracy of the majority classifier is usually
 fairly close to the majority class proportion in the training data.
-In this case, we would suspect that the majority classifier will have 
+In this case, we would suspect that the majority classifier will have
 an accuracy of around {glue:text}`cancer_train_b_prop`%.
-The $K$-nearest neighbors classifier we built does quite a bit better than this, 
-with an accuracy of {glue:text}`cancer_acc_1`%. 
+The K-nearest neighbors classifier we built does quite a bit better than this,
+with an accuracy of {glue:text}`cancer_acc_1`%.
 This means that from the perspective of accuracy,
-the $K$-nearest neighbors classifier improved quite a bit on the basic
-majority classifier. Hooray! But we still need to be cautious; in 
+the K-nearest neighbors classifier improved quite a bit on the basic
+majority classifier. Hooray! But we still need to be cautious; in
 this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing
 patients who actually need medical care. The confusion matrix above shows
-that the classifier does, indeed, misdiagnose a significant number of 
+that the classifier does, indeed, misdiagnose a significant number of
 malignant tumors as benign ({glue:text}`confu10` out of {glue:text}`confu10_11` malignant tumors, or {glue:text}`confu_fal_neg`%!).
 Therefore, even though the accuracy improved upon the majority classifier,
 our critical analysis suggests that this classifier may not have appropriate performance
@@ -845,23 +846,23 @@ for the application.
 ```
 
 The vast majority of predictive models in statistics and machine learning have
-*parameters*. A *parameter* 
+*parameters*. A *parameter*
 is a number you have to pick in advance that determines
-some aspect of how the model behaves. For example, in the $K$-nearest neighbors
+some aspect of how the model behaves. For example, in the K-nearest neighbors
 classification algorithm, $K$ is a parameter that we have to pick
-that determines how many neighbors participate in the class vote. 
-By picking different values of $K$, we create different classifiers 
+that determines how many neighbors participate in the class vote.
+By picking different values of $K$, we create different classifiers
 that make different predictions.
 
-So then, how do we pick the *best* value of $K$, i.e., *tune* the model? 
+So then, how do we pick the *best* value of $K$, i.e., *tune* the model?
 And is it possible to make this selection in a principled way?  In this book,
-we will focus on maximizing the accuracy of the classifier. Ideally, 
+we will focus on maximizing the accuracy of the classifier. Ideally,
 we want somehow to maximize the accuracy of our classifier on data *it
 hasn't seen yet*. But we cannot use our test data set in the process of building
 our model. So we will play the same trick we did before when evaluating
 our classifier: we'll split our *training data itself* into two subsets,
 use one to train the model, and then use the other to evaluate it.
-In this section, we will cover the details of this procedure, as well as 
+In this section, we will cover the details of this procedure, as well as
 how to use it to help you pick a good parameter value for your classifier.
 
 **And remember:** don't touch the test set during the tuning process. Tuning is a part of model training!
@@ -873,12 +874,12 @@ how to use it to help you pick a good parameter value for your classifier.
 ```{index} validation set
 ```
 
-The first step in choosing the parameter $K$ is to be able to evaluate the 
+The first step in choosing the parameter $K$ is to be able to evaluate the
 classifier using only the training data. If this is possible, then we can compare
-the classifier's performance for different values of $K$&mdash;and pick the best&mdash;using 
+the classifier's performance for different values of $K$&mdash;and pick the best&mdash;using
 only the training data. As suggested at the beginning of this section, we will
 accomplish this by splitting the training data, training on one subset, and evaluating
-on the other. The subset of training data used for evaluation is often called the **validation set**. 
+on the other. The subset of training data used for evaluation is often called the **validation set**.
 
 There is, however, one key difference from the train/test split
 that we performed earlier. In particular, we were forced to make only a *single split*
@@ -892,10 +893,10 @@ data *once*, our best parameter choice will depend strongly on whatever data
 was lucky enough to end up in the validation set. Perhaps using multiple
 different train/validation splits, we'll get a better estimate of accuracy,
 which will lead to a better choice of the number of neighbors $K$ for the
-overall set of training data. 
+overall set of training data.
 
 Let's investigate this idea in Python! In particular, we will generate five different train/validation
-splits of our overall training data, train five different $K$-nearest neighbors
+splits of our overall training data, train five different K-nearest neighbors
 models, and evaluate their accuracy. We will start with just a single
 split.
 
@@ -906,7 +907,7 @@ cancer_subtrain, cancer_validation = train_test_split(
 )
 
 # fit the model on the sub-training data
-knn = KNeighborsClassifier(n_neighbors=3) 
+knn = KNeighborsClassifier(n_neighbors=3)
 X = cancer_subtrain[["Smoothness", "Concavity"]]
 y = cancer_subtrain["Class"]
 knn_pipeline = make_pipeline(cancer_preprocessor, knn)
@@ -931,7 +932,7 @@ for i in range(1, 5):
     )
 
     # fit the model on the sub-training data
-    knn = KNeighborsClassifier(n_neighbors=3) 
+    knn = KNeighborsClassifier(n_neighbors=3)
     X = cancer_subtrain[["Smoothness", "Concavity"]]
     y = cancer_subtrain["Class"]
     knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
@@ -965,18 +966,18 @@ just five estimates of the true, underlying accuracy of our classifier built
 using our overall training data. We can combine the estimates by taking their
 average (here {glue:text}`avg_5_splits`%) to try to get a single assessment of our
 classifier's accuracy; this has the effect of reducing the influence of any one
-(un)lucky validation set on the estimate. 
+(un)lucky validation set on the estimate.
 
 ```{index} cross-validation
 ```
 
 In practice, we don't use random splits, but rather use a more structured
 splitting procedure so that each observation in the data set is used in a
-validation set only a single time. The name for this strategy is 
+validation set only a single time. The name for this strategy is
 **cross-validation**.  In **cross-validation**, we split our **overall training
 data** into $C$ evenly sized chunks. Then, iteratively use $1$ chunk as the
-**validation set** and combine the remaining $C-1$ chunks 
-as the **training set**. 
+**validation set** and combine the remaining $C-1$ chunks
+as the **training set**.
 This procedure is shown in {numref}`fig:06-cv-image`.
 Here, $C=5$ different chunks of the data set are used,
 resulting in 5 different choices for the **validation set**; we call this
@@ -997,19 +998,19 @@ resulting in 5 different choices for the **validation set**; we call this
 ```
 
 To perform 5-fold cross-validation in Python with `scikit-learn`, we use another
-function: `cross_validate`. This function requires that we specify 
+function: `cross_validate`. This function requires that we specify
 a modelling `Pipeline` as the `estimator` argument,
 the number of folds as the `cv` argument,
 and the training data predictors and labels as the `X` and `y` arguments.
 Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame` to convert it to a `pandas`
-dataframe for better visualization. 
+dataframe for better visualization.
 Note that the `cross_validate` function handles stratifying the classes in
-each train and validate fold automatically. 
+each train and validate fold automatically.
 
 ```{code-cell} ipython3
 from sklearn.model_selection import cross_validate
 
-knn = KNeighborsClassifier(n_neighbors=3) 
+knn = KNeighborsClassifier(n_neighbors=3)
 cancer_pipe = make_pipeline(cancer_preprocessor, knn)
 X = cancer_train[["Smoothness", "Concavity"]]
 y = cancer_train["Class"]
@@ -1027,11 +1028,11 @@ cv_5_df
 
 The validation scores we are interested in are contained in the `test_score` column.
 We can then aggregate the *mean* and *standard error*
-of the classifier's validation accuracy across the folds. 
-You should consider the mean (`mean`) to be the estimated accuracy, while the standard 
+of the classifier's validation accuracy across the folds.
+You should consider the mean (`mean`) to be the estimated accuracy, while the standard
 error (`sem`) is a measure of how uncertain we are in that mean value. A detailed treatment of this
 is beyond the scope of this chapter; but roughly, if your estimated mean is {glue:text}`cv_5_mean` and standard
-error is {glue:text}`cv_5_std`, you can expect the *true* average accuracy of the 
+error is {glue:text}`cv_5_std`, you can expect the *true* average accuracy of the
 classifier to be somewhere roughly between {glue:text}`cv_5_lower`% and {glue:text}`cv_5_upper`% (although it may
 fall outside this range). You may ignore the other columns in the metrics data frame.
 
@@ -1066,13 +1067,13 @@ glue("cv_5_lower",
 ```
 
 We can choose any number of folds, and typically the more we use the better our
-accuracy estimate will be (lower standard error). However, we are limited 
+accuracy estimate will be (lower standard error). However, we are limited
 by computational power: the
 more folds we choose, the  more computation it takes, and hence the more time
 it takes to run the analysis. So when you do cross-validation, you need to
-consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
-neighbors), and the speed of your computer. In practice, this is a 
-trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here 
+consider the size of the data, the speed of the algorithm (e.g., K-nearest
+neighbors), and the speed of your computer. In practice, this is a
+trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
 we will try 10-fold cross-validation to see if we get a lower standard error.
 
 ```{code-cell} ipython3
@@ -1097,7 +1098,7 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt
 cv_10_metrics
 ```
 
-In this case, using 10-fold instead of 5-fold cross validation did 
+In this case, using 10-fold instead of 5-fold cross validation did
 reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
 you might even end up with a *higher* standard error when increasing the number of folds!
 We can make the reduction in standard error more dramatic by increasing the number of folds
@@ -1135,19 +1136,19 @@ glue("cv_10_mean", "{:0.0f}".format(100 * cv_10_metrics.loc["mean", "test_score"
 ### Parameter value selection
 
 Using 5- and 10-fold cross-validation, we have estimated that the prediction
-accuracy of our classifier is somewhere around {glue:text}`cv_10_mean`%. 
+accuracy of our classifier is somewhere around {glue:text}`cv_10_mean`%.
 Whether that is good or not
 depends entirely on the downstream application of the data analysis. In the
 present situation, we are trying to predict a tumor diagnosis, with expensive,
 damaging chemo/radiation therapy or patient death as potential consequences of
-misprediction. Hence, we might like to 
-do better than {glue:text}`cv_10_mean`% for this application.  
+misprediction. Hence, we might like to
+do better than {glue:text}`cv_10_mean`% for this application.
 
 In order to improve our classifier, we have one choice of parameter: the number of
 neighbors, $K$. Since cross-validation helps us evaluate the accuracy of our
 classifier, we can use cross-validation to calculate an accuracy for each value
 of $K$ in a reasonable range, and then pick the value of $K$ that gives us the
-best accuracy. The `scikit-learn` package collection provides built-in 
+best accuracy. The `scikit-learn` package collection provides built-in
 functionality, named `GridSearchCV`, to automatically handle the details for us.
 Before we use `GridSearchCV`, we need to create a new pipeline
 with a `KNeighborsClassifier` that has the number of neighbors left unspecified.
@@ -1159,8 +1160,8 @@ cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn)
 
 +++
 
-Next we specify the grid of parameter values that we want to try for 
-each tunable parameter. We do this in a Python dictionary: the key is 
+Next we specify the grid of parameter values that we want to try for
+each tunable parameter. We do this in a Python dictionary: the key is
 the identifier of the parameter to tune, and the value is a list of parameter values
 to try when tuning. We can find the "identifier" of a parameter by using
 the `get_params` method on the pipeline.
@@ -1183,7 +1184,7 @@ parameter_grid = {
 }
 ```
 The `range` function in Python that we used above allows us to specify a sequence of values.
-The first argument is the starting number (here, `1`), 
+The first argument is the starting number (here, `1`),
 the second argument is *one greater than* the final number (here, `100`),
 and the third argument is the number to values to skip between steps in the sequence (here, `5`).
 So in this case we generate the sequence 1, 6, 11, 16, ..., 96.
@@ -1233,13 +1234,13 @@ accuracies_grid.info()
 
 There is a lot of information to look at here, but we are most interested
 in three quantities: the number of neighbors (`param_kneighbors_classifier__n_neighbors`),
-the cross-validation accuracy estimate (`mean_test_score`), 
+the cross-validation accuracy estimate (`mean_test_score`),
 and the standard error of the accuracy estimate. Unfortunately `GridSearchCV` does
 not directly output the standard error for each cross-validation accuracy; but
 it *does* output the standard *deviation* (`std_test_score`). We can compute
 the standard error from the standard deviation by dividing it by the square
-root of the number of folds, i.e., 
-  
+root of the number of folds, i.e.,
+
 $$\text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Number of Folds}}}.$$
 
 We will also rename the parameter name column to be a bit more readable,
@@ -1298,14 +1299,14 @@ cancer_tune_grid.best_params_
 
 +++
 
-Setting the number of 
+Setting the number of
 neighbors to $K =$ {glue:text}`best_k_unique`
 provides the highest cross-validation accuracy estimate ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
 any selection from $K = 30$ to $80$ or so would be reasonably justified, as all
 of these differ in classifier accuracy by a small amount. Remember: the
 values you see on this plot are *estimates* of the true accuracy of our
-classifier. Although the 
-$K =$ {glue:text}`best_k_unique` value is 
+classifier. Although the
+$K =$ {glue:text}`best_k_unique` value is
 higher than the others on this plot,
 that doesn't mean the classifier is actually more accurate with this parameter
 value! Generally, when selecting $K$ (and other parameters for other predictive
@@ -1315,8 +1316,8 @@ models), we are looking for a value where:
 - changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty;
 - the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!).
 
-We know that $K =$ {glue:text}`best_k_unique` 
-provides the highest estimated accuracy. Further, {numref}`fig:06-find-k` shows that the estimated accuracy 
+We know that $K =$ {glue:text}`best_k_unique`
+provides the highest estimated accuracy. Further, {numref}`fig:06-find-k` shows that the estimated accuracy
 changes by only a small amount if we increase or decrease $K$ near $K =$ {glue:text}`best_k_unique`.
 And finally, $K =$ {glue:text}`best_k_unique` does not create a prohibitively expensive
 computational cost of training. Considering these three points, we would indeed select
@@ -1327,9 +1328,9 @@ $K =$ {glue:text}`best_k_unique` for the classifier.
 ### Under/Overfitting
 
 To build a bit more intuition, what happens if we keep increasing the number of
-neighbors $K$? In fact, the cross-validation accuracy estimate actually starts to decrease! 
-Let's specify a much larger range of values of $K$ to try in the `param_grid` 
-argument of `GridSearchCV`. {numref}`fig:06-lots-of-ks` shows a plot of estimated accuracy as 
+neighbors $K$? In fact, the cross-validation accuracy estimate actually starts to decrease!
+Let's specify a much larger range of values of $K$ to try in the `param_grid`
+argument of `GridSearchCV`. {numref}`fig:06-lots-of-ks` shows a plot of estimated accuracy as
 we vary $K$ from 1 to almost the number of observations in the data set.
 
 ```{code-cell} ipython3
@@ -1454,11 +1455,11 @@ plot_list = []
 for k in [1, 7, 20, 300]:
     cancer_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier(n_neighbors=k))
     cancer_pipe.fit(X, y)
-    
+
     knnPredGrid = cancer_pipe.predict(scgrid)
     prediction_table = scgrid.copy()
     prediction_table["Class"] = knnPredGrid
-    
+
     # add a prediction layer
     prediction_plot = (
         alt.Chart(
@@ -1523,10 +1524,10 @@ set the number of neighbors $K$ to 1, 7, 20, and 300.
 
 ### Evaluating on the test set
 
-Now that we have tuned the KNN classifier and set $K =$ {glue:text}`best_k_unique`,
+Now that we have tuned the K-NN classifier and set $K =$ {glue:text}`best_k_unique`,
 we are done building the model and it is time to evaluate the quality of its predictions on the held out 
 test data, as we did earlier in {numref}`eval-performance-clasfcn2`.
-We first need to retrain the KNN classifier
+We first need to retrain the K-NN classifier
 on the entire training data set using the selected number of neighbors.
 Fortunately we do not have to do this ourselves manually; `scikit-learn` does it for
 us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the
@@ -1615,13 +1616,13 @@ maximize accuracy are not necessarily better for a given application.
 ## Summary
 
 Classification algorithms use one or more quantitative variables to predict the
-value of another categorical variable. In particular, the $K$-nearest neighbors
+value of another categorical variable. In particular, the K-nearest neighbors
 algorithm does this by first finding the $K$ points in the training data
 nearest to the new observation, and then returning the majority class vote from
 those training observations. We can tune and evaluate a classifier by splitting
 the data randomly into a training and test data set. The training set is used
 to build the classifier, and we can tune the classifier (e.g., select the number
-of neighbors in $K$-nearest neighbors) by maximizing estimated accuracy via
+of neighbors in K-nearest neighbors) by maximizing estimated accuracy via
 cross-validation. After we have tuned the model, we can use the test set to
 estimate its accuracy.  The overall process is summarized in
 {numref}`fig:06-overview`.
@@ -1631,7 +1632,7 @@ estimate its accuracy.  The overall process is summarized in
 ```{figure} img/classification2/train-test-overview.jpeg
 :name: fig:06-overview
 
-Overview of KNN classification.
+Overview of K-NN classification.
 ```
 
 +++
@@ -1639,29 +1640,29 @@ Overview of KNN classification.
 ```{index} scikit-learn, pipeline, cross-validation, K-nearest neighbors; classification, classification
 ```
 
-The overall workflow for performing $K$-nearest neighbors classification using `scikit-learn` is as follows:
+The overall workflow for performing K-nearest neighbors classification using `scikit-learn` is as follows:
 
-1. Use the `train_test_split` function to split the data into a training and test set. Set the `stratify` argument to the class label column of the dataframe. Put the test set aside for now. 
-2. Create a `Pipeline` that specifies the preprocessing steps and the classifier. 
-3. Define the parameter grid by passing the set of $K$ values that you would like to tune. 
+1. Use the `train_test_split` function to split the data into a training and test set. Set the `stratify` argument to the class label column of the dataframe. Put the test set aside for now.
+2. Create a `Pipeline` that specifies the preprocessing steps and the classifier.
+3. Define the parameter grid by passing the set of $K$ values that you would like to tune.
 4. Use `GridSearchCV` to estimate the classifier accuracy for a range of $K$ values. Pass the pipeline and parameter grid defined in steps 2. and 3. as the `param_grid` argument and the `estimator` argument, respectively.
 5. Execute the grid search by passing the training data to the `fit` method on the `GridSearchCV` instance created in step 4.
 6. Pick a value of $K$ that yields a high cross-validation accuracy estimate that doesn't change much if you change $K$ to a nearby value.
 7. Create a new model object for the best parameter value (i.e., $K$), and retrain the classifier by calling the `fit` method.
 8. Evaluate the estimated accuracy of the classifier on the test set using the `score` method.
 
-In these last two chapters, we focused on the $K$-nearest neighbor algorithm, 
-but there are many other methods we could have used to predict a categorical label. 
-All algorithms have their strengths and weaknesses, and we summarize these for 
-the $K$-NN here.
+In these last two chapters, we focused on the K-nearest neighbors algorithm,
+but there are many other methods we could have used to predict a categorical label.
+All algorithms have their strengths and weaknesses, and we summarize these for
+the K-NN here.
 
-**Strengths:** $K$-nearest neighbors classification
+**Strengths:** K-nearest neighbors classification
 
 1. is a simple, intuitive algorithm,
 2. requires few assumptions about what the data must look like, and
 3. works for binary (two-class) and multi-class (more than 2 classes) classification problems.
 
-**Weaknesses:** $K$-nearest neighbors classification
+**Weaknesses:** K-nearest neighbors classification
 
 1. becomes very slow as the training data gets larger,
 2. may not perform well with a large number of predictors, and
@@ -1672,7 +1673,7 @@ the $K$-NN here.
 ## Predictor variable selection
 
 ```{note}
-This section is not required reading for the remainder of the textbook. It is included for those readers 
+This section is not required reading for the remainder of the textbook. It is included for those readers
 interested in learning how irrelevant variables can influence the performance of a classifier, and how to
 pick a subset of useful variables to include as predictors.
 ```
@@ -1683,7 +1684,7 @@ pick a subset of useful variables to include as predictors.
 Another potentially important part of tuning your classifier is to choose which
 variables from your data will be treated as predictor variables. Technically, you can choose
 anything from using a single predictor variable to using every variable in your
-data; the $K$-nearest neighbors algorithm accepts any number of
+data; the K-nearest neighbors algorithm accepts any number of
 predictors. However, it is **not** the case that using more predictors always
 yields better predictions! In fact, sometimes including irrelevant predictors can
 actually negatively affect classifier performance.
@@ -1692,13 +1693,13 @@ actually negatively affect classifier performance.
 
 ### The effect of irrelevant predictors
 
-Let's take a look at an example where $K$-nearest neighbors performs
+Let's take a look at an example where K-nearest neighbors performs
 worse when given more predictors to work with. In this example, we modified
 the breast cancer data to have only the `Smoothness`, `Concavity`, and
 `Perimeter` variables from the original data. Then, we added irrelevant
 variables that we created ourselves using a random number generator.
 The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless
-of what the value `Class` variable takes. In other words, the irrelevant variables have 
+of what the value `Class` variable takes. In other words, the irrelevant variables have
 no meaningful relationship with the `Class` variable.
 
 ```{code-cell} ipython3
@@ -1721,7 +1722,7 @@ cancer_irrelevant[
 ]
 ```
 
-Next, we build a sequence of KNN classifiers that include `Smoothness`,
+Next, we build a sequence of K-NN classifiers that include `Smoothness`,
 `Concavity`, and `Perimeter` as predictor variables, but also increasingly many irrelevant
 variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors.
 Then we build a model, tuned via 5-fold cross-validation, for each data set.
@@ -1825,12 +1826,12 @@ glue("fig:06-performance-irrelevant-features", plt_irrelevant_accuracies)
 Effect of inclusion of irrelevant predictors.
 :::
 
-Although the accuracy decreases as expected, one surprising thing about 
+Although the accuracy decreases as expected, one surprising thing about
 {numref}`fig:06-performance-irrelevant-features` is that it shows that the method
-still outperforms the baseline majority classifier (with about {glue:text}`cancer_train_b_prop`% accuracy) 
+still outperforms the baseline majority classifier (with about {glue:text}`cancer_train_b_prop`% accuracy)
 even with 40 irrelevant variables.
 How could that be? {numref}`fig:06-neighbors-irrelevant-features` provides the answer:
-the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables 
+the tuning procedure for the K-nearest neighbors classifier combats the extra randomness from the irrelevant variables
 by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant
 variables, the number of neighbors does not increase smoothly; but the general trend is increasing. {numref}`fig:06-fixed-irrelevant-features` corroborates
 this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.
@@ -1893,17 +1894,17 @@ Accuracy versus number of irrelevant predictors for tuned and untuned number of
 
 ### Finding a good subset of predictors
 
-So then, if it is not ideal to use all of our variables as predictors without consideration, how 
+So then, if it is not ideal to use all of our variables as predictors without consideration, how
 do we choose which variables we *should* use?  A simple method is to rely on your scientific understanding
 of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer
 data that we have been studying, the `ID` variable is just a unique identifier for the observation.
 As it is not related to any measured property of the cells, the `ID` variable should therefore not be used
-as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables 
-is less obvious, as all seem like reasonable candidates. It 
+as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables
+is less obvious, as all seem like reasonable candidates. It
 is not clear which subset of them will create the best classifier. One could use visualizations and
 other exploratory analyses to try to help understand which variables are potentially relevant, but
 this process is both time-consuming and error-prone when there are many variables to consider.
-Therefore we need a more systematic and programmatic way of choosing variables. 
+Therefore we need a more systematic and programmatic way of choosing variables.
 This is a very difficult problem to solve in
 general, and there are a number of methods that have been developed that apply
 in particular cases of interest. Here we will discuss two basic
@@ -1918,15 +1919,15 @@ this chapter to find out where you can learn more about variable selection, incl
 
 The first idea you might think of for a systematic way to select predictors
 is to try all possible subsets of predictors and then pick the set that results in the "best" classifier.
-This procedure is indeed a well-known variable selection method referred to 
-as *best subset selection* {cite:p}`bealesubset,hockingsubset`. 
+This procedure is indeed a well-known variable selection method referred to
+as *best subset selection* {cite:p}`bealesubset,hockingsubset`.
 In particular, you
 
 1. create a separate model for every possible subset of predictors,
 2. tune each one using cross-validation, and
-3. pick the subset of predictors that gives you the highest cross-validation accuracy.  
+3. pick the subset of predictors that gives you the highest cross-validation accuracy.
 
-Best subset selection is applicable to any classification method ($K$-NN or otherwise).
+Best subset selection is applicable to any classification method (K-NN or otherwise).
 However, it becomes very slow when you have even a moderate
 number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
 grows very quickly with the number of predictors, and you have to train the model (itself
@@ -1934,17 +1935,17 @@ a slow process!) for each one. For example, if we have 2 predictors&mdash;let's
 them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
 and B together. If we have 3 predictors&mdash;A, B, and C&mdash;then we have 7
 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
-we have to train for $m$ predictors is $2^m-1$; in other words, when we 
-get to 10 predictors we have over *one thousand* models to train, and 
-at 20 predictors we have over *one million* models to train! 
-So although it is a simple method, best subset selection is usually too computationally 
+we have to train for $m$ predictors is $2^m-1$; in other words, when we
+get to 10 predictors we have over *one thousand* models to train, and
+at 20 predictors we have over *one million* models to train!
+So although it is a simple method, best subset selection is usually too computationally
 expensive to use in practice.
 
 ```{index} variable selection; forward
 ```
 
-Another idea is to iteratively build up a model by adding one predictor variable 
-at a time. This method&mdash;known as *forward selection* {cite:p}`forwardefroymson,forwarddraper`&mdash;is also widely 
+Another idea is to iteratively build up a model by adding one predictor variable
+at a time. This method&mdash;known as *forward selection* {cite:p}`forwardefroymson,forwarddraper`&mdash;is also widely
 applicable and fairly straightforward. It involves the following steps:
 
 1. Start with a model having no predictors.
@@ -1965,13 +1966,13 @@ training over 1000 candidate models with 10 predictors, forward selection requir
 Therefore we will continue the rest of this section using forward selection.
 
 ```{note}
-One word of caution before we move on. Every additional model that you train 
-increases the likelihood that you will get unlucky and stumble 
+One word of caution before we move on. Every additional model that you train
+increases the likelihood that you will get unlucky and stumble
 on a model that has a high cross-validation accuracy estimate, but a low true
 accuracy on the test data and other future observations.
 Since forward selection involves training a lot of models, you run a fairly
 high risk of this happening. To keep this risk low, only use forward selection
-when you have a large amount of data and a relatively small total number of 
+when you have a large amount of data and a relatively small total number of
 predictors. More advanced methods do not suffer from this
 problem as much; see the additional resources at the end of this chapter for
 where to learn more about advanced predictor selection methods.
@@ -1980,7 +1981,7 @@ where to learn more about advanced predictor selection methods.
 +++
 
 ### Forward selection in `scikit-learn`
- 
+
 We now turn to implementing forward selection in Python.
 First we will extract a smaller set of predictors to work with in this illustrative example&mdash;`Smoothness`,
 `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as well as the `Class` variable as the label.
@@ -2007,12 +2008,12 @@ cancer_subset
 ```
 
 To perform forward selection, we could use the
-[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) 
-from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors 
-for each set of features. Instead we will code the forward selection algorithm manually. 
-In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating. 
+[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html)
+from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors
+for each set of features. Instead we will code the forward selection algorithm manually.
+In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating.
 If you recall the end of the wrangling chapter, we mentioned
-that sometimes one needs more flexible forms of iteration than what 
+that sometimes one needs more flexible forms of iteration than what
 we have used earlier, and in these cases one typically resorts to
 a *for loop*; see
 the [control flow section](https://wesmckinney.com/book/python-basics.html#control_for) in
@@ -2022,7 +2023,7 @@ Here we will use two for loops: one over increasing predictor set sizes
 and another to check which predictor to add in each round (where you see `for j in range(len(names))` below).
 For each set of predictors to try, we extract the subset of predictors,
 pass it into a preprocessor, build a `Pipeline` that tunes
-a K-NN classifier using 10-fold cross-validation, 
+a K-NN classifier using 10-fold cross-validation,
 and finally records the estimated accuracy.
 
 ```{code-cell} ipython3
@@ -2047,19 +2048,19 @@ cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
 cancer_tune_grid = GridSearchCV(
     estimator=cancer_tune_pipe,
     param_grid=param_grid,
-    cv=10, 
+    cv=10,
     n_jobs=-1
 )
 
 # for every possible number of predictors
 for i in range(1, n_total + 1):
-    accs = np.zeros(len(names)) 
+    accs = np.zeros(len(names))
     # for every possible predictor to add
     for j in range(len(names)):
         # Add remaining predictor j to the model
         X = cancer_subset[selected + [names[j]]]
         y = cancer_subset["Class"]
-        
+
         # Find the best K for this set of predictors
         cancer_tune_grid.fit(X, y)
         accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_)
@@ -2067,14 +2068,14 @@ for i in range(1, n_total + 1):
         # Store the tuned accuracy for this set of predictors
         accs[j] = accuracies_grid["mean_test_score"].max()
 
-    # get the best new set of predictors that maximize cv accuracy    
+    # get the best new set of predictors that maximize cv accuracy
     best_set = selected + [names[accs.argmax()]]
-    
+
     # store the results for this round of forward selection
     accuracy_dict["size"].append(i)
     accuracy_dict["selected_predictors"].append(", ".join(best_set))
     accuracy_dict["accuracy"].append(accs.max())
-    
+
     # update the selected & available sets of predictors
     selected = best_set
     del names[accs.argmax()]
@@ -2091,14 +2092,14 @@ Interesting! The forward selection procedure first added the three meaningful va
 visualizes the accuracy versus the number of predictors in the model. You can see that
 as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant
 variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number
-of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have 
-to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). 
-The way to find that balance is to look for the *elbow* 
+of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have
+to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting).
+The way to find that balance is to look for the *elbow*
 in {numref}`fig:06-fwdsel-3`, i.e., the place on the plot where the accuracy stops increasing dramatically and
-levels off or begins to decrease. The elbow in {numref}`fig:06-fwdsel-3` appears to occur at the model with 
+levels off or begins to decrease. The elbow in {numref}`fig:06-fwdsel-3` appears to occur at the model with
 3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors
 occurs with 3 variables: `Perimeter, Concavity, Smoothness`. In other words, we have successfully removed irrelevant
-predictors from the model! It is always worth remembering, however, that what cross-validation gives you 
+predictors from the model! It is always worth remembering, however, that what cross-validation gives you
 is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
 where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
 
@@ -2131,13 +2132,13 @@ Estimated accuracy versus the number of predictors for the sequence of models bu
 ```{note}
 Since the choice of which variables to include as predictors is
 part of tuning your classifier, you *cannot use your test data* for this
-process! 
+process!
 ```
 
 ## Exercises
 
-Practice exercises for the material covered in this chapter 
-can be found in the accompanying 
+Practice exercises for the material covered in this chapter
+can be found in the accompanying
 [worksheets repository](https://worksheets.python.datasciencebook.ca)
 in the "Classification II: evaluation and tuning" row.
 You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -2155,15 +2156,15 @@ and guidance that the worksheets provide will function as intended.
 
 - The [`scikit-learn` website](https://scikit-learn.org/stable/) is an excellent
   reference for more details on, and advanced usage of, the functions and
-  packages in the past two chapters. Aside from that, it also offers many 
-  useful [tutorials](https://scikit-learn.org/stable/tutorial/index.html) 
-  to get you started. It's worth noting that the `scikit-learn` package 
+  packages in the past two chapters. Aside from that, it also offers many
+  useful [tutorials](https://scikit-learn.org/stable/tutorial/index.html)
+  to get you started. It's worth noting that the `scikit-learn` package
   does a lot more than just classification, and so the
   examples on the website similarly go beyond classification as well. In the next
   two chapters, you'll learn about another kind of predictive modeling setting,
   so it might be worth visiting the website only after reading through those
-  chapters. 
-- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) {cite:p}`james2013introduction` provides 
+  chapters.
+- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) {cite:p}`james2013introduction` provides
   a great next stop in the process of
   learning about classification. Chapter 4 discusses additional basic techniques
   for classification that we do not cover, such as logistic regression, linear
@@ -2174,7 +2175,7 @@ and guidance that the worksheets provide will function as intended.
   variables. Note that while this book is still a very accessible introductory
   text, it requires a bit more mathematical background than we require.
 
- 
+
 ## References
 
 +++
diff --git a/source/clustering.md b/source/clustering.md
index dc1c6759..7dc7815a 100755
--- a/source/clustering.md
+++ b/source/clustering.md
@@ -20,7 +20,7 @@ kernelspec:
 
 # get rid of futurewarnings from sklearn kmeans
 import warnings
-warnings.simplefilter(action='ignore', category=FutureWarning) 
+warnings.simplefilter(action='ignore', category=FutureWarning)
 
 from chapter_preamble import *
 ```
@@ -39,16 +39,17 @@ including techniques to choose the number of clusters.
 
 By the end of the chapter, readers will be able to do the following:
 
-* Describe a case where clustering is appropriate,
+- Describe a situation in which clustering is an appropriate technique to use,
 and what insight it might extract from the data.
-* Explain the K-means clustering algorithm.
-* Interpret the output of a K-means analysis.
-* Differentiate between clustering and classification.
-* Identify when it is necessary to scale variables before clustering and do this using Python
-* Perform k-means clustering in Python using `scikit-learn`
-* Use the elbow method to choose the number of clusters for K-means.
-* Visualize the output of k-means clustering in Python using a coloured scatter plot
-* Describe advantages, limitations and assumptions of the kmeans clustering algorithm.
+- Explain the K-means clustering algorithm.
+- Interpret the output of a K-means analysis.
+- Differentiate between clustering, classification, and regression.
+- Identify when it is necessary to scale variables before clustering, and do this using Python.
+- Perform K-means clustering in Python using `scikit-learn`.
+- Use the elbow method to choose the number of clusters for K-means.
+- Visualize the output of K-means clustering in Python using a colored scatter plot.
+- Describe advantages, limitations and assumptions of the K-means clustering algorithm.
+
 
 ## Clustering
 
@@ -130,7 +131,7 @@ In this chapter we will focus on a data set from
 [the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) {cite:p}`palmerpenguins`. This
 data set was collected by Dr. Kristen Gorman and
 the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
-measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`. 
+measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`.
 Our goal will be to use two
 variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether
 there are distinct types of penguins in our data.
@@ -834,7 +835,7 @@ kmeans
 
 To actually run the K-means clustering, we combine the preprocessor and model object
 in a `Pipeline`, and use the `fit` function. Note that the K-means
-algorithm uses a random initialization of assignments, but since we set 
+algorithm uses a random initialization of assignments, but since we set
 the random seed in the beginning of this chapter, the clustering will be reproducible.
 
 ```{code-cell} ipython3
@@ -848,14 +849,14 @@ penguin_clust
 ```{index} K-means; inertia_, K-means; cluster_centers_, K-means; labels_, K-means; predict
 ```
 
-The fit `KMeans` object&mdash;which is the second item in the 
+The fit `KMeans` object&mdash;which is the second item in the
 pipeline, and can be accessed as `penguin_clust[1]`&mdash;has a lot of information
 that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
-Let's start by visualizing the clusters as a colored scatter plot! In 
-order to do that, we first need to augment our 
-original `penguins` data frame with the cluster assignments. 
-We can access these using the `labels_` attribute of the clustering object 
-("labels" is a common alternative term to "assignments" in clustering), and 
+Let's start by visualizing the clusters as a colored scatter plot! In
+order to do that, we first need to augment our
+original `penguins` data frame with the cluster assignments.
+We can access these using the `labels_` attribute of the clustering object
+("labels" is a common alternative term to "assignments" in clustering), and
 add them to the data frame.
 
 ```{code-cell} ipython3
@@ -863,9 +864,9 @@ penguins["cluster"] = penguin_clust[1].labels_
 penguins
 ```
 
-Now that we have the cluster assignments included in the `penguins` data frame, we can 
+Now that we have the cluster assignments included in the `penguins` data frame, we can
 visualize them as shown in {numref}`cluster_plot`.
-Note that we are plotting the *un-standardized* data here; if we for some reason wanted to 
+Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
 visualize the *standardized* data, we would need to use the `fit` and `transform` functions
 on the `StandardScaler` preprocessor directly to obtain that first.
 As in {numref}`Chapter %s <viz>`,
@@ -912,7 +913,7 @@ penguin_clust[1].inertia_
 
 To calculate the total WSSD for a variety of Ks, we will
 create a data frame that contains different values of `k`
-and the WSSD of running KMeans with each values of k.
+and the WSSD of running K-means with each values of k.
 To create this dataframe,
 we will use what is called a "list comprehension" in Python,
 where we repeat an operation multiple times
@@ -934,10 +935,10 @@ we could square all the numbers from 1-4 and store them in a list:
 
 Next, we will use this approach to compute the WSSD for the K-values 1 through 9.
 For each value of K,
-we create a new KMeans model
+we create a new `KMeans` model
 and wrap it in a `scikit-learn` pipeline
 with the preprocessor we created earlier.
-We store the WSSD values in a list that we will use to create a dataframe 
+We store the WSSD values in a list that we will use to create a dataframe
 of both the K-values and their corresponding WSSDs.
 
 ```{note}
@@ -954,7 +955,7 @@ it is always the safest to assign it to a variable name for reuse.
 ks = range(1, 10)
 wssds = [
     make_pipeline(
-    	preprocessor, 
+    	preprocessor,
     	KMeans(n_clusters=k)  # Create a new KMeans model with `k` clusters
     ).fit(penguins)[1].inertia_
     for k in ks
@@ -1008,7 +1009,7 @@ due to an unlucky initialization of the initial center positions
 as we mentioned earlier in the chapter.
 
 ```{note}
-It is rare that the KMeans function from `scikit-learn`
+It is rare that the implementation of K-means from `scikit-learn`
 gets stuck in a bad solution, because `scikit-learn` tries to choose
 the initial centers carefully to prevent this from happening.
 If you still find yourself in a situation where you have a bump in the elbow plot,
diff --git a/source/index.md b/source/index.md
index e75806a1..454fa914 100755
--- a/source/index.md
+++ b/source/index.md
@@ -15,7 +15,7 @@ kernelspec:
 
 ![](img/frontmatter/ds-a-first-intro-graphic.jpg)
 
-# Data Science 
+# Data Science
 
 ## *A First Introduction (Python Edition)*
 
diff --git a/source/inference.md b/source/inference.md
index 6e89cc1d..dfb36c07 100755
--- a/source/inference.md
+++ b/source/inference.md
@@ -36,16 +36,16 @@ populations and then introduce two common techniques in statistical inference:
 
 By the end of the chapter, readers will be able to do the following:
 
-* Describe real-world examples of questions that can be answered with statistical inference.
-* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
-* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
-* Explain the difference between a population parameter and a sample point estimate.
-* Use Python to draw random samples from a finite population.
-* Use Python to create a sampling distribution from a finite population.
-* Describe how sample size influences the sampling distribution.
-* Define bootstrapping.
-* Use Python to create a bootstrap distribution to approximate a sampling distribution.
-* Contrast the bootstrap and sampling distributions.
+- Describe real-world examples of questions that can be answered with statistical inference.
+- Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
+- Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
+- Explain the difference between a population parameter and a sample point estimate.
+- Use Python to draw random samples from a finite population.
+- Use Python to create a sampling distribution from a finite population.
+- Describe how sample size influences the sampling distribution.
+- Define bootstrapping.
+- Use Python to create a bootstrap distribution to approximate a sampling distribution.
+- Contrast the bootstrap and sampling distributions.
 
 +++
 
@@ -317,7 +317,7 @@ with the `name` parameter:
 ```
 
 Below we put everything together
-and also filter the data frame to keep only the room types 
+and also filter the data frame to keep only the room types
 that we are interested in.
 
 ```{code-cell} ipython3
@@ -776,7 +776,7 @@ How large is "large enough?" Unfortunately, it depends entirely on the problem a
 as a rule of thumb, often a sample size of at least 20 will suffice.
 ```
 
-<!--- 
+<!---
 ```{note}
 If random samples of size $n$ are taken from a population, the sample mean
 $\bar{x}$ will be approximately Normal with mean $\mu$ and standard deviation
diff --git a/source/intro.md b/source/intro.md
index d0921e89..576deba0 100755
--- a/source/intro.md
+++ b/source/intro.md
@@ -37,11 +37,14 @@ By the end of the chapter, readers will be able to do the following:
 - Identify the different types of data analysis question and categorize a question into the correct type.
 - Load the `pandas` package into Python.
 - Read tabular data with `read_csv`.
-- Use `help()` to access help and documentation tools in Python.
-- Create new variables and objects in Python.
+- Create new variables and objects in Python using the assignment symbol.
 - Create and organize subsets of tabular data using `[]`, `loc[]`, `sort_values`, and `head`.
+- Add and modify columns in tabular data using column assignment.
 - Chain multiple operations in sequence.
 - Visualize data with an `altair` bar plot.
+- Use `help()` and `?` to access help and documentation tools in Python.
+
+
 
 ## Canadian languages data set
 
@@ -576,13 +579,13 @@ and wrote `pd.read_csv`. The dot means that the thing on the left (`pd`, i.e., t
 thing on the right (the `read_csv` function). In the case of `can_lang.loc[]`, the thing on the left (the `can_lang` data frame)
 *provides* the thing on the right (the `loc[]` operation). In Python,
 both packages (like `pandas`) *and* objects (like our `can_lang` data frame) can provide functions
-and other objects that we access using the dot syntax. 
+and other objects that we access using the dot syntax.
 
 ```{note}
 A note on terminology: when an object `obj` provides a function `f` with the
 dot syntax (as in `obj.f()`), sometimes we call that function `f` a *method* of `obj` or an *operation* on `obj`.
-Similarly, when an object `obj` provides another object `x` with the dot syntax (as in `obj.x`), sometimes we call the object `x` an *attribute* of `obj`. 
-We will use all of these terms throughout the book, as you will see them used commonly in the community. 
+Similarly, when an object `obj` provides another object `x` with the dot syntax (as in `obj.x`), sometimes we call the object `x` an *attribute* of `obj`.
+We will use all of these terms throughout the book, as you will see them used commonly in the community.
 And just because we programmers like to be confusing for no apparent reason: we *don't* use the "method", "operation", or "attribute" terminology
 when referring to functions and objects from packages, like `pandas`. So for example, `pd.read_csv`
 would typically just be referred to as a function, but not as a method or operation, even though it uses the dot syntax.
@@ -662,18 +665,18 @@ a first one&mdash;so fear not and explore! To answer this small
 question-along-the-way, we need to divide each count in the `mother_tongue`
 column by the total Canadian population according to the 2016
 census&mdash;i.e., 35,151,728&mdash;and multiply it by 100. We can perform
-this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`. 
+this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`.
 Then to store the result in a new column (or
 overwrite an existing column), we specify the name of the new
-column to create (or old column to modify), then the assignment symbol `=`, 
+column to create (or old column to modify), then the assignment symbol `=`,
 and then the computation to store in that column. In this case, we will opt to
-create a new column called `mother_tongue_percent`. 
+create a new column called `mother_tongue_percent`.
 
 ```{note}
 You will see below that we write the Canadian population in
 Python as `35_151_728`. The underscores (`_`) are just there for readability,
-and do not affect how Python interprets the number. In other words, 
-`35151728` and `35_151_728` are treated identically in Python, 
+and do not affect how Python interprets the number. In other words,
+`35151728` and `35_151_728` are treated identically in Python,
 although the latter is much clearer!
 ```
 
@@ -692,7 +695,7 @@ ten_lang
 ```
 
 The `ten_lang_percent` data frame shows that
-the ten Aboriginal languages in the `ten_lang` data frame were spoken 
+the ten Aboriginal languages in the `ten_lang` data frame were spoken
 as a mother tongue by between 0.008% and 0.18% of the Canadian population.
 
 ## Combining analysis steps with chaining and multiline expressions
@@ -828,7 +831,7 @@ each language. When you move on to more complicated analyses, this issue only
 gets worse. In contrast, a *visualization* would convey this information in a much
 more easily understood format.
 Visualizations are a great tool for summarizing information to help you
-effectively communicate with your audience, and creating effective data visualizations 
+effectively communicate with your audience, and creating effective data visualizations
 is an essential component of any data
 analysis. In this section we will develop a visualization of the
  ten Aboriginal languages that were most often reported in 2016 as mother tongues in
diff --git a/source/reading.md b/source/reading.md
index e461c241..61c9d53c 100755
--- a/source/reading.md
+++ b/source/reading.md
@@ -40,29 +40,30 @@ tied well before going for a run so that you don’t trip later on!
 ## Chapter learning objectives
 By the end of the chapter, readers will be able to do the following:
 
-- Define the following:
+- Define the types of path and use them to locate files:
     - absolute file path
     - relative file path
-    - **U**niform **R**esource **L**ocator (URL)
-- Read data into Python using an absolute path, relative path and a URL.
-- Compare and contrast the following functions:
+    - Uniform Resource Locator (URL)
+- Read data into Python from various types of path using:
     - `read_csv`
     - `read_excel`
-- Match the following `pandas` `read_csv` function arguments to their descriptions:
-    - `filepath_or_buffer`
+- Compare and contrast `read_csv` and `read_excel`.
+- Describe when to use the following `read_csv` function arguments:
+    - `skiprows`
     - `sep`
+    - `header`
     - `names`
-    - `skiprows`
 - Choose the appropriate `read_csv` function arguments to load a given plain text tabular data set into Python.
 - Use the `rename` function to rename columns in a data frame.
 - Use `pandas` package's `read_excel` function and arguments to load a sheet from an excel file into Python.
-- Connect to a database using the `ibis` library's `connect` function.
-- List the tables in a database using the `ibis` library's `list_tables` function.
-- Create a reference to a database table using the `ibis` library's `table` function.
-- Execute queries to bring data from a database into Python using the `ibis` library's `execute` function.
+- Work with databases using functions from the `ibis` package:
+    - Connect to a database with `connect.
+    - List tables in the database with `list_tables`.
+    - Create a reference to a database table with `table`.
+    - Bring data from a database into Python with `execute`.
 - Use `to_csv` to save a data frame to a `.csv` file.
-- (*Optional*) Obtain data using **a**pplication **p**rogramming **i**nterfaces (APIs) and web scraping.
-    - Read/scrape data from an internet URL using the `BeautifulSoup` package.
+- (*Optional*) Obtain data from the web using scraping and application programming interfaces (APIs):
+    - Read HTML source code from a URL using the `BeautifulSoup` package.
     - Read data from the NASA "Astronomy Picture of the Day" using the `requests` package.
     - Compare downloading tabular data from a plain text file (e.g., `.csv`), accessing data from an API, and scraping the HTML source code from a website.
 
@@ -83,7 +84,7 @@ could live on your computer (*local*) or somewhere on the internet (*remote*).
 The place where the file lives on your computer is referred to as its "path". You can
 think of the path as directions to the file. There are two kinds of paths:
 *relative* paths and *absolute* paths. A relative path indicates where the file is
-with respect to your *working directory* (i.e., "where you are currently") on the computer. 
+with respect to your *working directory* (i.e., "where you are currently") on the computer.
 On the other hand, an absolute path indicates where the file is
 with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working.
 
@@ -111,7 +112,7 @@ proceeds by listing out the sequence of folders you would have to enter to reach
 So in this case, `happiness_report.csv` would be reached by starting at the root, and entering the `home` folder,
 then the `dsci-100` folder, then the `worksheet_02` folder, and then finally the `data` folder. So its absolute
 path would be `/home/dsci-100/worksheet_02/data/happiness_report.csv`. We can load the file using its absolute path
-as a string passed to the `read_csv` function from `pandas`. 
+as a string passed to the `read_csv` function from `pandas`.
 ```python
 happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
 ```
@@ -126,20 +127,20 @@ Note that there is no forward slash at the beginning of a relative path; if we a
 Python would look for a folder named `data` in the root folder of the computer&mdash;but that doesn't exist!
 
 Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional
-special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and 
+special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
 the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could
 use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using
-the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`, 
+the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
 then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip!
 
-So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths. 
-Using a relative path helps ensure that your code can be run 
+So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
+Using a relative path helps ensure that your code can be run
 on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
 This is because a file's relative path is often the same across different computers, while a
-file's absolute path (the names of 
-all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same 
-across different computers. For example, suppose Fatima and Jayden are working on a 
-project together on the `happiness_report.csv` data. Fatima's file is stored at 
+file's absolute path (the names of
+all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
+across different computers. For example, suppose Fatima and Jayden are working on a
+project together on the `happiness_report.csv` data. Fatima's file is stored at
 
 ```
 /home/Fatima/project/data/happiness_report.csv
@@ -157,7 +158,7 @@ their different usernames.  If Jayden has code that loads the
 `happiness_report.csv` data using an absolute path, the code won't work on
 Fatima's computer.  But the relative path from inside the `project` folder
 (`data/happiness_report.csv`) is the same on both computers; any code that uses
-relative paths will work on both! In the additional resources section, 
+relative paths will work on both! In the additional resources section,
 we include a link to a short video on the
 difference between absolute and relative paths.
 
@@ -381,7 +382,7 @@ Non-Official & Non-Aboriginal languages	Amharic	22465	12785	200	33670
 
 ```
 
-Data frames in Python need to have column names.  Thus if you read in data 
+Data frames in Python need to have column names.  Thus if you read in data
 without column names, Python will assign names automatically. In this example,
 Python assigns the column names `0, 1, 2, 3, 4, 5`.
 To read this data into Python, we specify the first
@@ -1236,7 +1237,7 @@ page = bs4.BeautifulSoup(wiki.content, "html.parser")
 import bs4
 
 # the above cell doesn't actually run; this one does run
-# and loads the html data from a local, static file 
+# and loads the html data from a local, static file
 
 with open("data/canada_wiki.html", "r") as f:
     wiki_hidden = f.read()
@@ -1302,7 +1303,7 @@ Using `requests` and `BeautifulSoup` to extract data based on CSS selectors is
 a very general way to scrape data from the web, albeit perhaps a little bit
 complicated.  Fortunately, `pandas` provides the
 [`read_html`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)
-function, which is easier method to try when the data 
+function, which is easier method to try when the data
 appear on the webpage already in a tabular format.  The `read_html` function takes one
 argument&mdash;the URL of the page to scrape&mdash;and will return a list of
 data frames corresponding to all the tables it finds at that URL. We can see
@@ -1421,7 +1422,7 @@ endpoint is `https://api.nasa.gov/planetary/apod`. Second, we write `?`, which d
 list of *query parameters* will follow. And finally, we specify a list of
 query parameters of the form `parameter=value`, separated by `&` characters.  The NASA
 "Astronomy Picture of the Day" API accepts the parameters shown in
-{numref}`fig:NASA-API-parameters`. 
+{numref}`fig:NASA-API-parameters`.
 
 ```{figure} img/reading/NASA-API-parameters.png
 :name: fig:NASA-API-parameters
@@ -1432,7 +1433,7 @@ along with syntax, default settings, and a description of each.
 
 So for example, to obtain the image of the day
 from July 13, 2023, the API query would have two parameters: `api_key=YOUR_API_KEY`
-and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you 
+and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you
 received from NASA in your email! Putting it all together, the query will look like the following:
 ```
 https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13
@@ -1473,7 +1474,7 @@ you will recognize the same query URL that we pasted into the browser earlier.
 We will then obtain a JSON representation of the
 response using the `json` method.
 
-<!-- we have disabled the below code for reproducibility, with hidden setting 
+<!-- we have disabled the below code for reproducibility, with hidden setting
 of the nasa_data object. But you can reproduce this using the DEMO_KEY key -->
 ```python
 import requests
@@ -1490,14 +1491,14 @@ import json
 with open("data/nasa.json", "r") as f:
     nasa_data = json.load(f)
 # the last entry in the stored data is July 13, 2023, so print that
-nasa_data[-1] 
+nasa_data[-1]
 ```
 
 We can obtain more records at once by using the `start_date` and `end_date` parameters, as
 shown in the table of parameters in {numref}`fig:NASA-API-parameters`.
 Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result
 in an object called `nasa_data`; now the response
-will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object), 
+will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
 and there will be 74 items total, one for each day between the start and end dates:
 
 ```python
diff --git a/source/regression1.md b/source/regression1.md
index 0d66fe4c..d7b23af3 100755
--- a/source/regression1.md
+++ b/source/regression1.md
@@ -33,7 +33,7 @@ This is unlike the past two chapters, which focused on predicting categorical
 variables via classification. However, regression does have many similarities
 to classification: for example, just as in the case of classification,
 we will split our data into training, validation, and test sets, we will
-use `scikit-learn` workflows, we will use a K-nearest neighbors (KNN)
+use `scikit-learn` workflows, we will use a K-nearest neighbors (K-NN)
 approach to make predictions, and we will use cross-validation to choose K.
 Because of how similar these procedures are, make sure to read
 {numref}`Chapters %s <classification1>` and {numref}`%s <classification2>` before reading
@@ -51,14 +51,15 @@ however that is beyond the scope of this book.
 ## Chapter learning objectives
 By the end of the chapter, readers will be able to do the following:
 
-* Recognize situations where a simple regression analysis would be appropriate for making predictions.
-* Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification.
-* Interpret the output of a KNN regression.
-* In a data set with two or more variables, perform K-nearest neighbor regression in Python using a `scikit-learn` workflow.
-* Execute cross-validation in Python to choose the number of neighbors.
-* Evaluate KNN regression prediction accuracy in Python using a test data set and the root mean squared prediction error (RMSPE).
-* In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
-* Describe the advantages and disadvantages of K-nearest neighbors regression.
+- Recognize situations where a regression analysis would be appropriate for making predictions.
+- Explain the K-nearest neighbors (K-NN) regression algorithm and describe how it differs from K-NN classification.
+- Interpret the output of a K-NN regression.
+- In a data set with two or more variables, perform K-nearest neighbors regression in Python.
+- Evaluate K-NN regression prediction quality in Python using the root mean squared prediction error (RMSPE).
+- Estimate the RMSPE in Python using cross-validation or a test set.
+- Choose the number of neighbors in K-nearest neighbors regression by minimizing estimated cross-validation RMSPE.
+- Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors regression.
+- Describe the advantages and disadvantages of K-nearest neighbors regression.
 
 +++
 
@@ -220,10 +221,10 @@ Much like in the case of classification,
 we can use a K-nearest neighbors-based
 approach in regression to make predictions.
 Let's take a small sample of the data in {numref}`fig:07-edaRegr`
-and walk through how K-nearest neighbors (KNN) works
+and walk through how K-nearest neighbors (K-NN) works
 in a regression context before we dive in to creating our model and assessing
 how well it predicts house sale price. This subsample is taken to allow us to
-illustrate the mechanics of KNN regression with a few data points; later in
+illustrate the mechanics of K-NN regression with a few data points; later in
 this chapter we will use all the data.
 
 ```{index} pandas.DataFrame; sample
@@ -371,12 +372,12 @@ Our predicted price is \${glue:text}`knn-5-pred`
 (shown as a red point in {numref}`fig:07-predictedViz-knn`), which is much less than \$350,000; perhaps we
 might want to offer less than the list price at which the house is advertised.
 But this is only the very beginning of the story. We still have all the same
-unanswered questions here with KNN regression that we had with KNN
+unanswered questions here with K-NN regression that we had with K-NN
 classification: which $K$ do we choose, and is our model any good at making
 predictions? In the next few sections, we will address these questions in the
-context of KNN regression.
+context of K-NN regression.
 
-One strength of the KNN regression algorithm
+One strength of the K-NN regression algorithm
 that we would like to draw attention to at this point
 is its ability to work well with non-linear relationships
 (i.e., if the relationship is not a straight line).
@@ -384,7 +385,7 @@ This stems from the use of nearest neighbors to predict values.
 The algorithm really has very few assumptions
 about what the data must look like for it to work.
 
-+++ 
++++
 
 ## Training, evaluating, and tuning the model
 
@@ -427,11 +428,11 @@ sacramento_train, sacramento_test = train_test_split(
 ```{index} see: root mean square prediction error; RMSPE
 ```
 
-Next, we'll use cross-validation to choose $K$. In KNN classification, we used
+Next, we'll use cross-validation to choose $K$. In K-NN classification, we used
 accuracy to see how well our predictions matched the true labels. We cannot use
 the same metric in the regression setting, since our predictions will almost never
 *exactly* match the true response variable values. Therefore in the
-context of KNN regression we will use root mean square prediction error (RMSPE) instead.
+context of K-NN regression we will use root mean square prediction error (RMSPE) instead.
 The mathematical formula for calculating RMSPE is:
 
 $$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
@@ -524,7 +525,7 @@ Scatter plot of price (USD) versus house size (square feet) with example predict
 ```{note}
 When using many code packages, the evaluation output
 we will get to assess the prediction quality of
-our KNN regression models is labeled "RMSE", or "root mean squared
+our K-NN regression models is labeled "RMSE", or "root mean squared
 error". Why is this so, and why not RMSPE?
 In statistics, we try to be very precise with our
 language to indicate whether we are calculating the prediction error on the
@@ -553,10 +554,10 @@ opposed to the classification problems from the previous chapters.  The use of
 different metrics (instead of accuracy) for tuning and evaluation.  Next we
 specify a parameter grid containing numbers of neighbors
 ranging from 1 to 200.  Then we create a 5-fold `GridSearchCV` object, and
-pass in the pipeline and parameter grid. 
+pass in the pipeline and parameter grid.
 There is one additional slight complication: unlike classification models in `scikit-learn`---which
 by default use accuracy for tuning, as desired---regression models in `scikit-learn`
-do not use the RMSPE for tuning by default. 
+do not use the RMSPE for tuning by default.
 So we need to specify that we want to use the RMSPE for tuning by setting the
 `scoring` argument to `"neg_root_mean_squared_error"`.
 
@@ -570,7 +571,7 @@ of `sacr_pipeline.get_params()`, as we did in {numref}`Chapter %s <classificatio
 ```
 
 ```{code-cell} ipython3
-# import the KNN regression model
+# import the K-NN regression model
 from sklearn.neighbors import KNeighborsRegressor
 
 # preprocess the data, make the pipeline
@@ -594,7 +595,7 @@ on `sacr_gridsearch`. Note the use of two brackets for the input features
 (`sacramento_train[["sqft"]]`), which creates a data frame with a single column.
 As we learned in {numref}`Chapter %s <wrangling>`, we can obtain a data frame with a
 subset of columns by passing a list of column names; `["sqft"]` is a list with one
-item, so we obtain a data frame with one column. If instead we used 
+item, so we obtain a data frame with one column. If instead we used
 just one bracket (`sacramento_train["sqft"]`), we would obtain a series.
 In `scikit-learn`, it is easier to work with the input features as a data frame
 rather than a series, so we opt for two brackets here. On the other hand, the response variable
@@ -602,7 +603,7 @@ can be a series, so we use just one bracket there (`sacramento_train["price"]`).
 
 As in {numref}`Chapter %s <classification2>`, once the model has been fit
 we will wrap the `cv_results_` output in a data frame, extract
-only the relevant columns, compute the standard error based on 5 folds, 
+only the relevant columns, compute the standard error based on 5 folds,
 and rename the parameter column to be more readable.
 
 
@@ -630,7 +631,7 @@ sacr_results
 In the `sacr_results` results data frame, we see that the
 `n_neighbors` variable contains the values of $K$,
 and `mean_test_score` variable contains the value of the RMSPE estimated via
-cross-validation...Wait a moment! Isn't the RMSPE supposed to be nonnegative? 
+cross-validation...Wait a moment! Isn't the RMSPE supposed to be nonnegative?
 Recall that when we specified the `scoring` argument in the `GridSearchCV` object,
 we used the value `"neg_root_mean_squared_error"`. See the `neg_` at the start?
 That stands for *negative*! As it turns out, `scikit-learn` always tries to *maximize* a score
@@ -644,7 +645,7 @@ sacr_results["mean_test_score"] = -sacr_results["mean_test_score"]
 sacr_results
 ```
 
-Alright, now the `mean_test_score` variable actually has values of the RMSPE 
+Alright, now the `mean_test_score` variable actually has values of the RMSPE
 for different numbers of neighbors. Finally, the `sem_test_score` variable
 contains the standard error of our cross-validation RMSPE estimate, which
 is a measure of how uncertain we are in the mean value. Roughly, if
@@ -687,7 +688,7 @@ glue("fig:07-choose-k-knn-plot", sacr_tunek_plot, display=False)
 Effect of the number of neighbors on the RMSPE.
 :::
 
-To see which parameter value corresponds to the minimum RMSPE, 
+To see which parameter value corresponds to the minimum RMSPE,
 we can also access the `best_params_` attribute of the original fit `GridSearchCV` object.
 Note that it is still useful to visualize the results as we did above
 since this provides additional information on how the model performance varies.
@@ -705,7 +706,7 @@ to be too small or too large, we cause the RMSPE to increase, as shown in
 
 {numref}`fig:07-howK` visualizes the effect of different settings of $K$ on the
 regression model. Each plot shows the predicted values for house sale price from
-our KNN regression model for 6 different values for $K$: 1, 3, 25, {glue:text}`best_k_sacr`, 250, and 699 (i.e., all of the training data).
+our K-NN regression model for 6 different values for $K$: 1, 3, 25, {glue:text}`best_k_sacr`, 250, and 699 (i.e., all of the training data).
 For each model, we predict prices for the range of possible home sizes we
 observed in the data set (here 500 to 5,000 square feet) and we plot the
 predicted prices as a orange line.
@@ -765,7 +766,7 @@ glue(
 :::{glue:figure} fig:07-howK
 :name: fig:07-howK
 
-Predicted values for house price (represented as a orange line) from KNN regression models for six different values for $K$.
+Predicted values for house price (represented as a orange line) from K-NN regression models for six different values for $K$.
 :::
 
 +++
@@ -823,17 +824,16 @@ chapter.
 ## Evaluating on the test set
 
 To assess how well our model might do at predicting on unseen data, we will
-assess its RMSPE on the test data. To do this, we first need to retrain the 
-KNN regression model on the entire training data set using $K =$ {glue:text}`best_k_sacr`
+assess its RMSPE on the test data. To do this, we first need to retrain the
+K-NN regression model on the entire training data set using $K =$ {glue:text}`best_k_sacr`
 neighbors. As we saw in {numref}`Chapter %s <classification2>` we do not have to do this ourselves manually; `scikit-learn`
 does it for us automatically. To make predictions with the best model on the test data,
 we can use the `predict` method of the fit `GridSearchCV` object.
-We then use the `mean_squared_error`
-function (with the `y_true` and `y_pred` arguments) 
+We then use the `mean_squared_error` function (with the `y_true` and `y_pred` arguments)
 to compute the mean squared prediction error, and finally take the
-square root to get the RMSPE. The reason that we do not just use the `score` 
+square root to get the RMSPE. The reason that we do not just use the `score`
 method---as in {numref}`Chapter %s <classification2>`---is that the `KNeighborsRegressor`
-model uses a different default scoring metric than the RMSPE. 
+model uses a different default scoring metric than the RMSPE.
 
 ```{code-cell} ipython3
 from sklearn.metrics import mean_squared_error
@@ -862,7 +862,7 @@ RMSPE estimate of our tuned model
 (which was \${glue:text}`cv_RMSPE`,
 so we can say that the model appears to generalize well
 to new data that it has never seen before.
-However, much like in the case of KNN classification, whether this value for RMSPE is *good*&mdash;i.e.,
+However, much like in the case of K-NN classification, whether this value for RMSPE is *good*&mdash;i.e.,
 whether an error of around \${glue:text}`test_RMSPE`
 is acceptable&mdash;depends entirely on the application.
 In this application, this error
@@ -906,7 +906,7 @@ base_plot = alt.Chart(sacramento).mark_circle(opacity=0.4).encode(
 
 # Add the predictions as a line
 sacr_preds_plot = base_plot + alt.Chart(
-    sqft_prediction_grid, 
+    sqft_prediction_grid,
     title=f"K = {best_k_sacr}"
 ).mark_line(
     color="#ff7f0e"
@@ -927,14 +927,14 @@ glue("fig:07-predict-all", sacr_preds_plot)
 :::{glue:figure} fig:07-predict-all
 :name: fig:07-predict-all
 
-Predicted values of house price (orange line) for the final KNN regression model.
+Predicted values of house price (orange line) for the final K-NN regression model.
 :::
 
 +++
 
-## Multivariable KNN regression
+## Multivariable K-NN regression
 
-As in KNN classification, we can use multiple predictors in KNN regression.
+As in K-NN classification, we can use multiple predictors in K-NN regression.
 In this setting, we have the same concerns regarding the scale of the predictors. Once again,
  predictions are made by identifying the $K$
 observations that are nearest to the new point we want to predict; any
@@ -943,16 +943,16 @@ variables on a small scale. Hence, we should re-define the preprocessor in the
 pipeline to incorporate all predictor variables.
 
 Note that we also have the same concern regarding the selection of predictors
-in KNN regression as in KNN classification: having more predictors is **not** always
+in K-NN regression as in K-NN classification: having more predictors is **not** always
 better, and the choice of which predictors to use has a potentially large influence
 on the quality of predictions. Fortunately, we can use the predictor selection
-algorithm from {numref}`Chapter %s <classification2>` in KNN regression as well.
+algorithm from {numref}`Chapter %s <classification2>` in K-NN regression as well.
 As the algorithm is the same, we will not cover it again in this chapter.
 
 ```{index} K-nearest neighbors; multivariable regression, Sacramento real estate
 ```
 
-We will now demonstrate a multivariable KNN regression analysis of the
+We will now demonstrate a multivariable K-NN regression analysis of the
 Sacramento real estate data using `scikit-learn`. This time we will use
 house size (measured in square feet) as well as number of bedrooms as our
 predictors, and continue to use house sale price as our response variable
@@ -991,7 +991,7 @@ Scatter plot of the sale price of houses versus the number of bedrooms.
 the house sale price tends to increase as well, but that the relationship
 is quite weak. Does adding the number of bedrooms
 to our model improve our ability to predict price? To answer that
-question, we will have to create a new KNN regression
+question, we will have to create a new K-NN regression
 model using house size and number of bedrooms, and then we can compare it to
 the model we previously came up with that only used house
 size. Let's do that now!
@@ -1054,7 +1054,7 @@ glue("cv_RMSPE_2pred", "{0:,.0f}".format(min_rmspe_sacr_multi))
 ```
 
 Here we see that the smallest estimated RMSPE from cross-validation occurs when $K =$ {glue:text}`best_k_sacr_multi`.
-If we want to compare this multivariable KNN regression model to the model with only a single
+If we want to compare this multivariable K-NN regression model to the model with only a single
 predictor *as part of the model tuning process* (e.g., if we are running forward selection as described
 in the chapter on evaluating and tuning classification models),
 then we must compare the RMSPE estimated using only the training data via cross-validation.
@@ -1065,7 +1065,7 @@ The estimated cross-validation RMSPE for the multivariable model is
 Thus in this case, we did not improve the model
 by a large amount by adding this additional predictor.
 
-Regardless, let's continue the analysis to see how we can make predictions with a multivariable KNN regression model
+Regardless, let's continue the analysis to see how we can make predictions with a multivariable K-NN regression model
 and evaluate its performance on test data. As previously, we will use the best model to make predictions on the test data
 via the `predict` method of the fit `GridSearchCV` object. Finally, we will use the `mean_squared_error` function
 to compute the RMSPE.
@@ -1086,7 +1086,7 @@ RMSPE_mult
 glue("RMSPE_mult", "{0:,.0f}".format(RMSPE_mult))
 ```
 
-This time, when we performed KNN regression on the same data set, but also
+This time, when we performed K-NN regression on the same data set, but also
 included number of bedrooms as a predictor, we obtained a RMSPE test error
 of \${glue:text}`RMSPE_mult`.
 {numref}`fig:07-knn-mult-viz` visualizes the model's predictions overlaid on top of the data. This
@@ -1143,7 +1143,7 @@ glue("fig:07-knn-mult-viz", fig)
 :name: fig:07-knn-mult-viz
 :figclass: caption-hack
 
-KNN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes.
+K-NN regression model's predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes.
 ```
 
 +++
@@ -1160,9 +1160,9 @@ bedrooms, we would predict the same price for these two houses.
 
 +++
 
-## Strengths and limitations of KNN regression
+## Strengths and limitations of K-NN regression
 
-As with KNN classification (or any prediction algorithm for that matter), KNN
+As with K-NN classification (or any prediction algorithm for that matter), K-NN
 regression has both strengths and weaknesses. Some are listed here:
 
 **Strengths:** K-nearest neighbors regression
diff --git a/source/regression2.md b/source/regression2.md
index edd9ea8d..f7245fdf 100755
--- a/source/regression2.md
+++ b/source/regression2.md
@@ -27,10 +27,10 @@ import plotly.graph_objects as go
 
 ## Overview
 Up to this point, we have solved all of our predictive problems&mdash;both classification
-and regression&mdash;using K-nearest neighbors (KNN)-based approaches. In the context of regression,
+and regression&mdash;using K-nearest neighbors (K-NN)-based approaches. In the context of regression,
 there is another commonly used method known as *linear regression*. This chapter provides an introduction
 to the basic concept of linear regression, shows how to use `scikit-learn` to perform linear regression in Python,
-and characterizes its strengths and weaknesses compared to KNN regression. The focus is, as usual,
+and characterizes its strengths and weaknesses compared to K-NN regression. The focus is, as usual,
 on the case where there is a single predictor and single response variable of interest; but the chapter
 concludes with an example using *multivariable linear regression* when there is more than one
 predictor.
@@ -38,9 +38,10 @@ predictor.
 ## Chapter learning objectives
 By the end of the chapter, readers will be able to do the following:
 
-* Use Python and `scikit-learn` to fit a linear regression model on training data.
-* Evaluate the linear regression model on test data.
-* Compare and contrast predictions obtained from K-nearest neighbor regression to those obtained using linear regression from the same data set.
+- Use Python to fit simple and multivariable linear regression models on training data.
+- Evaluate the linear regression model on test data.
+- Compare and contrast predictions obtained from K-nearest neighbors regression to those obtained using linear regression from the same data set.
+- Describe how linear regression is affected by outliers and multicollinearity.
 
 +++
 
@@ -49,19 +50,19 @@ By the end of the chapter, readers will be able to do the following:
 ```{index} regression; linear
 ```
 
-At the end of the previous chapter, we noted some limitations of KNN regression.
-While the method is simple and easy to understand, KNN regression does not
+At the end of the previous chapter, we noted some limitations of K-NN regression.
+While the method is simple and easy to understand, K-NN regression does not
 predict well beyond the range of the predictors in the training data, and
 the method gets significantly slower as the training data set grows.
-Fortunately, there is an alternative to KNN regression&mdash;*linear regression*&mdash;that addresses
+Fortunately, there is an alternative to K-NN regression&mdash;*linear regression*&mdash;that addresses
 both of these limitations. Linear regression is also very commonly
 used in practice because it provides an interpretable mathematical equation that describes
 the relationship between the predictor and response variables. In this first part of the chapter, we will focus on *simple* linear regression,
 which involves only one predictor variable and one response variable; later on, we will consider
  *multivariable* linear regression, which involves multiple predictor variables.
- Like KNN regression, simple linear regression involves
+ Like K-NN regression, simple linear regression involves
 predicting a numerical response variable (like race time, house price, or height);
-but *how* it makes those predictions for a new observation is quite different from KNN regression.
+but *how* it makes those predictions for a new observation is quite different from K-NN regression.
  Instead of looking at the K nearest neighbors and averaging
 over their values for a prediction, in simple linear regression, we create a
 straight line of best fit through the training data and then
@@ -78,8 +79,8 @@ is another popular method for classification called *logistic
 regression* (it is used for classification even though the name, somewhat confusingly,
 has the word "regression" in it). In logistic regression&mdash;similar to linear regression&mdash;you
 "fit" the model to the training data and then "look up" the prediction for each new observation.
-Logistic regression and KNN classification have an advantage/disadvantage comparison
-similar to that of linear regression and KNN
+Logistic regression and K-NN classification have an advantage/disadvantage comparison
+similar to that of linear regression and K-NN
 regression. It is useful to have a good understanding of linear regression before learning about
 logistic regression. After reading this chapter, see the "Additional Resources" section at the end of the
 classification chapters to learn more about logistic regression.
@@ -91,7 +92,7 @@ classification chapters to learn more about logistic regression.
 ```
 
 Let's return to the Sacramento housing data from {numref}`Chapter %s <regression1>` to learn
-how to apply linear regression and compare it to KNN regression. For now, we
+how to apply linear regression and compare it to K-NN regression. For now, we
 will consider
 a smaller version of the housing data to help make our visualizations clear.
 Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict
@@ -290,7 +291,7 @@ the line that minimizes the **average squared vertical distance** between itself
 each of the observed data points in the training data. {numref}`fig:08-verticalDistToMin` illustrates
 these vertical distances as red lines. Finally, to assess the predictive
 accuracy of a simple linear regression model,
-we use RMSPE&mdash;the same measure of predictive performance we used with KNN regression.
+we use RMSPE&mdash;the same measure of predictive performance we used with K-NN regression.
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
@@ -337,7 +338,7 @@ Scatter plot of sale price versus size with red lines denoting the vertical dist
 ```
 
 We can perform simple linear regression in Python using `scikit-learn` in a
-very similar manner to how we performed KNN regression.
+very similar manner to how we performed K-NN regression.
 To do this, instead of creating a `KNeighborsRegressor` model object,
 we use a `LinearRegression` model object;
 and as usual, we first have to import it from `sklearn`.
@@ -375,7 +376,7 @@ sacramento_train, sacramento_test = train_test_split(
 )
 ```
 
-Now that we have our training data, we will create 
+Now that we have our training data, we will create
 and fit the linear regression model object.
 We will also extract the slope of the line
 via the `coef_[0]` property, as well as the
@@ -510,16 +511,16 @@ glue("fig:08-lm-predict-all", sacr_preds_plot)
 Scatter plot of sale price versus size with line of best fit for the full Sacramento housing data.
 :::
 
-## Comparing simple linear and KNN regression
+## Comparing simple linear and K-NN regression
 
 ```{index} regression; comparison of methods
 ```
 
-Now that we have a general understanding of both simple linear and KNN
+Now that we have a general understanding of both simple linear and K-NN
 regression, we can start to compare and contrast these methods as well as the
 predictions made by them. To start, let's look at the visualization of the
 simple linear regression model predictions for the Sacramento real estate data
-(predicting price from house size) and the "best" KNN regression model
+(predicting price from house size) and the "best" K-NN regression model
 obtained from the same problem, shown in {numref}`fig:08-compareRegression`.
 
 ```{code-cell} ipython3
@@ -558,7 +559,7 @@ sacr_rmspe_knn = np.sqrt(
 
 # plot knn in-sample predictions overlaid on scatter plot
 knn_plot_final = (
-    alt.Chart(sacr_preds_knn, title="KNN regression")
+    alt.Chart(sacr_preds_knn, title="K-NN regression")
     .mark_circle()
     .encode(
         x=alt.X("sqft", title="House size (square feet)", scale=alt.Scale(zero=False)),
@@ -629,21 +630,21 @@ glue("fig:08-compareRegression", (lm_plot_final | knn_plot_final))
 :::{glue:figure} fig:08-compareRegression
 :name: fig:08-compareRegression
 
-Comparison of simple linear regression and KNN regression.
+Comparison of simple linear regression and K-NN regression.
 :::
 
 +++
 
 What differences do we observe in {numref}`fig:08-compareRegression`? One obvious
 difference is the shape of the orange lines. In simple linear regression we are
-restricted to a straight line, whereas in KNN regression our line is much more
+restricted to a straight line, whereas in K-NN regression our line is much more
 flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the
 model to a straight line. A
 straight line can be defined by two numbers, the
 vertical intercept and the slope. The intercept tells us what the prediction is when
 all of the predictors are equal to 0; and the slope tells us what unit increase in the response
 variable we predict given a unit increase in the predictor
-variable. KNN regression, as simple as it is to implement and understand, has no such
+variable. K-NN regression, as simple as it is to implement and understand, has no such
 interpretability from its wiggly line.
 
 ```{index} underfitting; regression
@@ -657,14 +658,14 @@ will underfit (have high bias), meaning that model/predicted values do not
 match the actual observed values very well. Such a model would probably have a
 quite high RMSE when assessing model goodness of fit on the training data and
 a quite high RMSPE when assessing model prediction quality on a test data
-set. On such a data set, KNN regression may fare better. Additionally, there
+set. On such a data set, K-NN regression may fare better. Additionally, there
 are other types of regression you can learn about in future books that may do
 even better at predicting with such data.
 
 How do these two models compare on the Sacramento house prices data set? In
 {numref}`fig:08-compareRegression`, we also printed the RMSPE as calculated from
 predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear
-regression model is slightly lower than the RMSPE for the KNN regression model.
+regression model is slightly lower than the RMSPE for the K-NN regression model.
 Considering that the simple linear regression model is also more interpretable,
 if we were comparing these in practice we would likely choose to use the simple
 linear regression model.
@@ -672,17 +673,17 @@ linear regression model.
 ```{index} extrapolation
 ```
 
-Finally, note that the KNN regression model becomes "flat"
+Finally, note that the K-NN regression model becomes "flat"
 at the left and right boundaries of the data, while the linear model
 predicts a constant slope. Predicting outside the range of the observed
-data is known as *extrapolation*; KNN and linear models behave quite differently
+data is known as *extrapolation*; K-NN and linear models behave quite differently
 when extrapolating. Depending on the application, the flat
 or constant slope trend may make more sense. For example, if our housing
 data were slightly different, the linear model may have actually predicted
 a *negative* price for a small house (if the intercept $\beta_0$ was negative),
 which obviously does not match reality. On the other hand, the trend of increasing
 house size corresponding to increasing house price probably continues for large houses,
-so the "flat" extrapolation of KNN likely does not match reality.
+so the "flat" extrapolation of K-NN likely does not match reality.
 
 +++
 
@@ -696,15 +697,15 @@ so the "flat" extrapolation of KNN likely does not match reality.
 ```{index} see: multivariable linear equation; plane equation
 ```
 
-As in KNN classification and KNN regression, we can move beyond the simple
+As in K-NN classification and K-NN regression, we can move beyond the simple
 case of only one predictor to the case with multiple predictors,
 known as *multivariable linear regression*.
 To do this, we follow a very similar approach to what we did for
-KNN regression: we just specify the training data by adding more predictors.
+K-NN regression: we just specify the training data by adding more predictors.
 But recall that we do not need to use cross-validation to choose any parameters,
 nor do we need to standardize (i.e., center and scale) the data for linear regression.
 Note once again that we have the same concerns regarding multiple predictors
- as in the settings of multivariable KNN regression and classification: having more predictors is **not** always
+ as in the settings of multivariable K-NN regression and classification: having more predictors is **not** always
 better. But because the same predictor selection
 algorithm from {numref}`Chapter %s <classification2>` extends to the setting of linear regression,
 it will not be covered again in this chapter.
@@ -715,8 +716,8 @@ it will not be covered again in this chapter.
 We will demonstrate multivariable linear regression using the Sacramento real estate
 data with both house size
 (measured in square feet) as well as number of bedrooms as our predictors, and
-continue to use house sale price as our response variable. 
-The `scikit-learn` framework makes this easy to do: we just need to set 
+continue to use house sale price as our response variable.
+The `scikit-learn` framework makes this easy to do: we just need to set
 both the `sqft` and `beds` variables as predictors, and then use the `fit`
 method as usual.
 
@@ -811,10 +812,10 @@ to illustrate what the regression plane looks like for learning purposes.
 
 We see that the predictions from linear regression with two predictors form a
 flat plane. This is the hallmark of linear regression, and differs from the
-wiggly, flexible surface we get from other methods such as KNN regression.
+wiggly, flexible surface we get from other methods such as K-NN regression.
  As discussed, this can be advantageous in one aspect, which is that for each
 predictor, we can get slopes/intercept from linear regression, and thus describe the
-plane mathematically. We can extract those slope values from the `coef_` property 
+plane mathematically. We can extract those slope values from the `coef_` property
 of our model object, and the intercept from the `intercept_` property,
 as shown below.
 
@@ -828,10 +829,10 @@ mlm.intercept_
 
 When we have multiple predictor variables, it is not easy to
 know which variable goes with which coefficient in `mlm.coef_`. In particular,
-you will see that `mlm.coef_` above is just an array of values without any variable names. 
+you will see that `mlm.coef_` above is just an array of values without any variable names.
 Unfortunately you have to do this mapping yourself: the coefficients in `mlm.coef_` appear
 in the *same order* as the columns of the predictor data frame you used when training.
-So since we used `sacramento_train[["sqft", "beds"]]` when training, 
+So since we used `sacramento_train[["sqft", "beds"]]` when training,
 we have that `mlm.coef_[0]` corresponds to `sqft`, and `mlm.coef_[1]` corresponds to `beds`.
 Once you sort out the correspondence, you can then use those slopes to write a mathematical equation to describe the prediction plane:
 
@@ -863,15 +864,15 @@ glue("bedsc", bedsc)
 
 $\text{house sale price} =$ {glue:text}`icept` $+$ {glue:text}`sqftc` $\cdot (\text{house size})$ {glue:text}`bedsc` $\cdot (\text{number of bedrooms})$
 
-This model is more interpretable than the multivariable KNN
+This model is more interpretable than the multivariable K-NN
 regression model; we can write a mathematical equation that explains how
 each predictor is affecting the predictions. But as always, we should
 question how well multivariable linear regression is doing compared to
 the other tools we have, such as simple linear regression
-and multivariable KNN regression. If this comparison is part of
+and multivariable K-NN regression. If this comparison is part of
 the model tuning process&mdash;for example, if we are trying
  out many different sets of predictors for multivariable linear
-and KNN regression&mdash;we must perform this comparison using
+and K-NN regression&mdash;we must perform this comparison using
 cross-validation on only our training data. But if we have already
 decided on a small number (e.g., 2 or 3) of tuned candidate models and
 we want to make a final comparison, we can do so by comparing the prediction
@@ -886,7 +887,7 @@ lm_mult_test_RMSPE
 
 We obtain an RMSPE for the multivariable linear regression model
 of \${glue:text}`sacr_mult_RMSPE`. This prediction error
- is less than the prediction error for the multivariable KNN regression model,
+ is less than the prediction error for the multivariable K-NN regression model,
 indicating that we should likely choose linear regression for predictions of
 house sale price on this data set. Revisiting the simple linear regression model
 with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
@@ -1361,7 +1362,7 @@ and guidance that the worksheets provide will function as intended.
   of "informative" predictors when you have a data set with many predictors, and
   you expect only a few of them to be relevant. Chapter 7 covers regression
   models that are more flexible than linear regression models but still enjoy the
-  computational efficiency of linear regression. In contrast, the KNN methods we
+  computational efficiency of linear regression. In contrast, the K-NN methods we
   covered earlier are indeed more flexible but become very slow when given lots
   of data.
 
diff --git a/source/setup.md b/source/setup.md
index 45f81c3e..a540198d 100755
--- a/source/setup.md
+++ b/source/setup.md
@@ -21,8 +21,8 @@ kernelspec:
 In this chapter, you'll learn how to set up the software needed to follow along
 with this book on your own computer.  Given that installation instructions can
 vary based on computer setup, we provide instructions for
-multiple operating systems (Ubuntu Linux, MacOS, and Windows). 
-Although the instructions in this chapter will likely work on many systems, 
+multiple operating systems (Ubuntu Linux, MacOS, and Windows).
+Although the instructions in this chapter will likely work on many systems,
 we have specifically verified that they work on a computer that:
 
 - runs Windows 10 Home, MacOS 13 Ventura, or Ubuntu 22.04,
@@ -38,18 +38,18 @@ By the end of the chapter, readers will be able to do the following:
 - Download the worksheets that accompany this book.
 - Install the Docker virtualization engine.
 - Edit and run the worksheets using JupyterLab running inside a Docker container.
-- Install Git, JupyterLab Desktop, and python packages.
+- Install Git, JupyterLab Desktop, and Python packages.
 - Edit and run the worksheets using JupyterLab Desktop.
 
 ## Obtaining the worksheets for this book
 
-The worksheets containing exercises for this book 
+The worksheets containing exercises for this book
 are online at [https://worksheets.python.datasciencebook.ca](https://worksheets.python.datasciencebook.ca).
 The worksheets can be launched directly from that page using the Binder links in the rightmost
-column of the table. This is the easiest way to access the worksheets, but note that you will not 
+column of the table. This is the easiest way to access the worksheets, but note that you will not
 be able to save your work and return to it again later.
-In order to save your progress, you will need to download the worksheets to your own computer and 
-work on them locally. You can download the worksheets as a compressed zip file 
+In order to save your progress, you will need to download the worksheets to your own computer and
+work on them locally. You can download the worksheets as a compressed zip file
 using [the link at the top of the page](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/archive/refs/heads/main.zip).
 Once you unzip the downloaded file, you will have a folder containing all of the Jupyter notebook worksheets
 accompanying this book. See {numref}`Chapter %s <getting-started-with-jupyter>` for
@@ -64,7 +64,7 @@ software packages, not to mention getting the right versions of
 everything&mdash;the worksheets and autograder tests may not work unless all the versions are
 exactly right! To keep things simple, we instead recommend that you install
 [Docker](https://docker.com). Docker lets you run your Jupyter notebooks inside
-a pre-built *container* that comes with precisely the right versions of 
+a pre-built *container* that comes with precisely the right versions of
 all software packages needed run the worksheets that come with this book.
 ```{index} Docker
 ```
@@ -73,15 +73,15 @@ all software packages needed run the worksheets that come with this book.
 A *container* is a virtualized user space within your computer.
 Within the container, you can run software in isolation without interfering with the
 other software that already exists on your machine. In this book, we use
-a container to run a specific version of the python programming
+a container to run a specific version of the Python programming
 language, as well as other necessary packages. The container ensures that
-the worksheets function correctly, even if you have a different version of python
-installed on your computer&mdash;or even if you haven't installed python at all!
+the worksheets function correctly, even if you have a different version of Python
+installed on your computer&mdash;or even if you haven't installed Python at all!
 ```
 
 ### Windows
 
-**Installation** To install Docker on Windows, 
+**Installation** To install Docker on Windows,
 visit [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/),
 and download the `Docker Desktop Installer.exe` file. Double-click the file to open the installer
 and follow the instructions on the installation wizard, choosing **WSL-2** instead of **Hyper-V** when prompted.
@@ -90,27 +90,27 @@ and follow the instructions on the installation wizard, choosing **WSL-2** inste
 Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see:
 
 - If you need to update WSL, you can enter `cmd.exe` in the Start menu to run the command line. Type `wsl --update` to update WSL.
-- If the admin account on your computer is different to your user account, you must add the user to the "docker-users" group. 
-  Run Computer Management as an administrator and navigate to `Local Users` and `Groups -> Groups -> docker-users`. Right-click to 
+- If the admin account on your computer is different to your user account, you must add the user to the "docker-users" group.
+  Run Computer Management as an administrator and navigate to `Local Users` and `Groups -> Groups -> docker-users`. Right-click to
   add the user to the group. Log out and log back in for the changes to take effect.
 - If you need to enable virtualization, you will need to edit your BIOS. Restart your computer, and enter the BIOS using the hotkey
   (usually Delete, Esc, and/or one of the F# keys). Look for an "Advanced" menu, and under your CPU settings, set the "Virtualization" option
-  to "enabled". Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert 
-  to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. 
+  to "enabled". Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert
+  to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book.
 ```
 
 **Running JupyterLab** Run Docker Desktop. Once it is running, you need to download and run the
-Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a 
+Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a
 computer with all the right packages pre-installed). You only need to do this step one time; the image will remain
 the next time you run Docker Desktop.
-In the Docker Desktop search bar, enter `ubcdsci/py-dsci-100`, as this is 
+In the Docker Desktop search bar, enter `ubcdsci/py-dsci-100`, as this is
 the name of the image. You will see the `ubcdsci/py-dsci-100` image in the list ({numref}`docker-desktop-search`),
 and "latest" in the Tag drop down menu. We need to change "latest" to the right image version before proceeding.
-To find the right tag, open 
+To find the right tag, open
 the [`Dockerfile` in the worksheets repository](https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/main/Dockerfile),
 and look for the line `FROM ubcdsci/py-dsci-100:` followed by the tag consisting of a sequence of numbers and letters.
-Back in Docker Desktop, in the "Tag" drop down menu, click that tag to select the correct image version. Then click 
-the "Pull" button to download the image. 
+Back in Docker Desktop, in the "Tag" drop down menu, click that tag to select the correct image version. Then click
+the "Pull" button to download the image.
 
 ```{figure} img/setup/docker-1.png
 ---
@@ -148,10 +148,10 @@ name: docker-desktop-runconfig
 The Docker Desktop container run configuration menu.
 ```
 
-After clicking the "Run" button, you will see a terminal. The terminal will then print 
-some text as the Docker container starts. Once the text stops scrolling, find the 
-URL in the terminal that starts 
-with `http://127.0.0.1:8888` (highlighted by the red box in {numref}`docker-desktop-url`), and paste it 
+After clicking the "Run" button, you will see a terminal. The terminal will then print
+some text as the Docker container starts. Once the text stops scrolling, find the
+URL in the terminal that starts
+with `http://127.0.0.1:8888` (highlighted by the red box in {numref}`docker-desktop-url`), and paste it
 into your browser to start JupyterLab.
 
 ```{figure} img/setup/docker-4.png
@@ -162,11 +162,11 @@ name: docker-desktop-url
 The terminal text after running the Docker container. The red box indicates the URL that you should paste into your browser to open JupyterLab.
 ```
 
-When you are done working, make sure to shut down and remove the container by 
+When you are done working, make sure to shut down and remove the container by
 clicking the red trash can symbol (in the top right corner of {numref}`docker-desktop-url`).
 You will not be able to start the container again until you do so.
 More information on installing and running
-Docker on Windows, as well as troubleshooting tips, can 
+Docker on Windows, as well as troubleshooting tips, can
 be found in [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/).
 
 ### MacOS
@@ -174,18 +174,18 @@ be found in [the online Docker documentation](https://docs.docker.com/desktop/in
 **Installation** To install Docker on MacOS,
 visit [the online Docker documentation](https://docs.docker.com/desktop/install/mac-install/), and
 download the `Docker.dmg` installation file that is appropriate for your
-computer. To know which installer is right for your machine, you need to know 
+computer. To know which installer is right for your machine, you need to know
 whether your computer has an Intel processor (older machines) or an
 Apple processor (newer machines); the [Apple support page](https://support.apple.com/en-ca/HT211814) has
-information to help you determine which processor you have. Once downloaded, double-click 
+information to help you determine which processor you have. Once downloaded, double-click
 the file to open the installer, then drag the Docker icon to the Applications folder.
-Double-click the icon in the Applications folder to start Docker. In the installation 
+Double-click the icon in the Applications folder to start Docker. In the installation
 window, use the recommended settings.
 
 **Running JupyterLab** Run Docker Desktop. Once it is running, follow the
 instructions above in the Windows section on *Running JupyterLab* (the user
 interface is the same). More information on installing and running Docker on
-MacOS, as well as troubleshooting tips, can be 
+MacOS, as well as troubleshooting tips, can be
 found in [the online Docker documentation](https://docs.docker.com/desktop/install/mac-install/).
 
 ### Ubuntu
@@ -206,8 +206,8 @@ the following command, replacing `TAG` with the *tag* you found earlier.
 ```
 docker run --rm -v $(pwd):/home/jovyan/work -p 8888:8888 ubcdsci/py-dsci-100:TAG jupyter lab
 ```
-The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the 
-URL in your terminal that starts with `http://127.0.0.1:8888` (highlighted by the 
+The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the
+URL in your terminal that starts with `http://127.0.0.1:8888` (highlighted by the
 red box in {numref}`ubuntu-docker-terminal`), and paste it into your browser to start JupyterLab.
 More information on installing and running Docker on Ubuntu, as well as troubleshooting tips, can be found in
 [the online Docker documentation](https://docs.docker.com/engine/install/ubuntu/).
@@ -226,23 +226,23 @@ The terminal text after running the Docker container in Ubuntu. The red box indi
 You can also run the worksheets accompanying this book on your computer
 using [JupyterLab Desktop](https://github.com/jupyterlab/jupyterlab-desktop).
 The advantage of JupyterLab Desktop over Docker is that it can be easier to install;
-Docker can sometimes run into some fairly technical issues (especially on Windows computers) 
-that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that 
-you may not end up with the right versions of all the python packages needed for the worksheets. Docker, on the other hand,
-*guarantees* that the worksheets will work exactly as intended. 
+Docker can sometimes run into some fairly technical issues (especially on Windows computers)
+that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that
+you may not end up with the right versions of all the Python packages needed for the worksheets. Docker, on the other hand,
+*guarantees* that the worksheets will work exactly as intended.
 
 In this section, we will cover how to install JupyterLab Desktop,
 Git and the JupyterLab Git extension (for version control, as discussed in {numref}`Chapter %s <getting-started-with-version-control>`), and
-all of the python packages needed to run
+all of the Python packages needed to run
 the code in this book.
 ```{index} JupyterLab Desktop, git;installation
 ```
 
 ### Windows
 
-**Installation** First, we will install Git for version control. 
-Go to [the Git download page](https://git-scm.com/download/win) and 
-download the Windows version of Git. Once the download has finished, run the installer and accept 
+**Installation** First, we will install Git for version control.
+Go to [the Git download page](https://git-scm.com/download/win) and
+download the Windows version of Git. Once the download has finished, run the installer and accept
 the default configuration for all pages.
 Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation).
 Download the `JupyterLab-Setup-Windows.exe` installer file for Windows.
@@ -265,8 +265,8 @@ The JupyterLab Desktop graphical user interface.
 
 Next, we need to add the JupyterLab Git extension (so that
 we can use version control directly from within JupyterLab Desktop),
-the IPython kernel (to enable the python programming language),
-and various python software packages. Click "New session..." in the JupyterLab Desktop
+the IPython kernel (to enable the Python programming language),
+and various Python software packages. Click "New session..." in the JupyterLab Desktop
 user interface, then scroll to the bottom, and click "Terminal" under the "Other" heading ({numref}`setup-jlab-gui-2`).
 
 ```{figure} img/setup/jlab-2.png
@@ -283,29 +283,29 @@ In this terminal, run the following commands:
 pip install --upgrade jupyterlab-git
 conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/main/environment.yml
 ```
-The second command installs the specific python and package versions specified in
-the `environment.yml` file found in 
+The second command installs the specific Python and package versions specified in
+the `environment.yml` file found in
 [the worksheets repository](https://worksheets.python.datasciencebook.ca).
 We will always keep the versions in the `environment.yml` file updated
 so that they are compatible with the exercise worksheets that accompany the book.
-Once all of the software installation is complete, it is a good idea to restart 
+Once all of the software installation is complete, it is a good idea to restart
 JupyterLab Desktop entirely before you proceed to doing your data analysis.
-This will ensure all the software and settings you put in place are 
+This will ensure all the software and settings you put in place are
 correctly set up and ready for use.
 
 
 ### MacOS
 
-**Installation** First, we will install Git for version control. 
-Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY)) 
+**Installation** First, we will install Git for version control.
+Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY))
 and type the following command:
 
 ```
 xcode-select --install
 ```
 Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation).
-Download the `JupyterLab-Setup-MacOS-x64.dmg` or `JupyterLab-Setup-MacOS-arm64.dmg` installer file. 
-To know which installer is right for your machine, you need to know 
+Download the `JupyterLab-Setup-MacOS-x64.dmg` or `JupyterLab-Setup-MacOS-arm64.dmg` installer file.
+To know which installer is right for your machine, you need to know
 whether your computer has an Intel processor (older machines) or an
 Apple processor (newer machines); the [Apple support page](https://support.apple.com/en-ca/HT211814) has
 information to help you determine which processor you have.
@@ -316,11 +316,11 @@ the icon in the Applications folder to start JupyterLab Desktop.
 **Configuring JupyterLab Desktop** From this point onward, with JupyterLab Desktop running,
 follow the instructions in the Windows section on *Configuring JupyterLab Desktop* to set up the
 environment, install the JupyterLab Git extension, and install
-the various python software packages needed for the worksheets.
+the various Python software packages needed for the worksheets.
 
 ### Ubuntu
 
-**Installation** First, we will install Git for version control. 
+**Installation** First, we will install Git for version control.
 Open the terminal and type the following commands:
 ```
 sudo apt update
@@ -340,4 +340,4 @@ jlab
 **Configuring JupyterLab Desktop** From this point onward, with JupyterLab Desktop running,
 follow the instructions in the Windows section on *Configuring JupyterLab Desktop* to set up the
 environment, install the JupyterLab Git extension, and install
-the various python software packages needed for the worksheets.
+the various Python software packages needed for the worksheets.
diff --git a/source/viz.md b/source/viz.md
index 40de0dea..1f56039d 100755
--- a/source/viz.md
+++ b/source/viz.md
@@ -40,16 +40,18 @@ By the end of the chapter, readers will be able to do the following:
     - bar plots
     - histogram plots
 - Given a data set and a question, select from the above plot types and use Python to create a visualization that best answers the question.
-- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question.
+- Evaluate the effectiveness of a visualization and suggest improvements to better answer a given question.
 - Referring to the visualization, communicate the conclusions in non-technical terms.
 - Identify rules of thumb for creating effective visualizations.
-- Define the two key aspects of altair charts:
-    - graphical marks
-    - encoding channels
-- Use the altair library in Python to create and refine the above visualizations using:
-    - graphical marks: `mark_point`, `mark_line`, `mark_bar`
+- Use the `altair` library in Python to create and refine the above visualizations using:
+    - graphical marks: `mark_point`, `mark_line`, `mark_circle`, `mark_bar`, `mark_rule`
     - encoding channels: `x`, `y`, `color`, `shape`
+    - labeling: `title`
+    - transformations: `scale`
     - subplots: `facet`
+- Define the two key aspects of `altair` charts:
+    - graphical marks
+    - encoding channels
 - Describe the difference in raster and vector output formats.
 - Use `chart.save()` to save visualizations in `.png` and `.svg` format.
 
@@ -611,7 +613,7 @@ can_lang
 ```{code-cell} ipython3
 :tags: ["remove-cell"]
 # use only nonzero entries (to avoid issues with log scale), and wrap in a pd.DataFrame to prevent copy/view warnings later
-can_lang = pd.DataFrame(can_lang[(can_lang["most_at_home"] > 0) & (can_lang["mother_tongue"] > 0)]) 
+can_lang = pd.DataFrame(can_lang[(can_lang["most_at_home"] > 0) & (can_lang["mother_tongue"] > 0)])
 ```
 
 ```{index} altair; mark_circle
diff --git a/source/wrangling.md b/source/wrangling.md
index f7f94564..2c400af3 100755
--- a/source/wrangling.md
+++ b/source/wrangling.md
@@ -36,28 +36,25 @@ application, providing more practice working through a whole case study.
 
 By the end of the chapter, readers will be able to do the following:
 
-  - Define the term "tidy data".
-  - Discuss the advantages of storing data in a tidy data format.
-  - Define what series and data frames are in Python, and describe how they relate to
-    each other.
-  - Describe the common types of data in Python and their uses.
-  - Recall and use the following functions for their
-    intended data wrangling tasks:
-      - `agg`
-      - `assign` (as well as regular column assignment)
-      - `groupby`
-      - `melt`
-      - `pivot`
-      - `str.split`
-  - Recall and use the following operators for their
-    intended data wrangling tasks:
-      - `==`, `!=`, `<`, `>`, `<=`, `>=`
-      - `in`
-      - `and`
-      - `or`
-      - `[]`
-      - `loc[]`
-      - `iloc[]`
+- Define the term "tidy data".
+- Discuss the advantages of storing data in a tidy data format.
+- Define what series and data frames are in Python, and describe how they relate to
+  each other.
+- Describe the common types of data in Python and their uses.
+- Use the following functions for their intended data wrangling tasks:
+    - `melt`
+    - `pivot`
+    - `reset_index`
+    - `str.split`
+    - `agg`
+    - `assign` and regular column assignment
+    - `groupby`
+    - `merge`
+- Use the following operators for their intended data wrangling tasks:
+    - `==`, `!=`, `<`, `>`, `<=`, and `>=`
+    - `isin`
+    - `&` and `|`
+    - `[]`, `loc[]`, and `iloc[]`
 
 ## Data frames and series
 
@@ -838,7 +835,7 @@ one can use in the `[]` to select subsets of rows.
 Recall that if we provide a list of column names, `[]` returns the subset of columns with those names as a data frame.
 Suppose we wanted to select the columns `language`, `region`,
 `most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
-learned in {numref}`Chapter %s <intro>`, we can pass all of these column 
+learned in {numref}`Chapter %s <intro>`, we can pass all of these column
 names into the square brackets.
 
 ```{code-cell} ipython3
@@ -1042,8 +1039,8 @@ The `[]` operation is only used when you want to either filter rows **or** selec
 it cannot be used to do both operations at the same time. This is where `loc[]`
 comes in. For the first example, recall `loc[]` from {numref}`Chapter %s <intro>`,
 which lets us create a subset of the rows and columns in the `tidy_lang` data frame.
-In the first argument to `loc[]`, we specify a logical statement that 
-filters the rows to only those pertaining to the Toronto region, 
+In the first argument to `loc[]`, we specify a logical statement that
+filters the rows to only those pertaining to the Toronto region,
 and the second argument specifies a list of columns to keep by name.
 
 ```{code-cell} ipython3
@@ -1055,11 +1052,11 @@ tidy_lang.loc[
 ```
 
 In addition to simultaneous subsetting of rows and columns, `loc[]` has two
-more special capabilities beyond those of `[]`. First, `loc[]` has the ability to specify *ranges* of rows and columns. 
-For example, note that the list of columns `language`, `region`, `most_at_home`, `most_at_work` 
+more special capabilities beyond those of `[]`. First, `loc[]` has the ability to specify *ranges* of rows and columns.
+For example, note that the list of columns `language`, `region`, `most_at_home`, `most_at_work`
 corresponds to the *range* of columns from `language` to `most_at_work`.
 Rather than explicitly listing all of the column names as we did above,
-we can ask for the range of columns `"language":"most_at_work"`; the `:`-syntax 
+we can ask for the range of columns `"language":"most_at_work"`; the `:`-syntax
 denotes a range, and is supported by the `loc[]` function, but not by `[]`.
 
 ```{code-cell} ipython3
@@ -1490,7 +1487,7 @@ region_lang_nums.info()
 ```
 You can now see that the columns from `mother_tongue` to `lang_known` are type `int32`,
 and that we have obtained a data frame with the same number of columns and rows
-as the input data frame. 
+as the input data frame.
 
 The second situation occurs when you want to apply a function across columns within each individual
 row, i.e., *row-wise*. This operation, illustrated in {numref}`fig:rowwise`,
@@ -1520,7 +1517,7 @@ We see that we obtain a series containing the maximum value between `mother_tong
 is often the case that we want to include a column result
 from a row-wise operation as a new column in the data frame, so that we can make
 plots or continue our analysis. To make this happen,
-we will use column assignment or the `assign` function to create a new column. 
+we will use column assignment or the `assign` function to create a new column.
 This is discussed in the next section.
 
 ```{note}
@@ -1554,7 +1551,7 @@ You can see above that the `region_lang` data frame now has an additional column
 The `maximum` column contains
 the maximum value between `mother_tongue`,
 `most_at_home`, `most_at_work` and `lang_known` for each language
-and region, just as we specified! 
+and region, just as we specified!
 
 To instead create an entirely new data frame, we can use the `assign` method and specify one argument for each column we want to create.
 In this case we want to create one new column named `maximum`, so the argument
@@ -1670,7 +1667,7 @@ See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab
 english_lang
 ```
 Wait a moment...what is that warning message? It seems to suggest that something went wrong, but
-if we inspect the `english_lang` data frame above, it looks like the city populations were added 
+if we inspect the `english_lang` data frame above, it looks like the city populations were added
 just fine! As it turns out, this is caused by the earlier filtering we did from `region_lang` to
 produce the original `english_lang`. The details are a little bit technical, but
 `pandas` sometimes does not like it when you subset a data frame using `[]` or `loc[]` followed by
@@ -1733,14 +1730,14 @@ english_lang = region_lang[
 :tags: ["output_scroll"]
 english_lang
 ```
-We then added the populations of these cities as a column 
+We then added the populations of these cities as a column
 (Toronto: 5928040, Montréal: 4098927, Vancouver: 2463431,
 Calgary: 1392609, and Edmonton: 1321426). We had to be careful to add those populations in the
 right order; this is an error-prone process. An alternative approach, that we demonstrate here
 is to (1) create a new data frame with the city names and populations, and
 (2) use `merge` to combine the two data frames, recognizing that the "regions" are the same.
 
-We create a new data frame by calling `pd.DataFrame` with a dictionary 
+We create a new data frame by calling `pd.DataFrame` with a dictionary
 as its argument. The dictionary associates each column name in the data frame to be created
 with a list of entries. Here we list city names in a column called `"region"`
 and their populations in a column called `"population"`.