diff --git a/source/acknowledgements.md b/source/acknowledgements.md index 233e18fc..751e2a85 100755 --- a/source/acknowledgements.md +++ b/source/acknowledgements.md @@ -58,7 +58,7 @@ We would like to give special thanks to Navya Dahiya and Gloria Ye for completing the first round of translation of the R material to Python, and to Philip Austin for his leadership and guidance throughout the translation process. We also gratefully acknowledge the UBC Open Educational Resources Fund -and the UBC Department of Statistics for supporting the translation of -the original R textbook and exercises to the python programming language. +and the UBC Department of Statistics for supporting the translation of +the original R textbook and exercises to the Python programming language. diff --git a/source/authors.md b/source/authors.md index e365683c..a622a9f6 100755 --- a/source/authors.md +++ b/source/authors.md @@ -20,6 +20,16 @@ Campbell, and Melissa Lee for the R programming language. The content of the R textbook was adapted to Python by Trevor Campbell, Joel Ostblom, and Lindsey Heagy. +**[Tiffany Timbers](https://www.tiffanytimbers.com/)** is an Associate Professor of Teaching in the Department of +Statistics and Co-Director for the Master of Data Science program (Vancouver +Option) at the University of British Columbia. In these roles she teaches and +develops curriculum around the responsible application of Data Science to solve +real-world problems. One of her favorite courses she teaches is a graduate +course on collaborative software development, which focuses on teaching how to +create R and Python packages using modern tools and workflows. + ++++ + **[Trevor Campbell](https://trevorcampbell.me/)** is an Associate Professor in the Department of Statistics at the University of British Columbia. His research focuses on automated, scalable Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and @@ -32,15 +42,6 @@ program at the University of Toronto. +++ -**[Tiffany Timbers](https://www.tiffanytimbers.com/)** is an Associate Professor of Teaching in the Department of -Statistics and Co-Director for the Master of Data Science program (Vancouver -Option) at the University of British Columbia. In these roles she teaches and -develops curriculum around the responsible application of Data Science to solve -real-world problems. One of her favorite courses she teaches is a graduate -course on collaborative software development, which focuses on teaching how to -create R and Python packages using modern tools and workflows. -+++ - **[Melissa Lee](https://www.stat.ubc.ca/users/melissa-lee)** is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. She teaches and develops curriculum for undergraduate statistics and data science courses. Her work @@ -50,19 +51,8 @@ initiatives. +++ -**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric -Sciences and director of the Geophysical Inversion Facility at the University of British Columbia. -Her research combines computational methods in numerical simulations, inversions, and machine -learning to answer questions about the subsurface of the Earth. Primary applications include -mineral exploration, carbon sequestration, groundwater and environmental studies. She -completed her BSc at the University of Alberta, her PhD at the University of British Columbia, -and held a Postdoctoral research position at the University of California Berkeley prior to -starting her current position at UBC. - -+++ - **[Joel Ostblom](https://joelostblom.com/)** is an Assistant Professor of Teaching in the Department of -Statistics at the University of British Columbia. +Statistics at the University of British Columbia. During his PhD, Joel developed a passion for data science and reproducibility through the development of quantitative image analysis pipelines for studying stem cell and developmental biology. He has since co-created or lead the @@ -71,3 +61,15 @@ is now an assistant professor of teaching in the statistics department at the University of British Columbia. Joel cares deeply about spreading data literacy and excitement over programmatic data analysis, which is reflected in his contributions to open source projects and data science learning resources. + ++++ + +**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric +Sciences and director of the Geophysical Inversion Facility at the University of British Columbia. +Her research combines computational methods in numerical simulations, inversions, and machine +learning to answer questions about the subsurface of the Earth. Primary applications include +mineral exploration, carbon sequestration, groundwater and environmental studies. She +completed her BSc at the University of Alberta, her PhD at the University of British Columbia, +and held a Postdoctoral research position at the University of California Berkeley prior to +starting her current position at UBC. + diff --git a/source/classification1.md b/source/classification1.md index 38b14e42..a393f295 100755 --- a/source/classification1.md +++ b/source/classification1.md @@ -25,12 +25,12 @@ import plotly.graph_objects as go (classification1)= # Classification I: training & predicting -## Overview +## Overview In previous chapters, we focused solely on descriptive and exploratory -data analysis questions. +data analysis questions. This chapter and the next together serve as our first foray into answering *predictive* questions about data. In particular, we will -focus on *classification*, i.e., using one or more +focus on *classification*, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make @@ -38,7 +38,7 @@ predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy. -## Chapter learning objectives +## Chapter learning objectives By the end of the chapter, readers will be able to do the following: @@ -46,11 +46,10 @@ By the end of the chapter, readers will be able to do the following: - Describe what a training data set is and how it is used in classification. - Interpret the output of a classifier. - Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables. -- Explain the $K$-nearest neighbor classification algorithm. -- Perform $K$-nearest neighbor classification in Python using `scikit-learn`. -- Use `StandardScaler` and `make_column_transformer` to preprocess data to be centered and scaled. -- Use `sample` to preprocess data to be balanced. -- Combine preprocessing and model training using `make_pipeline`. +- Explain the K-nearest neighbors classification algorithm. +- Perform K-nearest neighbors classification in Python using `scikit-learn`. +- Use methods from `scikit-learn` to center, scale, balance, and impute data as a preprocessing step. +- Combine preprocessing and model training into a `Pipeline` using `make_pipeline`. +++ @@ -66,7 +65,7 @@ In many situations, we want to make predictions based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor's past experience with patients; an email provider might want to tag a given -email as "spam" or "not spam" based on the email's text and past email text data; +email as "spam" or "not spam" based on the email's text and past email text data; or a credit card company may want to predict whether a purchase is fraudulent based on the current purchase item, amount, and location as well as past purchases. These tasks are all examples of **classification**, i.e., predicting a @@ -76,7 +75,7 @@ other variables (sometimes called *features*). ```{index} training set ``` -Generally, a classifier assigns an observation without a known class (e.g., a new patient) +Generally, a classifier assigns an observation without a known class (e.g., a new patient) to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations for which we do know the class (e.g., previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for @@ -89,14 +88,14 @@ the classifier to make predictions on new data for which we do not know the clas There are many possible methods that we could use to predict a categorical class/label for an observation. In this book, we will -focus on the widely used **$K$-nearest neighbors** algorithm {cite:p}`knnfix,knncover`. +focus on the widely used **K-nearest neighbors** algorithm {cite:p}`knnfix,knncover`. In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. It is also worth mentioning that there are many -variations on the basic classification problem. For example, +variations on the basic classification problem. For example, we focus on the setting of **binary classification** where only two -classes are involved (e.g., a diagnosis of either healthy or diseased), but you may +classes are involved (e.g., a diagnosis of either healthy or diseased), but you may also run into multiclass classification problems with more than two categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold). @@ -105,16 +104,16 @@ categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common col ```{index} breast cancer, question; classification ``` -In this chapter and the next, we will study a data set of +In this chapter and the next, we will study a data set of [digitized breast cancer image features](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29), created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian {cite:p}`streetbreastcancer`. Each row in the data set represents an image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, and more). -Diagnosis for each image was conducted by physicians. +Diagnosis for each image was conducted by physicians. As with all data analyses, we first need to formulate a precise question that -we want to answer. Here, the question is *predictive*: can +we want to answer. Here, the question is *predictive*: can we use the tumor image measurements available to us to predict whether a future tumor image (with unknown diagnosis) shows a benign or malignant tumor? Answering this @@ -162,24 +161,24 @@ Traditionally these procedures were quite invasive; modern methods such as fine needle aspiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, ten different variables were measured -for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean +for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis by a physician. Therefore, the total set of variables per image in this data set is: -1. ID: identification number +1. ID: identification number 2. Class: the diagnosis (M = malignant or B = benign) 3. Radius: the mean of distances from center to points on the perimeter 4. Texture: the standard deviation of gray-scale values -5. Perimeter: the length of the surrounding contour +5. Perimeter: the length of the surrounding contour 6. Area: the area inside the contour 7. Smoothness: the local variation in radius lengths 8. Compactness: the ratio of squared perimeter and area -9. Concavity: severity of concave portions of the contour +9. Concavity: severity of concave portions of the contour 10. Concave Points: the number of concave portions of the contour -11. Symmetry: how similar the nucleus is when mirrored +11. Symmetry: how similar the nucleus is when mirrored 12. Fractal Dimension: a measurement of how "rough" the perimeter is +++ @@ -187,7 +186,7 @@ total set of variables per image in this data set is: ```{index} info ``` -Below we use the `info` method to preview the data frame. This method can +Below we use the `info` method to preview the data frame. This method can make it easier to inspect the data when we have a lot of columns: it prints only the column names down the page (instead of across), as well as their data types and the number of non-missing entries. @@ -211,7 +210,7 @@ cancer["Class"].unique() We will improve the readability of our analysis by renaming `"M"` to `"Malignant"` and `"B"` to `"Benign"` using the `replace` method. The `replace` method takes one argument: a dictionary that maps -previous values to desired new values. +previous values to desired new values. We will verify the result using the `unique` method. ```{index} replace @@ -240,7 +239,7 @@ glue("malignant_pct", "{:0.0f}".format(100*cancer["Class"].value_counts(normaliz ``` Before we start doing any modeling, let's explore our data set. Below we use -the `groupby` and `count` methods to find the number and percentage +the `groupby` and `count` methods to find the number and percentage of benign and malignant tumor observations in our data set. When paired with `groupby`, `count` counts the number of observations for each value of the `Class` variable. Then we calculate the percentage in each group by dividing by the total @@ -248,9 +247,9 @@ number of observations and multiplying by 100. The total number of observations equals the number of rows in the data frame, which we can access via the `shape` attribute of the data frame (`shape[0]` is the number of rows and `shape[1]` is the number of columns). -We have +We have {glue:text}`benign_count` ({glue:text}`benign_pct`\%) benign and -{glue:text}`malignant_count` ({glue:text}`malignant_pct`\%) malignant +{glue:text}`malignant_count` ({glue:text}`malignant_pct`\%) malignant tumor observations. ```{code-cell} ipython3 @@ -260,7 +259,7 @@ tumor observations. ```{index} value_counts ``` -The `pandas` package also has a more convenient specialized `value_counts` method for +The `pandas` package also has a more convenient specialized `value_counts` method for counting the number of occurrences of each value in a column. If we pass no arguments to the method, it outputs a series containing the number of occurences of each value. If we instead pass the argument `normalize=True`, it instead prints the fraction @@ -308,17 +307,17 @@ obtain a new observation not in the current data set that has all the variables measured *except* the label (i.e., an image without the physician's diagnosis for the tumor class). We could compute the standardized perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify -that observation as benign or malignant? Based on the scatter plot, how might +that observation as benign or malignant? Based on the scatter plot, how might you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as -malignant. Based on our visualization, it seems like +malignant. Based on our visualization, it seems like it may be possible to make accurate predictions of the `Class` variable (i.e., a diagnosis) for tumor images with unknown diagnoses. +++ -## Classification with $K$-nearest neighbors +## Classification with K-nearest neighbors ```{code-cell} ipython3 :tags: [remove-cell] @@ -342,21 +341,21 @@ my_distances = euclidean_distances(perim_concav_with_new_point_df[attrs])[ ``` In order to actually make predictions for new observations in practice, we -will need a classification algorithm. -In this book, we will use the $K$-nearest neighbors classification algorithm. +will need a classification algorithm. +In this book, we will use the K-nearest neighbors classification algorithm. To predict the label of a new observation (here, classify it as either benign -or malignant), the $K$-nearest neighbors classifier generally finds the $K$ +or malignant), the K-nearest neighbors classifier generally finds the $K$ "nearest" or "most similar" observations in our training set, and then uses -their diagnoses to make a prediction for the new observation's diagnosis. $K$ +their diagnoses to make a prediction for the new observation's diagnosis. $K$ is a number that we must choose in advance; for now, we will assume that someone has chosen -$K$ for us. We will cover how to choose $K$ ourselves in the next chapter. +$K$ for us. We will cover how to choose $K$ ourselves in the next chapter. -To illustrate the concept of $K$-nearest neighbors classification, we +To illustrate the concept of K-nearest neighbors classification, we will walk through an example. Suppose we have a -new observation, with standardized perimeter -of {glue:text}`new_point_1_0` and standardized concavity -of {glue:text}`new_point_1_1`, whose -diagnosis "Class" is unknown. This new observation is +new observation, with standardized perimeter +of {glue:text}`new_point_1_0` and standardized concavity +of {glue:text}`new_point_1_1`, whose +diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in {numref}`fig:05-knn-2`. ```{code-cell} ipython3 @@ -397,7 +396,7 @@ glue("1-neighbor_con", "{:.1f}".format(near_neighbor_df.iloc[0, :]["Concavity"]) {numref}`fig:05-knn-3` shows that the nearest point to this new observation is **malignant** and located at the coordinates ({glue:text}`1-neighbor_per`, {glue:text}`1-neighbor_con`). The idea here is that if a point is close to another -in the scatter plot, then the perimeter and concavity values are similar, +in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis. ```{code-cell} ipython3 @@ -481,7 +480,7 @@ Suppose we have another new observation with standardized perimeter scatter plot in {numref}`fig:05-knn-4`, how would you classify this red, diamond observation? The nearest neighbor to this new point is a **benign** observation at ({glue:text}`2-neighbor_per`, {glue:text}`2-neighbor_con`). -Does this seem like the right prediction to make for this observation? Probably +Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points. +++ @@ -561,7 +560,7 @@ Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$. Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$; $b_x$ and $b_y$ have similar definitions for observation $b$. Then the straight-line distance between observation $a$ and -$b$ on the x-y plane can be computed using the following formula: +$b$ on the x-y plane can be computed using the following formula: $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$ @@ -569,13 +568,13 @@ $$\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}$$ To find the $K$ nearest neighbors to our new observation, we compute the distance from that new observation to each observation in our training data, and select the $K$ observations corresponding to the -$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new -observation with perimeter {glue:text}`3-new_point_0` and +$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new +observation with perimeter {glue:text}`3-new_point_0` and concavity {glue:text}`3-new_point_1`, shown as a red diamond in {numref}`fig:05-multiknn-1`. Let's calculate the distances between our new point and each of the observations in the training set to find -the $K=5$ neighbors that are nearest to our new point. +the $K=5$ neighbors that are nearest to our new point. You will see in the code below, we compute the straight-line -distance using the formula above: we square the differences between the two observations' perimeter +distance using the formula above: we square the differences between the two observations' perimeter and concavity coordinates, add the squared differences, and then take the square root. In order to find the $K=5$ nearest neighbors, we will use the `nsmallest` function from `pandas`. @@ -633,16 +632,16 @@ cancer["dist_from_new"] = ( + (cancer["Concavity"] - new_obs_Concavity) ** 2 )**(1/2) cancer.nsmallest(5, "dist_from_new")[[ - "Perimeter", - "Concavity", - "Class", + "Perimeter", + "Concavity", + "Class", "dist_from_new" ]] ``` ```{code-cell} ipython3 :tags: [remove-cell] -# code needed to render the latex table with distance calculations +# code needed to render the latex table with distance calculations from IPython.display import Latex five_neighbors = ( cancer @@ -685,7 +684,7 @@ training data. +++ The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are -malignant; since this is the majority, we classify our new observation as malignant. +malignant; since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in {numref}`fig:05-multiknn-3`. ```{code-cell} ipython3 @@ -714,21 +713,21 @@ Scatter plot of concavity versus perimeter with 5 nearest neighbors circled. +++ -### More than two explanatory variables +### More than two explanatory variables -Although the above description is directed toward two predictor variables, -exactly the same $K$-nearest neighbors algorithm applies when you +Although the above description is directed toward two predictor variables, +exactly the same K-nearest neighbors algorithm applies when you have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula for the distance between points. Suppose we have $m$ predictor -variables for two observations $a$ and $b$, i.e., +variables for two observations $a$ and $b$, i.e., $a = (a_{1}, a_{2}, \dots, a_{m})$ and $b = (b_{1}, b_{2}, \dots, b_{m})$. ```{index} distance; more than two variables ``` -The distance formula becomes +The distance formula becomes $$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.$$ @@ -758,17 +757,17 @@ cancer["dist_from_new"] = ( + (cancer["Symmetry"] - new_obs_Symmetry) ** 2 )**(1/2) cancer.nsmallest(5, "dist_from_new")[[ - "Perimeter", - "Concavity", - "Symmetry", - "Class", + "Perimeter", + "Concavity", + "Symmetry", + "Class", "dist_from_new" ]] ``` -Based on $K=5$ nearest neighbors with these three predictors we would classify -the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. -{numref}`fig:05-more` shows what the data look like when we visualize them +Based on $K=5$ nearest neighbors with these three predictors we would classify +the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. +{numref}`fig:05-more` shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors. ```{code-cell} ipython3 @@ -873,9 +872,9 @@ nearest neighbors look like, for learning purposes. +++ -### Summary of $K$-nearest neighbors algorithm +### Summary of K-nearest neighbors algorithm -In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following: +In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following: 1. Compute the distance between the new observation and each observation in the training set. 2. Find the $K$ rows corresponding to the $K$ smallest distances. @@ -883,21 +882,21 @@ In order to classify a new observation using a $K$-nearest neighbor classifier, +++ -## $K$-nearest neighbors with `scikit-learn` +## K-nearest neighbors with `scikit-learn` ```{index} scikit-learn ``` -Coding the $K$-nearest neighbors algorithm in Python ourselves can get complicated, +Coding the K-nearest neighbors algorithm in Python ourselves can get complicated, especially if we want to handle multiple classes, more than two variables, or predict the class for multiple new observations. Thankfully, in Python, -the $K$-nearest neighbors algorithm is -implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with -many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions -in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the -less we have to code ourselves, the fewer mistakes we will likely make. -Before getting started with $K$-nearest neighbors, we need to tell the `sklearn` package -that we prefer using `pandas` data frames over regular arrays via the `set_config` function. +the K-nearest neighbors algorithm is +implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with +many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions +in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the +less we have to code ourselves, the fewer mistakes we will likely make. +Before getting started with K-nearest neighbors, we need to tell the `sklearn` package +that we prefer using `pandas` data frames over regular arrays via the `set_config` function. ```{note} You will notice a new way of importing functions in the code below: `from ... import ...`. This lets us import *just* `set_config` from `sklearn`, and then call `set_config` without any package prefix. @@ -914,14 +913,14 @@ from sklearn import set_config set_config(transform_output="pandas") ``` -We can now get started with $K$-nearest neighbors. The first step is to +We can now get started with K-nearest neighbors. The first step is to import the `KNeighborsClassifier` from the `sklearn.neighbors` module. ```{code-cell} ipython3 from sklearn.neighbors import KNeighborsClassifier ``` -Let's walk through how to use `KNeighborsClassifier` to perform $K$-nearest neighbors classification. +Let's walk through how to use `KNeighborsClassifier` to perform K-nearest neighbors classification. We will use the `cancer` data set from above, with perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with @@ -936,15 +935,15 @@ cancer_train ```{index} scikit-learn; model object, scikit-learn; KNeighborsClassifier ``` -Next, we create a *model object* for $K$-nearest neighbors classification +Next, we create a *model object* for K-nearest neighbors classification by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors; we will discuss how to choose $K$ in the next chapter. ```{note} You can specify the `weights` argument in order to control how neighbors vote when classifying a new observation. The default is `"uniform"`, where -each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, -which weigh each neighbor's vote differently, can be found on +each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, +which weigh each neighbor's vote differently, can be found on [the `scikit-learn` website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighborsclassifier#sklearn.neighbors.KNeighborsClassifier). ``` @@ -975,7 +974,7 @@ knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]); After using the `fit` function, we can make a prediction on a new observation by calling `predict` on the classifier object, passing the new observation -itself. As above, when we ran the $K$-nearest neighbors classification +itself. As above, when we ran the K-nearest neighbors classification algorithm manually, the `knn` model object classifies the new observation as "Malignant". Note that the `predict` function outputs an `array` with the model's prediction; you can actually make multiple predictions at the same @@ -988,8 +987,8 @@ knn.predict(new_obs) Is this predicted malignant label the actual class for this observation? Well, we don't know because we do not have this -observation's diagnosis— that is what we were trying to predict! The -classifier's prediction is not necessarily correct, but in the next chapter, we will +observation's diagnosis— that is what we were trying to predict! The +classifier's prediction is not necessarily correct, but in the next chapter, we will learn ways to quantify how accurate we think our predictions are. +++ @@ -1001,9 +1000,9 @@ learn ways to quantify how accurate we think our predictions are. ```{index} scaling ``` -When using $K$-nearest neighbor classification, the *scale* of each variable +When using K-nearest neighbors classification, the *scale* of each variable (i.e., its size and range of values) matters. Since the classifier predicts -classes by identifying observations nearest to it, any variables with +classes by identifying observations nearest to it, any variables with a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale *doesn't mean* that it is more important for making accurate predictions. For example, suppose you have a @@ -1027,20 +1026,20 @@ degrees Celsius, the two variables would differ by a constant shift of 273 hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn't affect the -$K$-nearest neighbor classification algorithm, this large shift can change the +K-nearest neighbors classification algorithm, this large shift can change the outcome of using many other predictive models. ```{index} standardization; K-nearest neighbors ``` To scale and center our data, we need to find -our variables' *mean* (the average, which quantifies the "central" value of a -set of numbers) and *standard deviation* (a number quantifying how spread out values are). -For each observed value of the variable, we subtract the mean (i.e., center the variable) -and divide by the standard deviation (i.e., scale the variable). When we do this, the data -is said to be *standardized*, and all variables in a data set will have a mean of 0 -and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest -neighbor algorithm, we will read in the original, unstandardized Wisconsin breast +our variables' *mean* (the average, which quantifies the "central" value of a +set of numbers) and *standard deviation* (a number quantifying how spread out values are). +For each observed value of the variable, we subtract the mean (i.e., center the variable) +and divide by the standard deviation (i.e., scale the variable). When we do this, the data +is said to be *standardized*, and all variables in a data set will have a mean of 0 +and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest +neighbors algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. We will apply the same initial wrangling steps as we did earlier, and to keep things simple we will just use the `Area`, `Smoothness`, and `Class` @@ -1072,11 +1071,11 @@ The `scikit-learn` framework provides a collection of *preprocessors* used to ma data in the [`preprocessing` module](https://scikit-learn.org/stable/modules/preprocessing.html). Here we will use the `StandardScaler` transformer to standardize the predictor variables in the `unscaled_cancer` data. In order to tell the `StandardScaler` which variables to standardize, -we wrap it in a +we wrap it in a [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) object -using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. +using the [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html#sklearn.compose.make_column_transformer) function. `ColumnTransformer` objects also enable the use of multiple preprocessors at -once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. +once, which is especially handy when you want to apply different preprocessing to each of the predictor variables. The primary argument of the `make_column_transformer` function is a sequence of pairs of (1) a preprocessor, and (2) the columns to which you want to apply that preprocessor. In the present case, we just have the one `StandardScaler` preprocessor to apply to the `Area` and `Smoothness` columns. @@ -1101,14 +1100,14 @@ preprocessor ``` You can see that the preprocessor includes a single standardization step -that is applied to the `Area` and `Smoothness` columns. -Note that here we specified which columns to apply the preprocessing step to +that is applied to the `Area` and `Smoothness` columns. +Note that here we specified which columns to apply the preprocessing step to by individual names; this approach can become quite difficult, e.g., when we have many predictor variables. Rather than writing out the column names individually, -we can instead use the +we can instead use the [`make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) function. For example, if we wanted to standardize all *numerical* predictors, -we would use `make_column_selector` and specify the `dtype_include` argument to be `"number"`. +we would use `make_column_selector` and specify the `dtype_include` argument to be `"number"`. This creates a preprocessor equivalent to the one we created previously. ```{code-cell} ipython3 @@ -1126,10 +1125,10 @@ preprocessor We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame. This happens in two steps. We first use the `fit` function to compute the values necessary to apply the standardization (the mean and standard deviation of each variable), passing the `unscaled_cancer` data as an argument. -Then we use the `transform` function to actually apply the standardization. +Then we use the `transform` function to actually apply the standardization. It may seem a bit unnecessary to use two steps---`fit` *and* `transform`---to standardize the data. -However, we do this in two steps so that we can specify a different data set in the `transform` step if we want. -This enables us to compute the quantities needed to standardize using one data set, and then +However, we do this in two steps so that we can specify a different data set in the `transform` step if we want. +This enables us to compute the quantities needed to standardize using one data set, and then apply that standardization to another data set. ```{code-cell} ipython3 @@ -1145,7 +1144,7 @@ glue("scaled-cancer-column-1", '"'+scaled_cancer.columns[1]+'"') It looks like our `Smoothness` and `Area` variables have been standardized. Woohoo! But there are two important things to notice about the new `scaled_cancer` data frame. First, it only keeps the columns from the input to `transform` (here, `unscaled_cancer`) that had a preprocessing step applied -to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer` +to them. The default behavior of the `ColumnTransformer` that we build using `make_column_transformer` is to *drop* the remaining columns. This default behavior works well with the rest of `sklearn` (as we will see below in {numref}`08:puttingittogetherworkflow`), but for visualizing the result of preprocessing it can be useful to keep the other columns in our original data frame, such as the `Class` variable here. @@ -1174,7 +1173,7 @@ scaled_cancer_all You may wonder why we are doing so much work just to center and scale our variables. Can't we just manually scale and center the `Area` and -`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well, +`Smoothness` variables ourselves before building our K-nearest neighbors model? Well, technically *yes*; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a *different* centering / scaling than what @@ -1184,7 +1183,7 @@ the preprocessor is required only when you want to inspect the result of the preprocessing steps yourself. You will see further on in {numref}`08:puttingittogetherworkflow` that `scikit-learn` provides tools to -automatically streamline the preprocesser and the model so that you can call `fit` +automatically streamline the preprocesser and the model so that you can call `fit` and `transform` on the `Pipeline` as necessary without additional coding effort. {numref}`fig:05-scaling-plt` shows the two scatter plots side-by-side—one for `unscaled_cancer` and one for @@ -1195,10 +1194,10 @@ well within the cloud of benign observations, and the neighbors are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). {numref}`fig:05-scaling-plt-zoomed` shows a close-up of that region on the unstandardized plot. Here the computation of nearest -neighbors is dominated by the much larger-scale area variable. The plot for standardized data +neighbors is dominated by the much larger-scale area variable. The plot for standardized data on the right in {numref}`fig:05-scaling-plt` shows a much more intuitively reasonable selection of nearest neighbors. Thus, standardizing the data can change things -in an important way when we are using predictive algorithms. +in an important way when we are using predictive algorithms. Standardizing your data should be a part of the preprocessing you do before predictive modeling and you should always think carefully about your problem domain and whether you need to standardize your data. @@ -1399,9 +1398,9 @@ Close-up of three nearest neighbors for unstandardized data. ```{index} balance, imbalance ``` -Another potential issue in a data set for a classifier is *class imbalance*, +Another potential issue in a data set for a classifier is *class imbalance*, i.e., when one label is much more common than another. Since classifiers like -the $K$-nearest neighbor algorithm use the labels of nearby points to predict +the K-nearest neighbors algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the "pattern" of data suggests otherwise). Class imbalance is actually quite a @@ -1410,19 +1409,19 @@ detection, there are many cases in which the "important" class to identify (presence of disease, malicious email) is much rarer than the "unimportant" class (no disease, normal email). -To better illustrate the problem, let's revisit the scaled breast cancer data, +To better illustrate the problem, let's revisit the scaled breast cancer data, `cancer`; except now we will remove many of the observations of malignant tumors, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the `.head()` method, which takes the number of rows to select from the top (`n`). -We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) +We will then use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function from `pandas` to glue the two resulting filtered data frames back together. The `concat` function *concatenates* data frames along an axis. By default, it concatenates the data frames vertically along `axis=0` yielding a single *taller* data frame, which is what we want to do here. If we instead wanted to concatenate horizontally to produce a *wider* data frame, we would specify `axis=1`. -The new imbalanced data is shown in {numref}`fig:05-unbalanced`, +The new imbalanced data is shown in {numref}`fig:05-unbalanced`, and we print the counts of the classes using the `value_counts` function. ```{code-cell} ipython3 @@ -1452,8 +1451,8 @@ rare_cancer["Class"].value_counts() +++ -Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification. -With only 3 observations of malignant tumors, the classifier +Suppose we now decided to use $K = 7$ in K-nearest neighbors classification. +With only 3 observations of malignant tumors, the classifier will *always predict that the tumor is benign, no matter what its concavity and perimeter are!* This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be @@ -1525,9 +1524,9 @@ Imbalanced data with 7 nearest neighbors to a new observation highlighted. +++ -{numref}`fig:05-upsample-2` shows what happens if we set the background color of -each area of the plot to the predictions the $K$-nearest neighbor -classifier would make. We can see that the decision is +{numref}`fig:05-upsample-2` shows what happens if we set the background color of +each area of the plot to the predictions the K-nearest neighbors +classifier would make. We can see that the decision is always "benign," corresponding to the blue color. ```{code-cell} ipython3 @@ -1609,9 +1608,9 @@ Imbalanced data with background color indicating the decision of the classifier Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. -For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. +For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. In other words, we will replicate rare observations multiple times in our data set to give them more -voting power in the $K$-nearest neighbor algorithm. In order to do this, we will +voting power in the K-nearest neighbors algorithm. In order to do this, we will first separate the classes out into their own data frames by filtering. Then, we will use the `sample` method on the rare class data frame to increase the number of `Malignant` observations to be the same as the number @@ -1624,7 +1623,7 @@ in data analysis in {numref}`Chapter %s `. ```{code-cell} ipython3 :tags: [remove-cell] # hidden seed call to make the below resample reproducible -# we haven't taught students about seeds / prngs yet, so +# we haven't taught students about seeds / prngs yet, so # for now just hide this. np.random.seed(1) ``` @@ -1639,11 +1638,11 @@ upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer)) upsampled_cancer["Class"].value_counts() ``` -Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data. -{numref}`fig:05-upsample-plot` shows what happens now when we set the background color -of each area of our scatter plot to the decision the $K$-nearest neighbor +Now suppose we train our K-nearest neighbors classifier with $K=7$ on this *balanced* data. +{numref}`fig:05-upsample-plot` shows what happens now when we set the background color +of each area of our scatter plot to the decision the K-nearest neighbors classifier would make. We can see that the decision is more reasonable; when the points are close -to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are +to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are closer to the benign tumor observations. ```{code-cell} ipython3 @@ -1739,13 +1738,13 @@ missing_cancer["Class"] = missing_cancer["Class"].replace({ missing_cancer ``` -Recall that K-nearest neighbor classification makes predictions by computing +Recall that K-nearest neighbors classification makes predictions by computing the straight-line distance to nearby training observations, and hence requires access to the values of *all* variables for *all* observations in the training -data. So how can we perform K-nearest neighbor classification in the presence +data. So how can we perform K-nearest neighbors classification in the presence of missing data? Well, since there are not too many observations with missing entries, one option is to simply remove those observations prior to building -the K-nearest neighbor classifier. We can accomplish this by using the +the K-nearest neighbors classifier. We can accomplish this by using the `dropna` method prior to working with the data. ```{code-cell} ipython3 @@ -1759,7 +1758,7 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries are filled in using the mean of the present entries in each variable. To perform mean imputation, we -use a `SimpleImputer` transformer with the default arguments, and wrap it in a +use a `SimpleImputer` transformer with the default arguments, and wrap it in a `ColumnTransformer` to indicate which columns need imputation. ```{code-cell} ipython3 @@ -1782,7 +1781,7 @@ imputed_cancer = preprocessor.transform(missing_cancer) imputed_cancer ``` -Many other options for missing data imputation can be found in +Many other options for missing data imputation can be found in [the `scikit-learn` documentation](https://scikit-learn.org/stable/modules/impute.html). However you decide to handle missing data in your data analysis, it is always crucial to think critically about the setting, how the data were collected, and the @@ -1796,7 +1795,7 @@ question you are answering. ```{index} scikit-learn; pipeline ``` -The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), +The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline), a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. To illustrate the whole workflow, let's start from scratch with the `wdbc_unscaled.csv` data. First we will load the data, create a model, and specify a preprocessor for the data. @@ -1810,7 +1809,7 @@ unscaled_cancer["Class"] = unscaled_cancer["Class"].replace({ }) unscaled_cancer -# create the KNN model +# create the K-NN model knn = KNeighborsClassifier(n_neighbors=7) # create the centering / scaling preprocessor @@ -1822,7 +1821,7 @@ preprocessor = make_column_transformer( ```{index} scikit-learn; make_pipeline, scikit-learn; fit ``` -Next we place these steps in a `Pipeline` using +Next we place these steps in a `Pipeline` using the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) function. The `make_pipeline` function takes a list of steps to apply in your data analysis; in this case, we just have the `preprocessor` and `knn` steps. @@ -1839,7 +1838,7 @@ from sklearn.pipeline import make_pipeline knn_pipeline = make_pipeline(preprocessor, knn) knn_pipeline.fit( - X=unscaled_cancer, + X=unscaled_cancer, y=unscaled_cancer["Class"] ) knn_pipeline @@ -1848,7 +1847,7 @@ knn_pipeline As before, the fit object lists the function that trains the model. But now the fit object also includes information about the overall workflow, including the standardization preprocessing step. In other words, when we use the `predict` function with the `knn_pipeline` object to make a prediction for a new -observation, it will first apply the same preprocessing steps to the new observation. +observation, it will first apply the same preprocessing steps to the new observation. As an example, we will predict the class label of two new observations: one with `Area = 500` and `Smoothness = 0.075`, and one with `Area = 1500` and `Smoothness = 0.1`. @@ -1859,13 +1858,13 @@ prediction ``` The classifier predicts that the first observation is benign, while the second is -malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this -trained $K$-nearest neighbor model will make on a large range of new observations. +malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this +trained K-nearest neighbors model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. -For the interested reader who wants a learning challenge, we now include it below. -The basic idea is to create a grid of synthetic new observations using the `meshgrid` function from `numpy`, -predict the label of each, and visualize the predictions with a colored scatter having a very high transparency +For the interested reader who wants a learning challenge, we now include it below. +The basic idea is to create a grid of synthetic new observations using the `meshgrid` function from `numpy`, +predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low `opacity` value) and large point radius. See if you can figure out what each line is doing! ```{note} @@ -1950,8 +1949,8 @@ Scatter plot of smoothness versus area where background color indicates the deci ## Exercises -Practice exercises for the material covered in this chapter -can be found in the accompanying +Practice exercises for the material covered in this chapter +can be found in the accompanying [worksheets repository](https://worksheets.python.datasciencebook.ca) in the "Classification I: training and predicting" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. diff --git a/source/classification2.md b/source/classification2.md index 739b86d4..649b5aa3 100755 --- a/source/classification2.md +++ b/source/classification2.md @@ -21,25 +21,26 @@ kernelspec: from chapter_preamble import * ``` -## Overview +## Overview This chapter continues the introduction to predictive modeling through classification. While the previous chapter covered training and data preprocessing, this chapter focuses on how to evaluate the performance of a classifier, as well as how to improve the classifier (where possible) to maximize its accuracy. -## Chapter learning objectives +## Chapter learning objectives By the end of the chapter, readers will be able to do the following: - Describe what training, validation, and test data sets are and how they are used in classification. - Split data into training, validation, and test data sets. - Describe what a random seed is and its importance in reproducible data analysis. -- Set the random seed in Python using the `numpy.random.seed` function. +- Set the random seed in Python using the `numpy.random.seed` function. - Describe and interpret accuracy, precision, recall, and confusion matrices. -- Evaluate classification accuracy in Python using a validation data set. +- Evaluate classification accuracy, precision, and recall in Python using a test set, a single validation set, and cross-validation. - Produce a confusion matrix in Python. -- Execute cross-validation in Python to choose the number of neighbors in a $K$-nearest neighbors classifier. -- Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm. +- Choose the number of neighbors in a K-nearest neighbors classifier by maximizing estimated cross-validation accuracy. +- Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors classification. +- Describe the advantages and disadvantages of the K-nearest neighbors classification algorithm. +++ @@ -51,7 +52,7 @@ By the end of the chapter, readers will be able to do the following: Sometimes our classifier might make the wrong prediction. A classifier does not need to be right 100\% of the time to be useful, though we don't want the classifier to make too many wrong predictions. How do we measure how "good" our -classifier is? Let's revisit the +classifier is? Let's revisit the [breast cancer images data](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) {cite:p}`streetbreastcancer` and think about how our classifier will be used in practice. A biopsy will be performed on a *new* patient's tumor, the resulting image will be analyzed, @@ -59,9 +60,9 @@ and the classifier will be asked to decide whether the tumor is benign or malignant. The key word here is *new*: our classifier is "good" if it provides accurate predictions on data *not seen during training*, as this implies that it has actually learned about the relationship between the predictor variables and response variable, -as opposed to simply memorizing the labels of individual training data examples. +as opposed to simply memorizing the labels of individual training data examples. But then, how can we evaluate our classifier without visiting the hospital to collect more -tumor images? +tumor images? ```{index} training set, test set @@ -79,7 +80,7 @@ labels for new observations without known class labels. ``` ```{note} -If there were a golden rule of machine learning, it might be this: +If there were a golden rule of machine learning, it might be this: *you cannot use the test data to build the model!* If you do, the model gets to "see" the test data in advance, making it look more accurate than it really is. Imagine how bad it would be to overestimate your classifier's accuracy @@ -106,7 +107,7 @@ How exactly can we assess how well our predictions match the actual labels for the observations in the test set? One way we can do this is to calculate the prediction **accuracy**. This is the fraction of examples for which the classifier made the correct prediction. To calculate this, we divide the number -of correct predictions by the number of predictions made. +of correct predictions by the number of predictions made. The process for assessing if our predictions match the actual labels in the test set is illustrated in {numref}`fig:06-ML-paradigm-test`. @@ -136,7 +137,7 @@ a test set of 65 observations. :header-rows: 1 :name: confusion-matrix-table -* - +* - - Predicted Malignant - Predicted Benign * - **Actually Malignant** @@ -145,7 +146,7 @@ a test set of 65 observations. * - **Actually Benign** - 4 - 57 -``` +``` In the example in {numref}`confusion-matrix-table`, we see that there was 1 malignant observation that was correctly classified as malignant (top left corner), @@ -161,7 +162,7 @@ But we can also see that the classifier only identified 1 out of 4 total maligna tumors; in other words, it misclassified 75% of the malignant cases present in the data set! In this example, misclassifying a malignant tumor is a potentially disastrous error, since it may lead to a patient who requires treatment not receiving it. -Since we are particularly interested in identifying malignant cases, this +Since we are particularly interested in identifying malignant cases, this classifier would likely be unacceptable even with an accuracy of 89%. Focusing more on one label than the other is @@ -240,12 +241,12 @@ Beginning in this chapter, our data analyses will often involve the use of *randomness*. We use randomness any time we need to make a decision in our analysis that needs to be fair, unbiased, and not influenced by human input. For example, in this chapter, we need to split -a data set into a training set and test set to evaluate our classifier. We +a data set into a training set and test set to evaluate our classifier. We certainly do not want to choose how to split the data ourselves by hand, as we want to avoid accidentally influencing the result of the evaluation. So instead, we let Python *randomly* split the data. In future chapters we will use randomness -in many other ways, e.g., to help us select a small subset of data from a larger data set, +in many other ways, e.g., to help us select a small subset of data from a larger data set, to pick groupings of data, and more. ```{index} reproducible, seed @@ -257,14 +258,14 @@ to pick groupings of data, and more. ```{index} seed; numpy.random.seed ``` -However, the use of randomness runs counter to one of the main +However, the use of randomness runs counter to one of the main tenets of good data analysis practice: *reproducibility*. Recall that a reproducible analysis produces the same result each time it is run; if we include randomness in the analysis, would we not get a different result each time? -The trick is that in Python—and other programming languages—randomness +The trick is that in Python—and other programming languages—randomness is not actually random! Instead, Python uses a *random number generator* that produces a sequence of numbers that -are completely determined by a +are completely determined by a *seed value*. Once you set the seed value, everything after that point may *look* random, but is actually totally reproducible. As long as you pick the same seed value, you get the same result! @@ -272,12 +273,12 @@ value, you get the same result! ```{index} sample; numpy.random.choice ``` -Let's use an example to investigate how randomness works in Python. Say we +Let's use an example to investigate how randomness works in Python. Say we have a series object containing the integers from 0 to 9. We want to randomly pick 10 numbers from that list, but we want it to be reproducible. Before randomly picking the 10 numbers, -we call the `seed` function from the `numpy` package, and pass it any integer as the argument. -Below we use the seed number `1`. At +we call the `seed` function from the `numpy` package, and pass it any integer as the argument. +Below we use the seed number `1`. At that point, Python will keep track of the randomness that occurs throughout the code. For example, we can call the `sample` method on the series of numbers, passing the argument `n=10` to indicate that we want 10 samples. @@ -294,8 +295,8 @@ random_numbers1 = nums_0_to_9.sample(n=10).to_numpy() random_numbers1 ``` You can see that `random_numbers1` is a list of 10 numbers -from 0 to 9 that, from all appearances, looks random. If -we run the `sample` method again, +from 0 to 9 that, from all appearances, looks random. If +we run the `sample` method again, we will get a fresh batch of 10 numbers that also look random. ```{code-cell} ipython3 @@ -336,18 +337,18 @@ random_numbers ``` In other words, even though the sequences of numbers that Python is generating *look* -random, they are totally determined when we set a seed value! +random, they are totally determined when we set a seed value! So what does this mean for data analysis? Well, `sample` is certainly not the only data frame method that uses randomness in Python. Many of the functions that we use in `scikit-learn`, `pandas`, and beyond use randomness—many of them without even telling you about it. Also note that when Python starts -up, it creates its own seed to use. So if you do not explicitly -call the `np.random.seed` function, your results +up, it creates its own seed to use. So if you do not explicitly +call the `np.random.seed` function, your results will likely not be reproducible. Finally, be careful to set the seed *only once* at the beginning of a data analysis. Each time you set the seed, you are inserting your own human input, thereby influencing the analysis. For example, if you use -the `sample` many times throughout your analysis but set the seed each time, the +the `sample` many times throughout your analysis but set the seed each time, the randomness that Python uses will not look as random as it should. In summary: if you want your analysis to be reproducible, i.e., produce *the same result* @@ -363,32 +364,32 @@ package's *default random number generator*. Using the global default random number generator is easier than other methods, but has some potential drawbacks. For example, other code that you may not notice (e.g., code buried inside some other package) could potentially *also* call `np.random.seed`, thus modifying -your analysis in an undesirable way. Furthermore, not *all* functions use +your analysis in an undesirable way. Furthermore, not *all* functions use `numpy`'s random number generator; some may use another one entirely. -In that case, setting `np.random.seed` may not actually make your whole analysis +In that case, setting `np.random.seed` may not actually make your whole analysis reproducible. In this book, we will generally only use packages that play nicely with `numpy`'s -default random number generator, so we will stick with `np.random.seed`. -You can achieve more careful control over randomness in your analysis -by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html) -once at the beginning of your analysis, and passing it to +default random number generator, so we will stick with `np.random.seed`. +You can achieve more careful control over randomness in your analysis +by creating a `numpy` [`RandomState` object](https://numpy.org/doc/1.16/reference/generated/numpy.random.RandomState.html) +once at the beginning of your analysis, and passing it to the `random_state` argument that is available in many `pandas` and `scikit-learn` -functions. Those functions will then use your `RandomState` to generate random numbers instead of +functions. Those functions will then use your `RandomState` to generate random numbers instead of `numpy`'s default generator. For example, we can reproduce our earlier example by using a `RandomState` object with the `seed` value set to 1; we get the same lists of numbers once again. ```{code} rnd = np.random.RandomState(seed=1) random_numbers1_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy() random_numbers1_third -``` +``` ```{code} array([2, 9, 6, 4, 0, 3, 1, 7, 8, 5]) ``` ```{code} random_numbers2_third = nums_0_to_9.sample(n=10, random_state=rnd).to_numpy() random_numbers2_third -``` +``` ```{code} array([9, 5, 3, 0, 8, 4, 2, 1, 6, 7]) ``` @@ -401,15 +402,15 @@ array([9, 5, 3, 0, 8, 4, 2, 1, 6, 7]) ``` Back to evaluating classifiers now! -In Python, we can use the `scikit-learn` package not only to perform $K$-nearest neighbors -classification, but also to assess how well our classification worked. +In Python, we can use the `scikit-learn` package not only to perform K-nearest neighbors +classification, but also to assess how well our classification worked. Let's work through an example of how to use tools from `scikit-learn` to evaluate a classifier using the breast cancer data set from the previous chapter. We begin the analysis by loading the packages we require, reading in the breast cancer data, and then making a quick scatter plot visualization of tumor cell concavity versus smoothness colored by diagnosis in {numref}`fig:06-precode`. -You will also notice that we set the random seed using the `np.random.seed` function, +You will also notice that we set the random seed using the `np.random.seed` function, as described in {numref}`randomseeds`. ```{code-cell} ipython3 @@ -478,7 +479,7 @@ it **stratifies** the data by the class label, to ensure that roughly the same proportion of each class ends up in both the training and testing sets. For example, in our data set, roughly 63% of the observations are from the benign class (`Benign`), and 37% are from the malignant class (`Malignant`), -so specifying `stratify` as the class column ensures that roughly 63% of the training data are benign, +so specifying `stratify` as the class column ensures that roughly 63% of the training data are benign, 37% of the training data are malignant, and the same proportions exist in the testing data. @@ -518,19 +519,19 @@ glue("cancer_test_nrow", "{:d}".format(len(cancer_test))) ```{index} info ``` -We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations, +We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations, while the test set contains {glue:text}`cancer_test_nrow` observations. This corresponds to a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s ` -that we use the `info` method to preview the number of rows, the variable names, their data types, and +that we use the `info` method to preview the number of rows, the variable names, their data types, and missing entries of a data frame. ```{index} groupby, count ``` -We can use the `value_counts` method with the `normalize` argument set to `True` -to find the percentage of malignant and benign classes +We can use the `value_counts` method with the `normalize` argument set to `True` +to find the percentage of malignant and benign classes in `cancer_train`. We see about {glue:text}`cancer_train_b_prop`% of the training -data are benign and {glue:text}`cancer_train_m_prop`% +data are benign and {glue:text}`cancer_train_m_prop`% are malignant, indicating that our class proportions were roughly preserved when we split the data. ```{code-cell} ipython3 @@ -546,7 +547,7 @@ glue("cancer_train_m_prop", "{:0.0f}".format(cancer_train["Class"].value_counts( ### Preprocess the data -As we mentioned in the last chapter, $K$-nearest neighbors is sensitive to the scale of the predictors, +As we mentioned in the last chapter, K-nearest neighbors is sensitive to the scale of the predictors, so we should perform some preprocessing to standardize them. An additional consideration we need to take when doing this is that we should create the standardization preprocessor using **only the training data**. This ensures that @@ -559,7 +560,7 @@ training and test data sets. ```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler ``` -Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our +Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our analysis steps in a `Pipeline`, as in {numref}`Chapter %s `. So below we construct and prepare the preprocessor using `make_column_transformer` just as before. @@ -576,11 +577,11 @@ cancer_preprocessor = make_column_transformer( ### Train the classifier Now that we have split our original data set into training and test sets, we -can create our $K$-nearest neighbors classifier with only the training set using +can create our K-nearest neighbors classifier with only the training set using the technique we learned in the previous chapter. For now, we will just choose the number $K$ of neighbors to be 3, and use only the concavity and smoothness predictors by -selecting them from the `cancer_train` data frame. -We will first import the `KNeighborsClassifier` model and `make_pipeline` from `sklearn`. +selecting them from the `cancer_train` data frame. +We will first import the `KNeighborsClassifier` model and `make_pipeline` from `sklearn`. Then as before we will create a model object, combine the model object and preprocessor into a `Pipeline` using the `make_pipeline` function, and then finally use the `fit` method to build the classifier. @@ -589,7 +590,7 @@ use the `fit` method to build the classifier. from sklearn.neighbors import KNeighborsClassifier from sklearn.pipeline import make_pipeline -knn = KNeighborsClassifier(n_neighbors=3) +knn = KNeighborsClassifier(n_neighbors=3) X = cancer_train[["Smoothness", "Concavity"]] y = cancer_train["Class"] @@ -605,7 +606,7 @@ knn_pipeline ```{index} pandas.concat ``` -Now that we have a $K$-nearest neighbors classifier object, we can use it to +Now that we have a K-nearest neighbors classifier object, we can use it to predict the class labels for our test set and augment the original test data with a column of predictions. The `Class` variable contains the actual @@ -663,7 +664,7 @@ glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1)) +++ -The output shows that the estimated accuracy of the classifier on the test data +The output shows that the estimated accuracy of the classifier on the test data was {glue:text}`cancer_acc_1`%. To compute the precision and recall, we can use the `precision_score` and `recall_score` functions from `scikit-learn`. We specify the true labels from the `Class` variable as the `y_true` argument, the predicted @@ -709,7 +710,7 @@ _ctab = pd.crosstab(cancer_test["Class"], c11 = _ctab["Malignant"]["Malignant"] c00 = _ctab["Benign"]["Benign"] -c10 = _ctab["Benign"]["Malignant"] # classify benign, true malignant +c10 = _ctab["Benign"]["Malignant"] # classify benign, true malignant c01 = _ctab["Malignant"]["Benign"] # classify malignant, true benign glue("confu11", "{:d}".format(c11)) @@ -726,8 +727,8 @@ glue("confu_precision_0", "{:0.0f}".format(100*c11/(c11+c01))) glue("confu_recall_0", "{:0.0f}".format(100*c11/(c11+c10))) ``` -The confusion matrix shows {glue:text}`confu11` observations were correctly predicted -as malignant, and {glue:text}`confu00` were correctly predicted as benign. +The confusion matrix shows {glue:text}`confu11` observations were correctly predicted +as malignant, and {glue:text}`confu00` were correctly predicted as benign. It also shows that the classifier made some mistakes; in particular, it classified {glue:text}`confu10` observations as benign when they were actually malignant, and {glue:text}`confu01` observations as malignant when they were actually benign. @@ -768,15 +769,15 @@ glue("rec_eq_math_glued", rec_eq_math) ### Critically analyze performance We now know that the classifier was {glue:text}`cancer_acc_1`% accurate -on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and -a recall of {glue:text}`cancer_rec_1`%. -That sounds pretty good! Wait, *is* it good? +on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and +a recall of {glue:text}`cancer_rec_1`%. +That sounds pretty good! Wait, *is* it good? Or do we need something higher? ```{index} accuracy; assessment ``` -In general, a *good* value for accuracy (as well as precision and recall, if applicable) +In general, a *good* value for accuracy (as well as precision and recall, if applicable) depends on the application; you must critically analyze your accuracy in the context of the problem you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99% of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!). @@ -789,7 +790,7 @@ words, in this context, we need the classifier to have a *high recall*. On the other hand, it might be less bad for the classifier to guess "malignant" when the actual class is "benign" (a false positive), as the patient will then likely see a doctor who can provide an expert diagnosis. In other words, we are fine with sacrificing -some precision in the interest of achieving high recall. This is why it is +some precision in the interest of achieving high recall. This is why it is important not only to look at accuracy, but also the confusion matrix. @@ -801,12 +802,12 @@ classification problem: the *majority classifier*. The majority classifier *always* guesses the majority class label from the training data, regardless of the predictor variables' values. It helps to give you a sense of scale when considering accuracies. If the majority classifier obtains a 90% -accuracy on a problem, then you might hope for your $K$-nearest neighbors +accuracy on a problem, then you might hope for your K-nearest neighbors classifier to do better than that. If your classifier provides a significant improvement upon the majority classifier, this means that at least your method is extracting some useful information from your predictor variables. Be careful though: improving on the majority classifier does not *necessarily* -mean the classifier is working well enough for your application. +mean the classifier is working well enough for your application. As an example, in the breast cancer data, recall the proportions of benign and malignant observations in the training data are as follows: @@ -819,16 +820,16 @@ Since the benign class represents the majority of the training data, the majority classifier would *always* predict that a new observation is benign. The estimated accuracy of the majority classifier is usually fairly close to the majority class proportion in the training data. -In this case, we would suspect that the majority classifier will have +In this case, we would suspect that the majority classifier will have an accuracy of around {glue:text}`cancer_train_b_prop`%. -The $K$-nearest neighbors classifier we built does quite a bit better than this, -with an accuracy of {glue:text}`cancer_acc_1`%. +The K-nearest neighbors classifier we built does quite a bit better than this, +with an accuracy of {glue:text}`cancer_acc_1`%. This means that from the perspective of accuracy, -the $K$-nearest neighbors classifier improved quite a bit on the basic -majority classifier. Hooray! But we still need to be cautious; in +the K-nearest neighbors classifier improved quite a bit on the basic +majority classifier. Hooray! But we still need to be cautious; in this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing patients who actually need medical care. The confusion matrix above shows -that the classifier does, indeed, misdiagnose a significant number of +that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign ({glue:text}`confu10` out of {glue:text}`confu10_11` malignant tumors, or {glue:text}`confu_fal_neg`%!). Therefore, even though the accuracy improved upon the majority classifier, our critical analysis suggests that this classifier may not have appropriate performance @@ -845,23 +846,23 @@ for the application. ``` The vast majority of predictive models in statistics and machine learning have -*parameters*. A *parameter* +*parameters*. A *parameter* is a number you have to pick in advance that determines -some aspect of how the model behaves. For example, in the $K$-nearest neighbors +some aspect of how the model behaves. For example, in the K-nearest neighbors classification algorithm, $K$ is a parameter that we have to pick -that determines how many neighbors participate in the class vote. -By picking different values of $K$, we create different classifiers +that determines how many neighbors participate in the class vote. +By picking different values of $K$, we create different classifiers that make different predictions. -So then, how do we pick the *best* value of $K$, i.e., *tune* the model? +So then, how do we pick the *best* value of $K$, i.e., *tune* the model? And is it possible to make this selection in a principled way? In this book, -we will focus on maximizing the accuracy of the classifier. Ideally, +we will focus on maximizing the accuracy of the classifier. Ideally, we want somehow to maximize the accuracy of our classifier on data *it hasn't seen yet*. But we cannot use our test data set in the process of building our model. So we will play the same trick we did before when evaluating our classifier: we'll split our *training data itself* into two subsets, use one to train the model, and then use the other to evaluate it. -In this section, we will cover the details of this procedure, as well as +In this section, we will cover the details of this procedure, as well as how to use it to help you pick a good parameter value for your classifier. **And remember:** don't touch the test set during the tuning process. Tuning is a part of model training! @@ -873,12 +874,12 @@ how to use it to help you pick a good parameter value for your classifier. ```{index} validation set ``` -The first step in choosing the parameter $K$ is to be able to evaluate the +The first step in choosing the parameter $K$ is to be able to evaluate the classifier using only the training data. If this is possible, then we can compare -the classifier's performance for different values of $K$—and pick the best—using +the classifier's performance for different values of $K$—and pick the best—using only the training data. As suggested at the beginning of this section, we will accomplish this by splitting the training data, training on one subset, and evaluating -on the other. The subset of training data used for evaluation is often called the **validation set**. +on the other. The subset of training data used for evaluation is often called the **validation set**. There is, however, one key difference from the train/test split that we performed earlier. In particular, we were forced to make only a *single split* @@ -892,10 +893,10 @@ data *once*, our best parameter choice will depend strongly on whatever data was lucky enough to end up in the validation set. Perhaps using multiple different train/validation splits, we'll get a better estimate of accuracy, which will lead to a better choice of the number of neighbors $K$ for the -overall set of training data. +overall set of training data. Let's investigate this idea in Python! In particular, we will generate five different train/validation -splits of our overall training data, train five different $K$-nearest neighbors +splits of our overall training data, train five different K-nearest neighbors models, and evaluate their accuracy. We will start with just a single split. @@ -906,7 +907,7 @@ cancer_subtrain, cancer_validation = train_test_split( ) # fit the model on the sub-training data -knn = KNeighborsClassifier(n_neighbors=3) +knn = KNeighborsClassifier(n_neighbors=3) X = cancer_subtrain[["Smoothness", "Concavity"]] y = cancer_subtrain["Class"] knn_pipeline = make_pipeline(cancer_preprocessor, knn) @@ -931,7 +932,7 @@ for i in range(1, 5): ) # fit the model on the sub-training data - knn = KNeighborsClassifier(n_neighbors=3) + knn = KNeighborsClassifier(n_neighbors=3) X = cancer_subtrain[["Smoothness", "Concavity"]] y = cancer_subtrain["Class"] knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y) @@ -965,18 +966,18 @@ just five estimates of the true, underlying accuracy of our classifier built using our overall training data. We can combine the estimates by taking their average (here {glue:text}`avg_5_splits`%) to try to get a single assessment of our classifier's accuracy; this has the effect of reducing the influence of any one -(un)lucky validation set on the estimate. +(un)lucky validation set on the estimate. ```{index} cross-validation ``` In practice, we don't use random splits, but rather use a more structured splitting procedure so that each observation in the data set is used in a -validation set only a single time. The name for this strategy is +validation set only a single time. The name for this strategy is **cross-validation**. In **cross-validation**, we split our **overall training data** into $C$ evenly sized chunks. Then, iteratively use $1$ chunk as the -**validation set** and combine the remaining $C-1$ chunks -as the **training set**. +**validation set** and combine the remaining $C-1$ chunks +as the **training set**. This procedure is shown in {numref}`fig:06-cv-image`. Here, $C=5$ different chunks of the data set are used, resulting in 5 different choices for the **validation set**; we call this @@ -997,19 +998,19 @@ resulting in 5 different choices for the **validation set**; we call this ``` To perform 5-fold cross-validation in Python with `scikit-learn`, we use another -function: `cross_validate`. This function requires that we specify +function: `cross_validate`. This function requires that we specify a modelling `Pipeline` as the `estimator` argument, the number of folds as the `cv` argument, and the training data predictors and labels as the `X` and `y` arguments. Since the `cross_validate` function outputs a dictionary, we use `pd.DataFrame` to convert it to a `pandas` -dataframe for better visualization. +dataframe for better visualization. Note that the `cross_validate` function handles stratifying the classes in -each train and validate fold automatically. +each train and validate fold automatically. ```{code-cell} ipython3 from sklearn.model_selection import cross_validate -knn = KNeighborsClassifier(n_neighbors=3) +knn = KNeighborsClassifier(n_neighbors=3) cancer_pipe = make_pipeline(cancer_preprocessor, knn) X = cancer_train[["Smoothness", "Concavity"]] y = cancer_train["Class"] @@ -1027,11 +1028,11 @@ cv_5_df The validation scores we are interested in are contained in the `test_score` column. We can then aggregate the *mean* and *standard error* -of the classifier's validation accuracy across the folds. -You should consider the mean (`mean`) to be the estimated accuracy, while the standard +of the classifier's validation accuracy across the folds. +You should consider the mean (`mean`) to be the estimated accuracy, while the standard error (`sem`) is a measure of how uncertain we are in that mean value. A detailed treatment of this is beyond the scope of this chapter; but roughly, if your estimated mean is {glue:text}`cv_5_mean` and standard -error is {glue:text}`cv_5_std`, you can expect the *true* average accuracy of the +error is {glue:text}`cv_5_std`, you can expect the *true* average accuracy of the classifier to be somewhere roughly between {glue:text}`cv_5_lower`% and {glue:text}`cv_5_upper`% (although it may fall outside this range). You may ignore the other columns in the metrics data frame. @@ -1066,13 +1067,13 @@ glue("cv_5_lower", ``` We can choose any number of folds, and typically the more we use the better our -accuracy estimate will be (lower standard error). However, we are limited +accuracy estimate will be (lower standard error). However, we are limited by computational power: the more folds we choose, the more computation it takes, and hence the more time it takes to run the analysis. So when you do cross-validation, you need to -consider the size of the data, the speed of the algorithm (e.g., $K$-nearest -neighbors), and the speed of your computer. In practice, this is a -trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here +consider the size of the data, the speed of the algorithm (e.g., K-nearest +neighbors), and the speed of your computer. In practice, this is a +trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we will try 10-fold cross-validation to see if we get a lower standard error. ```{code-cell} ipython3 @@ -1097,7 +1098,7 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt cv_10_metrics ``` -In this case, using 10-fold instead of 5-fold cross validation did +In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes you might even end up with a *higher* standard error when increasing the number of folds! We can make the reduction in standard error more dramatic by increasing the number of folds @@ -1135,19 +1136,19 @@ glue("cv_10_mean", "{:0.0f}".format(100 * cv_10_metrics.loc["mean", "test_score" ### Parameter value selection Using 5- and 10-fold cross-validation, we have estimated that the prediction -accuracy of our classifier is somewhere around {glue:text}`cv_10_mean`%. +accuracy of our classifier is somewhere around {glue:text}`cv_10_mean`%. Whether that is good or not depends entirely on the downstream application of the data analysis. In the present situation, we are trying to predict a tumor diagnosis, with expensive, damaging chemo/radiation therapy or patient death as potential consequences of -misprediction. Hence, we might like to -do better than {glue:text}`cv_10_mean`% for this application. +misprediction. Hence, we might like to +do better than {glue:text}`cv_10_mean`% for this application. In order to improve our classifier, we have one choice of parameter: the number of neighbors, $K$. Since cross-validation helps us evaluate the accuracy of our classifier, we can use cross-validation to calculate an accuracy for each value of $K$ in a reasonable range, and then pick the value of $K$ that gives us the -best accuracy. The `scikit-learn` package collection provides built-in +best accuracy. The `scikit-learn` package collection provides built-in functionality, named `GridSearchCV`, to automatically handle the details for us. Before we use `GridSearchCV`, we need to create a new pipeline with a `KNeighborsClassifier` that has the number of neighbors left unspecified. @@ -1159,8 +1160,8 @@ cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn) +++ -Next we specify the grid of parameter values that we want to try for -each tunable parameter. We do this in a Python dictionary: the key is +Next we specify the grid of parameter values that we want to try for +each tunable parameter. We do this in a Python dictionary: the key is the identifier of the parameter to tune, and the value is a list of parameter values to try when tuning. We can find the "identifier" of a parameter by using the `get_params` method on the pipeline. @@ -1183,7 +1184,7 @@ parameter_grid = { } ``` The `range` function in Python that we used above allows us to specify a sequence of values. -The first argument is the starting number (here, `1`), +The first argument is the starting number (here, `1`), the second argument is *one greater than* the final number (here, `100`), and the third argument is the number to values to skip between steps in the sequence (here, `5`). So in this case we generate the sequence 1, 6, 11, 16, ..., 96. @@ -1233,13 +1234,13 @@ accuracies_grid.info() There is a lot of information to look at here, but we are most interested in three quantities: the number of neighbors (`param_kneighbors_classifier__n_neighbors`), -the cross-validation accuracy estimate (`mean_test_score`), +the cross-validation accuracy estimate (`mean_test_score`), and the standard error of the accuracy estimate. Unfortunately `GridSearchCV` does not directly output the standard error for each cross-validation accuracy; but it *does* output the standard *deviation* (`std_test_score`). We can compute the standard error from the standard deviation by dividing it by the square -root of the number of folds, i.e., - +root of the number of folds, i.e., + $$\text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Number of Folds}}}.$$ We will also rename the parameter name column to be a bit more readable, @@ -1298,14 +1299,14 @@ cancer_tune_grid.best_params_ +++ -Setting the number of +Setting the number of neighbors to $K =$ {glue:text}`best_k_unique` provides the highest cross-validation accuracy estimate ({glue:text}`best_acc`%). But there is no exact or perfect answer here; any selection from $K = 30$ to $80$ or so would be reasonably justified, as all of these differ in classifier accuracy by a small amount. Remember: the values you see on this plot are *estimates* of the true accuracy of our -classifier. Although the -$K =$ {glue:text}`best_k_unique` value is +classifier. Although the +$K =$ {glue:text}`best_k_unique` value is higher than the others on this plot, that doesn't mean the classifier is actually more accurate with this parameter value! Generally, when selecting $K$ (and other parameters for other predictive @@ -1315,8 +1316,8 @@ models), we are looking for a value where: - changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty; - the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!). -We know that $K =$ {glue:text}`best_k_unique` -provides the highest estimated accuracy. Further, {numref}`fig:06-find-k` shows that the estimated accuracy +We know that $K =$ {glue:text}`best_k_unique` +provides the highest estimated accuracy. Further, {numref}`fig:06-find-k` shows that the estimated accuracy changes by only a small amount if we increase or decrease $K$ near $K =$ {glue:text}`best_k_unique`. And finally, $K =$ {glue:text}`best_k_unique` does not create a prohibitively expensive computational cost of training. Considering these three points, we would indeed select @@ -1327,9 +1328,9 @@ $K =$ {glue:text}`best_k_unique` for the classifier. ### Under/Overfitting To build a bit more intuition, what happens if we keep increasing the number of -neighbors $K$? In fact, the cross-validation accuracy estimate actually starts to decrease! -Let's specify a much larger range of values of $K$ to try in the `param_grid` -argument of `GridSearchCV`. {numref}`fig:06-lots-of-ks` shows a plot of estimated accuracy as +neighbors $K$? In fact, the cross-validation accuracy estimate actually starts to decrease! +Let's specify a much larger range of values of $K$ to try in the `param_grid` +argument of `GridSearchCV`. {numref}`fig:06-lots-of-ks` shows a plot of estimated accuracy as we vary $K$ from 1 to almost the number of observations in the data set. ```{code-cell} ipython3 @@ -1454,11 +1455,11 @@ plot_list = [] for k in [1, 7, 20, 300]: cancer_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier(n_neighbors=k)) cancer_pipe.fit(X, y) - + knnPredGrid = cancer_pipe.predict(scgrid) prediction_table = scgrid.copy() prediction_table["Class"] = knnPredGrid - + # add a prediction layer prediction_plot = ( alt.Chart( @@ -1523,10 +1524,10 @@ set the number of neighbors $K$ to 1, 7, 20, and 300. ### Evaluating on the test set -Now that we have tuned the KNN classifier and set $K =$ {glue:text}`best_k_unique`, +Now that we have tuned the K-NN classifier and set $K =$ {glue:text}`best_k_unique`, we are done building the model and it is time to evaluate the quality of its predictions on the held out test data, as we did earlier in {numref}`eval-performance-clasfcn2`. -We first need to retrain the KNN classifier +We first need to retrain the K-NN classifier on the entire training data set using the selected number of neighbors. Fortunately we do not have to do this ourselves manually; `scikit-learn` does it for us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the @@ -1615,13 +1616,13 @@ maximize accuracy are not necessarily better for a given application. ## Summary Classification algorithms use one or more quantitative variables to predict the -value of another categorical variable. In particular, the $K$-nearest neighbors +value of another categorical variable. In particular, the K-nearest neighbors algorithm does this by first finding the $K$ points in the training data nearest to the new observation, and then returning the majority class vote from those training observations. We can tune and evaluate a classifier by splitting the data randomly into a training and test data set. The training set is used to build the classifier, and we can tune the classifier (e.g., select the number -of neighbors in $K$-nearest neighbors) by maximizing estimated accuracy via +of neighbors in K-nearest neighbors) by maximizing estimated accuracy via cross-validation. After we have tuned the model, we can use the test set to estimate its accuracy. The overall process is summarized in {numref}`fig:06-overview`. @@ -1631,7 +1632,7 @@ estimate its accuracy. The overall process is summarized in ```{figure} img/classification2/train-test-overview.jpeg :name: fig:06-overview -Overview of KNN classification. +Overview of K-NN classification. ``` +++ @@ -1639,29 +1640,29 @@ Overview of KNN classification. ```{index} scikit-learn, pipeline, cross-validation, K-nearest neighbors; classification, classification ``` -The overall workflow for performing $K$-nearest neighbors classification using `scikit-learn` is as follows: +The overall workflow for performing K-nearest neighbors classification using `scikit-learn` is as follows: -1. Use the `train_test_split` function to split the data into a training and test set. Set the `stratify` argument to the class label column of the dataframe. Put the test set aside for now. -2. Create a `Pipeline` that specifies the preprocessing steps and the classifier. -3. Define the parameter grid by passing the set of $K$ values that you would like to tune. +1. Use the `train_test_split` function to split the data into a training and test set. Set the `stratify` argument to the class label column of the dataframe. Put the test set aside for now. +2. Create a `Pipeline` that specifies the preprocessing steps and the classifier. +3. Define the parameter grid by passing the set of $K$ values that you would like to tune. 4. Use `GridSearchCV` to estimate the classifier accuracy for a range of $K$ values. Pass the pipeline and parameter grid defined in steps 2. and 3. as the `param_grid` argument and the `estimator` argument, respectively. 5. Execute the grid search by passing the training data to the `fit` method on the `GridSearchCV` instance created in step 4. 6. Pick a value of $K$ that yields a high cross-validation accuracy estimate that doesn't change much if you change $K$ to a nearby value. 7. Create a new model object for the best parameter value (i.e., $K$), and retrain the classifier by calling the `fit` method. 8. Evaluate the estimated accuracy of the classifier on the test set using the `score` method. -In these last two chapters, we focused on the $K$-nearest neighbor algorithm, -but there are many other methods we could have used to predict a categorical label. -All algorithms have their strengths and weaknesses, and we summarize these for -the $K$-NN here. +In these last two chapters, we focused on the K-nearest neighbors algorithm, +but there are many other methods we could have used to predict a categorical label. +All algorithms have their strengths and weaknesses, and we summarize these for +the K-NN here. -**Strengths:** $K$-nearest neighbors classification +**Strengths:** K-nearest neighbors classification 1. is a simple, intuitive algorithm, 2. requires few assumptions about what the data must look like, and 3. works for binary (two-class) and multi-class (more than 2 classes) classification problems. -**Weaknesses:** $K$-nearest neighbors classification +**Weaknesses:** K-nearest neighbors classification 1. becomes very slow as the training data gets larger, 2. may not perform well with a large number of predictors, and @@ -1672,7 +1673,7 @@ the $K$-NN here. ## Predictor variable selection ```{note} -This section is not required reading for the remainder of the textbook. It is included for those readers +This section is not required reading for the remainder of the textbook. It is included for those readers interested in learning how irrelevant variables can influence the performance of a classifier, and how to pick a subset of useful variables to include as predictors. ``` @@ -1683,7 +1684,7 @@ pick a subset of useful variables to include as predictors. Another potentially important part of tuning your classifier is to choose which variables from your data will be treated as predictor variables. Technically, you can choose anything from using a single predictor variable to using every variable in your -data; the $K$-nearest neighbors algorithm accepts any number of +data; the K-nearest neighbors algorithm accepts any number of predictors. However, it is **not** the case that using more predictors always yields better predictions! In fact, sometimes including irrelevant predictors can actually negatively affect classifier performance. @@ -1692,13 +1693,13 @@ actually negatively affect classifier performance. ### The effect of irrelevant predictors -Let's take a look at an example where $K$-nearest neighbors performs +Let's take a look at an example where K-nearest neighbors performs worse when given more predictors to work with. In this example, we modified the breast cancer data to have only the `Smoothness`, `Concavity`, and `Perimeter` variables from the original data. Then, we added irrelevant variables that we created ourselves using a random number generator. The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless -of what the value `Class` variable takes. In other words, the irrelevant variables have +of what the value `Class` variable takes. In other words, the irrelevant variables have no meaningful relationship with the `Class` variable. ```{code-cell} ipython3 @@ -1721,7 +1722,7 @@ cancer_irrelevant[ ] ``` -Next, we build a sequence of KNN classifiers that include `Smoothness`, +Next, we build a sequence of K-NN classifiers that include `Smoothness`, `Concavity`, and `Perimeter` as predictor variables, but also increasingly many irrelevant variables. In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors. Then we build a model, tuned via 5-fold cross-validation, for each data set. @@ -1825,12 +1826,12 @@ glue("fig:06-performance-irrelevant-features", plt_irrelevant_accuracies) Effect of inclusion of irrelevant predictors. ::: -Although the accuracy decreases as expected, one surprising thing about +Although the accuracy decreases as expected, one surprising thing about {numref}`fig:06-performance-irrelevant-features` is that it shows that the method -still outperforms the baseline majority classifier (with about {glue:text}`cancer_train_b_prop`% accuracy) +still outperforms the baseline majority classifier (with about {glue:text}`cancer_train_b_prop`% accuracy) even with 40 irrelevant variables. How could that be? {numref}`fig:06-neighbors-irrelevant-features` provides the answer: -the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables +the tuning procedure for the K-nearest neighbors classifier combats the extra randomness from the irrelevant variables by increasing the number of neighbors. Of course, because of all the extra noise in the data from the irrelevant variables, the number of neighbors does not increase smoothly; but the general trend is increasing. {numref}`fig:06-fixed-irrelevant-features` corroborates this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly. @@ -1893,17 +1894,17 @@ Accuracy versus number of irrelevant predictors for tuned and untuned number of ### Finding a good subset of predictors -So then, if it is not ideal to use all of our variables as predictors without consideration, how +So then, if it is not ideal to use all of our variables as predictors without consideration, how do we choose which variables we *should* use? A simple method is to rely on your scientific understanding of the data to tell you which variables are not likely to be useful predictors. For example, in the cancer data that we have been studying, the `ID` variable is just a unique identifier for the observation. As it is not related to any measured property of the cells, the `ID` variable should therefore not be used -as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables -is less obvious, as all seem like reasonable candidates. It +as a predictor. That is, of course, a very clear-cut case. But the decision for the remaining variables +is less obvious, as all seem like reasonable candidates. It is not clear which subset of them will create the best classifier. One could use visualizations and other exploratory analyses to try to help understand which variables are potentially relevant, but this process is both time-consuming and error-prone when there are many variables to consider. -Therefore we need a more systematic and programmatic way of choosing variables. +Therefore we need a more systematic and programmatic way of choosing variables. This is a very difficult problem to solve in general, and there are a number of methods that have been developed that apply in particular cases of interest. Here we will discuss two basic @@ -1918,15 +1919,15 @@ this chapter to find out where you can learn more about variable selection, incl The first idea you might think of for a systematic way to select predictors is to try all possible subsets of predictors and then pick the set that results in the "best" classifier. -This procedure is indeed a well-known variable selection method referred to -as *best subset selection* {cite:p}`bealesubset,hockingsubset`. +This procedure is indeed a well-known variable selection method referred to +as *best subset selection* {cite:p}`bealesubset,hockingsubset`. In particular, you 1. create a separate model for every possible subset of predictors, 2. tune each one using cross-validation, and -3. pick the subset of predictors that gives you the highest cross-validation accuracy. +3. pick the subset of predictors that gives you the highest cross-validation accuracy. -Best subset selection is applicable to any classification method ($K$-NN or otherwise). +Best subset selection is applicable to any classification method (K-NN or otherwise). However, it becomes very slow when you have even a moderate number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets grows very quickly with the number of predictors, and you have to train the model (itself @@ -1934,17 +1935,17 @@ a slow process!) for each one. For example, if we have 2 predictors—let's them A and B—then we have 3 variable sets to try: A alone, B alone, and finally A and B together. If we have 3 predictors—A, B, and C—then we have 7 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models -we have to train for $m$ predictors is $2^m-1$; in other words, when we -get to 10 predictors we have over *one thousand* models to train, and -at 20 predictors we have over *one million* models to train! -So although it is a simple method, best subset selection is usually too computationally +we have to train for $m$ predictors is $2^m-1$; in other words, when we +get to 10 predictors we have over *one thousand* models to train, and +at 20 predictors we have over *one million* models to train! +So although it is a simple method, best subset selection is usually too computationally expensive to use in practice. ```{index} variable selection; forward ``` -Another idea is to iteratively build up a model by adding one predictor variable -at a time. This method—known as *forward selection* {cite:p}`forwardefroymson,forwarddraper`—is also widely +Another idea is to iteratively build up a model by adding one predictor variable +at a time. This method—known as *forward selection* {cite:p}`forwardefroymson,forwarddraper`—is also widely applicable and fairly straightforward. It involves the following steps: 1. Start with a model having no predictors. @@ -1965,13 +1966,13 @@ training over 1000 candidate models with 10 predictors, forward selection requir Therefore we will continue the rest of this section using forward selection. ```{note} -One word of caution before we move on. Every additional model that you train -increases the likelihood that you will get unlucky and stumble +One word of caution before we move on. Every additional model that you train +increases the likelihood that you will get unlucky and stumble on a model that has a high cross-validation accuracy estimate, but a low true accuracy on the test data and other future observations. Since forward selection involves training a lot of models, you run a fairly high risk of this happening. To keep this risk low, only use forward selection -when you have a large amount of data and a relatively small total number of +when you have a large amount of data and a relatively small total number of predictors. More advanced methods do not suffer from this problem as much; see the additional resources at the end of this chapter for where to learn more about advanced predictor selection methods. @@ -1980,7 +1981,7 @@ where to learn more about advanced predictor selection methods. +++ ### Forward selection in `scikit-learn` - + We now turn to implementing forward selection in Python. First we will extract a smaller set of predictors to work with in this illustrative example—`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`—as well as the `Class` variable as the label. @@ -2007,12 +2008,12 @@ cancer_subset ``` To perform forward selection, we could use the -[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) -from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors -for each set of features. Instead we will code the forward selection algorithm manually. -In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating. +[`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) +from `scikit-learn`; but it is difficult to combine this approach with parameter tuning to find a good number of neighbors +for each set of features. Instead we will code the forward selection algorithm manually. +In particular, we need code that tries adding each available predictor to a model, finding the best, and iterating. If you recall the end of the wrangling chapter, we mentioned -that sometimes one needs more flexible forms of iteration than what +that sometimes one needs more flexible forms of iteration than what we have used earlier, and in these cases one typically resorts to a *for loop*; see the [control flow section](https://wesmckinney.com/book/python-basics.html#control_for) in @@ -2022,7 +2023,7 @@ Here we will use two for loops: one over increasing predictor set sizes and another to check which predictor to add in each round (where you see `for j in range(len(names))` below). For each set of predictors to try, we extract the subset of predictors, pass it into a preprocessor, build a `Pipeline` that tunes -a K-NN classifier using 10-fold cross-validation, +a K-NN classifier using 10-fold cross-validation, and finally records the estimated accuracy. ```{code-cell} ipython3 @@ -2047,19 +2048,19 @@ cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier()) cancer_tune_grid = GridSearchCV( estimator=cancer_tune_pipe, param_grid=param_grid, - cv=10, + cv=10, n_jobs=-1 ) # for every possible number of predictors for i in range(1, n_total + 1): - accs = np.zeros(len(names)) + accs = np.zeros(len(names)) # for every possible predictor to add for j in range(len(names)): # Add remaining predictor j to the model X = cancer_subset[selected + [names[j]]] y = cancer_subset["Class"] - + # Find the best K for this set of predictors cancer_tune_grid.fit(X, y) accuracies_grid = pd.DataFrame(cancer_tune_grid.cv_results_) @@ -2067,14 +2068,14 @@ for i in range(1, n_total + 1): # Store the tuned accuracy for this set of predictors accs[j] = accuracies_grid["mean_test_score"].max() - # get the best new set of predictors that maximize cv accuracy + # get the best new set of predictors that maximize cv accuracy best_set = selected + [names[accs.argmax()]] - + # store the results for this round of forward selection accuracy_dict["size"].append(i) accuracy_dict["selected_predictors"].append(", ".join(best_set)) accuracy_dict["accuracy"].append(accs.max()) - + # update the selected & available sets of predictors selected = best_set del names[accs.argmax()] @@ -2091,14 +2092,14 @@ Interesting! The forward selection procedure first added the three meaningful va visualizes the accuracy versus the number of predictors in the model. You can see that as meaningful predictors are added, the estimated accuracy increases substantially; and as you add irrelevant variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number -of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have -to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). -The way to find that balance is to look for the *elbow* +of neighbors to account for the extra noise. In order to pick the right model from the sequence, you have +to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). +The way to find that balance is to look for the *elbow* in {numref}`fig:06-fwdsel-3`, i.e., the place on the plot where the accuracy stops increasing dramatically and -levels off or begins to decrease. The elbow in {numref}`fig:06-fwdsel-3` appears to occur at the model with +levels off or begins to decrease. The elbow in {numref}`fig:06-fwdsel-3` appears to occur at the model with 3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors occurs with 3 variables: `Perimeter, Concavity, Smoothness`. In other words, we have successfully removed irrelevant -predictors from the model! It is always worth remembering, however, that what cross-validation gives you +predictors from the model! It is always worth remembering, however, that what cross-validation gives you is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy. @@ -2131,13 +2132,13 @@ Estimated accuracy versus the number of predictors for the sequence of models bu ```{note} Since the choice of which variables to include as predictors is part of tuning your classifier, you *cannot use your test data* for this -process! +process! ``` ## Exercises -Practice exercises for the material covered in this chapter -can be found in the accompanying +Practice exercises for the material covered in this chapter +can be found in the accompanying [worksheets repository](https://worksheets.python.datasciencebook.ca) in the "Classification II: evaluation and tuning" row. You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button. @@ -2155,15 +2156,15 @@ and guidance that the worksheets provide will function as intended. - The [`scikit-learn` website](https://scikit-learn.org/stable/) is an excellent reference for more details on, and advanced usage of, the functions and - packages in the past two chapters. Aside from that, it also offers many - useful [tutorials](https://scikit-learn.org/stable/tutorial/index.html) - to get you started. It's worth noting that the `scikit-learn` package + packages in the past two chapters. Aside from that, it also offers many + useful [tutorials](https://scikit-learn.org/stable/tutorial/index.html) + to get you started. It's worth noting that the `scikit-learn` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those - chapters. -- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) {cite:p}`james2013introduction` provides + chapters. +- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) {cite:p}`james2013introduction` provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear @@ -2174,7 +2175,7 @@ and guidance that the worksheets provide will function as intended. variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require. - + ## References +++ diff --git a/source/clustering.md b/source/clustering.md index dc1c6759..7dc7815a 100755 --- a/source/clustering.md +++ b/source/clustering.md @@ -20,7 +20,7 @@ kernelspec: # get rid of futurewarnings from sklearn kmeans import warnings -warnings.simplefilter(action='ignore', category=FutureWarning) +warnings.simplefilter(action='ignore', category=FutureWarning) from chapter_preamble import * ``` @@ -39,16 +39,17 @@ including techniques to choose the number of clusters. By the end of the chapter, readers will be able to do the following: -* Describe a case where clustering is appropriate, +- Describe a situation in which clustering is an appropriate technique to use, and what insight it might extract from the data. -* Explain the K-means clustering algorithm. -* Interpret the output of a K-means analysis. -* Differentiate between clustering and classification. -* Identify when it is necessary to scale variables before clustering and do this using Python -* Perform k-means clustering in Python using `scikit-learn` -* Use the elbow method to choose the number of clusters for K-means. -* Visualize the output of k-means clustering in Python using a coloured scatter plot -* Describe advantages, limitations and assumptions of the kmeans clustering algorithm. +- Explain the K-means clustering algorithm. +- Interpret the output of a K-means analysis. +- Differentiate between clustering, classification, and regression. +- Identify when it is necessary to scale variables before clustering, and do this using Python. +- Perform K-means clustering in Python using `scikit-learn`. +- Use the elbow method to choose the number of clusters for K-means. +- Visualize the output of K-means clustering in Python using a colored scatter plot. +- Describe advantages, limitations and assumptions of the K-means clustering algorithm. + ## Clustering @@ -130,7 +131,7 @@ In this chapter we will focus on a data set from [the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) {cite:p}`palmerpenguins`. This data set was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Site, and includes -measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`. +measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`. Our goal will be to use two variables—penguin bill and flipper length, both in millimeters—to determine whether there are distinct types of penguins in our data. @@ -834,7 +835,7 @@ kmeans To actually run the K-means clustering, we combine the preprocessor and model object in a `Pipeline`, and use the `fit` function. Note that the K-means -algorithm uses a random initialization of assignments, but since we set +algorithm uses a random initialization of assignments, but since we set the random seed in the beginning of this chapter, the clustering will be reproducible. ```{code-cell} ipython3 @@ -848,14 +849,14 @@ penguin_clust ```{index} K-means; inertia_, K-means; cluster_centers_, K-means; labels_, K-means; predict ``` -The fit `KMeans` object—which is the second item in the +The fit `KMeans` object—which is the second item in the pipeline, and can be accessed as `penguin_clust[1]`—has a lot of information that can be used to visualize the clusters, pick K, and evaluate the total WSSD. -Let's start by visualizing the clusters as a colored scatter plot! In -order to do that, we first need to augment our -original `penguins` data frame with the cluster assignments. -We can access these using the `labels_` attribute of the clustering object -("labels" is a common alternative term to "assignments" in clustering), and +Let's start by visualizing the clusters as a colored scatter plot! In +order to do that, we first need to augment our +original `penguins` data frame with the cluster assignments. +We can access these using the `labels_` attribute of the clustering object +("labels" is a common alternative term to "assignments" in clustering), and add them to the data frame. ```{code-cell} ipython3 @@ -863,9 +864,9 @@ penguins["cluster"] = penguin_clust[1].labels_ penguins ``` -Now that we have the cluster assignments included in the `penguins` data frame, we can +Now that we have the cluster assignments included in the `penguins` data frame, we can visualize them as shown in {numref}`cluster_plot`. -Note that we are plotting the *un-standardized* data here; if we for some reason wanted to +Note that we are plotting the *un-standardized* data here; if we for some reason wanted to visualize the *standardized* data, we would need to use the `fit` and `transform` functions on the `StandardScaler` preprocessor directly to obtain that first. As in {numref}`Chapter %s `, @@ -912,7 +913,7 @@ penguin_clust[1].inertia_ To calculate the total WSSD for a variety of Ks, we will create a data frame that contains different values of `k` -and the WSSD of running KMeans with each values of k. +and the WSSD of running K-means with each values of k. To create this dataframe, we will use what is called a "list comprehension" in Python, where we repeat an operation multiple times @@ -934,10 +935,10 @@ we could square all the numbers from 1-4 and store them in a list: Next, we will use this approach to compute the WSSD for the K-values 1 through 9. For each value of K, -we create a new KMeans model +we create a new `KMeans` model and wrap it in a `scikit-learn` pipeline with the preprocessor we created earlier. -We store the WSSD values in a list that we will use to create a dataframe +We store the WSSD values in a list that we will use to create a dataframe of both the K-values and their corresponding WSSDs. ```{note} @@ -954,7 +955,7 @@ it is always the safest to assign it to a variable name for reuse. ks = range(1, 10) wssds = [ make_pipeline( - preprocessor, + preprocessor, KMeans(n_clusters=k) # Create a new KMeans model with `k` clusters ).fit(penguins)[1].inertia_ for k in ks @@ -1008,7 +1009,7 @@ due to an unlucky initialization of the initial center positions as we mentioned earlier in the chapter. ```{note} -It is rare that the KMeans function from `scikit-learn` +It is rare that the implementation of K-means from `scikit-learn` gets stuck in a bad solution, because `scikit-learn` tries to choose the initial centers carefully to prevent this from happening. If you still find yourself in a situation where you have a bump in the elbow plot, diff --git a/source/index.md b/source/index.md index e75806a1..454fa914 100755 --- a/source/index.md +++ b/source/index.md @@ -15,7 +15,7 @@ kernelspec: ![](img/frontmatter/ds-a-first-intro-graphic.jpg) -# Data Science +# Data Science ## *A First Introduction (Python Edition)* diff --git a/source/inference.md b/source/inference.md index 6e89cc1d..dfb36c07 100755 --- a/source/inference.md +++ b/source/inference.md @@ -36,16 +36,16 @@ populations and then introduce two common techniques in statistical inference: By the end of the chapter, readers will be able to do the following: -* Describe real-world examples of questions that can be answered with statistical inference. -* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample. -* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution). -* Explain the difference between a population parameter and a sample point estimate. -* Use Python to draw random samples from a finite population. -* Use Python to create a sampling distribution from a finite population. -* Describe how sample size influences the sampling distribution. -* Define bootstrapping. -* Use Python to create a bootstrap distribution to approximate a sampling distribution. -* Contrast the bootstrap and sampling distributions. +- Describe real-world examples of questions that can be answered with statistical inference. +- Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample. +- Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution. +- Explain the difference between a population parameter and a sample point estimate. +- Use Python to draw random samples from a finite population. +- Use Python to create a sampling distribution from a finite population. +- Describe how sample size influences the sampling distribution. +- Define bootstrapping. +- Use Python to create a bootstrap distribution to approximate a sampling distribution. +- Contrast the bootstrap and sampling distributions. +++ @@ -317,7 +317,7 @@ with the `name` parameter: ``` Below we put everything together -and also filter the data frame to keep only the room types +and also filter the data frame to keep only the room types that we are interested in. ```{code-cell} ipython3 @@ -776,7 +776,7 @@ How large is "large enough?" Unfortunately, it depends entirely on the problem a as a rule of thumb, often a sample size of at least 20 will suffice. ``` - ```python import requests @@ -1490,14 +1491,14 @@ import json with open("data/nasa.json", "r") as f: nasa_data = json.load(f) # the last entry in the stored data is July 13, 2023, so print that -nasa_data[-1] +nasa_data[-1] ``` We can obtain more records at once by using the `start_date` and `end_date` parameters, as shown in the table of parameters in {numref}`fig:NASA-API-parameters`. Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result in an object called `nasa_data`; now the response -will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object), +will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object), and there will be 74 items total, one for each day between the start and end dates: ```python diff --git a/source/regression1.md b/source/regression1.md index 0d66fe4c..d7b23af3 100755 --- a/source/regression1.md +++ b/source/regression1.md @@ -33,7 +33,7 @@ This is unlike the past two chapters, which focused on predicting categorical variables via classification. However, regression does have many similarities to classification: for example, just as in the case of classification, we will split our data into training, validation, and test sets, we will -use `scikit-learn` workflows, we will use a K-nearest neighbors (KNN) +use `scikit-learn` workflows, we will use a K-nearest neighbors (K-NN) approach to make predictions, and we will use cross-validation to choose K. Because of how similar these procedures are, make sure to read {numref}`Chapters %s ` and {numref}`%s ` before reading @@ -51,14 +51,15 @@ however that is beyond the scope of this book. ## Chapter learning objectives By the end of the chapter, readers will be able to do the following: -* Recognize situations where a simple regression analysis would be appropriate for making predictions. -* Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification. -* Interpret the output of a KNN regression. -* In a data set with two or more variables, perform K-nearest neighbor regression in Python using a `scikit-learn` workflow. -* Execute cross-validation in Python to choose the number of neighbors. -* Evaluate KNN regression prediction accuracy in Python using a test data set and the root mean squared prediction error (RMSPE). -* In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE). -* Describe the advantages and disadvantages of K-nearest neighbors regression. +- Recognize situations where a regression analysis would be appropriate for making predictions. +- Explain the K-nearest neighbors (K-NN) regression algorithm and describe how it differs from K-NN classification. +- Interpret the output of a K-NN regression. +- In a data set with two or more variables, perform K-nearest neighbors regression in Python. +- Evaluate K-NN regression prediction quality in Python using the root mean squared prediction error (RMSPE). +- Estimate the RMSPE in Python using cross-validation or a test set. +- Choose the number of neighbors in K-nearest neighbors regression by minimizing estimated cross-validation RMSPE. +- Describe underfitting and overfitting, and relate it to the number of neighbors in K-nearest neighbors regression. +- Describe the advantages and disadvantages of K-nearest neighbors regression. +++ @@ -220,10 +221,10 @@ Much like in the case of classification, we can use a K-nearest neighbors-based approach in regression to make predictions. Let's take a small sample of the data in {numref}`fig:07-edaRegr` -and walk through how K-nearest neighbors (KNN) works +and walk through how K-nearest neighbors (K-NN) works in a regression context before we dive in to creating our model and assessing how well it predicts house sale price. This subsample is taken to allow us to -illustrate the mechanics of KNN regression with a few data points; later in +illustrate the mechanics of K-NN regression with a few data points; later in this chapter we will use all the data. ```{index} pandas.DataFrame; sample @@ -371,12 +372,12 @@ Our predicted price is \${glue:text}`knn-5-pred` (shown as a red point in {numref}`fig:07-predictedViz-knn`), which is much less than \$350,000; perhaps we might want to offer less than the list price at which the house is advertised. But this is only the very beginning of the story. We still have all the same -unanswered questions here with KNN regression that we had with KNN +unanswered questions here with K-NN regression that we had with K-NN classification: which $K$ do we choose, and is our model any good at making predictions? In the next few sections, we will address these questions in the -context of KNN regression. +context of K-NN regression. -One strength of the KNN regression algorithm +One strength of the K-NN regression algorithm that we would like to draw attention to at this point is its ability to work well with non-linear relationships (i.e., if the relationship is not a straight line). @@ -384,7 +385,7 @@ This stems from the use of nearest neighbors to predict values. The algorithm really has very few assumptions about what the data must look like for it to work. -+++ ++++ ## Training, evaluating, and tuning the model @@ -427,11 +428,11 @@ sacramento_train, sacramento_test = train_test_split( ```{index} see: root mean square prediction error; RMSPE ``` -Next, we'll use cross-validation to choose $K$. In KNN classification, we used +Next, we'll use cross-validation to choose $K$. In K-NN classification, we used accuracy to see how well our predictions matched the true labels. We cannot use the same metric in the regression setting, since our predictions will almost never *exactly* match the true response variable values. Therefore in the -context of KNN regression we will use root mean square prediction error (RMSPE) instead. +context of K-NN regression we will use root mean square prediction error (RMSPE) instead. The mathematical formula for calculating RMSPE is: $$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$ @@ -524,7 +525,7 @@ Scatter plot of price (USD) versus house size (square feet) with example predict ```{note} When using many code packages, the evaluation output we will get to assess the prediction quality of -our KNN regression models is labeled "RMSE", or "root mean squared +our K-NN regression models is labeled "RMSE", or "root mean squared error". Why is this so, and why not RMSPE? In statistics, we try to be very precise with our language to indicate whether we are calculating the prediction error on the @@ -553,10 +554,10 @@ opposed to the classification problems from the previous chapters. The use of different metrics (instead of accuracy) for tuning and evaluation. Next we specify a parameter grid containing numbers of neighbors ranging from 1 to 200. Then we create a 5-fold `GridSearchCV` object, and -pass in the pipeline and parameter grid. +pass in the pipeline and parameter grid. There is one additional slight complication: unlike classification models in `scikit-learn`---which by default use accuracy for tuning, as desired---regression models in `scikit-learn` -do not use the RMSPE for tuning by default. +do not use the RMSPE for tuning by default. So we need to specify that we want to use the RMSPE for tuning by setting the `scoring` argument to `"neg_root_mean_squared_error"`. @@ -570,7 +571,7 @@ of `sacr_pipeline.get_params()`, as we did in {numref}`Chapter %s `, we can obtain a data frame with a subset of columns by passing a list of column names; `["sqft"]` is a list with one -item, so we obtain a data frame with one column. If instead we used +item, so we obtain a data frame with one column. If instead we used just one bracket (`sacramento_train["sqft"]`), we would obtain a series. In `scikit-learn`, it is easier to work with the input features as a data frame rather than a series, so we opt for two brackets here. On the other hand, the response variable @@ -602,7 +603,7 @@ can be a series, so we use just one bracket there (`sacramento_train["price"]`). As in {numref}`Chapter %s `, once the model has been fit we will wrap the `cv_results_` output in a data frame, extract -only the relevant columns, compute the standard error based on 5 folds, +only the relevant columns, compute the standard error based on 5 folds, and rename the parameter column to be more readable. @@ -630,7 +631,7 @@ sacr_results In the `sacr_results` results data frame, we see that the `n_neighbors` variable contains the values of $K$, and `mean_test_score` variable contains the value of the RMSPE estimated via -cross-validation...Wait a moment! Isn't the RMSPE supposed to be nonnegative? +cross-validation...Wait a moment! Isn't the RMSPE supposed to be nonnegative? Recall that when we specified the `scoring` argument in the `GridSearchCV` object, we used the value `"neg_root_mean_squared_error"`. See the `neg_` at the start? That stands for *negative*! As it turns out, `scikit-learn` always tries to *maximize* a score @@ -644,7 +645,7 @@ sacr_results["mean_test_score"] = -sacr_results["mean_test_score"] sacr_results ``` -Alright, now the `mean_test_score` variable actually has values of the RMSPE +Alright, now the `mean_test_score` variable actually has values of the RMSPE for different numbers of neighbors. Finally, the `sem_test_score` variable contains the standard error of our cross-validation RMSPE estimate, which is a measure of how uncertain we are in the mean value. Roughly, if @@ -687,7 +688,7 @@ glue("fig:07-choose-k-knn-plot", sacr_tunek_plot, display=False) Effect of the number of neighbors on the RMSPE. ::: -To see which parameter value corresponds to the minimum RMSPE, +To see which parameter value corresponds to the minimum RMSPE, we can also access the `best_params_` attribute of the original fit `GridSearchCV` object. Note that it is still useful to visualize the results as we did above since this provides additional information on how the model performance varies. @@ -705,7 +706,7 @@ to be too small or too large, we cause the RMSPE to increase, as shown in {numref}`fig:07-howK` visualizes the effect of different settings of $K$ on the regression model. Each plot shows the predicted values for house sale price from -our KNN regression model for 6 different values for $K$: 1, 3, 25, {glue:text}`best_k_sacr`, 250, and 699 (i.e., all of the training data). +our K-NN regression model for 6 different values for $K$: 1, 3, 25, {glue:text}`best_k_sacr`, 250, and 699 (i.e., all of the training data). For each model, we predict prices for the range of possible home sizes we observed in the data set (here 500 to 5,000 square feet) and we plot the predicted prices as a orange line. @@ -765,7 +766,7 @@ glue( :::{glue:figure} fig:07-howK :name: fig:07-howK -Predicted values for house price (represented as a orange line) from KNN regression models for six different values for $K$. +Predicted values for house price (represented as a orange line) from K-NN regression models for six different values for $K$. ::: +++ @@ -823,17 +824,16 @@ chapter. ## Evaluating on the test set To assess how well our model might do at predicting on unseen data, we will -assess its RMSPE on the test data. To do this, we first need to retrain the -KNN regression model on the entire training data set using $K =$ {glue:text}`best_k_sacr` +assess its RMSPE on the test data. To do this, we first need to retrain the +K-NN regression model on the entire training data set using $K =$ {glue:text}`best_k_sacr` neighbors. As we saw in {numref}`Chapter %s ` we do not have to do this ourselves manually; `scikit-learn` does it for us automatically. To make predictions with the best model on the test data, we can use the `predict` method of the fit `GridSearchCV` object. -We then use the `mean_squared_error` -function (with the `y_true` and `y_pred` arguments) +We then use the `mean_squared_error` function (with the `y_true` and `y_pred` arguments) to compute the mean squared prediction error, and finally take the -square root to get the RMSPE. The reason that we do not just use the `score` +square root to get the RMSPE. The reason that we do not just use the `score` method---as in {numref}`Chapter %s `---is that the `KNeighborsRegressor` -model uses a different default scoring metric than the RMSPE. +model uses a different default scoring metric than the RMSPE. ```{code-cell} ipython3 from sklearn.metrics import mean_squared_error @@ -862,7 +862,7 @@ RMSPE estimate of our tuned model (which was \${glue:text}`cv_RMSPE`, so we can say that the model appears to generalize well to new data that it has never seen before. -However, much like in the case of KNN classification, whether this value for RMSPE is *good*—i.e., +However, much like in the case of K-NN classification, whether this value for RMSPE is *good*—i.e., whether an error of around \${glue:text}`test_RMSPE` is acceptable—depends entirely on the application. In this application, this error @@ -906,7 +906,7 @@ base_plot = alt.Chart(sacramento).mark_circle(opacity=0.4).encode( # Add the predictions as a line sacr_preds_plot = base_plot + alt.Chart( - sqft_prediction_grid, + sqft_prediction_grid, title=f"K = {best_k_sacr}" ).mark_line( color="#ff7f0e" @@ -927,14 +927,14 @@ glue("fig:07-predict-all", sacr_preds_plot) :::{glue:figure} fig:07-predict-all :name: fig:07-predict-all -Predicted values of house price (orange line) for the final KNN regression model. +Predicted values of house price (orange line) for the final K-NN regression model. ::: +++ -## Multivariable KNN regression +## Multivariable K-NN regression -As in KNN classification, we can use multiple predictors in KNN regression. +As in K-NN classification, we can use multiple predictors in K-NN regression. In this setting, we have the same concerns regarding the scale of the predictors. Once again, predictions are made by identifying the $K$ observations that are nearest to the new point we want to predict; any @@ -943,16 +943,16 @@ variables on a small scale. Hence, we should re-define the preprocessor in the pipeline to incorporate all predictor variables. Note that we also have the same concern regarding the selection of predictors -in KNN regression as in KNN classification: having more predictors is **not** always +in K-NN regression as in K-NN classification: having more predictors is **not** always better, and the choice of which predictors to use has a potentially large influence on the quality of predictions. Fortunately, we can use the predictor selection -algorithm from {numref}`Chapter %s ` in KNN regression as well. +algorithm from {numref}`Chapter %s ` in K-NN regression as well. As the algorithm is the same, we will not cover it again in this chapter. ```{index} K-nearest neighbors; multivariable regression, Sacramento real estate ``` -We will now demonstrate a multivariable KNN regression analysis of the +We will now demonstrate a multivariable K-NN regression analysis of the Sacramento real estate data using `scikit-learn`. This time we will use house size (measured in square feet) as well as number of bedrooms as our predictors, and continue to use house sale price as our response variable @@ -991,7 +991,7 @@ Scatter plot of the sale price of houses versus the number of bedrooms. the house sale price tends to increase as well, but that the relationship is quite weak. Does adding the number of bedrooms to our model improve our ability to predict price? To answer that -question, we will have to create a new KNN regression +question, we will have to create a new K-NN regression model using house size and number of bedrooms, and then we can compare it to the model we previously came up with that only used house size. Let's do that now! @@ -1054,7 +1054,7 @@ glue("cv_RMSPE_2pred", "{0:,.0f}".format(min_rmspe_sacr_multi)) ``` Here we see that the smallest estimated RMSPE from cross-validation occurs when $K =$ {glue:text}`best_k_sacr_multi`. -If we want to compare this multivariable KNN regression model to the model with only a single +If we want to compare this multivariable K-NN regression model to the model with only a single predictor *as part of the model tuning process* (e.g., if we are running forward selection as described in the chapter on evaluating and tuning classification models), then we must compare the RMSPE estimated using only the training data via cross-validation. @@ -1065,7 +1065,7 @@ The estimated cross-validation RMSPE for the multivariable model is Thus in this case, we did not improve the model by a large amount by adding this additional predictor. -Regardless, let's continue the analysis to see how we can make predictions with a multivariable KNN regression model +Regardless, let's continue the analysis to see how we can make predictions with a multivariable K-NN regression model and evaluate its performance on test data. As previously, we will use the best model to make predictions on the test data via the `predict` method of the fit `GridSearchCV` object. Finally, we will use the `mean_squared_error` function to compute the RMSPE. @@ -1086,7 +1086,7 @@ RMSPE_mult glue("RMSPE_mult", "{0:,.0f}".format(RMSPE_mult)) ``` -This time, when we performed KNN regression on the same data set, but also +This time, when we performed K-NN regression on the same data set, but also included number of bedrooms as a predictor, we obtained a RMSPE test error of \${glue:text}`RMSPE_mult`. {numref}`fig:07-knn-mult-viz` visualizes the model's predictions overlaid on top of the data. This @@ -1143,7 +1143,7 @@ glue("fig:07-knn-mult-viz", fig) :name: fig:07-knn-mult-viz :figclass: caption-hack -KNN regression model’s predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes. +K-NN regression model's predictions represented as a surface in 3D space overlaid on top of the data using three predictors (price, house size, and the number of bedrooms). Note that in general we recommend against using 3D visualizations; here we use a 3D visualization only to illustrate what the surface of predictions looks like for learning purposes. ``` +++ @@ -1160,9 +1160,9 @@ bedrooms, we would predict the same price for these two houses. +++ -## Strengths and limitations of KNN regression +## Strengths and limitations of K-NN regression -As with KNN classification (or any prediction algorithm for that matter), KNN +As with K-NN classification (or any prediction algorithm for that matter), K-NN regression has both strengths and weaknesses. Some are listed here: **Strengths:** K-nearest neighbors regression diff --git a/source/regression2.md b/source/regression2.md index edd9ea8d..f7245fdf 100755 --- a/source/regression2.md +++ b/source/regression2.md @@ -27,10 +27,10 @@ import plotly.graph_objects as go ## Overview Up to this point, we have solved all of our predictive problems—both classification -and regression—using K-nearest neighbors (KNN)-based approaches. In the context of regression, +and regression—using K-nearest neighbors (K-NN)-based approaches. In the context of regression, there is another commonly used method known as *linear regression*. This chapter provides an introduction to the basic concept of linear regression, shows how to use `scikit-learn` to perform linear regression in Python, -and characterizes its strengths and weaknesses compared to KNN regression. The focus is, as usual, +and characterizes its strengths and weaknesses compared to K-NN regression. The focus is, as usual, on the case where there is a single predictor and single response variable of interest; but the chapter concludes with an example using *multivariable linear regression* when there is more than one predictor. @@ -38,9 +38,10 @@ predictor. ## Chapter learning objectives By the end of the chapter, readers will be able to do the following: -* Use Python and `scikit-learn` to fit a linear regression model on training data. -* Evaluate the linear regression model on test data. -* Compare and contrast predictions obtained from K-nearest neighbor regression to those obtained using linear regression from the same data set. +- Use Python to fit simple and multivariable linear regression models on training data. +- Evaluate the linear regression model on test data. +- Compare and contrast predictions obtained from K-nearest neighbors regression to those obtained using linear regression from the same data set. +- Describe how linear regression is affected by outliers and multicollinearity. +++ @@ -49,19 +50,19 @@ By the end of the chapter, readers will be able to do the following: ```{index} regression; linear ``` -At the end of the previous chapter, we noted some limitations of KNN regression. -While the method is simple and easy to understand, KNN regression does not +At the end of the previous chapter, we noted some limitations of K-NN regression. +While the method is simple and easy to understand, K-NN regression does not predict well beyond the range of the predictors in the training data, and the method gets significantly slower as the training data set grows. -Fortunately, there is an alternative to KNN regression—*linear regression*—that addresses +Fortunately, there is an alternative to K-NN regression—*linear regression*—that addresses both of these limitations. Linear regression is also very commonly used in practice because it provides an interpretable mathematical equation that describes the relationship between the predictor and response variables. In this first part of the chapter, we will focus on *simple* linear regression, which involves only one predictor variable and one response variable; later on, we will consider *multivariable* linear regression, which involves multiple predictor variables. - Like KNN regression, simple linear regression involves + Like K-NN regression, simple linear regression involves predicting a numerical response variable (like race time, house price, or height); -but *how* it makes those predictions for a new observation is quite different from KNN regression. +but *how* it makes those predictions for a new observation is quite different from K-NN regression. Instead of looking at the K nearest neighbors and averaging over their values for a prediction, in simple linear regression, we create a straight line of best fit through the training data and then @@ -78,8 +79,8 @@ is another popular method for classification called *logistic regression* (it is used for classification even though the name, somewhat confusingly, has the word "regression" in it). In logistic regression—similar to linear regression—you "fit" the model to the training data and then "look up" the prediction for each new observation. -Logistic regression and KNN classification have an advantage/disadvantage comparison -similar to that of linear regression and KNN +Logistic regression and K-NN classification have an advantage/disadvantage comparison +similar to that of linear regression and K-NN regression. It is useful to have a good understanding of linear regression before learning about logistic regression. After reading this chapter, see the "Additional Resources" section at the end of the classification chapters to learn more about logistic regression. @@ -91,7 +92,7 @@ classification chapters to learn more about logistic regression. ``` Let's return to the Sacramento housing data from {numref}`Chapter %s ` to learn -how to apply linear regression and compare it to KNN regression. For now, we +how to apply linear regression and compare it to K-NN regression. For now, we will consider a smaller version of the housing data to help make our visualizations clear. Recall our predictive question: can we use the size of a house in the Sacramento, CA area to predict @@ -290,7 +291,7 @@ the line that minimizes the **average squared vertical distance** between itself each of the observed data points in the training data. {numref}`fig:08-verticalDistToMin` illustrates these vertical distances as red lines. Finally, to assess the predictive accuracy of a simple linear regression model, -we use RMSPE—the same measure of predictive performance we used with KNN regression. +we use RMSPE—the same measure of predictive performance we used with K-NN regression. ```{code-cell} ipython3 :tags: [remove-cell] @@ -337,7 +338,7 @@ Scatter plot of sale price versus size with red lines denoting the vertical dist ``` We can perform simple linear regression in Python using `scikit-learn` in a -very similar manner to how we performed KNN regression. +very similar manner to how we performed K-NN regression. To do this, instead of creating a `KNeighborsRegressor` model object, we use a `LinearRegression` model object; and as usual, we first have to import it from `sklearn`. @@ -375,7 +376,7 @@ sacramento_train, sacramento_test = train_test_split( ) ``` -Now that we have our training data, we will create +Now that we have our training data, we will create and fit the linear regression model object. We will also extract the slope of the line via the `coef_[0]` property, as well as the @@ -510,16 +511,16 @@ glue("fig:08-lm-predict-all", sacr_preds_plot) Scatter plot of sale price versus size with line of best fit for the full Sacramento housing data. ::: -## Comparing simple linear and KNN regression +## Comparing simple linear and K-NN regression ```{index} regression; comparison of methods ``` -Now that we have a general understanding of both simple linear and KNN +Now that we have a general understanding of both simple linear and K-NN regression, we can start to compare and contrast these methods as well as the predictions made by them. To start, let's look at the visualization of the simple linear regression model predictions for the Sacramento real estate data -(predicting price from house size) and the "best" KNN regression model +(predicting price from house size) and the "best" K-NN regression model obtained from the same problem, shown in {numref}`fig:08-compareRegression`. ```{code-cell} ipython3 @@ -558,7 +559,7 @@ sacr_rmspe_knn = np.sqrt( # plot knn in-sample predictions overlaid on scatter plot knn_plot_final = ( - alt.Chart(sacr_preds_knn, title="KNN regression") + alt.Chart(sacr_preds_knn, title="K-NN regression") .mark_circle() .encode( x=alt.X("sqft", title="House size (square feet)", scale=alt.Scale(zero=False)), @@ -629,21 +630,21 @@ glue("fig:08-compareRegression", (lm_plot_final | knn_plot_final)) :::{glue:figure} fig:08-compareRegression :name: fig:08-compareRegression -Comparison of simple linear regression and KNN regression. +Comparison of simple linear regression and K-NN regression. ::: +++ What differences do we observe in {numref}`fig:08-compareRegression`? One obvious difference is the shape of the orange lines. In simple linear regression we are -restricted to a straight line, whereas in KNN regression our line is much more +restricted to a straight line, whereas in K-NN regression our line is much more flexible and can be quite wiggly. But there is a major interpretability advantage in limiting the model to a straight line. A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the response variable we predict given a unit increase in the predictor -variable. KNN regression, as simple as it is to implement and understand, has no such +variable. K-NN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line. ```{index} underfitting; regression @@ -657,14 +658,14 @@ will underfit (have high bias), meaning that model/predicted values do not match the actual observed values very well. Such a model would probably have a quite high RMSE when assessing model goodness of fit on the training data and a quite high RMSPE when assessing model prediction quality on a test data -set. On such a data set, KNN regression may fare better. Additionally, there +set. On such a data set, K-NN regression may fare better. Additionally, there are other types of regression you can learn about in future books that may do even better at predicting with such data. How do these two models compare on the Sacramento house prices data set? In {numref}`fig:08-compareRegression`, we also printed the RMSPE as calculated from predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear -regression model is slightly lower than the RMSPE for the KNN regression model. +regression model is slightly lower than the RMSPE for the K-NN regression model. Considering that the simple linear regression model is also more interpretable, if we were comparing these in practice we would likely choose to use the simple linear regression model. @@ -672,17 +673,17 @@ linear regression model. ```{index} extrapolation ``` -Finally, note that the KNN regression model becomes "flat" +Finally, note that the K-NN regression model becomes "flat" at the left and right boundaries of the data, while the linear model predicts a constant slope. Predicting outside the range of the observed -data is known as *extrapolation*; KNN and linear models behave quite differently +data is known as *extrapolation*; K-NN and linear models behave quite differently when extrapolating. Depending on the application, the flat or constant slope trend may make more sense. For example, if our housing data were slightly different, the linear model may have actually predicted a *negative* price for a small house (if the intercept $\beta_0$ was negative), which obviously does not match reality. On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, -so the "flat" extrapolation of KNN likely does not match reality. +so the "flat" extrapolation of K-NN likely does not match reality. +++ @@ -696,15 +697,15 @@ so the "flat" extrapolation of KNN likely does not match reality. ```{index} see: multivariable linear equation; plane equation ``` -As in KNN classification and KNN regression, we can move beyond the simple +As in K-NN classification and K-NN regression, we can move beyond the simple case of only one predictor to the case with multiple predictors, known as *multivariable linear regression*. To do this, we follow a very similar approach to what we did for -KNN regression: we just specify the training data by adding more predictors. +K-NN regression: we just specify the training data by adding more predictors. But recall that we do not need to use cross-validation to choose any parameters, nor do we need to standardize (i.e., center and scale) the data for linear regression. Note once again that we have the same concerns regarding multiple predictors - as in the settings of multivariable KNN regression and classification: having more predictors is **not** always + as in the settings of multivariable K-NN regression and classification: having more predictors is **not** always better. But because the same predictor selection algorithm from {numref}`Chapter %s ` extends to the setting of linear regression, it will not be covered again in this chapter. @@ -715,8 +716,8 @@ it will not be covered again in this chapter. We will demonstrate multivariable linear regression using the Sacramento real estate data with both house size (measured in square feet) as well as number of bedrooms as our predictors, and -continue to use house sale price as our response variable. -The `scikit-learn` framework makes this easy to do: we just need to set +continue to use house sale price as our response variable. +The `scikit-learn` framework makes this easy to do: we just need to set both the `sqft` and `beds` variables as predictors, and then use the `fit` method as usual. @@ -811,10 +812,10 @@ to illustrate what the regression plane looks like for learning purposes. We see that the predictions from linear regression with two predictors form a flat plane. This is the hallmark of linear regression, and differs from the -wiggly, flexible surface we get from other methods such as KNN regression. +wiggly, flexible surface we get from other methods such as K-NN regression. As discussed, this can be advantageous in one aspect, which is that for each predictor, we can get slopes/intercept from linear regression, and thus describe the -plane mathematically. We can extract those slope values from the `coef_` property +plane mathematically. We can extract those slope values from the `coef_` property of our model object, and the intercept from the `intercept_` property, as shown below. @@ -828,10 +829,10 @@ mlm.intercept_ When we have multiple predictor variables, it is not easy to know which variable goes with which coefficient in `mlm.coef_`. In particular, -you will see that `mlm.coef_` above is just an array of values without any variable names. +you will see that `mlm.coef_` above is just an array of values without any variable names. Unfortunately you have to do this mapping yourself: the coefficients in `mlm.coef_` appear in the *same order* as the columns of the predictor data frame you used when training. -So since we used `sacramento_train[["sqft", "beds"]]` when training, +So since we used `sacramento_train[["sqft", "beds"]]` when training, we have that `mlm.coef_[0]` corresponds to `sqft`, and `mlm.coef_[1]` corresponds to `beds`. Once you sort out the correspondence, you can then use those slopes to write a mathematical equation to describe the prediction plane: @@ -863,15 +864,15 @@ glue("bedsc", bedsc) $\text{house sale price} =$ {glue:text}`icept` $+$ {glue:text}`sqftc` $\cdot (\text{house size})$ {glue:text}`bedsc` $\cdot (\text{number of bedrooms})$ -This model is more interpretable than the multivariable KNN +This model is more interpretable than the multivariable K-NN regression model; we can write a mathematical equation that explains how each predictor is affecting the predictions. But as always, we should question how well multivariable linear regression is doing compared to the other tools we have, such as simple linear regression -and multivariable KNN regression. If this comparison is part of +and multivariable K-NN regression. If this comparison is part of the model tuning process—for example, if we are trying out many different sets of predictors for multivariable linear -and KNN regression—we must perform this comparison using +and K-NN regression—we must perform this comparison using cross-validation on only our training data. But if we have already decided on a small number (e.g., 2 or 3) of tuned candidate models and we want to make a final comparison, we can do so by comparing the prediction @@ -886,7 +887,7 @@ lm_mult_test_RMSPE We obtain an RMSPE for the multivariable linear regression model of \${glue:text}`sacr_mult_RMSPE`. This prediction error - is less than the prediction error for the multivariable KNN regression model, + is less than the prediction error for the multivariable K-NN regression model, indicating that we should likely choose linear regression for predictions of house sale price on this data set. Revisiting the simple linear regression model with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was @@ -1361,7 +1362,7 @@ and guidance that the worksheets provide will function as intended. of "informative" predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the - computational efficiency of linear regression. In contrast, the KNN methods we + computational efficiency of linear regression. In contrast, the K-NN methods we covered earlier are indeed more flexible but become very slow when given lots of data. diff --git a/source/setup.md b/source/setup.md index 45f81c3e..a540198d 100755 --- a/source/setup.md +++ b/source/setup.md @@ -21,8 +21,8 @@ kernelspec: In this chapter, you'll learn how to set up the software needed to follow along with this book on your own computer. Given that installation instructions can vary based on computer setup, we provide instructions for -multiple operating systems (Ubuntu Linux, MacOS, and Windows). -Although the instructions in this chapter will likely work on many systems, +multiple operating systems (Ubuntu Linux, MacOS, and Windows). +Although the instructions in this chapter will likely work on many systems, we have specifically verified that they work on a computer that: - runs Windows 10 Home, MacOS 13 Ventura, or Ubuntu 22.04, @@ -38,18 +38,18 @@ By the end of the chapter, readers will be able to do the following: - Download the worksheets that accompany this book. - Install the Docker virtualization engine. - Edit and run the worksheets using JupyterLab running inside a Docker container. -- Install Git, JupyterLab Desktop, and python packages. +- Install Git, JupyterLab Desktop, and Python packages. - Edit and run the worksheets using JupyterLab Desktop. ## Obtaining the worksheets for this book -The worksheets containing exercises for this book +The worksheets containing exercises for this book are online at [https://worksheets.python.datasciencebook.ca](https://worksheets.python.datasciencebook.ca). The worksheets can be launched directly from that page using the Binder links in the rightmost -column of the table. This is the easiest way to access the worksheets, but note that you will not +column of the table. This is the easiest way to access the worksheets, but note that you will not be able to save your work and return to it again later. -In order to save your progress, you will need to download the worksheets to your own computer and -work on them locally. You can download the worksheets as a compressed zip file +In order to save your progress, you will need to download the worksheets to your own computer and +work on them locally. You can download the worksheets as a compressed zip file using [the link at the top of the page](https://github.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/archive/refs/heads/main.zip). Once you unzip the downloaded file, you will have a folder containing all of the Jupyter notebook worksheets accompanying this book. See {numref}`Chapter %s ` for @@ -64,7 +64,7 @@ software packages, not to mention getting the right versions of everything—the worksheets and autograder tests may not work unless all the versions are exactly right! To keep things simple, we instead recommend that you install [Docker](https://docker.com). Docker lets you run your Jupyter notebooks inside -a pre-built *container* that comes with precisely the right versions of +a pre-built *container* that comes with precisely the right versions of all software packages needed run the worksheets that come with this book. ```{index} Docker ``` @@ -73,15 +73,15 @@ all software packages needed run the worksheets that come with this book. A *container* is a virtualized user space within your computer. Within the container, you can run software in isolation without interfering with the other software that already exists on your machine. In this book, we use -a container to run a specific version of the python programming +a container to run a specific version of the Python programming language, as well as other necessary packages. The container ensures that -the worksheets function correctly, even if you have a different version of python -installed on your computer—or even if you haven't installed python at all! +the worksheets function correctly, even if you have a different version of Python +installed on your computer—or even if you haven't installed Python at all! ``` ### Windows -**Installation** To install Docker on Windows, +**Installation** To install Docker on Windows, visit [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/), and download the `Docker Desktop Installer.exe` file. Double-click the file to open the installer and follow the instructions on the installation wizard, choosing **WSL-2** instead of **Hyper-V** when prompted. @@ -90,27 +90,27 @@ and follow the instructions on the installation wizard, choosing **WSL-2** inste Occasionally, when you first run Docker on Windows, you will encounter an error message. Some common errors you may see: - If you need to update WSL, you can enter `cmd.exe` in the Start menu to run the command line. Type `wsl --update` to update WSL. -- If the admin account on your computer is different to your user account, you must add the user to the "docker-users" group. - Run Computer Management as an administrator and navigate to `Local Users` and `Groups -> Groups -> docker-users`. Right-click to +- If the admin account on your computer is different to your user account, you must add the user to the "docker-users" group. + Run Computer Management as an administrator and navigate to `Local Users` and `Groups -> Groups -> docker-users`. Right-click to add the user to the group. Log out and log back in for the changes to take effect. - If you need to enable virtualization, you will need to edit your BIOS. Restart your computer, and enter the BIOS using the hotkey (usually Delete, Esc, and/or one of the F# keys). Look for an "Advanced" menu, and under your CPU settings, set the "Virtualization" option - to "enabled". Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert - to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. + to "enabled". Then save the changes and reboot your machine. If you are not familiar with BIOS editing, you may want to find an expert + to help you with this, as editing the BIOS can be dangerous. Detailed instructions for doing this are beyond the scope of this book. ``` **Running JupyterLab** Run Docker Desktop. Once it is running, you need to download and run the -Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a +Docker *image* that we have made available for the worksheets (an *image* is like a "snapshot" of a computer with all the right packages pre-installed). You only need to do this step one time; the image will remain the next time you run Docker Desktop. -In the Docker Desktop search bar, enter `ubcdsci/py-dsci-100`, as this is +In the Docker Desktop search bar, enter `ubcdsci/py-dsci-100`, as this is the name of the image. You will see the `ubcdsci/py-dsci-100` image in the list ({numref}`docker-desktop-search`), and "latest" in the Tag drop down menu. We need to change "latest" to the right image version before proceeding. -To find the right tag, open +To find the right tag, open the [`Dockerfile` in the worksheets repository](https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/main/Dockerfile), and look for the line `FROM ubcdsci/py-dsci-100:` followed by the tag consisting of a sequence of numbers and letters. -Back in Docker Desktop, in the "Tag" drop down menu, click that tag to select the correct image version. Then click -the "Pull" button to download the image. +Back in Docker Desktop, in the "Tag" drop down menu, click that tag to select the correct image version. Then click +the "Pull" button to download the image. ```{figure} img/setup/docker-1.png --- @@ -148,10 +148,10 @@ name: docker-desktop-runconfig The Docker Desktop container run configuration menu. ``` -After clicking the "Run" button, you will see a terminal. The terminal will then print -some text as the Docker container starts. Once the text stops scrolling, find the -URL in the terminal that starts -with `http://127.0.0.1:8888` (highlighted by the red box in {numref}`docker-desktop-url`), and paste it +After clicking the "Run" button, you will see a terminal. The terminal will then print +some text as the Docker container starts. Once the text stops scrolling, find the +URL in the terminal that starts +with `http://127.0.0.1:8888` (highlighted by the red box in {numref}`docker-desktop-url`), and paste it into your browser to start JupyterLab. ```{figure} img/setup/docker-4.png @@ -162,11 +162,11 @@ name: docker-desktop-url The terminal text after running the Docker container. The red box indicates the URL that you should paste into your browser to open JupyterLab. ``` -When you are done working, make sure to shut down and remove the container by +When you are done working, make sure to shut down and remove the container by clicking the red trash can symbol (in the top right corner of {numref}`docker-desktop-url`). You will not be able to start the container again until you do so. More information on installing and running -Docker on Windows, as well as troubleshooting tips, can +Docker on Windows, as well as troubleshooting tips, can be found in [the online Docker documentation](https://docs.docker.com/desktop/install/windows-install/). ### MacOS @@ -174,18 +174,18 @@ be found in [the online Docker documentation](https://docs.docker.com/desktop/in **Installation** To install Docker on MacOS, visit [the online Docker documentation](https://docs.docker.com/desktop/install/mac-install/), and download the `Docker.dmg` installation file that is appropriate for your -computer. To know which installer is right for your machine, you need to know +computer. To know which installer is right for your machine, you need to know whether your computer has an Intel processor (older machines) or an Apple processor (newer machines); the [Apple support page](https://support.apple.com/en-ca/HT211814) has -information to help you determine which processor you have. Once downloaded, double-click +information to help you determine which processor you have. Once downloaded, double-click the file to open the installer, then drag the Docker icon to the Applications folder. -Double-click the icon in the Applications folder to start Docker. In the installation +Double-click the icon in the Applications folder to start Docker. In the installation window, use the recommended settings. **Running JupyterLab** Run Docker Desktop. Once it is running, follow the instructions above in the Windows section on *Running JupyterLab* (the user interface is the same). More information on installing and running Docker on -MacOS, as well as troubleshooting tips, can be +MacOS, as well as troubleshooting tips, can be found in [the online Docker documentation](https://docs.docker.com/desktop/install/mac-install/). ### Ubuntu @@ -206,8 +206,8 @@ the following command, replacing `TAG` with the *tag* you found earlier. ``` docker run --rm -v $(pwd):/home/jovyan/work -p 8888:8888 ubcdsci/py-dsci-100:TAG jupyter lab ``` -The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the -URL in your terminal that starts with `http://127.0.0.1:8888` (highlighted by the +The terminal will then print some text as the Docker container starts. Once the text stops scrolling, find the +URL in your terminal that starts with `http://127.0.0.1:8888` (highlighted by the red box in {numref}`ubuntu-docker-terminal`), and paste it into your browser to start JupyterLab. More information on installing and running Docker on Ubuntu, as well as troubleshooting tips, can be found in [the online Docker documentation](https://docs.docker.com/engine/install/ubuntu/). @@ -226,23 +226,23 @@ The terminal text after running the Docker container in Ubuntu. The red box indi You can also run the worksheets accompanying this book on your computer using [JupyterLab Desktop](https://github.com/jupyterlab/jupyterlab-desktop). The advantage of JupyterLab Desktop over Docker is that it can be easier to install; -Docker can sometimes run into some fairly technical issues (especially on Windows computers) -that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that -you may not end up with the right versions of all the python packages needed for the worksheets. Docker, on the other hand, -*guarantees* that the worksheets will work exactly as intended. +Docker can sometimes run into some fairly technical issues (especially on Windows computers) +that require expert troubleshooting. The downside of JupyterLab Desktop is that there is a (very) small chance that +you may not end up with the right versions of all the Python packages needed for the worksheets. Docker, on the other hand, +*guarantees* that the worksheets will work exactly as intended. In this section, we will cover how to install JupyterLab Desktop, Git and the JupyterLab Git extension (for version control, as discussed in {numref}`Chapter %s `), and -all of the python packages needed to run +all of the Python packages needed to run the code in this book. ```{index} JupyterLab Desktop, git;installation ``` ### Windows -**Installation** First, we will install Git for version control. -Go to [the Git download page](https://git-scm.com/download/win) and -download the Windows version of Git. Once the download has finished, run the installer and accept +**Installation** First, we will install Git for version control. +Go to [the Git download page](https://git-scm.com/download/win) and +download the Windows version of Git. Once the download has finished, run the installer and accept the default configuration for all pages. Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation). Download the `JupyterLab-Setup-Windows.exe` installer file for Windows. @@ -265,8 +265,8 @@ The JupyterLab Desktop graphical user interface. Next, we need to add the JupyterLab Git extension (so that we can use version control directly from within JupyterLab Desktop), -the IPython kernel (to enable the python programming language), -and various python software packages. Click "New session..." in the JupyterLab Desktop +the IPython kernel (to enable the Python programming language), +and various Python software packages. Click "New session..." in the JupyterLab Desktop user interface, then scroll to the bottom, and click "Terminal" under the "Other" heading ({numref}`setup-jlab-gui-2`). ```{figure} img/setup/jlab-2.png @@ -283,29 +283,29 @@ In this terminal, run the following commands: pip install --upgrade jupyterlab-git conda env update --file https://raw.githubusercontent.com/UBC-DSCI/data-science-a-first-intro-python-worksheets/main/environment.yml ``` -The second command installs the specific python and package versions specified in -the `environment.yml` file found in +The second command installs the specific Python and package versions specified in +the `environment.yml` file found in [the worksheets repository](https://worksheets.python.datasciencebook.ca). We will always keep the versions in the `environment.yml` file updated so that they are compatible with the exercise worksheets that accompany the book. -Once all of the software installation is complete, it is a good idea to restart +Once all of the software installation is complete, it is a good idea to restart JupyterLab Desktop entirely before you proceed to doing your data analysis. -This will ensure all the software and settings you put in place are +This will ensure all the software and settings you put in place are correctly set up and ready for use. ### MacOS -**Installation** First, we will install Git for version control. -Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY)) +**Installation** First, we will install Git for version control. +Open the terminal ([how-to video](https://youtu.be/5AJbWEWwnbY)) and type the following command: ``` xcode-select --install ``` Next, visit the ["Installation" section of the JupyterLab Desktop homepage](https://github.com/jupyterlab/jupyterlab-desktop#installation). -Download the `JupyterLab-Setup-MacOS-x64.dmg` or `JupyterLab-Setup-MacOS-arm64.dmg` installer file. -To know which installer is right for your machine, you need to know +Download the `JupyterLab-Setup-MacOS-x64.dmg` or `JupyterLab-Setup-MacOS-arm64.dmg` installer file. +To know which installer is right for your machine, you need to know whether your computer has an Intel processor (older machines) or an Apple processor (newer machines); the [Apple support page](https://support.apple.com/en-ca/HT211814) has information to help you determine which processor you have. @@ -316,11 +316,11 @@ the icon in the Applications folder to start JupyterLab Desktop. **Configuring JupyterLab Desktop** From this point onward, with JupyterLab Desktop running, follow the instructions in the Windows section on *Configuring JupyterLab Desktop* to set up the environment, install the JupyterLab Git extension, and install -the various python software packages needed for the worksheets. +the various Python software packages needed for the worksheets. ### Ubuntu -**Installation** First, we will install Git for version control. +**Installation** First, we will install Git for version control. Open the terminal and type the following commands: ``` sudo apt update @@ -340,4 +340,4 @@ jlab **Configuring JupyterLab Desktop** From this point onward, with JupyterLab Desktop running, follow the instructions in the Windows section on *Configuring JupyterLab Desktop* to set up the environment, install the JupyterLab Git extension, and install -the various python software packages needed for the worksheets. +the various Python software packages needed for the worksheets. diff --git a/source/viz.md b/source/viz.md index 40de0dea..1f56039d 100755 --- a/source/viz.md +++ b/source/viz.md @@ -40,16 +40,18 @@ By the end of the chapter, readers will be able to do the following: - bar plots - histogram plots - Given a data set and a question, select from the above plot types and use Python to create a visualization that best answers the question. -- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question. +- Evaluate the effectiveness of a visualization and suggest improvements to better answer a given question. - Referring to the visualization, communicate the conclusions in non-technical terms. - Identify rules of thumb for creating effective visualizations. -- Define the two key aspects of altair charts: - - graphical marks - - encoding channels -- Use the altair library in Python to create and refine the above visualizations using: - - graphical marks: `mark_point`, `mark_line`, `mark_bar` +- Use the `altair` library in Python to create and refine the above visualizations using: + - graphical marks: `mark_point`, `mark_line`, `mark_circle`, `mark_bar`, `mark_rule` - encoding channels: `x`, `y`, `color`, `shape` + - labeling: `title` + - transformations: `scale` - subplots: `facet` +- Define the two key aspects of `altair` charts: + - graphical marks + - encoding channels - Describe the difference in raster and vector output formats. - Use `chart.save()` to save visualizations in `.png` and `.svg` format. @@ -611,7 +613,7 @@ can_lang ```{code-cell} ipython3 :tags: ["remove-cell"] # use only nonzero entries (to avoid issues with log scale), and wrap in a pd.DataFrame to prevent copy/view warnings later -can_lang = pd.DataFrame(can_lang[(can_lang["most_at_home"] > 0) & (can_lang["mother_tongue"] > 0)]) +can_lang = pd.DataFrame(can_lang[(can_lang["most_at_home"] > 0) & (can_lang["mother_tongue"] > 0)]) ``` ```{index} altair; mark_circle diff --git a/source/wrangling.md b/source/wrangling.md index f7f94564..2c400af3 100755 --- a/source/wrangling.md +++ b/source/wrangling.md @@ -36,28 +36,25 @@ application, providing more practice working through a whole case study. By the end of the chapter, readers will be able to do the following: - - Define the term "tidy data". - - Discuss the advantages of storing data in a tidy data format. - - Define what series and data frames are in Python, and describe how they relate to - each other. - - Describe the common types of data in Python and their uses. - - Recall and use the following functions for their - intended data wrangling tasks: - - `agg` - - `assign` (as well as regular column assignment) - - `groupby` - - `melt` - - `pivot` - - `str.split` - - Recall and use the following operators for their - intended data wrangling tasks: - - `==`, `!=`, `<`, `>`, `<=`, `>=` - - `in` - - `and` - - `or` - - `[]` - - `loc[]` - - `iloc[]` +- Define the term "tidy data". +- Discuss the advantages of storing data in a tidy data format. +- Define what series and data frames are in Python, and describe how they relate to + each other. +- Describe the common types of data in Python and their uses. +- Use the following functions for their intended data wrangling tasks: + - `melt` + - `pivot` + - `reset_index` + - `str.split` + - `agg` + - `assign` and regular column assignment + - `groupby` + - `merge` +- Use the following operators for their intended data wrangling tasks: + - `==`, `!=`, `<`, `>`, `<=`, and `>=` + - `isin` + - `&` and `|` + - `[]`, `loc[]`, and `iloc[]` ## Data frames and series @@ -838,7 +835,7 @@ one can use in the `[]` to select subsets of rows. Recall that if we provide a list of column names, `[]` returns the subset of columns with those names as a data frame. Suppose we wanted to select the columns `language`, `region`, `most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we -learned in {numref}`Chapter %s `, we can pass all of these column +learned in {numref}`Chapter %s `, we can pass all of these column names into the square brackets. ```{code-cell} ipython3 @@ -1042,8 +1039,8 @@ The `[]` operation is only used when you want to either filter rows **or** selec it cannot be used to do both operations at the same time. This is where `loc[]` comes in. For the first example, recall `loc[]` from {numref}`Chapter %s `, which lets us create a subset of the rows and columns in the `tidy_lang` data frame. -In the first argument to `loc[]`, we specify a logical statement that -filters the rows to only those pertaining to the Toronto region, +In the first argument to `loc[]`, we specify a logical statement that +filters the rows to only those pertaining to the Toronto region, and the second argument specifies a list of columns to keep by name. ```{code-cell} ipython3 @@ -1055,11 +1052,11 @@ tidy_lang.loc[ ``` In addition to simultaneous subsetting of rows and columns, `loc[]` has two -more special capabilities beyond those of `[]`. First, `loc[]` has the ability to specify *ranges* of rows and columns. -For example, note that the list of columns `language`, `region`, `most_at_home`, `most_at_work` +more special capabilities beyond those of `[]`. First, `loc[]` has the ability to specify *ranges* of rows and columns. +For example, note that the list of columns `language`, `region`, `most_at_home`, `most_at_work` corresponds to the *range* of columns from `language` to `most_at_work`. Rather than explicitly listing all of the column names as we did above, -we can ask for the range of columns `"language":"most_at_work"`; the `:`-syntax +we can ask for the range of columns `"language":"most_at_work"`; the `:`-syntax denotes a range, and is supported by the `loc[]` function, but not by `[]`. ```{code-cell} ipython3 @@ -1490,7 +1487,7 @@ region_lang_nums.info() ``` You can now see that the columns from `mother_tongue` to `lang_known` are type `int32`, and that we have obtained a data frame with the same number of columns and rows -as the input data frame. +as the input data frame. The second situation occurs when you want to apply a function across columns within each individual row, i.e., *row-wise*. This operation, illustrated in {numref}`fig:rowwise`, @@ -1520,7 +1517,7 @@ We see that we obtain a series containing the maximum value between `mother_tong is often the case that we want to include a column result from a row-wise operation as a new column in the data frame, so that we can make plots or continue our analysis. To make this happen, -we will use column assignment or the `assign` function to create a new column. +we will use column assignment or the `assign` function to create a new column. This is discussed in the next section. ```{note} @@ -1554,7 +1551,7 @@ You can see above that the `region_lang` data frame now has an additional column The `maximum` column contains the maximum value between `mother_tongue`, `most_at_home`, `most_at_work` and `lang_known` for each language -and region, just as we specified! +and region, just as we specified! To instead create an entirely new data frame, we can use the `assign` method and specify one argument for each column we want to create. In this case we want to create one new column named `maximum`, so the argument @@ -1670,7 +1667,7 @@ See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stab english_lang ``` Wait a moment...what is that warning message? It seems to suggest that something went wrong, but -if we inspect the `english_lang` data frame above, it looks like the city populations were added +if we inspect the `english_lang` data frame above, it looks like the city populations were added just fine! As it turns out, this is caused by the earlier filtering we did from `region_lang` to produce the original `english_lang`. The details are a little bit technical, but `pandas` sometimes does not like it when you subset a data frame using `[]` or `loc[]` followed by @@ -1733,14 +1730,14 @@ english_lang = region_lang[ :tags: ["output_scroll"] english_lang ``` -We then added the populations of these cities as a column +We then added the populations of these cities as a column (Toronto: 5928040, Montréal: 4098927, Vancouver: 2463431, Calgary: 1392609, and Edmonton: 1321426). We had to be careful to add those populations in the right order; this is an error-prone process. An alternative approach, that we demonstrate here is to (1) create a new data frame with the city names and populations, and (2) use `merge` to combine the two data frames, recognizing that the "regions" are the same. -We create a new data frame by calling `pd.DataFrame` with a dictionary +We create a new data frame by calling `pd.DataFrame` with a dictionary as its argument. The dictionary associates each column name in the data frame to be created with a list of entries. Here we list city names in a column called `"region"` and their populations in a column called `"population"`.