diff --git a/README.md b/README.md index 878a383..578c3df 100644 --- a/README.md +++ b/README.md @@ -59,6 +59,8 @@ $ pip install -r requirements.txt ``` ### Using Conda +⚠️ If you're using Apple with M1 Chip, please follow these [instructions](#note-for-conda-on-apple-m1-chip) + You can create an `pydata-global-2022-ml-repro` conda environment executing: ``` @@ -77,7 +79,7 @@ You might also only update your current environment using: $ conda env update --prefix ./env --file environment.yml --prune ``` -#### Note for Conda nn Apple M1 Chip +#### Note for Conda on Apple M1 Chip If you're using a Mac with the latest M1 chip, it is highly recommended to install the packages in your conda environment specifically tailored for your hardware architecture (i.e. `arm64`). @@ -124,7 +126,7 @@ So how do we actually go about obtaining these goals? ## Data -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/0%20-%20Basic%20Data%20Prep%20and%20Model.ipynb) This tutorial uses the [Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/). @@ -137,7 +139,7 @@ Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.e ## Model Evaluation -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/1%20-%20Model%20Evaluation.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/1%20-%20Model%20Evaluation.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/1%20-%20Model%20Evaluation.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/1%20-%20Model%20Evaluation.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/1%20-%20Model%20Evaluation.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/1%20-%20Model%20Evaluation.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/1%20-%20Model%20Evaluation.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/1%20-%20Model%20Evaluation.ipynb) Applying machine learning in an applied science context is often method work. We build a prototype model and expect want to show that this method can be applied to our specific problem. This means that we have to guarantee that the insights we glean from this application generalize to new data from the same problem set. @@ -153,7 +155,7 @@ So we’ll go into some methods to properly evaluate machine learning models eve ## Benchmarking -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/2%20-%20Benchmarking.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/2%20-%20Benchmarking.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/2%20-%20Benchmarking.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/2%20-%20Benchmarking.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/2%20-%20Benchmarking.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/2%20-%20Benchmarking.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/2%20-%20Benchmarking.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/2%20-%20Benchmarking.ipynb) Another common reason for rejections of machine learning papers in applied science is the lack of proper benchmarks. This section will be fairly short, as it differs from discipline to discipline. @@ -165,7 +167,7 @@ However, any time we apply a superfancy deep neural network, we need to supply a ## Model Sharing -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/3%20-%20Model%20Sharing.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/3%20-%20Model%20Sharing.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/3%20-%20Model%20Sharing.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/3%20-%20Model%20Sharing.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/3%20-%20Model%20Sharing.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/3%20-%20Model%20Sharing.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/3%20-%20Model%20Sharing.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/3%20-%20Model%20Sharing.ipynb) Some journals will require the sharing of code or models, but even if they don’t we might benefit from it. @@ -184,7 +186,7 @@ In this section, we explore how we can export models and make our training codes ## Testing -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/4%20-%20Testing.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/4%20-%20Testing.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/4%20-%20Testing.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/4%20-%20Testing.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/4%20-%20Testing.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/4%20-%20Testing.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/4%20-%20Testing.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/4%20-%20Testing.ipynb) Machine learning is very hard to test. Due to the nature of the our models, we often have soft failures in the model that are difficult to test against. @@ -200,7 +202,7 @@ Writing software tests in science, is already incredibly hard, so in this sectio ## Interpretability -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/5%20-%20Interpretability.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/5%20-%20Interpretability.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/5%20-%20Interpretability.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/5%20-%20Interpretability.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/5%20-%20Interpretability.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/5%20-%20Interpretability.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/5%20-%20Interpretability.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/5%20-%20Interpretability.ipynb) One way to probe the models we build is to test them against the established knowledge of domain experts. In this final section, we’ll explore how to build intuitions about our machine learning model and avoid pitfalls like spurious correlations. These methods for model interpretability increase our trust into models, but they can also serve as an additional level of reproducibility in our research and a valuable research artefact that can be discussed in a publication. @@ -214,7 +216,7 @@ This section will introduce tools like `shap`, discuss feature importance, and m ## Ablation Studies -[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/6%20-%20Ablation%20Study.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/6%20-%20Ablation%20Study.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/6%20-%20Ablation%20Study.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/master/notebooks/6%20-%20Ablation%20Study.ipynb) +[![](https://img.shields.io/badge/view-notebook-orange)](notebooks/6%20-%20Ablation%20Study.ipynb) [![](https://img.shields.io/badge/open-colab-yellow)](https://colab.research.google.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/6%20-%20Ablation%20Study.ipynb) [![Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/6%20-%20Ablation%20Study.ipynb) [![Open%20In%20SageMaker%20Studio%20Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jesperdramsch/ml-for-science-reproducibility-tutorial/blob/main/notebooks/6%20-%20Ablation%20Study.ipynb) Finally, the gold standard in building complex machine learning models is proving that each constituent part of the model contributes something to the proposed solution. diff --git a/.jupytext.toml b/jupytext.toml similarity index 100% rename from .jupytext.toml rename to jupytext.toml diff --git a/notebooks/0 - Basic Data Prep and Model.ipynb b/notebooks/0 - Basic Data Prep and Model.ipynb index f184f41..31d00fa 100644 --- a/notebooks/0 - Basic Data Prep and Model.ipynb +++ b/notebooks/0 - Basic Data Prep and Model.ipynb @@ -20,10 +20,10 @@ "id": "54158e1d", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:36.930901Z", - "iopub.status.busy": "2022-12-01T10:51:36.930738Z", - "iopub.status.idle": "2022-12-01T10:51:36.936141Z", - "shell.execute_reply": "2022-12-01T10:51:36.935600Z" + "iopub.execute_input": "2022-12-02T12:08:36.427016Z", + "iopub.status.busy": "2022-12-02T12:08:36.426566Z", + "iopub.status.idle": "2022-12-02T12:08:36.435106Z", + "shell.execute_reply": "2022-12-02T12:08:36.434696Z" } }, "outputs": [], @@ -40,10 +40,10 @@ "id": "36b24fd4", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:36.937997Z", - "iopub.status.busy": "2022-12-01T10:51:36.937897Z", - "iopub.status.idle": "2022-12-01T10:51:40.292581Z", - "shell.execute_reply": "2022-12-01T10:51:40.292295Z" + "iopub.execute_input": "2022-12-02T12:08:36.437449Z", + "iopub.status.busy": "2022-12-02T12:08:36.437298Z", + "iopub.status.idle": "2022-12-02T12:08:36.752040Z", + "shell.execute_reply": "2022-12-02T12:08:36.751780Z" } }, "outputs": [], @@ -57,10 +57,10 @@ "id": "01e133b7", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:40.294318Z", - "iopub.status.busy": "2022-12-01T10:51:40.294218Z", - "iopub.status.idle": "2022-12-01T10:51:40.307229Z", - "shell.execute_reply": "2022-12-01T10:51:40.306947Z" + "iopub.execute_input": "2022-12-02T12:08:36.753558Z", + "iopub.status.busy": "2022-12-02T12:08:36.753477Z", + "iopub.status.idle": "2022-12-02T12:08:36.767488Z", + "shell.execute_reply": "2022-12-02T12:08:36.767137Z" }, "scrolled": true }, @@ -271,10 +271,10 @@ "id": "93eedeb8", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:40.308822Z", - "iopub.status.busy": "2022-12-01T10:51:40.308746Z", - "iopub.status.idle": "2022-12-01T10:51:40.315078Z", - "shell.execute_reply": "2022-12-01T10:51:40.314824Z" + "iopub.execute_input": "2022-12-02T12:08:36.768880Z", + "iopub.status.busy": "2022-12-02T12:08:36.768799Z", + "iopub.status.idle": "2022-12-02T12:08:36.774601Z", + "shell.execute_reply": "2022-12-02T12:08:36.774395Z" } }, "outputs": [ @@ -460,10 +460,10 @@ "id": "8378dc03", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:40.316459Z", - "iopub.status.busy": "2022-12-01T10:51:40.316386Z", - "iopub.status.idle": "2022-12-01T10:51:47.226742Z", - "shell.execute_reply": "2022-12-01T10:51:47.226405Z" + "iopub.execute_input": "2022-12-02T12:08:36.775953Z", + "iopub.status.busy": "2022-12-02T12:08:36.775892Z", + "iopub.status.idle": "2022-12-02T12:08:38.226669Z", + "shell.execute_reply": "2022-12-02T12:08:38.226416Z" } }, "outputs": [ @@ -502,10 +502,10 @@ "id": "791232d7", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:47.228421Z", - "iopub.status.busy": "2022-12-01T10:51:47.228337Z", - "iopub.status.idle": "2022-12-01T10:51:47.234278Z", - "shell.execute_reply": "2022-12-01T10:51:47.234048Z" + "iopub.execute_input": "2022-12-02T12:08:38.228486Z", + "iopub.status.busy": "2022-12-02T12:08:38.228405Z", + "iopub.status.idle": "2022-12-02T12:08:38.233604Z", + "shell.execute_reply": "2022-12-02T12:08:38.233388Z" } }, "outputs": [ @@ -677,10 +677,10 @@ "id": "44aaf953", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:47.235803Z", - "iopub.status.busy": "2022-12-01T10:51:47.235732Z", - "iopub.status.idle": "2022-12-01T10:51:47.239270Z", - "shell.execute_reply": "2022-12-01T10:51:47.239016Z" + "iopub.execute_input": "2022-12-02T12:08:38.234920Z", + "iopub.status.busy": "2022-12-02T12:08:38.234865Z", + "iopub.status.idle": "2022-12-02T12:08:38.238379Z", + "shell.execute_reply": "2022-12-02T12:08:38.238172Z" } }, "outputs": [], @@ -712,10 +712,10 @@ "id": "210ae85e", "metadata": { "execution": { - "iopub.execute_input": "2022-12-01T10:51:47.240766Z", - "iopub.status.busy": "2022-12-01T10:51:47.240683Z", - "iopub.status.idle": "2022-12-01T10:51:48.765086Z", - "shell.execute_reply": "2022-12-01T10:51:48.760434Z" + "iopub.execute_input": "2022-12-02T12:08:38.239692Z", + "iopub.status.busy": "2022-12-02T12:08:38.239625Z", + "iopub.status.idle": "2022-12-02T12:08:38.325291Z", + "shell.execute_reply": "2022-12-02T12:08:38.325047Z" } }, "outputs": [ @@ -748,39 +748,39 @@ " \n", "
\n", "\n", + " | index | \n", + "
---|---|
Species | \n", + "\n", + " |
Adelie Penguin (Pygoscelis adeliae) | \n", + "146 | \n", + "
Chinstrap penguin (Pygoscelis antarctica) | \n", + "68 | \n", + "
Gentoo penguin (Pygoscelis papua) | \n", + "120 | \n", + "
\n", + " | index | \n", + "
---|---|
Species | \n", + "\n", + " |
Adelie Penguin (Pygoscelis adeliae) | \n", + "102 | \n", + "
Chinstrap penguin (Pygoscelis antarctica) | \n", + "47 | \n", + "
\n", + " | index | \n", + "
---|---|
Species | \n", + "\n", + " |
Adelie Penguin (Pygoscelis adeliae) | \n", + "44 | \n", + "
Chinstrap penguin (Pygoscelis antarctica) | \n", + "21 | \n", + "
0.9914163090128756+
0.9871244635193133
1.0+
0.9900990099009901diff --git a/rendered_notebooks/1 - Model Evaluation.html b/rendered_notebooks/1 - Model Evaluation.html index b24fdd3..319bc9a 100644 --- a/rendered_notebooks/1 - Model Evaluation.html +++ b/rendered_notebooks/1 - Model Evaluation.html @@ -14896,15 +14896,32 @@
from matplotlib import pyplot as plt
+
penguins.groupby("Species").Sex.count().plot(kind="bar")
+plt.show()
<AxesSubplot:xlabel='Species'>-
y_train.reset_index().groupby(["Species"]).count()
@@ -14994,7 +14997,7 @@ Stratification
- Out[5]:
+ Out[6]:
@@ -15055,22 +15058,22 @@ Stratification
-We can address this by applying stratification.
-That is simply sampling randomly within a class (or strata) rather than randomly sampling from the entire dataframe.
+We can address this by applying stratification.
+That is simply achieved by randomly sampling *within a class** (or strata) rather than randomly sampling from the entire dataframe.
X_train, X_test, y_train, y_test = train_test_split(penguins[features], penguins[target[0]], train_size=.7, random_state=42, stratify=penguins[target[0]])
-y_train.reset_index().groupby("Species").count().plot(kind="bar")
+X, y = penguins[features], penguins[target[0]]
+X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, random_state=42, stratify=y)
To qualitatevely assess the effect of stratification, let's plot class distribution in both training and test sets:
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
-
-
-
-
-
- Out[6]:
-
-
-
+y_train.reset_index().groupby("Species").count().plot(kind="bar", ax=ax1, ylim=(0, len(y)), title="Training")
+y_test.reset_index().groupby("Species").count().plot(kind="bar", ax=ax2, ylim=(0, len(y)), title="Test")
+plt.show()
+
-
-<AxesSubplot:xlabel='Species'>
+
+
+
from sklearn.model_selection import cross_val_score
+
scores = cross_val_score(model, X_train, y_train, cv=5)
scores
print(f"{scores.mean():0.2f} accuracy with a standard deviation of {scores.std():0.2f}")
@@ -15286,34 +15308,77 @@ Cross-Validation
-Time-series Validation¶
But validation can get tricky if time gets involved.
-Imagine we measured the growth of baby penguin Hank over time and wanted to us machine learning to project the development of Hank. Then our data suddenly isn't i.i.d. anymore, since it is dependent in the time dimension.
-Were we to split our data randomly for our training and test set, we would test on data points that lie in between training points, where even a simple linear interpolation can do a fairly decent job.
-Therefor, we need to split our measurements along the time axis
-
-Scikit-learn Time Series CV [Source].
+Model Evaluation¶
+
+
Brilliant! So let's recap for a moment what we have done so far, in preparation for our (final) Model evaluation.
+We have:
+sklearn.pipeline.Pipeline
with preprocessor + model
(X_train, y_train)
and (X_test, y_test)
, respectivelycross_val_score
) on X_train
(!!)Now we need the complete our last step, namely "assess how the model we chose in CV" (we only had one model, so that was an easy choice :D ) will perform on future data!
+And we have a candidate as representative for these data: X_test
.
Please note that X_test
has never been used so far (as it should have!). The take away message here is: generate test partition, and forget about it until the last step!
Thanks to CV
, We have an indication of how the SVC
classifier behaves on multiple "version" of the training set. We calculated an average score of 0.99
accuracy, therefore we decided this model is to be trusted for predictions on unseen data.
Now all we need to do, is to prove this assertion.
+To do so we need to:
+ +- train a new model on the entire **training set**
+- evaluate it's performance on **test set** (using the metric of choice - presumably the same metric we chose in CV!)
import numpy as np
-from sklearn.model_selection import TimeSeriesSplit
+# training
+model = Pipeline(steps=[
+ ('preprocessor', preprocessor),
+ ('classifier', SVC()),
+])
+classifier = model.fit(X_train, y_train)
+
-X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
-y = np.array([1, 2, 3, 4, 5, 6])
-tscv = TimeSeriesSplit(n_splits=3)
-print(tscv)
+
# Model evaluation
+from sklearn.metrics import accuracy_score
+
+y_pred = classifier.predict(X_test)
+print("TEST ACC: ", accuracy_score(y_true=y_test, y_pred=y_pred))
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None) -[0 1 2] [3] -[0 1 2 3] [4] -[0 1 2 3 4] [5] +TEST ACC: 1.0
Spatial data, like maps and satellite data has a similar problem.
-Here the data is correlated in the spatial dimension. However, we can mitigate the effect by supplying a group. In this simple example I used continents, but it's possible to group by bins on a lat-lon grid as well.
-Here especially, a cross-validation scheme is very important, as it is used to validate against every area on your map at least once.
+Now we can finally say that we have concluded our model evaluation - with a fantastic score of 0.96
Accuracy on the test set.
Ok, now for the mere sake of considering a more realistic data scenario, let's pretend our reference dataset is composed by only samples from two (out of the three) classes we have. In particular, we will crafting our dataset by choosing the most and the least represented classes, respectively.
+The very idea is to explore whether the choice of appropriate metrics could make the difference in our machine learning models evaluation.
+ +Let's recall class distributions in our dataset:
y.reset_index().groupby(["Species"]).count()
+
+ | index | +
---|---|
Species | ++ |
Adelie Penguin (Pygoscelis adeliae) | +146 | +
Chinstrap penguin (Pygoscelis antarctica) | +68 | +
Gentoo penguin (Pygoscelis papua) | +120 | +
So let's select samples from the first two classes, Adelie Penguin
and Chinstrap penguin
:
samples = penguins[((penguins["Species"].str.startswith("Adelie")) | (penguins["Species"].str.startswith("Chinstrap")))]
+
samples.shape[0] == 146 + 68 # quick verification
+
True+
To make things even harder for our machine learning model, let's also see if we could get rid of clearly separating features in this toy dataset
+ +import seaborn as sns
+
+pairplot_figure = sns.pairplot(samples, hue="Species")
+
OK so if we get to choose, we could definitely say that in this dataset, the Flipper Length
in combination with the Culmen Depth
leads to the hardest classification task for our machine learning model.
Therefore, here is the plan:
+Culmen Lenght
feature)The very difference this time is that we will use multiple metrics to evaluate our model to prove our point on carefully selecting evaluation metrics.
+ +num_features = ["Culmen Length (mm)", "Culmen Depth (mm)", "Flipper Length (mm)"]
+selected_num_features = num_features[1:]
+cat_features = ["Sex"]
+features = selected_num_features + cat_features
+
num_transformer = StandardScaler()
+cat_transformer = OneHotEncoder(handle_unknown='ignore')
+
+preprocessor = ColumnTransformer(transformers=[
+ ('num', num_transformer, selected_num_features), # note here, we will only preprocess selected numerical features
+ ('cat', cat_transformer, cat_features)
+])
+
+model = Pipeline(steps=[
+ ('preprocessor', preprocessor),
+ ('classifier', SVC()),
+])
+
X, y = samples[features], samples[target[0]]
+X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, random_state=42, stratify=y) # we also stratify on classes
+
y_train.reset_index().groupby("Species").count()
+
+ | index | +
---|---|
Species | ++ |
Adelie Penguin (Pygoscelis adeliae) | +102 | +
Chinstrap penguin (Pygoscelis antarctica) | +47 | +
y_test.reset_index().groupby("Species").count()
+
+ | index | +
---|---|
Species | ++ |
Adelie Penguin (Pygoscelis adeliae) | +44 | +
Chinstrap penguin (Pygoscelis antarctica) | +21 | +
In our evaluation pipeline we will be using keep record both accuracy (ACC
) and matthew correlation coefficient (MCC
)
from sklearn.model_selection import cross_validate
+from sklearn.metrics import make_scorer
+from sklearn.metrics import matthews_corrcoef as mcc
+from sklearn.metrics import accuracy_score as acc
+
+mcc_scorer = make_scorer(mcc)
+acc_scorer = make_scorer(acc)
+scores = cross_validate(model, X_train, y_train, cv=5,
+ scoring={"MCC": mcc_scorer, "ACC": acc_scorer})
+scores
+
{'fit_time': array([0.00225472, 0.00185323, 0.00180411, 0.00178003, 0.00177217]), + 'score_time': array([0.00157523, 0.00124979, 0.00123906, 0.00125003, 0.0012238 ]), + 'test_MCC': array([0.37796447, 0.27863911, 0.40824829, 0.02424643, 0.08625819]), + 'test_ACC': array([0.73333333, 0.7 , 0.76666667, 0.66666667, 0.62068966])}+
import numpy as np
+
+print("Avg ACC in CV: ", np.average(scores["test_ACC"]))
+print("Avg MCC in CV: ", np.average(scores["test_MCC"]))
+
Avg ACC in CV: 0.697471264367816 +Avg MCC in CV: 0.2350712993854009 ++
model = model.fit(X_train, y_train)
+
+print("ACC: ", acc_scorer(model, X_test, y_test))
+print("MCC: ", mcc_scorer(model, X_test, y_test))
+
ACC: 0.7230769230769231 +MCC: 0.29439815585406465 ++
To see exactly what happened, let's have a look at the Confusion matrix
+ +from sklearn.metrics import ConfusionMatrixDisplay
+fig, ax = plt.subplots(figsize=(15, 10))
+ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, ax=ax)
+plt.show()
+
As expected, the model did a pretty bad job in classifying Chinstrap Penguins and the MCC
was able to catch that, whilst ACC
could not as it only considers correctly classified samples!
But validation can get tricky if time gets involved.
+Imagine we measured the growth of baby penguin Hank over time and wanted to us machine learning to project the development of Hank. Then our data suddenly isn't i.i.d. anymore, since it is dependent in the time dimension.
+Were we to split our data randomly for our training and test set, we would test on data points that lie in between training points, where even a simple linear interpolation can do a fairly decent job.
+Therefor, we need to split our measurements along the time axis
+
+Scikit-learn Time Series CV [Source].
import numpy as np
+from sklearn.model_selection import TimeSeriesSplit
+
+X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
+y = np.array([1, 2, 3, 4, 5, 6])
+tscv = TimeSeriesSplit(n_splits=3)
+print(tscv)
+
+for train, test in tscv.split(X):
+ print("%s %s" % (train, test))
+
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None) +[0 1 2] [3] +[0 1 2 3] [4] +[0 1 2 3 4] [5] ++
Spatial data, like maps and satellite data has a similar problem.
+Here the data is correlated in the spatial dimension. However, we can mitigate the effect by supplying a group. In this simple example I used continents, but it's possible to group by bins on a lat-lon grid as well.
+Here especially, a cross-validation scheme is very important, as it is used to validate against every area on your map at least once.
+ +tensor([[0.3527, 0.3850]], grad_fn=<SigmoidBackward0>) +tensor([[0.5887, 0.6065]], grad_fn=<SigmoidBackward0>)