update

gboeing · Apr 3, 2024 · 67d4f32 · 67d4f32
1 parent 9685ffc
commit 67d4f32
Showing 1 changed file with 19 additions and 8 deletions.
diff --git a/modules/12-unsupervised-learning/lecture.ipynb b/modules/12-unsupervised-learning/lecture.ipynb
@@ -81,7 +81,7 @@
    "source": [
     "## 1. Linear discriminant analysis\n",
     "\n",
-    "Dimensionality reduction lets us reduce the number of features (variables) in our data set with minimal loss of information. This data compression is called **feature extraction**. Feature extraction is similar to feature selection in that they both reduce the total number of variables in your analysis. In feature selection, we use domain theory or an algorithm to select important variables for our model. Feature extraction instead projects your features onto a lower-dimension space, creating new features rather than just selecting a subset of existing ones.\n",
+    "Dimensionality reduction lets us reduce the number of features (variables) in our data set with minimal loss of information. This data compression is called **feature extraction**. Feature extraction is similar to feature selection in that they both reduce the total number of variables in your analysis. In feature selection, we use domain theory or an algorithm to select important variables for our model. Feature extraction instead projects your features onto a lower-dimension space, creating wholly new features rather than just selecting a subset of existing ones.\n",
     "\n",
     "LDA is *supervised* dimensionality reduction, providing a link between supervised learning and dimensionality reduction. It uses a categorical response and continuous features to identify features that account for the most variance between classes (ie, maximum separability). It can be used as a classifier, similar to what we saw last week, or it can be used for dimensionality reduction by projecting the features in the most discriminative directions.\n",
     "\n",
@@ -157,7 +157,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# reduce data from n dimensions to 2\n",
+    "# reduce data from original n dimensions to 2\n",
     "lda = LinearDiscriminantAnalysis(n_components=2)\n",
     "X_reduced = lda.fit_transform(X, y)\n",
     "X_reduced.shape"
@@ -170,6 +170,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# scatter plot the 2 new dimensions\n",
     "fig, ax = plt.subplots(figsize=(6, 6))\n",
     "for county_name in data[\"county_name\"].unique():\n",
     "    mask = y == county_name\n",
@@ -234,9 +235,17 @@
     "\n",
     "PCA is used 1) to fix multicollinearity problems and 2) for dimensionality reduction. In the former, it converts a set of original, correlated features into a new set of orthogonal features, which is useful in regression and cluster analysis. In the latter, it summarizes a set of original, correlated features with a smaller number of features that still explain most of the variance in your data (data compression).\n",
     "\n",
-    "PCA identifies the combinations of features (directions in feature space) that account for the most variance in the dataset. These orthogonal axes of maximum variance are called principal components. A **principal component** is an eigenvector (direction of maximum variance) of the features' covariance matrix, and the corresponding eigenvalue is its magnitude (factor by which it is \"stretched\"). An eigenvector is the cosine of the angle between a feature and a component. Its corresponding eigenvalue represents the share of variance it accounts for. PCA takes your (standardized) features' covariance matrix, decomposes it into its eigenvectors/eigenvalues, sorts them by eigenvalue magnitude, constructs a projection matrix $W_k$ from the corresponding top $k$ eigenvectors, then transforms the features using the projection matrix to get the new $k$-dimensional feature subspace. Always standardize your data before PCA because it is sensitive to features' scale.\n",
+    "PCA identifies the combinations of features (directions in feature space) that account for the most variance in the dataset. These orthogonal axes of maximum variance are called principal components. A **principal component** is an eigenvector (direction of maximum variance) of the features' covariance matrix, and the corresponding eigenvalue is its magnitude (factor by which it is \"stretched\"). An eigenvector is the cosine of the angle between a feature and a component. Its corresponding eigenvalue represents the share of variance it accounts for. Always standardize your data before PCA because it is sensitive to features' scale.\n",
     "\n",
-    "We will reduce our feature set to fewer dimensions."
+    "How does PCA work? It...\n",
+    "\n",
+    "- calculates your (standardized) features' covariance matrix\n",
+    "- decomposes it into its eigenvectors/eigenvalues\n",
+    "- sorts them by eigenvalue magnitude\n",
+    "- constructs a projection matrix $W_k$ from the corresponding top $k$ eigenvectors\n",
+    "- transforms the features using the projection matrix to get the new $k$-dimensional feature subspace\n",
+    "\n",
+    "Let's practice reducing our feature set to fewer dimensions with PCA."
    ]
   },
   {
@@ -400,7 +409,9 @@
    "id": "improving-thanksgiving",
    "metadata": {},
    "source": [
-    "We often refer to these projected data as \"principal component scores\" or a \"score matrix\", $T_k$, where $T_k = XW_k$ and $X$ is your original feature matrix and $W_k$ is the projection matrix, that is, a matrix containing the first $k$ principal components (ie, the $k$ eigenvectors with the largest corresponding eigenvalues). In our case, $k=2$. We can calculate this manually:"
+    "We often refer to these projected data as \"principal component scores\" or a \"score matrix\", $T_k$, where $T_k = XW_k$ and $X$ is your original feature matrix and $W_k$ is the projection matrix, that is, a matrix containing the first $k$ principal components (ie, the $k$ eigenvectors with the largest corresponding eigenvalues). In our case, $k=2$.\n",
+    "\n",
+    "We can calculate this manually:"
    ]
   },
   {
@@ -565,7 +576,7 @@
    "outputs": [],
    "source": [
     "# cluster the data\n",
-    "km = KMeans(n_clusters=5).fit(X_reduced)"
+    "km = KMeans(n_clusters=5, n_init=\"auto\").fit(X_reduced)"
    ]
   },
   {
@@ -646,14 +657,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# create an elbow plot\n",
+    "# create an elbow plot: distortion vs cluster count\n",
     "fig, ax = plt.subplots()\n",
     "ax.set_xlabel(\"Number of clusters\")\n",
     "ax.set_ylabel(\"Distortion\")\n",
     "kvals = range(1, 15)\n",
     "distortions = []\n",
     "for k in kvals:\n",
-    "    km = KMeans(n_clusters=k).fit(X_reduced)\n",
+    "    km = KMeans(n_clusters=k, n_init=\"auto\").fit(X_reduced)\n",
     "    distortions.append(km.inertia_)\n",
     "ax.plot(kvals, distortions, marker=\"o\")\n",
     "_ = ax.grid(True)"