-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathClustering_Knowhow.qmd
264 lines (163 loc) · 9.37 KB
/
Clustering_Knowhow.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
title: "Clustering_knowhow"
author: "Niladri Dasgupta"
date: "2024-08-12"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## **What is clustering?**
Clustering is a method of segregating unlabeled data or data points into different groups/clusters such that similar data points fall in the same cluster than those which differ from the others. The similarity measures are calculated using distance based metrics like Euclidean distance, Cosine similarity, Manhattan distance, etc.
For Example, In the graph given below, we can clearly see that the data points can be grouped into 3 clusters
![](images/Clustering/clustering_ex.PNG)
<br>
## **Type of Clustering Algorithm**
Some of the popular clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
### 1.Centroid-based Clustering (Partitioning methods)
Partitioning methods group data points on the basis of their closeness. The similarity measure chosen for these algorithms are Euclidean distance, Manhattan Distance or Minkowski Distance.
The primary drawback for these algorithms is we need to pre define the number of clusters before allocating the data points to a group.
One of the popular centroid based clustering technique is K means Clustering.
<br>
#### **K Means Clustering**
K means is an iterative clustering algorithm that works in these 5 steps:
1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2-D space.
![](images/Clustering/kmeans_1.png)
2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1, shown using orange color, and two points in cluster 2, shown using grey color.
![](images/Clustering/kmeans_2.png)
3. Compute cluster centroids: Centroids correspond to the arithmetic mean of data points assigned to the cluster. The centroid of data points in the orange cluster is shown using the orange cross, and those in the grey cluster using a grey cross.
![](images/Clustering/kmeans_3.png)
4. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid
![](images/Clustering/kmeans_4.png)
5. Re-computing the centroids for both clusters.
![](images/Clustering/kmeans_5.png)
We will repeat the 4th and 5th steps until no further switching of data points between two clusters for two successive repeats.
<br>
#### K-Means Clustering in R
**Step 1: Load packages**
First, we’ll load below packages that contain several useful functions regarding k-means clustering in R.
```{r, message=FALSE}
library(cluster) #Contain cluster function
library(dplyr) #Data manipulation
library(ggplot2) #Plotting function
library(readr) #Read and write excel/csv files
library(factoextra) #Extract and Visualize the Results of Multivariate Data Analyses
```
**Step 2: Load Data**
We have used the “Mall_Customer” dataset in R for this case study.
```{r, message=FALSE}
#Loading the data
df <- read_csv("data/Mall_Customers.csv")
#Structure of the data
str(df)
```
dataset consists of 200 customers data with their age, annual income and Spending score.
```{r}
#Rename the columns
df <- df %>%
rename("Annual_Income"= `Annual Income (k$)`, "Spending_score"= `Spending Score (1-100)`)
#remove rows with missing values
df <- na.omit(df)
#scale each variable to have a mean of 0 and sd of 1
df1 <- df %>%
mutate(across(where(is.numeric), scale))
#view first six rows of dataset
head(df1)
```
<br>
We have separated the CustomerID and Genre from the dataset. The reason for removing these variables from the cluster dataset as Kmeans can handle only numerical variables.
To create cluster with categorical or ordinal variable we can use k-Medoid clustering.
<br>
```{r}
df1 <- df1[,4:5]
```
**Step 3: Find the Optimal Number of Clusters**
To perform k-means clustering in R we can use the built-in kmeans() function, which uses the following syntax:
kmeans(data, centers, iter.max, nstart)
where:
- data: Name of the dataset.
- centers: The number of clusters, denoted k.
- iter.max (optional): The maximum number of iterations allowed. Default value is 10.
- nstart (optional): The number of initial configurations. Default value is 1.
- Centers is the k of K Means. centers = 5 would results in 5 clusters being created. We need to **predefine the k** before the cluster process starts.
- iter.max is the number of times the algorithm will repeat the cluster assignment and update the centers / centroids. Iteration stops after this many iterations even if the convergence criterion is not satisfied
- nstart is the number of times the initial starting points are re-sampled.
It means at the initialization of Clusters you need to specify how many clusters you want and the algorithm will randomly find same number of centroids to initialize. nstart gives you an edge to initialize the centroids through re sampling.
For example if total number of cluster is 3 and nstart=25 then it extracts 3 sets of data, 25 times, and for each of these times, the algorithm is run (up to iter.max # of iterations) and the cost function (total sum of the squares) is evaluated and finally 3 centroids with lowest cost function are chosen to start the clustering process.
To find the best number of clusters/centroids there are two popular methods as shown below.
[**A. Elbow Method:**]{.underline}
It has two parts as explained below-
- WSS: The Within Sum of Squares (WSS) is the sum of distance between the centroids and every other data points within a cluster. Small WSS indicates that every data point is close to its nearest centroids.
- Elbow rule/method: Here we plot out the WSS score against the number of K. Because with the number of K increasing, the WSS will always decrease; however, the magnitude of decrease between each k will be diminishing, and the plot will be a curve which looks like an arm that curled up. In this way, we can find out which point falls on the elbow.
```{r}
set.seed(1)
wss<- NULL
#Feeding different centroid/cluster and record WSS
for (i in 1:10){
fit = kmeans(df1,centers = i,nstart=25)
wss = c(wss, fit$tot.withinss)
}
#Visualize the plot
plot(1:10, wss, type = "o", xlab='Number of clusters(k)')
```
Based on the above plot at k=5 we can see an “elbow” where the sum of squares begins to “bend” or level off so the ideal number of clusters should be 5.
The above process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust):
```{r}
fviz_nbclust(df1, kmeans, method = "wss",nstart=25)
```
[**B. Silhouette Method:**]{.underline}
The silhouette coefficient or silhouette score is a measure of how similar a data point is within-cluster (intra-cluster) compared to other clusters (inter-cluster).
The Silhouette Coefficient is calculated using the mean *intra-cluster distance (a)* and the *mean nearest-cluster distance (b)* for each sample. The Silhouette Coefficient for a sample is *(b - a) / max(a, b)*
Here we will plot the silhouette width/coefficient for different number of clusters and will choose the point where the silhouette width is highest.
**Points to Remember While Calculating Silhouette Coefficient:**
The value of the silhouette coefficient is between [-1, 1].
A score of 1 denotes the best, meaning that the data points are very compact within the cluster to which it belongs and far away from the other clusters.
The worst value is -1. Values near 0 denote overlapping clusters.
In this demonstration, we are going to see how silhouette method is used.
```{r}
silhouette_score <- function(k){
km <- kmeans(df1, centers = k,nstart = 25)
ss <- silhouette(km$cluster, dist(df1))
mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)
```
From the above method we can see the silhouette width is highest at cluster 5 so the optimal number of cluster should be 5.
Similar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function (fviz_nbclust):
```{r}
fviz_nbclust(df1, kmeans, method='silhouette',nstart=25)
```
The optimal number of clusters is 5.
**Step 4: Perform K-Means Clustering with Optimal K**
Lastly, we can perform k-means clustering on the dataset using the optimal value for k of 5:
```{r}
#make this example reproducible
set.seed(1)
#perform k-means clustering with k = 5 clusters
fit <- kmeans(df1, 5, nstart=25)
#view results
fit
```
We can visualize the clusters on a scatterplot that displays the first two principal components on the axes using the fivz_cluster() function:
```{r}
#plot results of final k-means model
fviz_cluster(fit, data = df1)
```
**Step 5: Exporting the data by adding generated clusters**
```{r}
#Adding the clusters in the main data
df_cluster <- df %>%
mutate(cluster=fit$cluster)
#Creating Summary of created clusters based on existing variables
df_summary <- df_cluster %>%
group_by(cluster) %>%
summarise(records=n(),avg_age=mean(Age),avg_annual_income=mean(Annual_Income),avg_spending_score=mean(Spending_score))
print(df_summary)
```
We can create a group of potential customers to target based on their age, average annual income and average spending score.