-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCourse_Project.rmd
153 lines (130 loc) · 5.07 KB
/
Course_Project.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "Course Project"
author: "Guruprasad Sridharan"
date: "Sunday, July 27, 2014"
output:
html_document:
toc: yes
---
Random Forest algorithm implemented in RandomForest R package was used for constructing the
classifier. It was the obvious choice since I had very less time to do the assignment and the
data had a lot of non-linearity in it. Random Forest is quite advantageous in that not much
parameter tuning is necessary to get decent results and is simple to train and use.
Tree based models are able to do well in non-linear classification problems. The idea was to use
bagging (eg. Decision Tree, Random Forest) initially and then try boosting approaches (eg.
AdaBoost, Stochastic Gradient Boosting). I was only able to test Random Forest due to shortage
of time and results were really good. I got 100% classification accuracy on the validation data.
Major steps in model training:
1. Remove timestamp columns from training and validation datasets.
2. Remove columns with a large number of missing values from training and validation datasets.
3. Remove columns with nearly zero variance from training and validation datasets.
4. Partition training data into train and test data (for cross-validation).
5. Train random forest classifier on train data based on variable importance.
6. Measure performance on test data.
7. Write out predictions on validation data.
Let me walkthrough the code,
Including libraries
```{r, message = FALSE, warning = FALSE}
library(caret)
library(ggplot2)
library(randomForest)
```
```{r, echo = FALSE}
setwd("C:\\Users\\gurupra\\Desktop\\R_Workspace")
source("functions.R")
```
The training data had a lot of columns with empty or NA values. It was important what values
were recognized as NA. Reading training and validation data from respective csv files.
```{r, }
data = read.table("pml-training.csv", sep=",", header = TRUE, na.strings=c("","NA","#DIV/0!"))
validation_data = read.table("pml-testing.csv", sep=",", header = TRUE, na.strings=c("","NA"
,"#DIV/0!"))
```
Once data is read, timestamp related columns were removed. These columns don't provide any
useful information regarding the output class and were hard to deal with.
```{r}
#removing timestamp columns
data <- data[,c(-1,-3,-4,-5)]
validation_data <- validation_data[,c(-1,-3,-4,-5)]
```
Some of the columns in training data had over 95% NA values. The rest of the columns had no missing values. Moreover, these columns were redundant(derivates of other columns) and didn't provide much useful information. They were also removed from the data.
```{r}
#removing cols with only NA values
ret <- removeColsWithNAs(data)
data <- ret$a
validation_data <- validation_data[,ret$b]
```
The previous step reduced the number of predictors from 156 to 56.
Next step is to remove predictors having very little variance.
```{r}
#removing predictors with zero variance
nzv <- nearZeroVar(data)
data <- data[,-nzv]
validation_data <- validation_data[,-nzv]
```
Partitioning training data into train and test data.
```{r}
set.seed(1729)
#partition train data into training and testing data
inTrain = createDataPartition(y = data$classe, p = 0.7, list = FALSE)
training = data[inTrain, ]
testing = data[-inTrain, ]
```
Plot of yaw_belt vs num_window in training set:
```{r, echo = FALSE}
qplot(training$yaw_belt, training$num_window, color = training$classe, geom = "jitter")
```
The above plot shows the non-linearity of the classification problem.
Plot of mTry vs Out-Of-Bag-Error (for 501 trees):
```{r}
#Training a randomForest model
#finding optimal value for mTry parameter
tuneRF(training[,-55], training[,55], nTryTree = 501, plot = TRUE)
```
Dimensionality reduction using PCA was avoided since it was too time-consuming on such a large data.
Fitting the model to train data and predicting on test data
```{r}
set.seed(223)
model <- randomForest(classe ~ ., data = training, importance = TRUE, ntree = 501)
pred <- predict(model, newdata = testing)
```
Plot of variables vs overall importance:
```{r, echo = FALSE, fig.width = 10, fig.height = 10}
vi <- varImp(model)
vi$sum <- rowSums(vi)
vi <- vi[with(vi, order(-sum)),]
par(mai=c(1,5,1,1))
plot(vi$sum,1:54, yaxt='n', main = "Variable Importance", xlab = "Importance", ylab = "", pch = 16, col = "red")
axis(2,at = 1:54, labels = rownames(vi), las = 2)
abline(h = 1:54, v = 0, col = "gray60")
```
Confusion Matrix for predictions on test data
```{r, echo = FALSE}
table(pred, testing$classe)
```
Overall accuracy and classwise accuracy
```{r, echo = FALSE}
acc <- accuracy(testing$classe, pred)
cat("Overall Accuracy = ",acc)
classes <- unique(testing$classe)
tb <- table(pred, testing$classe)
row_sum <- rowSums(tb)
for(i in 1:length(classes)) {
cat("Accuracy for class ", levels(classes)[i]," = ", tb[i,i]/row_sum[i],"\n")
}
```
Information summary on trained model
```{r, echo = FALSE}
model
```
Note: Out of sample error rate = 0.22%
Plot of number of trees vs MSE(Mean squared Error)
```{r}
plot(model, log = "x")
legend("topright", colnames(model$err.rate), col=1:6, fill=1:6)
```
Predictions output on validation data
```{r}
pred <- predict(model, newdata = validation_data)
pred
```