Course_Project.rmd

---
title: "Course Project"
author: "Guruprasad Sridharan"
date: "Sunday, July 27, 2014"
output:
  html_document:
    toc: yes
---
Random Forest algorithm implemented in RandomForest R package was used for constructing the 
classifier. It was the obvious choice since I had very less time to do the assignment and the 
data had a lot of non-linearity in it. Random Forest is quite advantageous in that not much 
parameter tuning is necessary to get decent results and is simple to train and use.

Tree based models are able to do well in non-linear classification problems. The idea was to use 
bagging (eg. Decision Tree, Random Forest) initially and then try boosting approaches (eg. 
AdaBoost, Stochastic Gradient Boosting). I was only able to test Random Forest due to shortage 
of time and results were really good. I got 100% classification accuracy on the validation data.

Major steps in model training:

1. Remove timestamp columns from training and validation datasets.

2. Remove columns with a large number of missing values from training and validation datasets.

3. Remove columns with nearly zero variance from training and validation datasets.

4. Partition training data into train and test data (for cross-validation).

5. Train random forest classifier on train data based on variable importance.

6. Measure performance on test data.

7. Write out predictions on validation data.

Let me walkthrough the code,

Including libraries
```{r, message = FALSE, warning = FALSE}
library(caret)
library(ggplot2)
library(randomForest)
```
```{r, echo = FALSE}
setwd("C:\\Users\\gurupra\\Desktop\\R_Workspace")
source("functions.R")
```
The training data had a lot of columns with empty or NA values. It was important what values 
were recognized as NA. Reading training and validation data from respective csv files.
```{r, }
data = read.table("pml-training.csv", sep=",", header = TRUE, na.strings=c("","NA","#DIV/0!"))
validation_data = read.table("pml-testing.csv", sep=",", header = TRUE, na.strings=c("","NA"
,"#DIV/0!"))
```
Once data is read, timestamp related columns were removed. These columns don't provide any 
useful information regarding the output class and were hard to deal with.
```{r}
#removing timestamp columns
data <- data[,c(-1,-3,-4,-5)]
validation_data <- validation_data[,c(-1,-3,-4,-5)]
```
Some of the columns in training data had over 95% NA values. The rest of the columns had no missing values. Moreover, these columns were redundant(derivates of other columns) and didn't provide much useful information. They were also removed from the data.
```{r}
#removing cols with only NA values
ret <- removeColsWithNAs(data)
data <- ret$a
validation_data <- validation_data[,ret$b]
```
The previous step reduced the number of predictors from 156 to 56.
Next step is to remove predictors having very little variance.
```{r}
#removing predictors with zero variance
nzv <- nearZeroVar(data)
data <- data[,-nzv]
validation_data <- validation_data[,-nzv]
```

Partitioning training data into train and test data.
```{r}
set.seed(1729)
#partition train data into training and testing data
inTrain = createDataPartition(y = data$classe, p = 0.7, list = FALSE)
training = data[inTrain, ]
testing = data[-inTrain, ]
```

Plot of yaw_belt vs num_window in training set:
```{r, echo = FALSE}
qplot(training$yaw_belt, training$num_window, color = training$classe, geom = "jitter")
```

The above plot shows the non-linearity of the classification problem.

Plot of mTry vs Out-Of-Bag-Error (for 501 trees):
```{r}
#Training a randomForest model
#finding optimal value for mTry parameter
tuneRF(training[,-55], training[,55], nTryTree = 501, plot = TRUE)
```
Dimensionality reduction using PCA was avoided since it was too time-consuming on such a large data.

Fitting the model to train data and predicting on test data
```{r}
set.seed(223)
model <- randomForest(classe ~ ., data = training, importance = TRUE, ntree = 501)
pred <- predict(model, newdata = testing)
```

Plot of variables vs overall importance:
```{r, echo = FALSE, fig.width = 10, fig.height = 10}
vi <- varImp(model)
vi$sum <- rowSums(vi)
vi <- vi[with(vi, order(-sum)),]
par(mai=c(1,5,1,1))
plot(vi$sum,1:54, yaxt='n', main = "Variable Importance", xlab = "Importance", ylab = "", pch = 16, col = "red")
axis(2,at = 1:54, labels = rownames(vi), las = 2)
abline(h = 1:54, v = 0, col = "gray60")
```

Confusion Matrix for predictions on test data
```{r, echo = FALSE}
table(pred, testing$classe)
```

Overall accuracy and classwise accuracy
```{r, echo = FALSE}
acc <- accuracy(testing$classe, pred)
cat("Overall Accuracy = ",acc)
classes <- unique(testing$classe)
tb <- table(pred, testing$classe)
row_sum <- rowSums(tb)
for(i in 1:length(classes)) {
  cat("Accuracy for class ", levels(classes)[i]," = ", tb[i,i]/row_sum[i],"\n")
}
```

Information summary on trained model
```{r, echo = FALSE}
model
```

Note: Out of sample error rate = 0.22%

Plot of number of trees vs MSE(Mean squared Error)
```{r}
plot(model, log = "x")
legend("topright", colnames(model$err.rate), col=1:6, fill=1:6)
```

Predictions output on validation data
```{r}
pred <- predict(model, newdata = validation_data)
pred
```