Here we build a classifier using random forest that allows one to determine the type of barbell lift movement using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The data comes from http://groupware.les.inf.puc-rio.br/har and was originally published in
We first go through builing a classifier that uses all of the features in the data set. However, the training for this model takes quite long. As such, we then go through identifying the features that are most important for classification accuracy and then use those features to construct a reduced model that takes much less time to train and also retains the same level of accuracy.
Here we build our random forest classifier.
First, we perform a very basic exploration of the data set.
library(ggplot2)
suppressMessages(library(gridExtra))
training<-read.csv('pml-training.csv',na.strings=c("NA",""))
testing<-read.csv('pml-testing.csv',na.strings=c("NA",""))
dim(training)
## [1] 19622 160
g1<-ggplot(training,aes(x=classe,y=roll_belt))+geom_point()+ylab('Roll Belt')
g1<-g1+xlab('Activity Type')
g2<-ggplot(training,aes(x=classe,y=yaw_belt))+geom_point()+ylab('Yaw Belt')
g2<-g2+xlab('Activity Type')
grid.arrange(g1,g2,ncol=2)
From the above, we see that the training data has 160 columns with 19,622 observations. A little more prodding shows that the activity labels are in the classe column and are factors with the levels A, B, C, D, and E. The above plots show two of the features as a function of the activity labels. We see that features have different ranges of values, and also that the data appears to cluster in different ways for different features as a function of the activity label. We also find that the first 6 columns of the data set which contain features like user name and time stamp data are not relavant for our analysis, so we choose to exlude them.
Next, we remove the extraneous variables that we mentioned above. We also find that quite a few of the features have a lot of missing values, and we exclude them from our model. Lastly, as many of the features have different ranges of values, we normalize the training data by subtracting out the mean and dividing by the standard deviation. We also perform the exact same normalization on the testing data.
# Remove extraneous features
training<-training[,7:160]
testing<-testing[,7:160]
nrow<-dim(training)[1]
ncol<-dim(training)[2]
# Convert everything to numeric (except classe labels)
training[,-ncol]<-sapply(training[,-ncol],as.numeric)
testing[,-ncol]<-sapply(testing[,-ncol],as.numeric)
# Remove columns with all NA's
notNA<-apply(!is.na(training),2,sum)>nrow-1
training<-training[,notNA]
testing<-testing[,notNA]
# Normalize Training and Testing Data
ncol<-dim(training)[2]
for(kk in 1:(ncol-1)){
mu<-mean(training[,kk])
s<-sd(training[,kk])
training[,kk]<-(training[,kk]-mu)/s
testing[,kk]<-(testing[,kk]-mu)/s
}
ncol<-dim(training)[2]
Here, we fit the random forest model on a subset of the training data in order to make the training go a little faster. We also choose to use 100 trees for our random forest model, but we will explore the effect of the number of trees in the next section. Lastly, we also employ 5-fold cross validation in order to estimate our out of sample error.
suppressMessages(library(caret))
suppressMessages(library(randomForest))
# Set random seed for reproducibility
set.seed(125)
# Only use a subset of the training data (to make fitting go faster)
InTrain<-createDataPartition(y=training$classe,p=0.3,list=FALSE)
training1<-training[InTrain,]
rm(training)
# Fit the random forest model with 100 trees
fit1<-train(classe~.,data=training1,method="rf"
,ntree=100,trControl=trainControl(method="cv",number=5)
,prox=TRUE,allowParallel=TRUE)
print(fit1$finalModel)
##
## Call:
## randomForest(x = x, y = y, ntree = 100, mtry = param$mtry, proximity = TRUE, allowParallel = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 1.1%
## Confusion matrix:
## A B C D E class.error
## A 1672 1 0 0 1 0.001194743
## B 13 1116 11 0 0 0.021052632
## C 0 12 1011 4 0 0.015579357
## D 0 0 16 948 1 0.017616580
## E 0 1 1 4 1077 0.005540166
As one can see, we get a really low out of sample error rate of about 1% !
Below, we look at the importance of the different features of the model. It looks like we may be able to get away with using fewer features as well as fewer trees and still retain a low error rate.
imp<-varImp(fit1,scale=FALSE)
plot(imp)
plot(seq(1,100),fit1$finalModel$err.rate[,1],xlab='Number of Trees',ylab='Out of Bag Error')
Here, we explore the error in a model where we use the top 12 most important features and 60 trees.
# Find the 12 most important features
nms<-rownames(imp$importance)
ind<-order(imp$importance,by=imp$importance$Overall,decreasing=TRUE)
imp12<-nms[ind[1:12]]
ninds<-match(imp12,names(training1))
# Create a new data frame with just those 12 features and the classe column
ninds<-c(ninds,ncol)
training1<-training1[,ninds]
# Set random seed for reproducibility
set.seed(125)
# Train a random forest model with 12 features and 60 trees
fit2<-train(classe~.,data=training1,method="rf"
,ntree=60,trControl=trainControl(method="cv",number=5)
,prox=TRUE,allowParallel=TRUE)
print(fit2$finalModel)
##
## Call:
## randomForest(x = x, y = y, ntree = 60, mtry = param$mtry, proximity = TRUE, allowParallel = TRUE)
## Type of random forest: classification
## Number of trees: 60
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.97%
## Confusion matrix:
## A B C D E class.error
## A 1668 1 2 2 1 0.003584229
## B 11 1120 7 1 1 0.017543860
## C 0 6 1018 3 0 0.008763389
## D 0 0 8 956 1 0.009326425
## E 2 6 2 3 1070 0.012003693
Thus, it appears that using less features and less trees slightly improves our accuracy and also reduces the amount of time required for training.
Lastly, we can compare the predictions of our two models on the test data to see if anything changes
pred1<-predict(fit1,testing)
pred2<-predict(fit2,testing[,ninds])
table(pred1,pred2)
## pred2
## pred1 A B C D E
## A 7 0 0 0 0
## B 0 8 0 0 0
## C 0 0 1 0 0
## D 0 0 0 1 0
## E 0 0 0 0 3
As one can see, both models predict the same classe labels for the test data.