Synopsis
The modern “quantified self movement” implies using wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information on Groupware@LES website (see the section on the Weight Lifting Exercise Dataset).
Our goal is to build the model that predicts type of execution (variable “classe” is the outcome). Value “A” means correct execution, four others, “B”, “C”, “D”, and “E”, are types of common mistakes. Predictors are readings from accelerometers.
First we load training and test data, and split training data into train (70%) and validation (30%) datasets. After short preprocessing, we build the first model to calculate the variables’ importance. Thus we can reduce the number of predictors in order to make modelling less resource consuming.
After that, we build two competing models based on Random Forest and Gradient Boosting algorithms. In this case, Random Forest gave us a little bit more accuracy. So we use it on the test set and get 20/20 accuracy on submission.
Data loading, cleaning and preprocessing
We start with loading training and testing data.
library(caret);
set.seed(123);
temp <- tempfile();
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",temp);
data <- read.csv(temp, row.names=1);
unlink(temp);
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",temp);
test <- read.csv(temp, row.names=1);
unlink(temp);
rm(temp);
We have large training dataset “data” - 19622 observations, and small test dataset “test” - 20 observations. In these conditions, we split data and use 70% of its observations for training models and 30% for models’ validation.
inTrain <- createDataPartition(y=data$classe, p=0.7, list=FALSE);
train <- data[inTrain,];
validation <- data[-inTrain,];
Let’s look at our data closer.
str(train, list.len=200);
## 'data.frame': 13737 obs. of 159 variables:
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 820366 120339 304277 368296 440390 484323 484434 500302 528316 560359 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 12 12 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.42 1.48 1.45 1.42 1.42 1.43 1.45 1.45 1.43 1.42 ...
## $ pitch_belt : num 8.07 8.05 8.06 8.09 8.13 8.16 8.17 8.18 8.18 8.2 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 0.02 ...
## $ gyros_belt_y : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 0 -0.02 -0.02 0 ...
## $ accel_belt_x : int -20 -22 -21 -22 -22 -20 -21 -21 -22 -22 ...
## $ accel_belt_y : int 5 3 4 3 4 2 4 2 2 4 ...
## $ accel_belt_z : int 23 21 21 21 21 24 22 23 23 21 ...
## $ magnet_belt_x : int -2 -6 0 -4 -2 1 -3 -5 -2 -3 ...
## $ magnet_belt_y : int 600 604 603 599 603 602 609 596 602 606 ...
## $ magnet_belt_z : int -305 -310 -312 -311 -313 -312 -308 -317 -319 -309 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.1 22 21.9 21.8 21.7 21.6 21.5 21.5 21.4 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0.02 0.02 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 ...
## $ gyros_arm_y : num -0.02 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 ...
## $ gyros_arm_z : num -0.02 0.02 0 0 0 -0.02 -0.02 0 0 -0.02 ...
## $ accel_arm_x : int -289 -289 -289 -289 -289 -288 -288 -290 -288 -287 ...
## $ accel_arm_y : int 110 111 111 111 111 109 110 110 111 111 ...
## $ accel_arm_z : int -126 -123 -122 -125 -124 -122 -124 -123 -123 -124 ...
## $ magnet_arm_x : int -368 -372 -369 -373 -372 -369 -376 -366 -363 -372 ...
## $ magnet_arm_y : int 344 344 342 336 338 341 334 339 343 338 ...
## $ magnet_arm_z : int 513 512 513 509 510 518 516 509 520 509 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 12.9 13.4 13.4 13.1 12.8 ...
## $ pitch_dumbbell : num -70.3 -70.4 -70.8 -70.2 -70.3 ...
## $ yaw_dumbbell : num -85.1 -84.9 -84.5 -85.1 -85.1 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_dumbbell: num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_dumbbell : Factor w/ 3 levels "","#DIV/0!","0.00": 1 1 1 1 1 1 1 1 1 1 ...
## $ total_accel_dumbbell : int 37 37 37 37 37 37 37 37 37 37 ...
## $ var_accel_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 -0.02 0 0 0 0 0 0 0 -0.02 ...
## $ accel_dumbbell_x : int -232 -232 -234 -232 -234 -232 -235 -233 -233 -234 ...
## $ accel_dumbbell_y : int 46 48 48 47 46 47 48 47 47 48 ...
## $ accel_dumbbell_z : int -270 -269 -269 -270 -272 -269 -270 -269 -270 -269 ...
## $ magnet_dumbbell_x : int -561 -552 -558 -551 -555 -549 -558 -564 -554 -552 ...
## $ magnet_dumbbell_y : int 298 303 294 295 300 292 291 299 291 302 ...
## $ magnet_dumbbell_z : num -63 -60 -66 -70 -74 -65 -69 -64 -65 -69 ...
## $ roll_forearm : num 28.3 28.1 27.9 27.9 27.8 27.7 27.7 27.6 27.5 27.2 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.8 -63.8 -63.9 ...
## $ yaw_forearm : num -152 -152 -152 -152 -152 -152 -152 -152 -152 -151 ...
## $ kurtosis_roll_forearm : Factor w/ 322 levels "","-0.0227","-0.0359",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_forearm : Factor w/ 323 levels "","-0.0073","-0.0442",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_forearm : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_forearm : Factor w/ 323 levels "","-0.0004","-0.0013",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_forearm : Factor w/ 319 levels "","-0.0113","-0.0131",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_forearm : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_forearm : Factor w/ 45 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_forearm : Factor w/ 45 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_forearm : Factor w/ 3 levels "","#DIV/0!","0.00": 1 1 1 1 1 1 1 1 1 1 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ var_accel_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_forearm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_forearm_x : num 0.03 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0 ...
## $ gyros_forearm_y : num -0.02 -0.02 -0.02 0 -0.02 0 0 -0.02 0.02 0 ...
## $ gyros_forearm_z : num 0 0 -0.03 -0.02 0 -0.02 -0.02 -0.02 -0.03 -0.03 ...
## $ accel_forearm_x : int 196 189 193 195 193 193 190 193 191 193 ...
## $ accel_forearm_y : int 204 206 203 205 205 204 205 205 203 205 ...
## $ accel_forearm_z : int -213 -214 -215 -215 -213 -214 -215 -214 -215 -215 ...
## $ magnet_forearm_x : int -18 -16 -9 -18 -9 -16 -22 -17 -11 -15 ...
## $ magnet_forearm_y : num 658 658 660 659 660 653 656 657 657 655 ...
## $ magnet_forearm_z : num 469 469 478 470 474 476 473 465 478 472 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
Obviously, many variables have lots of blank or “NA” values. We want to get rid of such features to optimize the training process. We use 95% constraint of emptiness to eliminate variables.
is.empty <- function(x) {is.na(x) || is.null(x) || x==""};
# Count share of empty cells in each column, and get rid of almost empty columns
col.void <- sapply(1:ncol(train), function(i) {sum(sapply(train[,i], is.empty))/length(train[,i])});
train <- train[,col.void<0.95];
Now lets exclude columns which don’t contain accelerometer information (e.g. subjects’ names, time, etc.)
train <- train[,-c(1:6)];
Even now, we have lots of predictors - 52. This may cause “expensive” training process and possible overfitting. Lets find out variables’ importance.
First modelling
We build the first model using random sample of 3000 observations.
n <-3000;
train.first <- train[sample(1:dim(train)[1],n),];
# dataset for rough estimation of the first model performance
train.first.est <- train[sample(1:dim(train)[1],n),];
We use Random Forest algorithm for our first model as a well known, accurate and effective modelling tool. This is not our final model, though, but just a step to see variables’ importance.
First, we need to assign train parameters for this and subsequent models. In order to make the Cross-Validation as a part of training process, we must put resampling method = CV (default is boot). Thus we use the cross-validation built into the train() function.
And we put the number of K-folds = 5 (default=10) to reduce execution time of train().
trc <- trainControl(method="cv", number=5);
Now we start the first modelling. Note that we restrict the number of trees by 100.
start <- Sys.time();
modelFit1 <- train(classe ~ ., data=train.first, method="rf", trControl=trc, ntree=100);
Sys.time()-start;
## Time difference of 37.25813 secs
Lets look at the first model’s performance on our random subsample from train dataset:
confusionMatrix(train.first.est$classe, predict(modelFit1, train.first.est));
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 868 3 0 2 0
## B 24 537 9 2 3
## C 0 13 519 5 1
## D 1 1 8 441 0
## E 1 2 1 10 549
##
## Overall Statistics
##
## Accuracy : 0.9713
## 95% CI : (0.9647, 0.977)
## No Information Rate : 0.298
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9636
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9709 0.9658 0.9665 0.9587 0.9928
## Specificity 0.9976 0.9845 0.9923 0.9961 0.9943
## Pos Pred Value 0.9943 0.9339 0.9647 0.9778 0.9751
## Neg Pred Value 0.9878 0.9922 0.9927 0.9925 0.9984
## Prevalence 0.2980 0.1853 0.1790 0.1533 0.1843
## Detection Rate 0.2893 0.1790 0.1730 0.1470 0.1830
## Detection Prevalence 0.2910 0.1917 0.1793 0.1503 0.1877
## Balanced Accuracy 0.9843 0.9751 0.9794 0.9774 0.9935
Already not bad for accuracy metrics, but we estimated them only on the training subsample. So lets get back to variables’ importance.
impvars<-varImp(modelFit1);
imp <- data.frame(var=rownames(impvars$importance), importance=impvars$importance$Overall);
imp <- imp[order(imp$importance, decreasing = TRUE),];
imp;
## var importance
## 1 roll_belt 100.0000000
## 41 pitch_forearm 76.9102136
## 40 roll_forearm 45.1487376
## 3 yaw_belt 44.9610374
## 39 magnet_dumbbell_z 43.5153761
## 38 magnet_dumbbell_y 40.0576144
## 2 pitch_belt 34.1344251
## 27 roll_dumbbell 26.6290642
## 35 accel_dumbbell_y 19.9706637
## 12 magnet_belt_y 18.9809479
## 47 accel_forearm_x 18.8106593
## 52 magnet_forearm_z 16.1208100
## 13 magnet_belt_z 15.9641009
## 10 accel_belt_z 15.5034991
## 37 magnet_dumbbell_x 14.6794097
## 36 accel_dumbbell_z 12.1495164
## 30 total_accel_dumbbell 10.7129512
## 29 yaw_dumbbell 10.2065239
## 49 accel_forearm_z 10.0561835
## 32 gyros_dumbbell_y 9.6070191
## 24 magnet_arm_x 9.4316614
## 7 gyros_belt_z 9.0785522
## 51 magnet_forearm_y 8.9445846
## 16 yaw_arm 8.8008925
## 11 magnet_belt_x 8.3987527
## 34 accel_dumbbell_x 7.8879090
## 14 roll_arm 7.3139012
## 25 magnet_arm_y 6.8810186
## 50 magnet_forearm_x 5.9805572
## 26 magnet_arm_z 5.7664812
## 21 accel_arm_x 5.1397727
## 42 yaw_forearm 4.6488830
## 18 gyros_arm_x 4.5877776
## 28 pitch_dumbbell 3.8000713
## 48 accel_forearm_y 3.7373988
## 43 total_accel_forearm 3.5832432
## 19 gyros_arm_y 3.3851034
## 22 accel_arm_y 3.3839381
## 4 total_accel_belt 3.3611452
## 31 gyros_dumbbell_x 2.6793627
## 15 pitch_arm 2.3398495
## 8 accel_belt_x 1.9076277
## 45 gyros_forearm_y 1.6912214
## 23 accel_arm_z 1.4875487
## 33 gyros_dumbbell_z 1.2108121
## 17 total_accel_arm 1.0341835
## 9 accel_belt_y 0.9899474
## 46 gyros_forearm_z 0.9750620
## 6 gyros_belt_y 0.1559453
## 44 gyros_forearm_x 0.1456163
## 5 gyros_belt_x 0.1166406
## 20 gyros_arm_z 0.0000000
Thus we have only two features with importance higher than 50%. Let’s plot them against each other and use color to see different types of outcome.
g <- ggplot(train, aes_string(x=as.character(imp[1,1]), y=as.character(imp[2,1])));
g <- g + geom_point(size=3, alpha = 0.7, aes(color=classe));
print(g);
As we see, there are no clear patterns here, though this figure gives something to think of. Obviously, we can’t draw decision boundaries based on this plot. So we must take in account other predictors. We would like to take variables with importance > 2.
vip <- as.character(imp[imp$importance > 2, 1]);
Thus we get the vector of important variables of length 41.
train <- train[,c(vip,"classe")];
Main modelling
Now we have preprocessed the dataset “train” and ready to start modelling. Due to the constraints of this report, we will try only two algorithms commonly used for prediction in classification problems: random forest and gradient boosting.
Random forest
start <- Sys.time();
modelFit.rf <- train(classe ~ ., data=train, method="rf", trControl=trc, ntree=100);
Sys.time()-start;
## Time difference of 2.884148 mins
Let’s see our new model’s parameters, particularly at Accuracy (see the bottom of that next block), calculated during built-in Cross-Validation. Thus we can evaluate the expected out-of-sample error (which is 1-Accuracy).
modelFit.rf;
## Random Forest
##
## 13737 samples
## 41 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 10989, 10989, 10991, 10989, 10990
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9901001 0.9874752 0.002485364 0.003146727
## 21 0.9907554 0.9883059 0.002474300 0.003130277
## 41 0.9862422 0.9825975 0.003256557 0.004119349
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 21.
Now we can estimate the quality of our model by predicting on the validation dataset:
confusionMatrix(validation$classe, predict(modelFit.rf, validation));
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 3 0 0 0
## B 5 1124 9 0 1
## C 0 8 1016 2 0
## D 0 0 13 950 1
## E 0 0 2 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9924
## 95% CI : (0.9898, 0.9944)
## No Information Rate : 0.2848
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9903 0.9769 0.9969 0.9981
## Specificity 0.9993 0.9968 0.9979 0.9972 0.9994
## Pos Pred Value 0.9982 0.9868 0.9903 0.9855 0.9972
## Neg Pred Value 0.9988 0.9977 0.9951 0.9994 0.9996
## Prevalence 0.2848 0.1929 0.1767 0.1619 0.1837
## Detection Rate 0.2839 0.1910 0.1726 0.1614 0.1833
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9982 0.9936 0.9874 0.9970 0.9988
Gradient boosting
start <- Sys.time();
gbmGrid <- expand.grid(.interaction.depth = 20, .n.trees = 50, .shrinkage = 0.1)
modelFit.gbm <- train(classe ~ ., data=train, method="gbm", trControl=trc, tuneGrid = gbmGrid);
Sys.time()-start;
## Time difference of 5.767263 mins
Lets look at the model parameters, and again, particularly at Accuracy.
modelFit.gbm;
## Stochastic Gradient Boosting
##
## 13737 samples
## 41 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 10990, 10989, 10991, 10989, 10989
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9851495 0.9812164 0.003360269 0.004252811
##
## Tuning parameter 'n.trees' was held constant at a value of 50
## Tuning parameter
## 'interaction.depth' was held constant at a value of 20
## Tuning parameter 'shrinkage' was
## held constant at a value of 0.1
##
..and second model prediction on the validation dataset:
confusionMatrix(validation$classe, predict(modelFit.gbm, validation));
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1667 6 1 0 0
## B 7 1121 11 0 0
## C 0 22 999 5 0
## D 0 0 21 941 2
## E 0 0 1 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9869
## 95% CI : (0.9837, 0.9897)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9834
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9958 0.9756 0.9671 0.9937 0.9982
## Specificity 0.9983 0.9962 0.9944 0.9953 0.9996
## Pos Pred Value 0.9958 0.9842 0.9737 0.9761 0.9982
## Neg Pred Value 0.9983 0.9941 0.9930 0.9988 0.9996
## Prevalence 0.2845 0.1952 0.1755 0.1609 0.1839
## Detection Rate 0.2833 0.1905 0.1698 0.1599 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9971 0.9859 0.9808 0.9945 0.9989
Conclusions
As we see, Random Forest has a little bit more Accuracy, and therefore less expected out-of-sample error in this case. We will use our modelFit.rf to predict the outcome on test dataset.
answers<-as.character(predict(modelFit.rf,test));
answers;
## [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A" "B" "B" "B"
Now we write answers to the files for submission:
pml_write_files = function(x)
{
n = length(x)
for(i in 1:n){
filename = paste0(wd,"/Practical_Machine_Learning/CP_answers/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers);
The result of the task submission on test data is 20/20. We can conclude that the model modelFit.rf works well.