Synopsis

The modern “quantified self movement” implies using wearable devices such as Jawbone Up, Nike FuelBand, and Fitbit. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information on Groupware@LES website (see the section on the Weight Lifting Exercise Dataset).

Our goal is to build the model that predicts type of execution (variable “classe” is the outcome). Value “A” means correct execution, four others, “B”, “C”, “D”, and “E”, are types of common mistakes. Predictors are readings from accelerometers.

First we load training and test data, and split training data into train (70%) and validation (30%) datasets. After short preprocessing, we build the first model to calculate the variables’ importance. Thus we can reduce the number of predictors in order to make modelling less resource consuming.

After that, we build two competing models based on Random Forest and Gradient Boosting algorithms. In this case, Random Forest gave us a little bit more accuracy. So we use it on the test set and get 20/20 accuracy on submission.

Data loading, cleaning and preprocessing

We start with loading training and testing data.

library(caret);
set.seed(123);
temp <- tempfile();

download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",temp);
data <- read.csv(temp, row.names=1);
unlink(temp);

download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",temp);
test <- read.csv(temp, row.names=1);
unlink(temp);
rm(temp);

We have large training dataset “data” - 19622 observations, and small test dataset “test” - 20 observations. In these conditions, we split data and use 70% of its observations for training models and 30% for models’ validation.

inTrain <- createDataPartition(y=data$classe, p=0.7, list=FALSE);
train <- data[inTrain,];
validation <- data[-inTrain,];

Let’s look at our data closer.

str(train, list.len=200);
## 'data.frame':    13737 obs. of  159 variables:
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  820366 120339 304277 368296 440390 484323 484434 500302 528316 560359 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 12 12 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.42 1.48 1.45 1.42 1.42 1.43 1.45 1.45 1.43 1.42 ...
##  $ pitch_belt              : num  8.07 8.05 8.06 8.09 8.13 8.16 8.17 8.18 8.18 8.2 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02 0.02 ...
##  $ gyros_belt_y            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 0 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -20 -22 -21 -22 -22 -20 -21 -21 -22 -22 ...
##  $ accel_belt_y            : int  5 3 4 3 4 2 4 2 2 4 ...
##  $ accel_belt_z            : int  23 21 21 21 21 24 22 23 23 21 ...
##  $ magnet_belt_x           : int  -2 -6 0 -4 -2 1 -3 -5 -2 -3 ...
##  $ magnet_belt_y           : int  600 604 603 599 603 602 609 596 602 606 ...
##  $ magnet_belt_z           : int  -305 -310 -312 -311 -313 -312 -308 -317 -319 -309 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.1 22 21.9 21.8 21.7 21.6 21.5 21.5 21.4 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0.02 0.02 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  -0.02 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 ...
##  $ gyros_arm_z             : num  -0.02 0.02 0 0 0 -0.02 -0.02 0 0 -0.02 ...
##  $ accel_arm_x             : int  -289 -289 -289 -289 -289 -288 -288 -290 -288 -287 ...
##  $ accel_arm_y             : int  110 111 111 111 111 109 110 110 111 111 ...
##  $ accel_arm_z             : int  -126 -123 -122 -125 -124 -122 -124 -123 -123 -124 ...
##  $ magnet_arm_x            : int  -368 -372 -369 -373 -372 -369 -376 -366 -363 -372 ...
##  $ magnet_arm_y            : int  344 344 342 336 338 341 334 339 343 338 ...
##  $ magnet_arm_z            : int  513 512 513 509 510 518 516 509 520 509 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  12.9 13.4 13.4 13.1 12.8 ...
##  $ pitch_dumbbell          : num  -70.3 -70.4 -70.8 -70.2 -70.3 ...
##  $ yaw_dumbbell            : num  -85.1 -84.9 -84.5 -85.1 -85.1 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_dumbbell: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_dumbbell  : Factor w/ 3 levels "","#DIV/0!","0.00": 1 1 1 1 1 1 1 1 1 1 ...
##  $ total_accel_dumbbell    : int  37 37 37 37 37 37 37 37 37 37 ...
##  $ var_accel_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_dumbbell    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_dumbbell   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_dumbbell     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_dumbbell        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_dumbbell_x        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y        : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z        : num  0 -0.02 0 0 0 0 0 0 0 -0.02 ...
##  $ accel_dumbbell_x        : int  -232 -232 -234 -232 -234 -232 -235 -233 -233 -234 ...
##  $ accel_dumbbell_y        : int  46 48 48 47 46 47 48 47 47 48 ...
##  $ accel_dumbbell_z        : int  -270 -269 -269 -270 -272 -269 -270 -269 -270 -269 ...
##  $ magnet_dumbbell_x       : int  -561 -552 -558 -551 -555 -549 -558 -564 -554 -552 ...
##  $ magnet_dumbbell_y       : int  298 303 294 295 300 292 291 299 291 302 ...
##  $ magnet_dumbbell_z       : num  -63 -60 -66 -70 -74 -65 -69 -64 -65 -69 ...
##  $ roll_forearm            : num  28.3 28.1 27.9 27.9 27.8 27.7 27.7 27.6 27.5 27.2 ...
##  $ pitch_forearm           : num  -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.8 -63.8 -63.9 ...
##  $ yaw_forearm             : num  -152 -152 -152 -152 -152 -152 -152 -152 -152 -151 ...
##  $ kurtosis_roll_forearm   : Factor w/ 322 levels "","-0.0227","-0.0359",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_forearm  : Factor w/ 323 levels "","-0.0073","-0.0442",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_forearm    : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_forearm   : Factor w/ 323 levels "","-0.0004","-0.0013",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_forearm  : Factor w/ 319 levels "","-0.0113","-0.0131",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_forearm    : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_forearm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_forearm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_forearm         : Factor w/ 45 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_forearm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_forearm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_forearm         : Factor w/ 45 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_forearm  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_forearm : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_forearm   : Factor w/ 3 levels "","#DIV/0!","0.00": 1 1 1 1 1 1 1 1 1 1 ...
##  $ total_accel_forearm     : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ var_accel_forearm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_forearm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_forearm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_forearm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_forearm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_forearm    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_forearm       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_forearm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_forearm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_forearm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_forearm_x         : num  0.03 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0 ...
##  $ gyros_forearm_y         : num  -0.02 -0.02 -0.02 0 -0.02 0 0 -0.02 0.02 0 ...
##  $ gyros_forearm_z         : num  0 0 -0.03 -0.02 0 -0.02 -0.02 -0.02 -0.03 -0.03 ...
##  $ accel_forearm_x         : int  196 189 193 195 193 193 190 193 191 193 ...
##  $ accel_forearm_y         : int  204 206 203 205 205 204 205 205 203 205 ...
##  $ accel_forearm_z         : int  -213 -214 -215 -215 -213 -214 -215 -214 -215 -215 ...
##  $ magnet_forearm_x        : int  -18 -16 -9 -18 -9 -16 -22 -17 -11 -15 ...
##  $ magnet_forearm_y        : num  658 658 660 659 660 653 656 657 657 655 ...
##  $ magnet_forearm_z        : num  469 469 478 470 474 476 473 465 478 472 ...
##  $ classe                  : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Obviously, many variables have lots of blank or “NA” values. We want to get rid of such features to optimize the training process. We use 95% constraint of emptiness to eliminate variables.

is.empty <- function(x) {is.na(x) || is.null(x) || x==""};

# Count share of empty cells in each column, and get rid of almost empty columns

col.void <- sapply(1:ncol(train), function(i) {sum(sapply(train[,i], is.empty))/length(train[,i])});
train <- train[,col.void<0.95];

Now lets exclude columns which don’t contain accelerometer information (e.g. subjects’ names, time, etc.)

train <- train[,-c(1:6)];

Even now, we have lots of predictors - 52. This may cause “expensive” training process and possible overfitting. Lets find out variables’ importance.

First modelling

We build the first model using random sample of 3000 observations.

n <-3000;

train.first <- train[sample(1:dim(train)[1],n),];

# dataset for rough estimation of the first model performance
train.first.est <- train[sample(1:dim(train)[1],n),];

We use Random Forest algorithm for our first model as a well known, accurate and effective modelling tool. This is not our final model, though, but just a step to see variables’ importance.

First, we need to assign train parameters for this and subsequent models. In order to make the Cross-Validation as a part of training process, we must put resampling method = CV (default is boot). Thus we use the cross-validation built into the train() function.

And we put the number of K-folds = 5 (default=10) to reduce execution time of train().

trc <- trainControl(method="cv", number=5);

Now we start the first modelling. Note that we restrict the number of trees by 100.

start <- Sys.time();

modelFit1 <- train(classe ~ ., data=train.first, method="rf", trControl=trc, ntree=100);

Sys.time()-start;
## Time difference of 37.25813 secs

Lets look at the first model’s performance on our random subsample from train dataset:

confusionMatrix(train.first.est$classe, predict(modelFit1, train.first.est));
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 868   3   0   2   0
##          B  24 537   9   2   3
##          C   0  13 519   5   1
##          D   1   1   8 441   0
##          E   1   2   1  10 549
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9713         
##                  95% CI : (0.9647, 0.977)
##     No Information Rate : 0.298          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9636         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9709   0.9658   0.9665   0.9587   0.9928
## Specificity            0.9976   0.9845   0.9923   0.9961   0.9943
## Pos Pred Value         0.9943   0.9339   0.9647   0.9778   0.9751
## Neg Pred Value         0.9878   0.9922   0.9927   0.9925   0.9984
## Prevalence             0.2980   0.1853   0.1790   0.1533   0.1843
## Detection Rate         0.2893   0.1790   0.1730   0.1470   0.1830
## Detection Prevalence   0.2910   0.1917   0.1793   0.1503   0.1877
## Balanced Accuracy      0.9843   0.9751   0.9794   0.9774   0.9935

Already not bad for accuracy metrics, but we estimated them only on the training subsample. So lets get back to variables’ importance.

impvars<-varImp(modelFit1);

imp <- data.frame(var=rownames(impvars$importance), importance=impvars$importance$Overall);

imp <- imp[order(imp$importance, decreasing = TRUE),];

imp;
##                     var  importance
## 1             roll_belt 100.0000000
## 41        pitch_forearm  76.9102136
## 40         roll_forearm  45.1487376
## 3              yaw_belt  44.9610374
## 39    magnet_dumbbell_z  43.5153761
## 38    magnet_dumbbell_y  40.0576144
## 2            pitch_belt  34.1344251
## 27        roll_dumbbell  26.6290642
## 35     accel_dumbbell_y  19.9706637
## 12        magnet_belt_y  18.9809479
## 47      accel_forearm_x  18.8106593
## 52     magnet_forearm_z  16.1208100
## 13        magnet_belt_z  15.9641009
## 10         accel_belt_z  15.5034991
## 37    magnet_dumbbell_x  14.6794097
## 36     accel_dumbbell_z  12.1495164
## 30 total_accel_dumbbell  10.7129512
## 29         yaw_dumbbell  10.2065239
## 49      accel_forearm_z  10.0561835
## 32     gyros_dumbbell_y   9.6070191
## 24         magnet_arm_x   9.4316614
## 7          gyros_belt_z   9.0785522
## 51     magnet_forearm_y   8.9445846
## 16              yaw_arm   8.8008925
## 11        magnet_belt_x   8.3987527
## 34     accel_dumbbell_x   7.8879090
## 14             roll_arm   7.3139012
## 25         magnet_arm_y   6.8810186
## 50     magnet_forearm_x   5.9805572
## 26         magnet_arm_z   5.7664812
## 21          accel_arm_x   5.1397727
## 42          yaw_forearm   4.6488830
## 18          gyros_arm_x   4.5877776
## 28       pitch_dumbbell   3.8000713
## 48      accel_forearm_y   3.7373988
## 43  total_accel_forearm   3.5832432
## 19          gyros_arm_y   3.3851034
## 22          accel_arm_y   3.3839381
## 4      total_accel_belt   3.3611452
## 31     gyros_dumbbell_x   2.6793627
## 15            pitch_arm   2.3398495
## 8          accel_belt_x   1.9076277
## 45      gyros_forearm_y   1.6912214
## 23          accel_arm_z   1.4875487
## 33     gyros_dumbbell_z   1.2108121
## 17      total_accel_arm   1.0341835
## 9          accel_belt_y   0.9899474
## 46      gyros_forearm_z   0.9750620
## 6          gyros_belt_y   0.1559453
## 44      gyros_forearm_x   0.1456163
## 5          gyros_belt_x   0.1166406
## 20          gyros_arm_z   0.0000000

Thus we have only two features with importance higher than 50%. Let’s plot them against each other and use color to see different types of outcome.

g <- ggplot(train, aes_string(x=as.character(imp[1,1]), y=as.character(imp[2,1])));
g <- g + geom_point(size=3, alpha = 0.7, aes(color=classe));
print(g);

As we see, there are no clear patterns here, though this figure gives something to think of. Obviously, we can’t draw decision boundaries based on this plot. So we must take in account other predictors. We would like to take variables with importance > 2.

vip <- as.character(imp[imp$importance > 2, 1]);

Thus we get the vector of important variables of length 41.

train <- train[,c(vip,"classe")];

Main modelling

Now we have preprocessed the dataset “train” and ready to start modelling. Due to the constraints of this report, we will try only two algorithms commonly used for prediction in classification problems: random forest and gradient boosting.

Random forest

start <- Sys.time();

modelFit.rf <- train(classe ~ ., data=train, method="rf", trControl=trc, ntree=100);

Sys.time()-start;
## Time difference of 2.884148 mins

Let’s see our new model’s parameters, particularly at Accuracy (see the bottom of that next block), calculated during built-in Cross-Validation. Thus we can evaluate the expected out-of-sample error (which is 1-Accuracy).

modelFit.rf;
## Random Forest 
## 
## 13737 samples
##    41 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 10989, 10989, 10991, 10989, 10990 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9901001  0.9874752  0.002485364  0.003146727
##   21    0.9907554  0.9883059  0.002474300  0.003130277
##   41    0.9862422  0.9825975  0.003256557  0.004119349
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 21.

Now we can estimate the quality of our model by predicting on the validation dataset:

confusionMatrix(validation$classe, predict(modelFit.rf, validation));
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    3    0    0    0
##          B    5 1124    9    0    1
##          C    0    8 1016    2    0
##          D    0    0   13  950    1
##          E    0    0    2    1 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9898, 0.9944)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9903   0.9769   0.9969   0.9981
## Specificity            0.9993   0.9968   0.9979   0.9972   0.9994
## Pos Pred Value         0.9982   0.9868   0.9903   0.9855   0.9972
## Neg Pred Value         0.9988   0.9977   0.9951   0.9994   0.9996
## Prevalence             0.2848   0.1929   0.1767   0.1619   0.1837
## Detection Rate         0.2839   0.1910   0.1726   0.1614   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9982   0.9936   0.9874   0.9970   0.9988

Gradient boosting

start <- Sys.time();

gbmGrid <- expand.grid(.interaction.depth = 20, .n.trees = 50, .shrinkage = 0.1)
modelFit.gbm <- train(classe ~ ., data=train, method="gbm", trControl=trc, tuneGrid = gbmGrid);
Sys.time()-start;
## Time difference of 5.767263 mins

Lets look at the model parameters, and again, particularly at Accuracy.

modelFit.gbm;
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    41 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 10990, 10989, 10991, 10989, 10989 
## 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.9851495  0.9812164  0.003360269  0.004252811
## 
## Tuning parameter 'n.trees' was held constant at a value of 50
## Tuning parameter
##  'interaction.depth' was held constant at a value of 20
## Tuning parameter 'shrinkage' was
##  held constant at a value of 0.1
## 

..and second model prediction on the validation dataset:

confusionMatrix(validation$classe, predict(modelFit.gbm, validation));
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1667    6    1    0    0
##          B    7 1121   11    0    0
##          C    0   22  999    5    0
##          D    0    0   21  941    2
##          E    0    0    1    1 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9869          
##                  95% CI : (0.9837, 0.9897)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9834          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9958   0.9756   0.9671   0.9937   0.9982
## Specificity            0.9983   0.9962   0.9944   0.9953   0.9996
## Pos Pred Value         0.9958   0.9842   0.9737   0.9761   0.9982
## Neg Pred Value         0.9983   0.9941   0.9930   0.9988   0.9996
## Prevalence             0.2845   0.1952   0.1755   0.1609   0.1839
## Detection Rate         0.2833   0.1905   0.1698   0.1599   0.1835
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9971   0.9859   0.9808   0.9945   0.9989

Conclusions

As we see, Random Forest has a little bit more Accuracy, and therefore less expected out-of-sample error in this case. We will use our modelFit.rf to predict the outcome on test dataset.

answers<-as.character(predict(modelFit.rf,test));
answers;
##  [1] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A" "E" "E" "A" "B" "B" "B"

Now we write answers to the files for submission:

pml_write_files = function(x)
        {
        n = length(x)
        for(i in 1:n){
                filename = paste0(wd,"/Practical_Machine_Learning/CP_answers/problem_id_",i,".txt")
                write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
        }
}

pml_write_files(answers);

The result of the task submission on test data is 20/20. We can conclude that the model modelFit.rf works well.