Introduction

Just doing exercise is important, but to prevent injuries and get the maximum benefit, it is important to do the exercises correctly. In the past, the only way to ensure correct performance of moves was training and monitoring by an expert like a physical therapist or personal trainer. A group of scientists (Velloso et al., 2013) became interested in using the different fitness monitoring systems on the market to detect not only the quantity of exercise done by the wearer but also the quantity.

Velloso’s team created a dataset of tracking data in which the volunteers did biceps curls both correctly and incorrectly (in four typical ways). In this paper, we use practical machine learning methods to try to predict whether an exercise was done correctly or not.

Exploratory Data Analysis and Processing

training <- read.csv("pml-training.csv", stringsAsFactors = FALSE, na.string= "")
test <- read.csv("pml-testing.csv", stringsAsFactors = FALSE, na.string= "")

Two datasets were provided. One for training with 19622 observations of 160 variables, and one for testing with 20 observations of 160 variables. These datasets can be found at:

training: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

testing: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

table(training$classe)

   A    B    C    D    E 
5580 3797 3422 3216 3607 

The classe variable denotes which category the exercise falls in with Class A representing correct performance and the four incorrect methods being: “throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E)” (Velloso et al., 2013).

There is a good sized sample of each of the classes.

There are many variables that measure the acceleration, pitch, yaw, roll, and similiar variables of each motion sensor. On inspection, there are also quite a few variables that have very few observations. By studying the dataset, we found that each time the new_window variable was yes, there were extra columns that seem to be summary statistics for the observations when that window was open.

table_new <- table(training$new_window)
table_new

   no   yes 
19216   406 

These variables only show up in 406 out of 19622 observations so are not likely to add significantly to any model. The following code looks for variables that only have values on new windows and eliminates them from the dataset. Extra time formats and window information was also eliminated from the data.

training2 <- training[training$new_window =="no",]
extracol = 1 
for (i in 1:ncol(training2))
      if(sum(is.na(training2[,i]))==nrow(training2)){
            extracol <- rbind(extracol, i)
      } else {
            if (sum(training2[,i] == "NA")==nrow(training2)){
            extracol <- rbind(extracol, i)
            }
      }
extracol <- rbind(extracol, 3,4,6,7) 
training <- training[,-extracol]

After removing these extraneous variables, the data was checked for any near zero variance (NZV) variables, but there were no more found.

nearZeroVar(training)
integer(0)

Since the data was read in with no factors, the user name and class were classified as factor variables

training$user_name <- factor(training$user_name)
training$classe <- factor(training$classe)

Last we split the data into a training (70%) and validation (30%) set to be able to check the accuracy of our models and estimate out of sample error rate. User name and time were not used as factors in these models so were removed from the datasets.

set.seed(77463) 
inTrain <- createDataPartition(y=training$classe,
                               p=0.7, list=FALSE)
train <- training[inTrain,-c(1,2) ]

validate <- training[-inTrain,-c(1,2) ]

We were left with 52 variables to use in predicting class.

Modeling

Linear Model

At first we tried to cut down the number of variables to prevent overfitting. To look for high correlates with class, but reduce variables that were highly correlated with each other, we used a forward selection linear regression model to select the 20 most useful variables. Please see the Rmarkdown source code for this document at https://github.com/jgrantier/practicalmachinelearning to see all of the code for the forward selection.

summary(mod20)

Call:
lm(formula = classe_int ~ pitch_forearm + magnet_belt_y + total_accel_forearm + 
    magnet_arm_x + total_accel_dumbbell + accel_belt_y + total_accel_belt + 
    pitch_belt + magnet_dumbbell_z + accel_forearm_z + magnet_dumbbell_x + 
    roll_belt + accel_dumbbell_x + yaw_dumbbell + accel_arm_z + 
    magnet_arm_y + accel_arm_x + magnet_forearm_z + gyros_arm_x, 
    data = train_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3676 -0.7092 -0.0328  0.6686  4.5887 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           5.364e+00  2.008e-01   26.71   <2e-16 ***
pitch_forearm         1.417e-02  3.935e-04   36.00   <2e-16 ***
magnet_belt_y        -1.005e-02  3.235e-04  -31.08   <2e-16 ***
total_accel_forearm   2.632e-02  9.885e-04   26.62   <2e-16 ***
magnet_arm_x         -9.404e-04  7.462e-05  -12.60   <2e-16 ***
total_accel_dumbbell  2.985e-02  1.929e-03   15.47   <2e-16 ***
accel_belt_y         -3.073e-02  1.385e-03  -22.19   <2e-16 ***
total_accel_belt      8.856e-02  6.996e-03   12.66   <2e-16 ***
pitch_belt            4.913e-02  9.821e-04   50.03   <2e-16 ***
magnet_dumbbell_z     9.687e-03  1.719e-04   56.36   <2e-16 ***
accel_forearm_z      -6.630e-03  1.345e-04  -49.30   <2e-16 ***
magnet_dumbbell_x    -2.941e-03  9.736e-05  -30.21   <2e-16 ***
roll_belt             1.623e-02  1.105e-03   14.68   <2e-16 ***
accel_dumbbell_x      6.148e-03  3.053e-04   20.14   <2e-16 ***
yaw_dumbbell         -3.975e-03  1.960e-04  -20.29   <2e-16 ***
accel_arm_z           3.997e-03  1.405e-04   28.45   <2e-16 ***
magnet_arm_y         -2.637e-03  1.383e-04  -19.07   <2e-16 ***
accel_arm_x           2.616e-03  2.144e-04   12.20   <2e-16 ***
magnet_forearm_z      5.016e-04  3.640e-05   13.78   <2e-16 ***
gyros_arm_x           4.677e-02  4.729e-03    9.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.047 on 13717 degrees of freedom
Multiple R-squared:  0.4971,    Adjusted R-squared:  0.4964 
F-statistic: 713.6 on 19 and 13717 DF,  p-value: < 2.2e-16
train3 <- train[,which(names(train) %in% c(names(mod20[[1]]),"classe"))]

This model only explains 49.71 percent of the variance in the training set, probably because the relationship with the most important predictors is not linear, but we kept this smaller dataset to test with other models to see if this subset would be sufficient for predictions, since it cuts down on variables that correlated with each other and, if sufficient, may run faster than the full set.

Random Tree

Since the poor performance of the linear model suggests the relationship is not linear, we next looked at a random tree model.

modFit_rt <- train(classe ~ .,method="rpart",data=train)
predictions_rt <- predict(modFit_rt,newdata=validate)
conMat_rt <- confusionMatrix(predictions_rt,validate$classe)
conMat_rt
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1494  478  501  410  151
         B   33  399   45  185  145
         C  120  262  480  369  295
         D    0    0    0    0    0
         E   27    0    0    0  491

Overall Statistics
                                          
               Accuracy : 0.4867          
                 95% CI : (0.4738, 0.4995)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3293          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8925   0.3503  0.46784   0.0000  0.45379
Specificity            0.6343   0.9140  0.78473   1.0000  0.99438
Pos Pred Value         0.4924   0.4944  0.31455      NaN  0.94788
Neg Pred Value         0.9369   0.8543  0.87474   0.8362  0.88988
Prevalence             0.2845   0.1935  0.17434   0.1638  0.18386
Detection Rate         0.2539   0.0678  0.08156   0.0000  0.08343
Detection Prevalence   0.5155   0.1371  0.25930   0.0000  0.08802
Balanced Accuracy      0.7634   0.6322  0.62628   0.5000  0.72408
plot(predictions_rt,validate$classe, main="Predictions for Random Tree Model", xlab = "Predictions", ylab = "Actual from Validation Set")

This model also did quite poorly, with an accuracy of only 51.56 percent on the training set and 48.67 percent on the validation set for the entire set of 52 variables.

Boosting and Random Forests

Since the simple models do not seem to be adequate in modeling the data or making predictions, we needed to use more complex modeling techniques and used both a boosting method with trees (caret gbm) and random forest (randomForest).

modFit_rf <- randomForest(classe ~., data=train)
predictions_rf.val <- predict(modFit_rf,newdata=validate)
conMat_rf <- confusionMatrix(predictions_rf.val,validate$classe)
conMat_rf
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1674    6    0    0    0
         B    0 1132    4    0    0
         C    0    1 1021    7    0
         D    0    0    1  957    0
         E    0    0    0    0 1082

Overall Statistics
                                         
               Accuracy : 0.9968         
                 95% CI : (0.995, 0.9981)
    No Information Rate : 0.2845         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9959         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            1.0000   0.9939   0.9951   0.9927   1.0000
Specificity            0.9986   0.9992   0.9984   0.9998   1.0000
Pos Pred Value         0.9964   0.9965   0.9922   0.9990   1.0000
Neg Pred Value         1.0000   0.9985   0.9990   0.9986   1.0000
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2845   0.1924   0.1735   0.1626   0.1839
Detection Prevalence   0.2855   0.1930   0.1749   0.1628   0.1839
Balanced Accuracy      0.9993   0.9965   0.9967   0.9963   1.0000
modFit_rf_small <- randomForest(classe ~., data=train3)
predictions_rf_small.val <- predict(modFit_rf_small,newdata=validate)
conMat_rf_small <- confusionMatrix(predictions_rf_small.val,validate$classe)
conMat_rf_small
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1666    9    0    0    2
         B    2 1126   11    0    1
         C    4    3 1012   14    3
         D    1    0    3  950    1
         E    1    1    0    0 1075

Overall Statistics
                                          
               Accuracy : 0.9905          
                 95% CI : (0.9877, 0.9928)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.988           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9952   0.9886   0.9864   0.9855   0.9935
Specificity            0.9974   0.9971   0.9951   0.9990   0.9996
Pos Pred Value         0.9934   0.9877   0.9768   0.9948   0.9981
Neg Pred Value         0.9981   0.9973   0.9971   0.9972   0.9985
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2831   0.1913   0.1720   0.1614   0.1827
Detection Prevalence   0.2850   0.1937   0.1760   0.1623   0.1830
Balanced Accuracy      0.9963   0.9928   0.9907   0.9922   0.9966
modFit_gbm_small <- train(classe ~ .,method="gbm",data=train3, verbose = FALSE)
predictions_gbm_small.val <- predict(modFit_gbm_small,newdata=validate)
conMat_gbm_small <- confusionMatrix(predictions_gbm_small.val,validate$classe)
conMat_gbm_small
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1616   58    6    8   11
         B   24 1027   53    7   27
         C   21   32  946   57   12
         D   11   14   18  891   17
         E    2    8    3    1 1015

Overall Statistics
                                        
               Accuracy : 0.9337        
                 95% CI : (0.9271, 0.94)
    No Information Rate : 0.2845        
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.9161        
 Mcnemar's Test P-Value : 2.227e-14     

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9654   0.9017   0.9220   0.9243   0.9381
Specificity            0.9803   0.9766   0.9749   0.9878   0.9971
Pos Pred Value         0.9511   0.9025   0.8858   0.9369   0.9864
Neg Pred Value         0.9861   0.9764   0.9834   0.9852   0.9862
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2746   0.1745   0.1607   0.1514   0.1725
Detection Prevalence   0.2887   0.1934   0.1815   0.1616   0.1749
Balanced Accuracy      0.9728   0.9391   0.9485   0.9560   0.9676

The Random Forest model did much better predicting the validate set with 99.68 percent accuracy for the full set of variable and with 99.05 percent for the set of 20 variables. These numbers are quite similar suggesting there is not too much overfitting by the larger model, but that the smaller should be sufficient. Due to computing limitations, the Boosted model was only run on the smaller dataset. It did not do quite as well with 93.37 percent accuracy for the set of 20 variables.

Conclusions

test <- test[,-c(1,2,extracol)]
predictions_rf.test <- predict(modFit_rf,newdata=test)

The best model for this data turned out to be the Random Forest model with all 52 variables. The expected out of sample error was estimated from the validation sample to be 0.32. This model predicted the small testing set from the study with 100% accuracy (according to the Coursera Practical Machine Learning exam).

plot(modFit_rf, main = "Final Random Forest Model", sub="Number of Trees vs Error")

plot(predictions_rf.val,validate$classe, main = "Predictions for Final Random Forest Model", xlab = "Predictions", ylab = "Actual from Validation Set")

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf Read more at: http://groupware.les.inf.puc-rio.br/har#ixzz4mo1EE6hf

Coursera Practical Machine Learning Class with Jeff Leek, Roger Peng and Brian Caffo, Johns Hopkins University https://www.coursera.org/learn/practical-machine-learning