Just doing exercise is important, but to prevent injuries and get the maximum benefit, it is important to do the exercises correctly. In the past, the only way to ensure correct performance of moves was training and monitoring by an expert like a physical therapist or personal trainer. A group of scientists (Velloso et al., 2013) became interested in using the different fitness monitoring systems on the market to detect not only the quantity of exercise done by the wearer but also the quantity.
Velloso’s team created a dataset of tracking data in which the volunteers did biceps curls both correctly and incorrectly (in four typical ways). In this paper, we use practical machine learning methods to try to predict whether an exercise was done correctly or not.
training <- read.csv("pml-training.csv", stringsAsFactors = FALSE, na.string= "")
test <- read.csv("pml-testing.csv", stringsAsFactors = FALSE, na.string= "")
Two datasets were provided. One for training with 19622 observations of 160 variables, and one for testing with 20 observations of 160 variables. These datasets can be found at:
training: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
testing: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
table(training$classe)
A B C D E
5580 3797 3422 3216 3607
The classe variable denotes which category the exercise falls in with Class A representing correct performance and the four incorrect methods being: “throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E)” (Velloso et al., 2013).
There is a good sized sample of each of the classes.
There are many variables that measure the acceleration, pitch, yaw, roll, and similiar variables of each motion sensor. On inspection, there are also quite a few variables that have very few observations. By studying the dataset, we found that each time the new_window variable was yes, there were extra columns that seem to be summary statistics for the observations when that window was open.
table_new <- table(training$new_window)
table_new
no yes
19216 406
These variables only show up in 406 out of 19622 observations so are not likely to add significantly to any model. The following code looks for variables that only have values on new windows and eliminates them from the dataset. Extra time formats and window information was also eliminated from the data.
training2 <- training[training$new_window =="no",]
extracol = 1
for (i in 1:ncol(training2))
if(sum(is.na(training2[,i]))==nrow(training2)){
extracol <- rbind(extracol, i)
} else {
if (sum(training2[,i] == "NA")==nrow(training2)){
extracol <- rbind(extracol, i)
}
}
extracol <- rbind(extracol, 3,4,6,7)
training <- training[,-extracol]
After removing these extraneous variables, the data was checked for any near zero variance (NZV) variables, but there were no more found.
nearZeroVar(training)
integer(0)
Since the data was read in with no factors, the user name and class were classified as factor variables
training$user_name <- factor(training$user_name)
training$classe <- factor(training$classe)
Last we split the data into a training (70%) and validation (30%) set to be able to check the accuracy of our models and estimate out of sample error rate. User name and time were not used as factors in these models so were removed from the datasets.
set.seed(77463)
inTrain <- createDataPartition(y=training$classe,
p=0.7, list=FALSE)
train <- training[inTrain,-c(1,2) ]
validate <- training[-inTrain,-c(1,2) ]
We were left with 52 variables to use in predicting class.
At first we tried to cut down the number of variables to prevent overfitting. To look for high correlates with class, but reduce variables that were highly correlated with each other, we used a forward selection linear regression model to select the 20 most useful variables. Please see the Rmarkdown source code for this document at https://github.com/jgrantier/practicalmachinelearning to see all of the code for the forward selection.
summary(mod20)
Call:
lm(formula = classe_int ~ pitch_forearm + magnet_belt_y + total_accel_forearm +
magnet_arm_x + total_accel_dumbbell + accel_belt_y + total_accel_belt +
pitch_belt + magnet_dumbbell_z + accel_forearm_z + magnet_dumbbell_x +
roll_belt + accel_dumbbell_x + yaw_dumbbell + accel_arm_z +
magnet_arm_y + accel_arm_x + magnet_forearm_z + gyros_arm_x,
data = train_df)
Residuals:
Min 1Q Median 3Q Max
-3.3676 -0.7092 -0.0328 0.6686 4.5887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.364e+00 2.008e-01 26.71 <2e-16 ***
pitch_forearm 1.417e-02 3.935e-04 36.00 <2e-16 ***
magnet_belt_y -1.005e-02 3.235e-04 -31.08 <2e-16 ***
total_accel_forearm 2.632e-02 9.885e-04 26.62 <2e-16 ***
magnet_arm_x -9.404e-04 7.462e-05 -12.60 <2e-16 ***
total_accel_dumbbell 2.985e-02 1.929e-03 15.47 <2e-16 ***
accel_belt_y -3.073e-02 1.385e-03 -22.19 <2e-16 ***
total_accel_belt 8.856e-02 6.996e-03 12.66 <2e-16 ***
pitch_belt 4.913e-02 9.821e-04 50.03 <2e-16 ***
magnet_dumbbell_z 9.687e-03 1.719e-04 56.36 <2e-16 ***
accel_forearm_z -6.630e-03 1.345e-04 -49.30 <2e-16 ***
magnet_dumbbell_x -2.941e-03 9.736e-05 -30.21 <2e-16 ***
roll_belt 1.623e-02 1.105e-03 14.68 <2e-16 ***
accel_dumbbell_x 6.148e-03 3.053e-04 20.14 <2e-16 ***
yaw_dumbbell -3.975e-03 1.960e-04 -20.29 <2e-16 ***
accel_arm_z 3.997e-03 1.405e-04 28.45 <2e-16 ***
magnet_arm_y -2.637e-03 1.383e-04 -19.07 <2e-16 ***
accel_arm_x 2.616e-03 2.144e-04 12.20 <2e-16 ***
magnet_forearm_z 5.016e-04 3.640e-05 13.78 <2e-16 ***
gyros_arm_x 4.677e-02 4.729e-03 9.89 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.047 on 13717 degrees of freedom
Multiple R-squared: 0.4971, Adjusted R-squared: 0.4964
F-statistic: 713.6 on 19 and 13717 DF, p-value: < 2.2e-16
train3 <- train[,which(names(train) %in% c(names(mod20[[1]]),"classe"))]
This model only explains 49.71 percent of the variance in the training set, probably because the relationship with the most important predictors is not linear, but we kept this smaller dataset to test with other models to see if this subset would be sufficient for predictions, since it cuts down on variables that correlated with each other and, if sufficient, may run faster than the full set.
Since the poor performance of the linear model suggests the relationship is not linear, we next looked at a random tree model.
modFit_rt <- train(classe ~ .,method="rpart",data=train)
predictions_rt <- predict(modFit_rt,newdata=validate)
conMat_rt <- confusionMatrix(predictions_rt,validate$classe)
conMat_rt
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1494 478 501 410 151
B 33 399 45 185 145
C 120 262 480 369 295
D 0 0 0 0 0
E 27 0 0 0 491
Overall Statistics
Accuracy : 0.4867
95% CI : (0.4738, 0.4995)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3293
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8925 0.3503 0.46784 0.0000 0.45379
Specificity 0.6343 0.9140 0.78473 1.0000 0.99438
Pos Pred Value 0.4924 0.4944 0.31455 NaN 0.94788
Neg Pred Value 0.9369 0.8543 0.87474 0.8362 0.88988
Prevalence 0.2845 0.1935 0.17434 0.1638 0.18386
Detection Rate 0.2539 0.0678 0.08156 0.0000 0.08343
Detection Prevalence 0.5155 0.1371 0.25930 0.0000 0.08802
Balanced Accuracy 0.7634 0.6322 0.62628 0.5000 0.72408
plot(predictions_rt,validate$classe, main="Predictions for Random Tree Model", xlab = "Predictions", ylab = "Actual from Validation Set")
This model also did quite poorly, with an accuracy of only 51.56 percent on the training set and 48.67 percent on the validation set for the entire set of 52 variables.
Since the simple models do not seem to be adequate in modeling the data or making predictions, we needed to use more complex modeling techniques and used both a boosting method with trees (caret gbm) and random forest (randomForest).
modFit_rf <- randomForest(classe ~., data=train)
predictions_rf.val <- predict(modFit_rf,newdata=validate)
conMat_rf <- confusionMatrix(predictions_rf.val,validate$classe)
conMat_rf
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1674 6 0 0 0
B 0 1132 4 0 0
C 0 1 1021 7 0
D 0 0 1 957 0
E 0 0 0 0 1082
Overall Statistics
Accuracy : 0.9968
95% CI : (0.995, 0.9981)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9959
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 1.0000 0.9939 0.9951 0.9927 1.0000
Specificity 0.9986 0.9992 0.9984 0.9998 1.0000
Pos Pred Value 0.9964 0.9965 0.9922 0.9990 1.0000
Neg Pred Value 1.0000 0.9985 0.9990 0.9986 1.0000
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2845 0.1924 0.1735 0.1626 0.1839
Detection Prevalence 0.2855 0.1930 0.1749 0.1628 0.1839
Balanced Accuracy 0.9993 0.9965 0.9967 0.9963 1.0000
modFit_rf_small <- randomForest(classe ~., data=train3)
predictions_rf_small.val <- predict(modFit_rf_small,newdata=validate)
conMat_rf_small <- confusionMatrix(predictions_rf_small.val,validate$classe)
conMat_rf_small
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1666 9 0 0 2
B 2 1126 11 0 1
C 4 3 1012 14 3
D 1 0 3 950 1
E 1 1 0 0 1075
Overall Statistics
Accuracy : 0.9905
95% CI : (0.9877, 0.9928)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.988
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9952 0.9886 0.9864 0.9855 0.9935
Specificity 0.9974 0.9971 0.9951 0.9990 0.9996
Pos Pred Value 0.9934 0.9877 0.9768 0.9948 0.9981
Neg Pred Value 0.9981 0.9973 0.9971 0.9972 0.9985
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2831 0.1913 0.1720 0.1614 0.1827
Detection Prevalence 0.2850 0.1937 0.1760 0.1623 0.1830
Balanced Accuracy 0.9963 0.9928 0.9907 0.9922 0.9966
modFit_gbm_small <- train(classe ~ .,method="gbm",data=train3, verbose = FALSE)
predictions_gbm_small.val <- predict(modFit_gbm_small,newdata=validate)
conMat_gbm_small <- confusionMatrix(predictions_gbm_small.val,validate$classe)
conMat_gbm_small
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 1616 58 6 8 11
B 24 1027 53 7 27
C 21 32 946 57 12
D 11 14 18 891 17
E 2 8 3 1 1015
Overall Statistics
Accuracy : 0.9337
95% CI : (0.9271, 0.94)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9161
Mcnemar's Test P-Value : 2.227e-14
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9654 0.9017 0.9220 0.9243 0.9381
Specificity 0.9803 0.9766 0.9749 0.9878 0.9971
Pos Pred Value 0.9511 0.9025 0.8858 0.9369 0.9864
Neg Pred Value 0.9861 0.9764 0.9834 0.9852 0.9862
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2746 0.1745 0.1607 0.1514 0.1725
Detection Prevalence 0.2887 0.1934 0.1815 0.1616 0.1749
Balanced Accuracy 0.9728 0.9391 0.9485 0.9560 0.9676
The Random Forest model did much better predicting the validate set with 99.68 percent accuracy for the full set of variable and with 99.05 percent for the set of 20 variables. These numbers are quite similar suggesting there is not too much overfitting by the larger model, but that the smaller should be sufficient. Due to computing limitations, the Boosted model was only run on the smaller dataset. It did not do quite as well with 93.37 percent accuracy for the set of 20 variables.
test <- test[,-c(1,2,extracol)]
predictions_rf.test <- predict(modFit_rf,newdata=test)
The best model for this data turned out to be the Random Forest model with all 52 variables. The expected out of sample error was estimated from the validation sample to be 0.32. This model predicted the small testing set from the study with 100% accuracy (according to the Coursera Practical Machine Learning exam).
plot(modFit_rf, main = "Final Random Forest Model", sub="Number of Trees vs Error")
plot(predictions_rf.val,validate$classe, main = "Predictions for Final Random Forest Model", xlab = "Predictions", ylab = "Actual from Validation Set")
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf Read more at: http://groupware.les.inf.puc-rio.br/har#ixzz4mo1EE6hf
Coursera Practical Machine Learning Class with Jeff Leek, Roger Peng and Brian Caffo, Johns Hopkins University https://www.coursera.org/learn/practical-machine-learning