The objective and goal of this project is to predict the manner in which 6 participants performed several exercises using machine learning classification of accelerometers data on the belt, forearm, arm, and dumbell movements. The outcome variable in the training data is under “classe”.The “classe” variable classifies the correct and incorrect outcomes of A, B, C, D, and E categories.
We will predict 20 different test cases with our model. The data for this project come from: http://groupware.les.inf.puc-rio.br/har.
library(caret)## Loading required package: lattice## Loading required package: ggplot2library(randomForest)## randomForest 4.6-12## Type rfNews() to see new features/changes/bug fixes.## 
## Attaching package: 'randomForest'## The following object is masked from 'package:ggplot2':
## 
##     marginset.seed(2017)a <- read.csv("pml-training.csv",na.strings=c("NA","#DIV/0!", "") )
test <- read.csv("pml-testing.csv")
dim(a)## [1] 19622   160dim(test)## [1]  20 160a <- a[, colMeans(is.na(a)) < 0.05]
dim(a)## [1] 19622    60nzv <- nearZeroVar(a, saveMetrics = FALSE)
a <- a[ , -nzv]
dim(a)## [1] 19622    59colnames(a[1:5])## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"a <- a[,-c(1:5)]
dim(a)## [1] 19622    54partition <- createDataPartition(a$classe, p = 0.7, list = FALSE)
trainData<- a[partition,]
validData <- a[-partition,]We will use the random forest model because it is widely used and has high accuracy.
fit <- train(classe ~ ., data = trainData, ntree = 100, method = 'rf',
                 trControl = trainControl(method = "cv", number = 10))
fit$finalModel## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.3%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B    6 2647    4    1    0 0.0041384500
## C    0    9 2386    1    0 0.0041736227
## D    0    2    8 2241    1 0.0048845471
## E    0    1    0    7 2517 0.0031683168plot(fit$finalModel)With an out of sample error rate of 0.3% we can move confidently to the validation set.
predictFit <- predict(fit, validData)
confusionMatrix(predictFit, validData$classe)## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    1    0    0    0
##          B    0 1136    4    0    0
##          C    0    2 1021    4    0
##          D    0    0    1  960    4
##          E    0    0    0    0 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9973          
##                  95% CI : (0.9956, 0.9984)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9966          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9951   0.9959   0.9963
## Specificity            0.9998   0.9992   0.9988   0.9990   1.0000
## Pos Pred Value         0.9994   0.9965   0.9942   0.9948   1.0000
## Neg Pred Value         1.0000   0.9994   0.9990   0.9992   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1930   0.1735   0.1631   0.1832
## Detection Prevalence   0.2846   0.1937   0.1745   0.1640   0.1832
## Balanced Accuracy      0.9999   0.9983   0.9969   0.9974   0.9982Out of sample error is 1 - 0.9973 = 0.0027. We are confident with our final model.
predictTest <- predict(fit, test)
print(predictTest)##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E