The objective and goal of this project is to predict the manner in which 6 participants performed several exercises using machine learning classification of accelerometers data on the belt, forearm, arm, and dumbell movements. The outcome variable in the training data is under “classe”.The “classe” variable classifies the correct and incorrect outcomes of A, B, C, D, and E categories.
We will predict 20 different test cases with our model. The data for this project come from: http://groupware.les.inf.puc-rio.br/har.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(2017)
a <- read.csv("pml-training.csv",na.strings=c("NA","#DIV/0!", "") )
test <- read.csv("pml-testing.csv")
dim(a)
## [1] 19622 160
dim(test)
## [1] 20 160
a <- a[, colMeans(is.na(a)) < 0.05]
dim(a)
## [1] 19622 60
nzv <- nearZeroVar(a, saveMetrics = FALSE)
a <- a[ , -nzv]
dim(a)
## [1] 19622 59
colnames(a[1:5])
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"
a <- a[,-c(1:5)]
dim(a)
## [1] 19622 54
partition <- createDataPartition(a$classe, p = 0.7, list = FALSE)
trainData<- a[partition,]
validData <- a[-partition,]
We will use the random forest model because it is widely used and has high accuracy.
fit <- train(classe ~ ., data = trainData, ntree = 100, method = 'rf',
trControl = trainControl(method = "cv", number = 10))
fit$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 100, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.3%
## Confusion matrix:
## A B C D E class.error
## A 3905 1 0 0 0 0.0002560164
## B 6 2647 4 1 0 0.0041384500
## C 0 9 2386 1 0 0.0041736227
## D 0 2 8 2241 1 0.0048845471
## E 0 1 0 7 2517 0.0031683168
plot(fit$finalModel)
With an out of sample error rate of 0.3% we can move confidently to the validation set.
predictFit <- predict(fit, validData)
confusionMatrix(predictFit, validData$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 1 0 0 0
## B 0 1136 4 0 0
## C 0 2 1021 4 0
## D 0 0 1 960 4
## E 0 0 0 0 1078
##
## Overall Statistics
##
## Accuracy : 0.9973
## 95% CI : (0.9956, 0.9984)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9966
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9974 0.9951 0.9959 0.9963
## Specificity 0.9998 0.9992 0.9988 0.9990 1.0000
## Pos Pred Value 0.9994 0.9965 0.9942 0.9948 1.0000
## Neg Pred Value 1.0000 0.9994 0.9990 0.9992 0.9992
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1930 0.1735 0.1631 0.1832
## Detection Prevalence 0.2846 0.1937 0.1745 0.1640 0.1832
## Balanced Accuracy 0.9999 0.9983 0.9969 0.9974 0.9982
Out of sample error is 1 - 0.9973 = 0.0027. We are confident with our final model.
predictTest <- predict(fit, test)
print(predictTest)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E