Practical Machine Learning Project

SYNOPSIS

The objective and goal of this project is to predict the manner in which 6 participants performed several exercises using machine learning classification of accelerometers data on the belt, forearm, arm, and dumbell movements. The outcome variable in the training data is under “classe”.The “classe” variable classifies the correct and incorrect outcomes of A, B, C, D, and E categories.
We will predict 20 different test cases with our model. The data for this project come from: http://groupware.les.inf.puc-rio.br/har.

Libraries

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(2017)

Reading Data

a <- read.csv("pml-training.csv",na.strings=c("NA","#DIV/0!", "") )
test <- read.csv("pml-testing.csv")
dim(a)

## [1] 19622   160

dim(test)

## [1]  20 160

Cleaning Data

Take out vars with more than 95% of na values

a <- a[, colMeans(is.na(a)) < 0.05]
dim(a)

## [1] 19622    60

Take out vars with near zero variance

nzv <- nearZeroVar(a, saveMetrics = FALSE)
a <- a[ , -nzv]
dim(a)

## [1] 19622    59

Take out ID and timestamp columns

colnames(a[1:5])

## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"

a <- a[,-c(1:5)]
dim(a)

## [1] 19622    54

Data Partition 70/30 for training and validation

partition <- createDataPartition(a$classe, p = 0.7, list = FALSE)
trainData<- a[partition,]
validData <- a[-partition,]

Train the model - 10-fold cross validation - random forest

We will use the random forest model because it is widely used and has high accuracy.

fit <- train(classe ~ ., data = trainData, ntree = 100, method = 'rf',
                 trControl = trainControl(method = "cv", number = 10))
fit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.3%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B    6 2647    4    1    0 0.0041384500
## C    0    9 2386    1    0 0.0041736227
## D    0    2    8 2241    1 0.0048845471
## E    0    1    0    7 2517 0.0031683168

plot(fit$finalModel)

With an out of sample error rate of 0.3% we can move confidently to the validation set.

High accuracy - check model with validation set

predictFit <- predict(fit, validData)
confusionMatrix(predictFit, validData$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    1    0    0    0
##          B    0 1136    4    0    0
##          C    0    2 1021    4    0
##          D    0    0    1  960    4
##          E    0    0    0    0 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9973          
##                  95% CI : (0.9956, 0.9984)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9966          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9951   0.9959   0.9963
## Specificity            0.9998   0.9992   0.9988   0.9990   1.0000
## Pos Pred Value         0.9994   0.9965   0.9942   0.9948   1.0000
## Neg Pred Value         1.0000   0.9994   0.9990   0.9992   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1930   0.1735   0.1631   0.1832
## Detection Prevalence   0.2846   0.1937   0.1745   0.1640   0.1832
## Balanced Accuracy      0.9999   0.9983   0.9969   0.9974   0.9982

Out of sample error is 1 - 0.9973 = 0.0027. We are confident with our final model.

High accuracy - move on to test set prediction

predictTest <- predict(fit, test)
print(predictTest)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Project - Activity Prediction

Kalil Figueroa

May 27, 2017