Machine Learning with caret
Objectives
- Train models, predict to new data, and assess model performance using different machine learning methods and the caret package
- Define training controls to optimize models and tune hyperparameters
- Explore means to improve model performance using training data balancing, feature selection, and pre-processing
- Make categorical and continuous predictions
- Plot decision trees
Overview
Expanding upon the last section, we will continue exploring machine learning in R. Specifically, we will use the caret (Classification and Regression Training) package. Many packages provide access to machine learning methods, and caret offers a standardized means to use a variety of algorithms from different packages. This link provides a list of all models that can be used through caret. In this module, we will specifically focus on k-nearest neighbor (k-NN), decision trees (DT), random forests (RF), and support vector machines (SVM); however, after learning to apply these methods you will be able to apply many more methods using similar syntax. We will explore caret using a variety of examples. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.
A cheat sheet for caret can be found here.
Before beginning, you will need to load in the required packages.
library(caret)
library(rpart.plot)
library(randomForest)
library(plyr)
library(dplyr)
library(raster)
library(sf)
library(rgdal)
library(tmap)
library(tmaptools)
library(Metrics)
library(forcats)
Example 1: Wetland Classification
In this first example, we will predict wetland categories using different algorithms and compare the results. The training variables were derived from Landsat imagery and include brightness, greenness, wetness, and NDVI from September and April imagery. Also, terrain variables were included to offer additional predictors. Four classes are differentiated: not wetlands (Not), palustrine emergent wetlands (PEM), palustrine forested wetlands (PFO), and rivers/lakes/ponds (RLP). These data have not been published or used in a paper.
First, I read in the data. Next, I subset 200 examples of each class for training (train) using functions from dplyr. Optimally, more samples would be used to train the models; however, I am trying to minimize training and tuning time since this is just a demonstration. I then use the setdiff() function to extract all examples that were not included in the training set to a validation set (val).
<- read.csv("caret/wetland_data2.csv", header=TRUE, sep=",", stringsAsFactors=TRUE) wetdata
set.seed(42)
<- wetdata %>% dplyr::group_by(class) %>% dplyr::sample_n(200, replace=FALSE)
train <- setdiff(wetdata, train) val
Now that I have created separate training and validation data sets, I can tune hyperparameters for the different models. Using the trainControl() function, I define the training and tuning parameters. Here, I am using cross validation with 5 folds. The available methods include:
- “boot”: bootstrap
- “cv”: k-fold cross validation
- “LOOCV”: leave-one-out cross validation
- “repeated”: repeated k-fold cross validation
I tend to use k-fold cross validation, bootstrapping, or repeated k-fold cross validation. The number argument for k-fold cross validation specifies the number of folds while it will determine the number of bootstrap samples for bootstrapping. A repeat argument is required for repeated k-fold cross validation. In the example, I am using 5-fold cross validation without repeats. I have also set the verboseIter argument to FALSE so that the results of each fold are not printed to the console. If you would like to monitor the progression of the hyperparameter tuning process, you can set this to TRUE. Optimally, I would use more folds and a larger training set; however, I am trying to speed up the process so that it doesn’t take very long to tune the algorithms. I generally prefer to use 10 folds. I am also setting a random seed to obtain consistent results and make the experiment more reproducible.
set.seed(42)
<- trainControl(method = "cv", number = 5, verboseIter = FALSE) trainctrl
In the next code block I am optimizing and training the four different models. Notice that the syntax is very similar. I only need to change the method to a different algorithm. I can also provide arguments specific to the algorithm; for example, I am providing an ntree argument for random forest. I am also centering and scaling the data for each model and setting the tuneLength to 10. So, ten values for each hyperparameter will be assessed using 5-fold cross validation. To fine tune a model, you should use a larger tune length; however, that will increase the time required. You can also provide your own list of values and implement tuneGrid as opposed to tuneLength. I am optimizing using the Kappa statistic, so the model with the best Kappa value will be returned as the final model. It is also possible to use overall accuracy as opposed to Kappa. Before running each model, I have set a random seed for reproducibility,
Note that it will take some time to tune and train these models if you choose to execute the code. Also, feel free to try different models. For example, the ranger package provides a faster implementation of random forest.
#Run models using caret
set.seed(42)
<- train(class~., data=train, method = "knn",
knn.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "rpart",
dt.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "rf",
rf.model tuneLength = 10,
ntree=100,
importance=TRUE,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "svmRadial",
svm.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
Once models have been trained and tuned, they can be used to predict to new data. In the next code block, I am predicting to the validation data. Note that the same predictor variables must be provided and they must have the same names. It is okay to include variables that are not used. It is also fine if the variables are in a different order.
Once a prediction has been made, I use the confusionMatrix() function to obtain assessment metrics. Based on the reported metrics, RF and SVM outperform the k-NN and DT algorithms for this specific task.
<-predict(knn.model, val)
knn.predict <-predict(dt.model, val)
dt.predict <-predict(rf.model, val)
rf.predict <-predict(svm.model, val) svm.predict
confusionMatrix(knn.predict, val$class)
Confusion Matrix and Statistics
Reference
Prediction Not PEM PFO RLP1505 18 24 47
Not 147 1240 282 119
PEM 123 490 1417 324
PFO 25 52 77 1310
RLP
Overall Statistics
: 0.76
Accuracy 95% CI : (0.75, 0.7698)
: 0.25
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.68
Kappa
's Test P-Value : < 2.2e-16
Mcnemar
Statistics by Class:
Class: Not Class: PEM Class: PFO Class: RLP
Sensitivity 0.8361 0.6889 0.7872 0.7278
Specificity 0.9835 0.8985 0.8265 0.9715
Pos Pred Value 0.9442 0.6935 0.6020 0.8948
Neg Pred Value 0.9474 0.8965 0.9210 0.9146
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.2090 0.1722 0.1968 0.1819
Detection Prevalence 0.2214 0.2483 0.3269 0.2033
Balanced Accuracy 0.9098 0.7937 0.8069 0.8496
confusionMatrix(dt.predict, val$class)
Confusion Matrix and Statistics
Reference
Prediction Not PEM PFO RLP1531 79 122 170
Not 107 1332 472 118
PEM 139 336 1150 358
PFO 23 53 56 1154
RLP
Overall Statistics
: 0.7176
Accuracy 95% CI : (0.7071, 0.728)
: 0.25
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6235
Kappa
's Test P-Value : < 2.2e-16
Mcnemar
Statistics by Class:
Class: Not Class: PEM Class: PFO Class: RLP
Sensitivity 0.8506 0.7400 0.6389 0.6411
Specificity 0.9313 0.8709 0.8457 0.9756
Pos Pred Value 0.8049 0.6565 0.5799 0.8974
Neg Pred Value 0.9492 0.9095 0.8754 0.8908
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.2126 0.1850 0.1597 0.1603
Detection Prevalence 0.2642 0.2818 0.2754 0.1786
Balanced Accuracy 0.8909 0.8055 0.7423 0.8083
confusionMatrix(rf.predict, val$class)
Confusion Matrix and Statistics
Reference
Prediction Not PEM PFO RLP1589 30 57 74
Not 104 1321 324 137
PEM 58 353 1277 188
PFO 49 96 142 1401
RLP
Overall Statistics
: 0.7761
Accuracy 95% CI : (0.7663, 0.7857)
: 0.25
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7015
Kappa
's Test P-Value : 3.06e-11
Mcnemar
Statistics by Class:
Class: Not Class: PEM Class: PFO Class: RLP
Sensitivity 0.8828 0.7339 0.7094 0.7783
Specificity 0.9702 0.8954 0.8891 0.9469
Pos Pred Value 0.9080 0.7004 0.6807 0.8300
Neg Pred Value 0.9613 0.9099 0.9018 0.9276
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.2207 0.1835 0.1774 0.1946
Detection Prevalence 0.2431 0.2619 0.2606 0.2344
Balanced Accuracy 0.9265 0.8146 0.7993 0.8626
confusionMatrix(svm.predict, val$class)
Confusion Matrix and Statistics
Reference
Prediction Not PEM PFO RLP1553 48 35 70
Not 102 1277 305 107
PEM 61 389 1332 182
PFO 84 86 128 1441
RLP
Overall Statistics
: 0.7782
Accuracy 95% CI : (0.7684, 0.7877)
: 0.25
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7043
Kappa
's Test P-Value : 5.62e-09
Mcnemar
Statistics by Class:
Class: Not Class: PEM Class: PFO Class: RLP
Sensitivity 0.8628 0.7094 0.7400 0.8006
Specificity 0.9717 0.9048 0.8830 0.9448
Pos Pred Value 0.9103 0.7130 0.6782 0.8286
Neg Pred Value 0.9550 0.9033 0.9106 0.9343
Prevalence 0.2500 0.2500 0.2500 0.2500
Detection Rate 0.2157 0.1774 0.1850 0.2001
Detection Prevalence 0.2369 0.2487 0.2728 0.2415
Balanced Accuracy 0.9172 0.8071 0.8115 0.8727
As discussed and demonstrated in the prior module, random forest provides an assessment of variable importance. To obtain these measures after a model has been generated with caret, you will need to extract the final model to a new object then call the importance() function on it. By calling the model, we can see the OOB error rate and confusion matrix for the OOB data. Based on the OOB mean decrease in accuracy measure, topographic slope was the most important variable in the prediction. Generally, both spectral and topographic variables were important in the model. The OOB error rate was 24.6%, suggesting that roughly a quarter of the OOB data are misclassified on average. So, the performance isn’t great. However, this is a complex classification problem.
#Variable Importance RF/OOB Error RF
<- rf.model$finalModel
rf.model.final importance(rf.model.final)
Not PEM PFO RLP MeanDecreaseAccuracy5.74239672 5.9778701 12.601644 4.2124869 10.000872
a_ndvi2 6.91729515 11.2023308 6.982464 6.0847881 12.608724
abright 7.63587685 3.9760913 9.102786 7.6173366 11.053074
agreen 6.98487614 6.0177666 5.054557 5.6133191 10.544203
awet 7.39448139 8.5656727 7.876298 0.9542089 14.124229
s_ndvi 4.64701968 14.2204087 8.877035 5.0004584 17.358782
sbright 5.94762932 3.5785531 6.669556 2.7279847 7.936331
sgreen 3.47939860 7.5277984 3.945337 2.2430319 10.207702
swet 16.62502681 6.4341151 12.345918 5.4415807 21.971194
slp_d 14.50122296 7.3490750 5.654878 2.7092294 15.785692
diss_a 10.10283171 7.5752717 8.527104 1.7674813 13.711659
rough_a 7.47914088 9.3065409 6.710554 5.0555033 12.303625
sp_a 3.69423817 1.6336448 3.511083 3.6621277 6.390287
ctmi -0.91201199 -0.2576315 1.762561 4.1165854 3.547979
curv_arc 0.05324813 0.5897456 3.305928 3.8294161 4.235983
curv_pro 1.32975100 -1.7227436 2.445788 3.3088655 3.100601
crv_pln
MeanDecreaseGini45.81355
a_ndvi2 53.87912
abright 60.70916
agreen 27.55160
awet 35.81947
s_ndvi 57.90330
sbright 27.05002
sgreen 30.14842
swet 79.31722
slp_d 44.55795
diss_a 40.15968
rough_a 38.32568
sp_a 16.41397
ctmi 13.38326
curv_arc 15.16093
curv_pro 12.93003
crv_pln
rf.model.final
:
CallrandomForest(x = x, y = y, ntree = 100, mtry = min(param$mtry, ncol(x)), importance = TRUE)
: classification
Type of random forest: 100
Number of trees: 5
No. of variables tried at each split
: 22.5%
OOB estimate of error rate:
Confusion matrix
Not PEM PFO RLP class.error174 9 15 2 0.130
Not 6 153 32 9 0.235
PEM 6 44 137 13 0.315
PFO 10 13 21 156 0.220 RLP
The structure of the decision tree can be plotted using the plot() function. The rpart.plot package includes the prp() function which provides a prettier decision tree visualization. This also gives us a sense of what variables are most important in the model.
#Make better tree plot
<- dt.model$finalModel
dt.model.final plot(dt.model.final)
text(dt.model.final)
prp(dt.model.final)
Example 2: Indian Pines
In this second example, I will demonstrate predicting crop types from hyperspectral imagery. The hyperspectral data are from the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), which offers 220 spectral bands in the visible, NIR, and SWIR spectrum. The following classes are differentiated:
- Alfalfa (54 pixels)
- Corn (2502 pixels)
- Grass/Pasture (523 pixels)
- Trees (1294 pixels)
- Hay (489 pixels)
- Oats (20 pixels)
- Soybeans (4050 pixels)
- Wheat (212 pixels)
These data are publicly available and cover the Indian Pines test site in Indiana. They can be obtained here. I have provided a raster representing the different categories (92av3gr8class.img), an image containing all the spectral bands (92av3c.img), and a mask to differentiate mapped and unmapped pixels (mask_ip.img).
Note that this example takes some time to execute, so you may choose to simply read through it as opposed to execute all the code.
<- raster("caret/92av3gr8class.img")
classes <- stack("caret/92av3c.img")
image <- raster("caret/mask_ip.img") mask
head(classes@data@attributes, n=8)
1]]
[[
ID COUNT Red Green Blue Class_Names Opacity1 0 0 255 255 255 0
2 1 54 255 255 138 Alfalfa 255
3 2 2502 2 28 243 Corn 255
4 3 523 255 89 0 Grass/pasture 255
5 4 1294 5 255 133 Trees 255
6 5 489 255 2 250 Hay 255
7 6 20 89 0 255 Oats 255
8 7 4050 2 171 255 Soybeans 255
9 8 212 12 255 7 Wheat 255
I did not provide the training and validation data as tables in this example, so I will need to create them in R. To produce these data, I will use the process outlined below.
- Convert the classes grid to points
- Change the column names
- Remove all zero values since these represent pixels without a mapped class at that location
- Convert the “Class” field to a factor
- Change the “Class” field values from numeric codes to factors to improve interpretability
<- st_as_sf(rasterToPoints(classes, fun=NULL, spatial=TRUE))
p names(p) <- c("Class", "geometry")
<- filter(p, Class > 0)
p2 $Class <- as.factor(p2$Class)
p2
$Class <- revalue(p2$Class, c("1"="Alfalfa", "2"="Corn", "3"="Grass/Pasture", "4"="Trees", "5"="Hay", "6"="Oats", "7"="Soybeans", "8"="Wheat")) p2
- Next, I need to extract all the image bands at each mapped point or pixel location. This can be accomplished using the extract() function from the raster package. I then merge the resulting tables and remove the geometry field that is no longer required. Note that this can take some time since there are 220 bands to extract at each point location.
<- as.data.frame(extract(image, p2))
p3
<- bind_cols(p2, p3)
data
st_geometry(data) <- NULL
- Now that I have extracted the predictor variables at each mapped point, I will split the data into training (train) and testing (test) sets using dplyr. I divide the data such that 50% of the samples from each class will be used for training and the remaining half will be used for testing. I now have separate and non-overlapping test and training sets.
set.seed(42)
<- data %>% group_by(Class) %>% sample_frac(0.5, replace=FALSE)
train
<- setdiff(data, train) test
I can now create models. First, I define the training and tuning controls to use 5-fold cross validation. I then tune and train each of the four models. I have set the tuneLength to 5, so only five values for each hyperparameter are tested. I am doing this to speed up the processes for demonstration purposes. However, if I were doing this for research purposes, I would test more values or use tuneGrid instead.
Again, if you choose to execute this code, it will take some time.
set.seed(42)
<- trainControl(method = "cv", number = 5, verboseIter = FALSE) trainctrl
set.seed(42)
<- train(Class~., data=train, method = "knn",
knn.model tuneLength = 5,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(Class~., data=train, method = "rpart",
dt.model tuneLength = 5,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(Class~., data=train, method = "rf",
rf.model tuneLength = 5,
ntree = 100,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(Class~., data=train, method = "svmRadial",
svm.model tuneLength = 5,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
Once the models are obtained, I predict to the withheld test data then created confusion matrices and accuracy assessment metrics. Take some time to review the confusion matrices to compare the models and assess what classes were most confused. Note that I provided an imbalanced training data set since there were different proportions of each class on the landscape.
<-predict(knn.model, test)
knn.predict <-predict(dt.model, test)
dt.predict <- predict(rf.model, test)
rf.predict <-predict(svm.model, test) svm.predict
confusionMatrix(knn.predict, test$Class)
Confusion Matrix and Statistics
Reference/Pasture Trees Hay Oats Soybeans Wheat
Prediction Alfalfa Corn Grass2 0 0 0 1 0 0 0
Alfalfa 0 664 3 0 0 1 230 0
Corn /Pasture 0 2 203 9 4 1 14 0
Grass0 0 33 636 0 0 2 0
Trees 22 0 17 0 239 0 1 0
Hay 0 0 0 0 0 5 1 1
Oats 3 584 5 0 1 2 1777 1
Soybeans 0 1 0 2 0 1 0 104
Wheat
Overall Statistics
: 0.794
Accuracy 95% CI : (0.7819, 0.8056)
: 0.4429
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7009
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: Alfalfa Class: Corn Class: Grass/Pasture
Sensitivity 0.0740741 0.5308 0.77778
Specificity 0.9997800 0.9295 0.99304
Pos Pred Value 0.6666667 0.7394 0.87124
Neg Pred Value 0.9945283 0.8402 0.98663
Prevalence 0.0059055 0.2736 0.05709
Detection Rate 0.0004374 0.1452 0.04440
Detection Prevalence 0.0006562 0.1964 0.05096
Balanced Accuracy 0.5369270 0.7302 0.88541
Class: Trees Class: Hay Class: Oats Class: Soybeans
Sensitivity 0.9830 0.97551 0.500000 0.8775
Specificity 0.9911 0.99076 0.999562 0.7660
Pos Pred Value 0.9478 0.85663 0.714286 0.7488
Neg Pred Value 0.9972 0.99860 0.998905 0.8872
Prevalence 0.1415 0.05359 0.002187 0.4429
Detection Rate 0.1391 0.05227 0.001094 0.3887
Detection Prevalence 0.1468 0.06102 0.001531 0.5190
Balanced Accuracy 0.9870 0.98313 0.749781 0.8218
Class: Wheat
Sensitivity 0.98113
Specificity 0.99910
Pos Pred Value 0.96296
Neg Pred Value 0.99955
Prevalence 0.02318
Detection Rate 0.02275
Detection Prevalence 0.02362
Balanced Accuracy 0.99012
confusionMatrix(dt.predict, test$Class)
Confusion Matrix and Statistics
Reference/Pasture Trees Hay Oats Soybeans Wheat
Prediction Alfalfa Corn Grass0 0 0 0 0 0 0 0
Alfalfa 0 406 3 0 8 0 124 0
Corn /Pasture 0 3 206 62 9 5 20 0
Grass0 0 28 577 0 0 0 0
Trees 24 1 17 0 228 0 1 0
Hay 0 0 0 0 0 0 0 0
Oats 3 820 4 0 0 0 1855 1
Soybeans 0 21 3 8 0 5 25 105
Wheat
Overall Statistics
: 0.7386
Accuracy 95% CI : (0.7256, 0.7513)
: 0.4429
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6163
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: Alfalfa Class: Corn Class: Grass/Pasture
Sensitivity 0.000000 0.3245 0.78927
Specificity 1.000000 0.9593 0.97704
Pos Pred Value NaN 0.7505 0.67541
Neg Pred Value 0.994094 0.7904 0.98711
Prevalence 0.005906 0.2736 0.05709
Detection Rate 0.000000 0.0888 0.04506
Detection Prevalence 0.000000 0.1183 0.06671
Balanced Accuracy 0.500000 0.6419 0.88315
Class: Trees Class: Hay Class: Oats Class: Soybeans
Sensitivity 0.8918 0.93061 0.000000 0.9160
Specificity 0.9929 0.99006 1.000000 0.6749
Pos Pred Value 0.9537 0.84133 NaN 0.6914
Neg Pred Value 0.9824 0.99605 0.997813 0.9100
Prevalence 0.1415 0.05359 0.002187 0.4429
Detection Rate 0.1262 0.04987 0.000000 0.4057
Detection Prevalence 0.1323 0.05927 0.000000 0.5868
Balanced Accuracy 0.9423 0.96034 0.500000 0.7955
Class: Wheat
Sensitivity 0.99057
Specificity 0.98612
Pos Pred Value 0.62874
Neg Pred Value 0.99977
Prevalence 0.02318
Detection Rate 0.02297
Detection Prevalence 0.03653
Balanced Accuracy 0.98834
confusionMatrix(rf.predict, test$Class)
Confusion Matrix and Statistics
Reference/Pasture Trees Hay Oats Soybeans Wheat
Prediction Alfalfa Corn Grass15 0 0 0 2 0 0 0
Alfalfa 0 942 4 0 0 0 108 0
Corn /Pasture 0 2 229 13 4 0 9 0
Grass0 0 10 633 0 0 3 0
Trees 9 0 15 0 239 0 1 0
Hay 0 0 0 0 0 9 2 1
Oats 3 307 3 0 0 0 1902 2
Soybeans 0 0 0 1 0 1 0 103
Wheat
Overall Statistics
: 0.8906
Accuracy 95% CI : (0.8812, 0.8995)
: 0.4429
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.8427
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: Alfalfa Class: Corn Class: Grass/Pasture
Sensitivity 0.555556 0.7530 0.87739
Specificity 0.999560 0.9663 0.99350
Pos Pred Value 0.882353 0.8937 0.89105
Neg Pred Value 0.997366 0.9122 0.99258
Prevalence 0.005906 0.2736 0.05709
Detection Rate 0.003281 0.2060 0.05009
Detection Prevalence 0.003718 0.2305 0.05621
Balanced Accuracy 0.777558 0.8596 0.93545
Class: Trees Class: Hay Class: Oats Class: Soybeans
Sensitivity 0.9784 0.97551 0.900000 0.9393
Specificity 0.9967 0.99422 0.999342 0.8763
Pos Pred Value 0.9799 0.90530 0.750000 0.8579
Neg Pred Value 0.9964 0.99861 0.999781 0.9478
Prevalence 0.1415 0.05359 0.002187 0.4429
Detection Rate 0.1385 0.05227 0.001969 0.4160
Detection Prevalence 0.1413 0.05774 0.002625 0.4849
Balanced Accuracy 0.9875 0.98487 0.949671 0.9078
Class: Wheat
Sensitivity 0.97170
Specificity 0.99955
Pos Pred Value 0.98095
Neg Pred Value 0.99933
Prevalence 0.02318
Detection Rate 0.02253
Detection Prevalence 0.02297
Balanced Accuracy 0.98563
confusionMatrix(svm.predict, test$Class)
Confusion Matrix and Statistics
Reference/Pasture Trees Hay Oats Soybeans Wheat
Prediction Alfalfa Corn Grass13 0 0 0 3 0 0 0
Alfalfa 0 967 1 0 0 0 72 0
Corn /Pasture 0 1 249 6 4 0 9 0
Grass0 0 2 640 0 0 3 0
Trees 11 0 4 0 238 0 0 0
Hay 0 0 0 0 0 9 0 1
Oats 3 283 5 0 0 0 1941 1
Soybeans 0 0 0 1 0 1 0 104
Wheat
Overall Statistics
: 0.9101
Accuracy 95% CI : (0.9014, 0.9182)
: 0.4429
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.8706
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: Alfalfa Class: Corn Class: Grass/Pasture
Sensitivity 0.481481 0.7730 0.95402
Specificity 0.999340 0.9780 0.99536
Pos Pred Value 0.812500 0.9298 0.92565
Neg Pred Value 0.996927 0.9196 0.99721
Prevalence 0.005906 0.2736 0.05709
Detection Rate 0.002843 0.2115 0.05446
Detection Prevalence 0.003500 0.2275 0.05884
Balanced Accuracy 0.740411 0.8755 0.97469
Class: Trees Class: Hay Class: Oats Class: Soybeans
Sensitivity 0.9892 0.97143 0.900000 0.9585
Specificity 0.9987 0.99653 0.999781 0.8854
Pos Pred Value 0.9922 0.94071 0.900000 0.8692
Neg Pred Value 0.9982 0.99838 0.999781 0.9641
Prevalence 0.1415 0.05359 0.002187 0.4429
Detection Rate 0.1400 0.05206 0.001969 0.4245
Detection Prevalence 0.1411 0.05534 0.002187 0.4884
Balanced Accuracy 0.9940 0.98398 0.949890 0.9219
Class: Wheat
Sensitivity 0.98113
Specificity 0.99955
Pos Pred Value 0.98113
Neg Pred Value 0.99955
Prevalence 0.02318
Detection Rate 0.02275
Detection Prevalence 0.02318
Balanced Accuracy 0.99034
Similar to the example from the previous module using the randomForest package, here I am predicting to the image to obtain a prediction at each cell location. I am using the support vector machine model since it provided the highest overall accuracy and Kappa statistic. I am using a progress window to monitor the progression. I am also setting overwrite equal to TRUE so that a previous output can be overwritten. If you do not want to overwrite a previous output, set this to FALSE. Again, this will take some time to execute if you choose to run it.
Once the raster-based prediction is generated, I read the result back in then multiply it by the mask to remove predictions over unmapped pixels. I then use tmap to visualize the mask and results. The masked example could then be written to disk using writeRaster().
To summarize, in this example I read in raster data, generated training and validation data from a categorical raster and a hyperspectral image, created and assessed four different models, then predicted back to the AVIRIS image using the best model. The results were then visualized using tmap.
predict(image, svm.model, progress="window", overwrite=TRUE, filename="caret/class_out.img")
<- raster("caret/class_out.img")
raster_result <- raster_result*mask result_masked
require(tmap)
tm_shape(mask)+
tm_raster(style= "cat", labels = c("Not Mapped", "Mapped"),
palette = c("gray", "red"),
title="Mask")+
tm_layout(title = "Classification Result", title.size = 1.5, title.snap.to.legend=TRUE)+
tm_layout(legend.outside=TRUE)
require(tmap)
tm_shape(raster_result)+
tm_raster(style= "cat", labels = c("Alfalfa","Corn","Grass","Hay","Oats","Soybeans","Trees","Wheat"),
palette = c("cyan4", "cornsilk", "darkgoldenrod3", "bisque4", "chocolate", "burlywood", "darkolivegreen", "darksalmon"),
title="Vegetation Types")+
tm_layout(title = "Classification Result", title.size = 1.5, title.snap.to.legend=TRUE)+
tm_layout(legend.outside=TRUE)
require(tmap)
tm_shape(result_masked)+
tm_raster(style= "cat", labels = c("Not Mapped", "Alfalfa","Corn","Grass","Hay","Oats","Soybeans","Trees","Wheat"),
palette = c("gray", "cyan4", "cornsilk", "darkgoldenrod3", "bisque4", "chocolate", "burlywood", "darkolivegreen", "darksalmon"),
title="Vegetation Types")+
tm_layout(title = "Classification Result", title.size = 1.5, title.snap.to.legend=TRUE)+
tm_layout(legend.outside=TRUE)
Example 3: Urban Land Cover Mapping using Machine learning and GEOBIA
In this example I will predict urban land cover types using predictor variables derived for image objects created using geographic object-based image analysis (GEOBIA). These data were obtained from the University of California, Irvine (UCI) Machine Learning Repository. The data were originally used in the following papers:
Johnson, B., and Xie, Z., 2013. Classifying a high resolution image of an urban area using super-object information. ISPRS Journal of Photogrammetry and Remote Sensing, 83, 40-49.
Johnson, B., 2013. High resolution urban land cover classification using a competitive multi-scale object-based approach. Remote Sensing Letters, 4 (2), 131-140.
The goal here is to differentiate urban land cover classes using multi-scale spectral, size, shape, and textural information calculated for each image object. Similar to the last example, the classes are imbalanced in the training and validation data sets.
In the first code block, I am reading in the data and counting the number of samples in each class in the training set.
<- read.csv("caret/training.csv", header=TRUE, sep=",", stringsAsFactors=TRUE)
train <- read.csv("caret/testing.csv", header=TRUE, sep=",", stringsAsFactors=TRUE)
test
<- train %>% dplyr::group_by(class) %>% dplyr::count()
class_n print(class_n)
# A tibble: 9 x 2
# Groups: class [9]
class n<fct> <int>
1 "asphalt " 14
2 "building " 25
3 "car " 15
4 "concrete " 23
5 "grass " 29
6 "pool " 15
7 "shadow " 16
8 "soil " 14
9 "tree " 17
Similar to the above examples, I then tune and train the four different models. Here I am using 10-fold cross validation and optimizing relative to Kappa. Once the models are trained, I then use them to predict to the validation data. Lastly, I produce confusion matrices to assess and compare the results.
Take some time to review the results and assessment. Note that this is a different problem than those presented above; however, the syntax is very similar. This is one of the benefits of caret: it provides a standardized way to experiment with different algorithms and machine learning problems within R.
set.seed(42)
<- trainControl(method = "cv", number = 10, verboseIter = FALSE)
trainctrl
<- train(class~., data=train, method = "knn",
knn.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "rpart",
dt.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "rf",
rf.model tuneLength = 10,
ntree = 100,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
<- train(class~., data=train, method = "svmRadial",
svm.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
<-predict(knn.model, test)
knn.predict <-predict(dt.model, test)
dt.predict <- predict(rf.model, test)
rf.predict <-predict(svm.model, test) svm.predict
confusionMatrix(knn.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 34 0 0 0 0 0 9 0 0
asphalt 0 68 1 4 1 1 0 2 0
building 0 0 14 1 0 0 0 0 0
car 1 19 0 71 5 0 0 3 0
concrete 1 1 1 2 58 1 0 10 9
grass 0 1 1 0 0 12 2 0 0
pool 6 2 0 0 0 0 31 0 2
shadow 0 5 3 13 1 0 0 5 0
soil 3 1 1 2 18 0 3 0 78
tree
Overall Statistics
: 0.7318
Accuracy 95% CI : (0.6909, 0.7699)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6854
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.75556 0.7010 0.66667
Specificity 0.98052 0.9780 0.99794
Pos Pred Value 0.79070 0.8831 0.93333
Neg Pred Value 0.97629 0.9326 0.98577
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06706 0.1341 0.02761
Detection Prevalence 0.08481 0.1519 0.02959
Balanced Accuracy 0.86804 0.8395 0.83230
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.7634 0.6988 0.85714 0.68889
Specificity 0.9324 0.9410 0.99189 0.97835
Pos Pred Value 0.7172 0.6988 0.75000 0.75610
Neg Pred Value 0.9461 0.9410 0.99593 0.96996
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1400 0.1144 0.02367 0.06114
Detection Prevalence 0.1953 0.1637 0.03156 0.08087
Balanced Accuracy 0.8479 0.8199 0.92451 0.83362
Class: soil Class: tree
Sensitivity 0.250000 0.8764
Specificity 0.954825 0.9330
Pos Pred Value 0.185185 0.7358
Neg Pred Value 0.968750 0.9726
Prevalence 0.039448 0.1755
Detection Rate 0.009862 0.1538
Detection Prevalence 0.053254 0.2091
Balanced Accuracy 0.602413 0.9047
confusionMatrix(dt.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 33 0 0 0 0 0 8 0 0
asphalt 1 56 0 13 0 0 0 4 0
building 2 7 17 7 2 0 0 3 0
car 2 25 4 70 1 0 0 0 0
concrete 0 0 0 0 73 1 0 0 35
grass 0 0 0 0 0 13 0 0 0
pool 7 1 0 0 0 0 21 0 1
shadow 0 8 0 3 3 0 0 13 0
soil 0 0 0 0 4 0 16 0 53
tree
Overall Statistics
: 0.6884
Accuracy 95% CI : (0.646, 0.7285)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6361
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.73333 0.5773 0.80952
Specificity 0.98268 0.9561 0.95679
Pos Pred Value 0.80488 0.7568 0.44737
Neg Pred Value 0.97425 0.9053 0.99147
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06509 0.1105 0.03353
Detection Prevalence 0.08087 0.1460 0.07495
Balanced Accuracy 0.85801 0.7667 0.88316
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.7527 0.8795 0.92857 0.46667
Specificity 0.9227 0.9151 1.00000 0.98052
Pos Pred Value 0.6863 0.6697 1.00000 0.70000
Neg Pred Value 0.9432 0.9749 0.99798 0.94969
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1381 0.1440 0.02564 0.04142
Detection Prevalence 0.2012 0.2150 0.02564 0.05917
Balanced Accuracy 0.8377 0.8973 0.96429 0.72359
Class: soil Class: tree
Sensitivity 0.65000 0.5955
Specificity 0.97125 0.9522
Pos Pred Value 0.48148 0.7260
Neg Pred Value 0.98542 0.9171
Prevalence 0.03945 0.1755
Detection Rate 0.02564 0.1045
Detection Prevalence 0.05325 0.1440
Balanced Accuracy 0.81063 0.7738
confusionMatrix(rf.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 38 0 0 0 0 0 4 0 0
asphalt 0 68 0 3 1 0 0 3 0
building 2 3 20 5 1 0 1 3 0
car 1 21 1 84 0 0 0 1 0
concrete 0 0 0 0 70 1 0 0 23
grass 0 0 0 0 0 13 0 0 0
pool 3 2 0 0 0 0 39 0 3
shadow 1 3 0 1 4 0 0 13 0
soil 0 0 0 0 7 0 1 0 63
tree
Overall Statistics
: 0.8047
Accuracy 95% CI : (0.7675, 0.8384)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7721
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.84444 0.7010 0.95238
Specificity 0.99134 0.9829 0.96914
Pos Pred Value 0.90476 0.9067 0.57143
Neg Pred Value 0.98495 0.9329 0.99788
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.07495 0.1341 0.03945
Detection Prevalence 0.08284 0.1479 0.06903
Balanced Accuracy 0.91789 0.8420 0.96076
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.9032 0.8434 0.92857 0.86667
Specificity 0.9420 0.9434 1.00000 0.98268
Pos Pred Value 0.7778 0.7447 1.00000 0.82979
Neg Pred Value 0.9774 0.9685 0.99798 0.98696
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1657 0.1381 0.02564 0.07692
Detection Prevalence 0.2130 0.1854 0.02564 0.09270
Balanced Accuracy 0.9226 0.8934 0.96429 0.92468
Class: soil Class: tree
Sensitivity 0.65000 0.7079
Specificity 0.98152 0.9809
Pos Pred Value 0.59091 0.8873
Neg Pred Value 0.98557 0.9404
Prevalence 0.03945 0.1755
Detection Rate 0.02564 0.1243
Detection Prevalence 0.04339 0.1400
Balanced Accuracy 0.81576 0.8444
confusionMatrix(svm.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 32 1 0 0 0 0 2 0 0
asphalt 0 72 0 7 1 2 0 2 0
building 0 3 20 5 0 0 0 1 1
car 1 16 0 73 1 0 0 3 0
concrete 0 1 0 0 63 1 0 6 16
grass 0 1 0 0 0 11 2 0 0
pool 12 1 0 0 0 0 41 0 3
shadow 0 2 1 7 5 0 0 8 0
soil 0 0 0 1 13 0 0 0 69
tree
Overall Statistics
: 0.7673
Accuracy 95% CI : (0.728, 0.8034)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7282
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.71111 0.7423 0.95238
Specificity 0.99351 0.9707 0.97942
Pos Pred Value 0.91429 0.8571 0.66667
Neg Pred Value 0.97246 0.9409 0.99790
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06312 0.1420 0.03945
Detection Prevalence 0.06903 0.1657 0.05917
Balanced Accuracy 0.85231 0.8565 0.96590
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.7849 0.7590 0.78571 0.91111
Specificity 0.9493 0.9434 0.99391 0.96537
Pos Pred Value 0.7766 0.7241 0.78571 0.71930
Neg Pred Value 0.9516 0.9524 0.99391 0.99111
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1440 0.1243 0.02170 0.08087
Detection Prevalence 0.1854 0.1716 0.02761 0.11243
Balanced Accuracy 0.8671 0.8512 0.88981 0.93824
Class: soil Class: tree
Sensitivity 0.40000 0.7753
Specificity 0.96920 0.9665
Pos Pred Value 0.34783 0.8313
Neg Pred Value 0.97521 0.9528
Prevalence 0.03945 0.1755
Detection Rate 0.01578 0.1361
Detection Prevalence 0.04536 0.1637
Balanced Accuracy 0.68460 0.8709
As noted in the machine learning background lectures, algorithms can be negatively impacted by imbalance in the training data. Fortunately, caret has built-in techniques for dealing with this issue including the following:
Down-Sampling (“down”): randomly down-sample more prevalent classes so that they have the same or a similar number of samples as the least frequent class
Up-sampling (“up”): randomly up-sample or duplicate samples from the least frequent classes
SMOTE (“smote”): down-sample the majority class and synthesizes new minority instances by interpolating between existing ones (synthetic minority sampling techniques)
In this example, I am using the up-sampling method. Notice that the code is the same as the example above, except that I have added sampling=“up” to the training controls. So, this is an easy experiment to perform. Compare the obtained results to those obtained without up-sampling. Did this provide any improvement? Are minority classes now being mapped more accurately? Note the impact of data balancing will vary based on the specific classification problem. So, you may or may not observe improvement.
set.seed(420)
<- trainControl(method = "cv", number = 10, sampling="up", verboseIter = FALSE)
trainctrl
<- train(class~., data=train, method = "knn",
knn.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "rpart",
dt.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=train, method = "rf",
rf.model tuneLength = 10,
ntree = 100,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
<- train(class~., data=train, method = "svmRadial",
svm.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
<-predict(knn.model, test)
knn.predict <-predict(dt.model, test)
dt.predict <- predict(rf.model, test)
rf.predict <-predict(svm.model, test) svm.predict
confusionMatrix(knn.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 35 0 0 0 0 0 10 0 0
asphalt 0 67 1 3 1 1 0 3 0
building 0 0 14 1 0 0 0 0 0
car 0 12 0 47 3 0 0 3 0
concrete 0 1 0 1 46 0 0 4 1
grass 0 1 2 0 0 12 3 0 0
pool 8 2 0 0 0 0 30 0 4
shadow 1 13 4 40 6 0 0 10 0
soil 1 1 0 1 27 1 2 0 84
tree
Overall Statistics
: 0.6805
Accuracy 95% CI : (0.6379, 0.7209)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6313
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.77778 0.6907 0.66667
Specificity 0.97835 0.9780 0.99794
Pos Pred Value 0.77778 0.8816 0.93333
Neg Pred Value 0.97835 0.9304 0.98577
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06903 0.1321 0.02761
Detection Prevalence 0.08876 0.1499 0.02959
Balanced Accuracy 0.87807 0.8344 0.83230
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.5054 0.55422 0.85714 0.66667
Specificity 0.9565 0.98349 0.98783 0.96970
Pos Pred Value 0.7231 0.86792 0.66667 0.68182
Neg Pred Value 0.8959 0.91850 0.99591 0.96760
Prevalence 0.1834 0.16371 0.02761 0.08876
Detection Rate 0.0927 0.09073 0.02367 0.05917
Detection Prevalence 0.1282 0.10454 0.03550 0.08679
Balanced Accuracy 0.7309 0.76885 0.92249 0.81818
Class: soil Class: tree
Sensitivity 0.50000 0.9438
Specificity 0.86858 0.9211
Pos Pred Value 0.13514 0.7179
Neg Pred Value 0.97691 0.9872
Prevalence 0.03945 0.1755
Detection Rate 0.01972 0.1657
Detection Prevalence 0.14596 0.2308
Balanced Accuracy 0.68429 0.9324
confusionMatrix(dt.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 29 0 0 0 0 0 7 0 0
asphalt 1 56 0 13 0 0 1 4 0
building 4 7 17 7 2 0 1 2 1
car 4 25 4 70 1 0 0 0 0
concrete 0 0 0 0 51 0 0 0 7
grass 0 0 0 0 0 13 0 0 0
pool 7 1 0 0 2 0 34 0 8
shadow 0 8 0 3 5 0 0 13 0
soil 0 0 0 0 22 1 2 1 73
tree
Overall Statistics
: 0.7022
Accuracy 95% CI : (0.6603, 0.7417)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6534
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.64444 0.5773 0.80952
Specificity 0.98485 0.9537 0.95062
Pos Pred Value 0.80556 0.7467 0.41463
Neg Pred Value 0.96603 0.9051 0.99142
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.05720 0.1105 0.03353
Detection Prevalence 0.07101 0.1479 0.08087
Balanced Accuracy 0.81465 0.7655 0.88007
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.7527 0.6145 0.92857 0.75556
Specificity 0.9179 0.9835 1.00000 0.96104
Pos Pred Value 0.6731 0.8793 1.00000 0.65385
Neg Pred Value 0.9429 0.9287 0.99798 0.97582
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1381 0.1006 0.02564 0.06706
Detection Prevalence 0.2051 0.1144 0.02564 0.10256
Balanced Accuracy 0.8353 0.7990 0.96429 0.85830
Class: soil Class: tree
Sensitivity 0.65000 0.8202
Specificity 0.96715 0.9378
Pos Pred Value 0.44828 0.7374
Neg Pred Value 0.98536 0.9608
Prevalence 0.03945 0.1755
Detection Rate 0.02564 0.1440
Detection Prevalence 0.05720 0.1953
Balanced Accuracy 0.80857 0.8790
confusionMatrix(rf.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 33 1 0 0 0 0 2 0 0
asphalt 0 72 0 4 1 1 0 3 0
building 1 5 20 3 1 0 0 1 1
car 0 15 1 82 0 0 0 2 0
concrete 2 1 0 2 68 0 0 6 16
grass 0 0 0 0 0 12 2 0 0
pool 7 1 0 0 0 0 40 0 4
shadow 0 2 0 1 1 0 0 8 0
soil 2 0 0 1 12 1 1 0 68
tree
Overall Statistics
: 0.7949
Accuracy 95% CI : (0.7571, 0.8292)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7596
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.73333 0.7423 0.95238
Specificity 0.99351 0.9780 0.97531
Pos Pred Value 0.91667 0.8889 0.62500
Neg Pred Value 0.97452 0.9413 0.99789
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06509 0.1420 0.03945
Detection Prevalence 0.07101 0.1598 0.06312
Balanced Accuracy 0.86342 0.8602 0.96384
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.8817 0.8193 0.85714 0.88889
Specificity 0.9565 0.9363 0.99594 0.97403
Pos Pred Value 0.8200 0.7158 0.85714 0.76923
Neg Pred Value 0.9730 0.9636 0.99594 0.98901
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1617 0.1341 0.02367 0.07890
Detection Prevalence 0.1972 0.1874 0.02761 0.10256
Balanced Accuracy 0.9191 0.8778 0.92654 0.93146
Class: soil Class: tree
Sensitivity 0.40000 0.7640
Specificity 0.99179 0.9593
Pos Pred Value 0.66667 0.8000
Neg Pred Value 0.97576 0.9502
Prevalence 0.03945 0.1755
Detection Rate 0.01578 0.1341
Detection Prevalence 0.02367 0.1677
Balanced Accuracy 0.69589 0.8617
confusionMatrix(svm.predict, test$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 31 0 0 0 0 0 4 0 0
asphalt 0 69 0 6 1 1 0 2 0
building 1 1 20 4 1 0 0 1 3
car 1 15 0 62 2 0 0 3 1
concrete 0 1 0 0 56 0 0 6 9
grass 0 1 0 0 0 12 1 0 0
pool 12 2 0 0 0 0 40 0 3
shadow 0 8 1 20 8 0 0 8 0
soil 0 0 0 1 15 1 0 0 73
tree
Overall Statistics
: 0.7318
Accuracy 95% CI : (0.6909, 0.7699)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.689
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.68889 0.7113 0.95238
Specificity 0.99134 0.9756 0.97737
Pos Pred Value 0.88571 0.8734 0.64516
Neg Pred Value 0.97034 0.9346 0.99790
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06114 0.1361 0.03945
Detection Prevalence 0.06903 0.1558 0.06114
Balanced Accuracy 0.84012 0.8435 0.96487
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.6667 0.6747 0.85714 0.88889
Specificity 0.9469 0.9623 0.99594 0.96320
Pos Pred Value 0.7381 0.7778 0.85714 0.70175
Neg Pred Value 0.9267 0.9379 0.99594 0.98889
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1223 0.1105 0.02367 0.07890
Detection Prevalence 0.1657 0.1420 0.02761 0.11243
Balanced Accuracy 0.8068 0.8185 0.92654 0.92605
Class: soil Class: tree
Sensitivity 0.40000 0.8202
Specificity 0.92402 0.9593
Pos Pred Value 0.17778 0.8111
Neg Pred Value 0.97403 0.9616
Prevalence 0.03945 0.1755
Detection Rate 0.01578 0.1440
Detection Prevalence 0.08876 0.1775
Balanced Accuracy 0.66201 0.8898
In this last example I am including feature selection using rfeControls and a random forest-based feature selection method. I am testing multiple subset sizes (from 1 to 147 variables by steps of 5 variables). Once the feature selection is complete, I then subset out the selected variables then create predictions using only this subset.
Again, whether or not feature selection improves model performance depends on the specific problem and varies on a case-by-case basis. Compare the obtained results. How did these models perform in comparison to the original models and balanced models? What variables were found to be important?
set.seed(42)
<- trainControl(method = "cv", number = 10, verboseIter = FALSE)
trainctrl
set.seed(42)
<- rfeControl(functions=rfFuncs, method="cv", number=10)
fsctrl <- seq(1, 147, by=5)
to_test
set.seed(42)
<- rfe(train[,2:ncol(train)], train[,1], sizes=c(to_test), metric = "Kappa", rfeControl=fsctrl)
fs_result <- predictors(fs_result)
selected
#Prepare training and test data
<- test[,selected]
test2 <- cbind(test$class, test2)
test3 colnames(test3)[1] <- "class"
<- as.data.frame(test3)
testx <- train[,selected]
train2 <- cbind(train$class, train2)
train3 colnames(train3)[1] <- "class"
<- as.data.frame(train3)
trainx
<- train(class~., data=trainx, method = "knn",
knn.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=trainx, method = "rpart",
dt.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
set.seed(42)
<- train(class~., data=trainx, method = "rf",
rf.model tuneLength = 10,
ntree = 100,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
<- train(class~., data=trainx, method = "svmRadial",
svm.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="Kappa")
<-predict(knn.model, testx)
knn.predict <-predict(dt.model, testx)
dt.predict <- predict(rf.model, testx)
rf.predict <-predict(svm.model, test) svm.predict
confusionMatrix(knn.predict, testx$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 34 1 0 0 0 0 6 0 0
asphalt 0 66 1 5 0 2 0 2 0
building 0 0 15 2 1 0 0 0 0
car 0 22 0 70 3 0 0 1 0
concrete 1 0 0 1 62 1 0 6 10
grass 0 1 1 0 0 11 2 0 0
pool 9 1 0 0 0 0 35 0 6
shadow 0 5 3 14 2 0 0 11 0
soil 1 1 1 1 15 0 2 0 73
tree
Overall Statistics
: 0.7436
Accuracy 95% CI : (0.7032, 0.7811)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7007
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.75556 0.6804 0.71429
Specificity 0.98485 0.9756 0.99383
Pos Pred Value 0.82927 0.8684 0.83333
Neg Pred Value 0.97639 0.9281 0.98773
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06706 0.1302 0.02959
Detection Prevalence 0.08087 0.1499 0.03550
Balanced Accuracy 0.87020 0.8280 0.85406
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.7527 0.7470 0.78571 0.77778
Specificity 0.9372 0.9552 0.99189 0.96537
Pos Pred Value 0.7292 0.7654 0.73333 0.68627
Neg Pred Value 0.9440 0.9507 0.99390 0.97807
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1381 0.1223 0.02170 0.06903
Detection Prevalence 0.1893 0.1598 0.02959 0.10059
Balanced Accuracy 0.8449 0.8511 0.88880 0.87157
Class: soil Class: tree
Sensitivity 0.55000 0.8202
Specificity 0.95072 0.9498
Pos Pred Value 0.31429 0.7766
Neg Pred Value 0.98093 0.9613
Prevalence 0.03945 0.1755
Detection Rate 0.02170 0.1440
Detection Prevalence 0.06903 0.1854
Balanced Accuracy 0.75036 0.8850
confusionMatrix(dt.predict, testx$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 33 0 0 0 0 0 8 0 0
asphalt 1 56 0 13 0 0 0 4 0
building 2 7 17 7 2 0 0 3 0
car 2 25 4 70 1 0 0 0 0
concrete 0 0 0 0 73 1 0 0 35
grass 0 0 0 0 0 13 0 0 0
pool 7 1 0 0 0 0 21 0 1
shadow 0 8 0 3 3 0 0 13 0
soil 0 0 0 0 4 0 16 0 53
tree
Overall Statistics
: 0.6884
Accuracy 95% CI : (0.646, 0.7285)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.6361
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.73333 0.5773 0.80952
Specificity 0.98268 0.9561 0.95679
Pos Pred Value 0.80488 0.7568 0.44737
Neg Pred Value 0.97425 0.9053 0.99147
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06509 0.1105 0.03353
Detection Prevalence 0.08087 0.1460 0.07495
Balanced Accuracy 0.85801 0.7667 0.88316
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.7527 0.8795 0.92857 0.46667
Specificity 0.9227 0.9151 1.00000 0.98052
Pos Pred Value 0.6863 0.6697 1.00000 0.70000
Neg Pred Value 0.9432 0.9749 0.99798 0.94969
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1381 0.1440 0.02564 0.04142
Detection Prevalence 0.2012 0.2150 0.02564 0.05917
Balanced Accuracy 0.8377 0.8973 0.96429 0.72359
Class: soil Class: tree
Sensitivity 0.65000 0.5955
Specificity 0.97125 0.9522
Pos Pred Value 0.48148 0.7260
Neg Pred Value 0.98542 0.9171
Prevalence 0.03945 0.1755
Detection Rate 0.02564 0.1045
Detection Prevalence 0.05325 0.1440
Balanced Accuracy 0.81063 0.7738
confusionMatrix(rf.predict, testx$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 36 1 0 0 0 0 1 0 0
asphalt 1 67 0 7 0 2 0 1 0
building 1 3 20 3 1 0 0 1 0
car 0 23 1 80 1 0 0 1 0
concrete 1 0 0 0 75 1 0 6 17
grass 0 0 0 0 0 11 1 0 0
pool 5 1 0 0 0 0 42 0 7
shadow 0 2 0 2 1 0 0 11 0
soil 1 0 0 1 5 0 1 0 65
tree
Overall Statistics
: 0.8028
Accuracy 95% CI : (0.7654, 0.8365)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.7691
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.80000 0.6907 0.95238
Specificity 0.99567 0.9732 0.98148
Pos Pred Value 0.94737 0.8590 0.68966
Neg Pred Value 0.98081 0.9301 0.99791
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.07101 0.1321 0.03945
Detection Prevalence 0.07495 0.1538 0.05720
Balanced Accuracy 0.89784 0.8319 0.96693
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.8602 0.9036 0.78571 0.93333
Specificity 0.9372 0.9410 0.99797 0.97186
Pos Pred Value 0.7547 0.7500 0.91667 0.76364
Neg Pred Value 0.9676 0.9803 0.99394 0.99336
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1578 0.1479 0.02170 0.08284
Detection Prevalence 0.2091 0.1972 0.02367 0.10848
Balanced Accuracy 0.8987 0.9223 0.89184 0.95260
Class: soil Class: tree
Sensitivity 0.55000 0.7303
Specificity 0.98973 0.9809
Pos Pred Value 0.68750 0.8904
Neg Pred Value 0.98167 0.9447
Prevalence 0.03945 0.1755
Detection Rate 0.02170 0.1282
Detection Prevalence 0.03156 0.1440
Balanced Accuracy 0.76987 0.8556
confusionMatrix(svm.predict, testx$class)
Confusion Matrix and Statistics
Reference
Prediction asphalt building car concrete grass pool shadow soil tree 32 0 0 0 0 0 1 0 0
asphalt 0 67 0 5 0 1 0 2 0
building 1 3 20 5 2 0 1 1 1
car 0 24 0 79 2 0 0 3 0
concrete 1 0 0 0 65 1 0 6 11
grass 0 0 0 0 0 12 1 0 0
pool 11 2 0 0 0 0 42 0 6
shadow 0 1 1 3 2 0 0 8 0
soil 0 0 0 1 12 0 0 0 71
tree
Overall Statistics
: 0.7811
Accuracy 95% CI : (0.7425, 0.8163)
: 0.1913
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.744
Kappa
's Test P-Value : NA
Mcnemar
Statistics by Class:
Class: asphalt Class: building Class: car
Sensitivity 0.71111 0.6907 0.95238
Specificity 0.99784 0.9805 0.97119
Pos Pred Value 0.96970 0.8933 0.58824
Neg Pred Value 0.97257 0.9306 0.99789
Prevalence 0.08876 0.1913 0.04142
Detection Rate 0.06312 0.1321 0.03945
Detection Prevalence 0.06509 0.1479 0.06706
Balanced Accuracy 0.85447 0.8356 0.96179
Class: concrete Class: grass Class: pool Class: shadow
Sensitivity 0.8495 0.7831 0.85714 0.93333
Specificity 0.9300 0.9552 0.99797 0.95887
Pos Pred Value 0.7315 0.7738 0.92308 0.68852
Neg Pred Value 0.9649 0.9574 0.99595 0.99327
Prevalence 0.1834 0.1637 0.02761 0.08876
Detection Rate 0.1558 0.1282 0.02367 0.08284
Detection Prevalence 0.2130 0.1657 0.02564 0.12032
Balanced Accuracy 0.8897 0.8692 0.92756 0.94610
Class: soil Class: tree
Sensitivity 0.40000 0.7978
Specificity 0.98563 0.9689
Pos Pred Value 0.53333 0.8452
Neg Pred Value 0.97561 0.9574
Prevalence 0.03945 0.1755
Detection Rate 0.01578 0.1400
Detection Prevalence 0.02959 0.1657
Balanced Accuracy 0.69281 0.8833
Example 4: A Regression Example
It is also possible to use caret to produce continuous predictions, similar to linear regression and geographically weighted regression. In this last example, I will repeat a portion of the analysis from the regression module and compare the results to those obtained with machine learning. As you might remember, the goal is to predict the percentage of people over 25 that have at least a bachelors degree by county using multiple other variables. This data violated several assumptions of linear regression, so machine learning might be more appropriate.
First, I read in the Census data as a table. Then I split the data into training and testing sets using a 50/50 split. I make a model using multiple regression then predict to the withheld data and obtain an RMSE estimate.
Next, I create models and predictions using the four machine learning algorithms. Note that I have changed the tuning metric to RMSE, as Kappa is not appropriate for a continuous prediction. I then predict to the withheld data and obtain RMSE values.
In the last two code blocks, I generate a graph to compare the RMSE values. Based on RMSE, all the machine learning methods outperformed multiple regression. Random forests and support vector machines provide the best performance.
<- read.csv("caret/census_data.csv", sep=",", header=TRUE, stringsAsFactors=TRUE) census
set.seed(42)
<- census %>% sample_frac(0.5)
train_reg <- setdiff(census, train_reg) test_reg
<- lm(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_native_born + per_eng_only + per_broadband, data = train_reg) mr_model
<- predict(mr_model, test_reg)
mr_predict <- rmse(test_reg$per_25_bach, mr_predict) mr_rmse
set.seed(42)
<- trainControl(method = "cv", number = 10, verboseIter = FALSE)
trainctrl
<- train(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_native_born + per_eng_only + per_broadband, data = train_reg, method = "knn",
knn.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="RMSE")
set.seed(42)
<- train(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_native_born + per_eng_only + per_broadband, data = train_reg, method = "rpart",
dt.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="RMSE")
set.seed(42)
<- train(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_native_born + per_eng_only + per_broadband, data = train_reg, method = "rf",
rf.model tuneLength = 10,
ntree = 100,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="RMSE")
: only 6 unique complexity parameters in default grid. Truncating the grid to 6 .
note<- train(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_native_born + per_eng_only + per_broadband, data = train_reg, method = "svmRadial",
svm.model tuneLength = 10,
preProcess = c("center", "scale"),
trControl = trainctrl,
metric="RMSE")
<-predict(knn.model, test_reg)
knn.predict <-predict(dt.model, test_reg)
dt.predict <- predict(rf.model, test_reg)
rf.predict <-predict(svm.model, test_reg)
svm.predict
<- rmse(test_reg$per_25_bach, knn.predict)
knn_rmse <- rmse(test_reg$per_25_bach, dt.predict)
dt_rmse <- rmse(test_reg$per_25_bach, rf.predict)
rf_rmse <- rmse(test_reg$per_25_bach, svm.predict) svm_rmse
<- c(mr_rmse, knn_rmse, dt_rmse, rf_rmse, svm_rmse)
rmse_results <- c("Multiple Regression", "k-NN", "Decision Trees", "Random Forest", "Support Vector Machines")
rmse_labels <- data.frame(model=rmse_labels, rmse=rmse_results) rmse_data
ggplot(rmse_data, aes(x=reorder(model, rmse), y=rmse))+
geom_bar(stat="identity")+
ggtitle("Model Comparison")+
labs(x="RMSE", y="Model")+
coord_flip()
Concluding Remarks
That’s it! Using these examples, you should be able to apply machine learning to make predictions on spatial data. I would recommend trying out these methods on your own data and experimenting with different algorithms.