# Machine Learning with Random Forest

## Objectives

- Create predictive models using the random forest (RF) algorithm and the randomForest package
- Predict to validation data to assess model performance
- Predict to raster data to make spatial predictions
- Assess variable importance
- Assess models using receiver operating characteristic curves (ROC) and the area under the curve (AUC) measure

## Overview

Before we experiment with using the **caret** package, which provides access to a variety of different **machine learning** algorithms, in this module we will explore the **randomForest** package to implement the **random forest** (**RF**) algorithm specifically. I will step through the process of creating a spatial prediction to map the likelihood or probability of palustrine forested/palustrine scrub/shrub (PFO/PSS) wetland occurrence using a variety of terrain variables. The provided *training.csv* table contains 1,000 examples of PFO/PSS wetlands and 1,000 not PFO/PSS examples. The *validation.csv* file contains a separate, non-overlapping 2,000 examples equally split between PFO/PSS wetlands and not PFO/PSS wetlands. I have also provided a raster stack (*predictors.img*) of all the predictor variables so that a spatial prediction can be produced. Since PFO/PSS wetlands should only occur in woody or forested extents, I have also provided a binary raster mask (*for_mask.img*) to subset the final result to only these areas. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.

Here is a brief description of all the provided variables.

*class*: PFO/PSS (wet) or not (not)*cost*: Distance from water bodies weighted by slope*crv_arc*: Topographic curvature*crv_plane*: Plan curvature*crv_pro*: Profile curvature*ctmi*: Compound topographic moisture index (CTMI)*diss_5*: Terrain dissection (11 x 11 window)*diss_10*: Terrain dissection (21 x 21 window)*diss_20*: Terrain dissection (41 x 41 window)*diss_25*: Terrain dissection (51 x 51 window)*diss_a*: Terrain dissection average*rough_5*: Terrain roughness (11 x 11 window)*rough_10*: Terrain roughness (21 x 21 window)*rough_20*: Terrain roughness (41 x 41 window)*rough_25*: Terrain Roughness (51 x 51 window)*rough_a*: Terrain roughness average*slp_d*: Topographic slope in degrees*sp_5*: Slope position (11 x 11 window)*sp_10*: Slope position (21 x 21 window)*sp_20*: Slope position (41 x 41 window)*sp_25*: slope position (51 x 51 window)*sp_a*: Slope position average

The data provided in this example are a subset of the data used in this publication:

Maxwell, A.E., T.A. Warner, and M.P. Strager, 2016. Predicting palustrine wetland probability using random forest machine learning and digital elevation data-derived terrain variables, *Photogrammetric Engineering & Remote Sensing*, 82(6): 437-447. https://doi.org/10.1016/S0099-1112(16)82038-8.

As always, to get started I am reading in all of the needed packages and data. The tables are read in using *read.csv()* and the raster data are loaded using the *raster()* function from the **raster** package. I created the training data by extracting raster data values at point locations using GIS software. I then saved the tables to **CSV** files. This process could also be completed in R using methods we have already discussed.

I then rename the raster bands to match the column names from the table. Note that the band order doesn’t matter; however, the correct name must be associated with the correct variable or band.

```
require(randomForest)
require(pROC)
require(raster)
require(rgdal)
require(tmap)
require(ggplot2)
require(caret)
```

```
<- read.csv("random_forests/training.csv", sep=",", header=TRUE, stringsAsFactors=TRUE)
train <- read.csv("random_forests/validation.csv", sep=",", header=TRUE, stringsAsFactors=TRUE)
test <- brick("random_forests/predictors.tif")
preds <- raster("random_forests/for_mask.tif") mask
```

`names(preds) <- c("cost", "crv_arc", "crv_plane", "crv_pro", "ctmi", "diss_5", "diss_10", "diss_20", "diss_25", "diss_a", "rough_5", "rough_10", "rough_20", "rough_25", "rough_a", "slp_d", "sp_5", "sp_10", "sp_20", "sp_25", "sp_a") `

## Create the Model

The code block below shows how to train a model with the *randomForest()* function from the **randomForest** package. The first column in the table (“class”) is the **dependent variable** while columns 2 through 22 provide the predictor variables. I am using 501 trees, calculating measures of variable importance, and assessing the model using the **out of bag** (**OOB**) data with a **confusion matrix** and an error rate. I am also setting a **random seed** to make sure that the results are reproducible.

```
set.seed(42)
<- randomForest(y= train[,1], train[,2:22], ntree=501, importance=T, confusion=T, err.rate=T) rf_model
```

Once the model is produced, I call it to print a summary of the results. The model generally performed well. The OOB error rate was only 3.2%, or 96.8% of the OOB data were correctly classified. Of the PFO/PSS samples, only 26 were incorrectly classified while 38 of the not wetland class were incorrectly classified, so **omission** and **commission error** are nearly balanced. You can also see that the algorithm used 4 random variables for splitting at each node (the *mtry* parameter). Since I did not specify this, the square root of the number of predictor variables was used.

```
rf_model
:
CallrandomForest(x = train[, 2:22], y = train[, 1], ntree = 501, importance = T, confusion = T, err.rate = T)
: classification
Type of random forest: 501
Number of trees: 4
No. of variables tried at each split
: 3.2%
OOB estimate of error rate:
Confusion matrix
not wet class.error962 38 0.038
not 26 974 0.026 wet
```

## Variable Importance

Using the *importance()* function from the randomForest package, we can obtained the importance estimates for each variable relative to each class and also overall using the **OOB mean decrease in accuracy** and/or **mean decrease in Gini** measures. In the second code block, I have plotted the importance measures. Higher values indicate a greater importance in the model.

Note that random forest-based importance measures have been shown to be biased if predictor variables are highly correlated, variables are measured on different scales, a mix of continuous and categorical variables are used, and/or categorical variables have varying numbers of levels. In this example, variables are very correlated since we calculated the same measures at different window sizes, so this is an issue. The **party** and **perimp** packages provide measures of **conditional variable importance**, which can be used to assess importance while taking into account variable correlation. Further, this package allows for approximations of both **partial** (i.e., conditional) and **marginal** importance. Partial importance relates to importance of a variable within the feature space and takes into account correlation between variables while marginal importance assess the relationship between the dependent variable and variable of interest without taking into account the other predictor variables in the model. With no correlation between predictor variables, partial and marginal importance are equivalent. Another means to assess variable importance using the RF algorithm is made available in the **Boruta** package. You may also want to check out the **rfUtilities** package.

Documentation for the party package can be found here. Here is a link to a paper on this topic by Strobl et al. Documentation for Boruta can be found here while a research article introducing the method can be found here.

```
importance(rf_model)
not wet MeanDecreaseAccuracy MeanDecreaseGini9.5252422 26.627942 26.913251 183.188275
cost 2.7114363 8.134470 8.741014 6.663290
ctmi 0.8500137 10.510031 10.452762 4.759517
crv_arc 0.6693872 14.716239 14.389290 6.121970
crv_plane 1.6327988 8.709588 9.024416 4.449532
crv_pro 14.0786011 15.869143 18.687559 32.797680
diss_10 15.5952264 17.916204 21.615596 64.318851
diss_20 18.5173809 19.342369 25.875810 79.512844
diss_25 7.7825581 13.160540 12.544342 11.840202
diss_5 11.2746311 15.828505 17.357294 31.550745
diss_a 12.2591646 17.233047 20.542069 86.359892
rough_10 14.3058916 13.788886 19.369884 51.784669
rough_20 16.5848323 12.871523 20.772291 42.153341
rough_25 10.8049583 16.728945 19.350248 136.217674
rough_5 12.4748354 15.021272 19.195641 68.487471
rough_a 9.9272365 11.959671 15.787975 125.113741
slp_d 8.5270362 12.050580 14.824608 10.423877
sp_10 9.5984981 12.731844 14.887266 13.122625
sp_20 6.4062282 14.352896 15.471995 18.392499
sp_25 8.9935641 12.050825 14.461741 8.129591
sp_5 8.7554578 13.501101 15.947827 14.138012 sp_a
```

`varImpPlot(rf_model)`

Calling *plot()* on the random forest model object will show how **OOB error** changes with the number of trees in the forest. This plot suggests that the model stabilizes after roughly 100 trees. I generally use 500 or more trees, as it has been shown that using more trees generally doesn’t reduce classification accuracy or cause overfitting. However, more trees will make the model more computationally intensive. If you aren’t sure how many trees to build, you can use this plot to assess the impact.

`plot(rf_model)`

**Partial dependency plots** are used to assess the dependency or response of the model to a single predictor variable. In the following examples, I have created plots to assess the impact of topographic slope, cost distance from streams weighted by topographic slope, and topographic dissection calculated with a 25-by-25 cell circular radius window size. The function requires that a model, data set, predictor variable of interest, and class of interest are provided. To generate these plots, the variable of interest is maintained in the model while the mean value is used for all other variables. Generally, lower slopes, proximity to streams, and more incision or dissection seem to be correlated with wetland occurrence. Although these graphs can be informative, they can also be noisy. So, be careful about reading too much into noisy patterns.

`partialPlot(x=rf_model, pred.data=train, x.var=slp_d, which.class="wet")`

`partialPlot(x=rf_model, pred.data=train, x.var=cost, which.class="wet")`

`partialPlot(x=rf_model, pred.data=train, x.var=diss_25, which.class="wet")`

## Prediction and Assessment

Now that we have a model, it can be used to predict to new data. In the next code block, I am predicting to the test or validation data. Instead of obtaining the resulting binary classification, I am using *type=“prob”* to obtain the resulting class likelihood or probability. Using *type=“class”* will return the classification result. I am setting the index equal to 2, since I want the probability for the wetland class as opposed to the not class. Generally, the values are indexed alphabetically. If *norm.votes* is TRUE then the class votes will be expressed as a fraction as opposed to a count. With *predict.all* set to FALSE, all predictions will not be written out, only the final result. When *proximity* is equal to FALSE, proximity measures will not be computed. When the *nodes* argument is set to FALSE a matrix of terminal node indicators will not be written to the result. So, I am simply obtaining the final probability value for each class with no additional information. Calling *head()* on the result will show the probability for the first five samples scaled from 0 to 1.

```
<- predict(rf_model, test, index=2, type="prob", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
pred_test head(pred_test)
: not wet
Output: 1 0.001996008 0.9980040
Output: 2 0.035928144 0.9640719
Output: 3 0.001996008 0.9980040
Output: 4 0.003992016 0.9960080
Output: 5 0.001996008 0.9980040
Output: 6 0.000000000 1.0000000 Output
```

To assess the model, I am now producing an **ROC curve** and associated **AUC** value using the *roc()* function from the **pROC** package. This requires that the correct class and resulting prediction be provided. An **AUC** value of 0.991 is obtained, suggesting strong model performance.

```
<- data.frame(pred_test)
pred_test <- roc(test$class, pred_test$wet)
pred_test_roc auc(pred_test_roc)
: 0.9912
Area under the curveplot(pred_test_roc)
```

ROC curves can also be plotted using **ggplot2** and the *ggroc()* function. This is my preferred method, as it allows for the customization options of **ggplot2** to be applied. You can also plot ROC curves for multiple models to the same graph for comparison, as will be demonstrated below. I have also plotted the reference line using *geom_abline()* with an intercept of 1 and a slope of 1.

```
ggroc(pred_test_roc, lwd=1.2, col="blue")+
geom_abline(intercept = 1, slope = 1, color = "red", linetype = "dashed", lwd=1.2)
```

Predicting to a raster grid is very similar to predicting to a table. I am providing the raster data and the trained model. I am also indicating that I want the probability of PFO/PSS wetland occurrence to be written out to each cell using *type=“prob”* and *index=2*. I am providing a file name for the raster output. If a full file path is not provided, then the grid will be written to the current working directory. I would recommend saving to TIFF or IMG format. Setting *progress* equal to “window” will cause a progress window to be launched. This can be useful since making predictions at every cell in a raster grid stack can take some time. Remember that the correct variable name must be assigned to each band. Also, it is okay to include bands that won’t be used. They are simply ignored.

`predict(preds, rf_model, type="prob", index=2, na.rm=TRUE, progress="window", overwrite=TRUE, filename="random_forests/rf_wetland.img")`

Once the model is created, I then read it back in using the *raster()* function, since it is a single-band raster. I then multiply the result by the mask so that all areas that are not woody are no longer predicted. Lastly, I create a map of the output using **tmap**. Generally, areas in low topographic positions near streams are predicted to be more likely to be wetlands than other topographic positions, which makes sense.

```
<- raster("random_forests/rf_wetland.img")
raster_result <- mask*raster_result masked_result
```

```
tm_shape(masked_result)+
tm_raster(palette="Blues")+
tm_layout(legend.outside = TRUE)+
tm_layout(title = "Wetland Probability", title.size = 1.5)
```

If you would like to assess the model using the binary classification as opposed to the class probabilities, you will need to change the *type* to “class” when predicting to new data. In this example, I am predicting out the class for each validation or test sample. I then use the *confusionMatrix()* function from **caret** to generate a confusion matrix. This function requires that the predicted and correct class be provided. I am also specifying that “wet” represents the positive case. The *mode* argument allows you to control what measures are provided. The options include “sens_spec”, “prec_recall”, or “everything.” When using “everything”, a wide variety of measures are provided including **specificity**, **sensitivity**, **precision**, **recall**, and **F1 score**. The results suggest strong performance, with a 96.4% accuracy and a **Kappa** statistic of 0.927. Precision and recall are both above 0.95.

Although I will not demonstrate it here, **diffeR** is another great package for assessing models. The documentation for this package can be found here.

```
<- predict(rf_model, test, type="class", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
pred_test_class head(pred_test_class)
1 2 3 4 5 6
wet wet wet wet wet wet : not wet Levels
```

```
confusionMatrix(pred_test_class, test$class, positive="wet", mode="everything")
Confusion Matrix and Statistics
Reference
Prediction not wet946 22
not 54 978
wet
: 0.962
Accuracy 95% CI : (0.9527, 0.9699)
: 0.5
No Information Rate -Value [Acc > NIR] : < 2.2e-16
P
: 0.924
Kappa
's Test P-Value : 0.0003766
Mcnemar
Sensitivity : 0.9780
Specificity : 0.9460
Pos Pred Value : 0.9477
Neg Pred Value : 0.9773
Precision : 0.9477
Recall : 0.9780
F1 : 0.9626
Prevalence : 0.5000
Detection Rate : 0.4890
Detection Prevalence : 0.5160
Balanced Accuracy : 0.9620
'Positive' Class : wet
```

ROC curves and the AUC measure can be used to compare models. In the next two code blocks I create models using just topographic slope followed by just the compound topographic moisture index (CTMI). The slope-only model provides an AUC of 0.934 while the CTMI-only model results in an AUC of 0.776, suggesting that CTMI is less predictive of wetland occurrence than topographic slope.

```
set.seed(42)
<- randomForest(class~slp_d, data=train, ntree=501, importance=T, confusion=T, err.rate=T)
rf_model_slp <- predict(rf_model_slp, test, index=2, type="prob", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
pred_test_slp <- data.frame(pred_test_slp)
pred_test_slp <- roc(test$class, pred_test_slp$wet)
pred_test_slp_roc auc(pred_test_slp_roc)
: 0.9337 Area under the curve
```

```
set.seed(42)
<- randomForest(class~ctmi, data=train, ntree=501, importance=T, confusion=T, err.rate=T)
rf_model_ctmi <- predict(rf_model_ctmi, test, index=2, type="prob", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
pred_test_ctmi <- data.frame(pred_test_ctmi)
pred_test_ctmi <- roc(test$class, pred_test_ctmi$wet)
pred_test_ctmi_roc auc(pred_test_ctmi_roc)
: 0.7755 Area under the curve
```

It is also possible to plot multiple ROC curves in the same space using the *ggroc()* function from ggplot2 by providing a list of ROC results. This provides an informative comparison of the three models produced in this exercise. Also, this allows for the customization provided by ggplot2.

```
ggroc(list(All=pred_test_roc, Slp=pred_test_slp_roc, ctmi=pred_test_ctmi_roc), lwd=1.2)+
geom_abline(intercept = 1, slope = 1, color = "red", linetype = "dashed", lwd=1.2)+
scale_color_manual(labels = c("All Variables", "Just CTMI", "Just Slope"), values= c("#88D04B", "#D65076", "#EFC050"))+
ggtitle("Model Comparison")+
labs(x="Specificity", y="Sensitivity")+
theme(axis.text.y = element_text(size=12))+
theme(axis.text.x = element_text(size=12))+
theme(plot.title = element_text(face="bold", size=18))+
theme(axis.title = element_text(size=14))+
theme(strip.text = element_text(size = 14))+
theme(legend.title=element_blank())
```

The pROC package also provides a statistical test to compare ROC curves (the **DeLong Test**). In the example, I am comparing the original model, with all predictor variables, to the model using only the topographic slope variable. This test suggests statistical difference, or that the model with all terrain features is providing a better prediction than the one using just slope.

```
roc.test(pred_test_roc, pred_test_slp_roc)
's test for two correlated ROC curves
DeLong
data: pred_test_roc and pred_test_slp_roc
Z = 10.968, p-value < 2.2e-16
alternative hypothesis: true difference in AUC is not equal to 0
95 percent confidence interval:
0.04719219 0.06772881
sample estimates:
AUC of roc1 AUC of roc2
0.9911565 0.9336960
```

## Concluding Remarks

I hope that this module provided an informative introduction to machine learning in R using the randomForest package. In the next module, we will explore the caret package for implementing a variety of machine learning algorithms in R to make a wide variety of predictions.