Random Forest Machine Learning

Objectives

  1. Create predictive models using the random forest (RF) algorithm and the randomForest package
  2. Predict to validation data to assess model performance
  3. Predict to raster data to make spatial predictions
  4. Assess variable importance using the out of bag (OOB) data and partial dependency plots
  5. Assess models using receiver operating characteristic curves (ROC) and the area under the curve (AUC) measure.

Overview

Before we experiment with using the caret package, which provides access to a variety of different machine learning algorithms, in this module we will explore the randomForest package to implement the random forest (RF) algorithm specifically. I will step through the process of creating a spatial prediction to map the likelihood or probability of palustrine forested/palustrine scrub/shrub (PFO/PSS) wetland occurrence using a variety of terrain variables. The provided training.csv table contains 1,000 examples of PFO/PSS wetlands and 1,000 not PFO/PSS examples. The validation.csv file contains a separate, non-overlapping 2,000 examples equally split between PFO/PSS wetlands and not PFO/PSS wetlands. I have also provided a raster stack (predictors.img) of all the predictor variables so that a spatial prediction can be produced. Since PFO/PSS wetlands should only occur in woody extents, I have also provided a binary raster mask (for_mask.img) to subset the final result to only woody areas.

Here is a brief description of all the provided variables.

  1. class: PFO/PSS (wet) or not (not)
  2. cost: Distance from water bodies
  3. crv_arc: Topographic curvature
  4. crv_plane: Plan curvature
  5. crv_pro: Profile curvature
  6. ctmi: Compound topographic moisture index (CTMI)
  7. diss_5: Terrain dissection (11 x 11 window)
  8. diss_10: Terrain dissection (21 x 21 window)
  9. diss_20: Terrain dissection (41 x 41 window)
  10. diss_25: Terrain dissection (51 x 51 window)
  11. diss_a: Terrain dissection average
  12. rough_5: Terrain roughness (11 x 11 window)
  13. rough_10: Terrain roughness (21 x 21 window)
  14. rough_20: Terrain roughness (41 x 41 window)
  15. rough_25: Terrain Roughness (51 x 51 window)
  16. rough_a: Terrain roughness average
  17. slp_d: Topographic slope in degrees
  18. sp_5: Slope position (11 x 11 window)
  19. sp_10: Slope position (21 x 21 window)
  20. sp_20: Slope position (41 x 41 window)
  21. sp_25: slope position (51 x 51 window)
  22. sp_a: Slope position average

The data provided in this example are a subset of the data used in this publication:

Maxwell, A.E., T.A. Warner, and M.P. Strager, 2016. Predicting palustrine wetland probability using random forest machine learning and digital elevation data-derived terrain variables, Photogrammetric Engineering & Remote Sensing, 82(6): 437-447. https://doi.org/10.1016/S0099-1112(16)82038-8.

As always, to get started I am reading in all of the needed packages and data. The tables are read in using read.csv() and the raster data are loaded using the raster() function from the raster package. Note that I created the training data by extracting raster data values at point locations using GIS software. I then saved the tables to CSV files. This process could also be completed in R using methods we have already discussed.

I then rename of the raster bands to match the column names from the table. Note that the band order doesn’t matter; however, the correct name must be associated with the correct variable or band.

Create the Model

The code block below shows how to train a model with the randomForest() function from the randomForest package. The first column in the table (“class”") is the dependent variable while columns 2 through 22 provide the predictor variables. I am using 501 trees, calculating measures of variable importance, and assessing the model using the OOB data with a confusion matrix and an error rate. I am also setting a random seed to make sure that my results are reproducible.

Once the model is produced, I call it to print a summary of the results. The model generally performed well. The OOB error rate was only 3.2%, or 96.8% of the OOB data in each tree on average were correctly classified. Of the PFO/PSS samples, only 26 were incorrectly classified while 38 of the not class were incorrectly classified, so omission and comission error are nearly balanced. You can also see that the algorithm used 4 random variables for splitting at each note (the mtry parameter). Since I did not specify this, the square root of the number of predictor variables was used.

Variable Importance

Using the importance() function from the randomForest package, we can obtained the importance estimates for each variable relative to each class and also overall using the OOB mean decrease in accuracy and mean decrease in Gini measures. In the second code block, I have plotted the importance measures. Higher values indicate a greater importance in the model.

Note that random forest-based importance measures have been shown to be biased if the number of training samples in each class are not balanced and/or variables are highly correlated. In this example, variables are very correlated since we calculated the same measures at different window sizes, so this is an issue. The party package provides a measure of conditional variable importance, which can be used to assess importance while taking into account variable correlation. Unfortunately, this process is computationally intensive and can be slow. I have generally found that I have to reduce the number of training samples to obtain results in a reasonable amount of time.

Documentation for the party package can be found here. Here is a link to a paper on this topic by Strobl et al. 

Calling plot() on the random forest model will show how OOB error changes with the number of trees in the forest. This plot suggests that the model stabilizes after roughly 100 trees. I generally use 500 or more trees, as it has been shown that using more trees generally doesn’t reduce classification accuracy or cause overfitting. However, more trees will make the model more computationally intensive. If you aren’t sure how many trees to build, you can use this plot to assess the impact.

Partial dependency plots are used to assess the dependency or response of the model to a single predictor variable. In the following examples, I have created plots to assess the impact of topographic slope, cost distance from streams weighted by topographic slope, and topographic dissection calculated with a 25-by-25 cell circular radius window size. The function requires that a model, data set, predictor variable, and class are provided. Generally, lower slopes, proximity to streams, and more incision or dissection seem to be correlated with wetland occurrence. Although these graphs can be informative, they can also be noisy. So, be careful about reading too much into noisy patterns.

Prediction and Assessment

Now that we have a model, it can be used to predict to new data. In the next code block, I am predicting to the test or validation data. Instead of obtaining the resulting binary classification, I am using type=“prob” to obtain the resulting class likelihood or probability. Using type=“class” will return the classification result. I am setting the index equal to 2, since I want the probability for the wetland class as opposed to the not class. Generally, the values are indexed alphabetically. If norm.votes is TRUE then the class votes will be expressed as a fraction as opposed to a count. With predict.all set to FALSE, all predictions will not be written out, only the final result. When proximity is equal to FALSE, proximity measures will not be computed. When the nodes argument is set to FALSE a matrix of terminal node indicators will not be written to the result. So, I am simply obtaining the final probability value for each class with no additional information. Calling head() on the result will show the probability for for the first five samples scaled from 0 to 1.

To assess the model, I am now producing an ROC curve and associated AUC value using the roc() function from the pROC package. This requires that the correct class and resulting prediction be provided. An AUC value of 0.991 is obtained, suggesting strong model performance.

ROC curves can also be plotted using ggplot2 and the ggroc() function. This is my preferred method, as it allows for the customization options of ggplot2 to be applied. I have also plotted the reference line using geom_abline() with an intercept of 1 and a slope of 1.

Predicting to a raster grid is very similar to predicting to a table. I am providing the raster data and the trained model. I am also indicating that I want the probability of PFO/PSS wetland occurrence to be written out to each cell using type=“prob” and index=2. I am providing a file name for the raster output. If a full file path is not provided, then the grid will be written to the current working directory. I would recommend saving to TIFF or IMG format. Setting progress equal to “window” will cause a progress window to be launched. This can be useful since making predictions at every cell in a raster grid stack can take some time. Remember that the correct variable name must be assigned to each band. Also, it is okay to include bands that won’t be used. They are simply ignored.

Once the model is created, I then read it back in using the raster() function, since it is a single-band raster. I then multiply the result by the mask so that all areas that are not woody are no longer predicted. Lastly, I create a map of the output using tmap. Generally, areas in low topographic positions near streams are predicted to be more likely to be wetlands than other topographic positions, which makes sense.

If you would like to assess the model using the binary classification as opposed to the probabilities, you will need to change the type to “class” when predicting to new data. In this example, I am predicting out the class for each validation or test sample. I then use the confusionMatrix() function from caret to generate a confusion matrix. This function requires that the predicted and correct class be provided. I am also specifying that “wet” represents the positive case. The mode argument allows you to control what measures are provided. The options include “sens_spec”, “prec_recall”, or “everything.” When using “everything”, a wide variety of measures are provided including specificity, sensitivity, precision, recall, and F1 score. The results suggest strong performance, with a 96.4% accuracy and a Kappa** statistic of 0.927. Precision and recall are both above 0.95.

Although I will not demonstrate it here, diffeR is another great package for assessing models. The documentation for this package can be found here.

ROC curves and the AUC measure can be used to compare models. In the next two code blocks I create models using just topographic slope followed by just the compound topographic moisture index (CTMI). The slope-only model provides an AUC of 0.934 while the CTMI-only model results in an AUC of 0.776, suggesting that CTMI is less predictive of wetland occurrence than topographic slope.

It is also possible to plot multiple ROC curves in the same space using the ggroc() function from ggplot2 by providing a list of ROC results. This provides an informative comparison of the three models produced in this exercise. Also, this allows for the customization provided by ggplot2.

The pROC package also provides a statistical test to compare ROC curves (the DeLong Test). In the example, I am comparing the original model, with all predictor variables, to the model using only the topographic slope variable. This test suggests statistical difference, or that the model with all terrain features is providing a better prediction than the one using just slope.

I hope that this module provided an informative introduction to machine learning in R using the randomForest package. In the next module, we will explore the caret package for implementing a variety of machine learning algorithms in R to make a wide variety of predictions.

Back to Course Page

Back to WV View

Download Data