Machine Learning with tidymodels
Machine Learning with tidymodels
Objectives
- Understand the components of tidymodels
- Understand the machine learning workflow using tidymodels
Overview
There are a variety of packages that provide access to machine learning algorithms and methods in R. Here, we will explore tidymodels. Similar to the tidyverse, tidymodels is not a single package, but a set of packages that function together, follow a common philosophy, and use a similar syntax. The component packages of tidymodels bring machine learning into the tidyverse. These packages use similar syntax to other tidyverse packages. So, if you like the tidyverse, you will probably like tidymodels.
Again, similar to the tidyverse, tidymodels consists of several linked packages that use a similar philosophy and syntax. To install all packages, you can simply install tidymodels, which acts as a meta-package to install and load the other components. Here is a brief explanation of the component packages.
- parsnips: used to define models using a common syntax; makes it easier to experiment with different algorithms
- workflows: provides tools to create workflows, or the desired machine learning pipeline, including pre-processing, training, and post-processing
- yardstick: a one-stop-shop to implement a variety of assessment metrics for regression and classification problems
- tune: for hyperparameter tuning
- dials: for managing parameters and tuning grids
- rsample: tools for data splitting, partitioning, and resampling
- recipes: tools for pre-processing and feature space engineering
This website provides examples and tutorials along with links to the documentation for all the core packages discussed above.
To experiment with tidymodels, we will first use the same data explored in the random forest module, followed by a new example relating to forest community type classification. As in the random forest module, we will step through the process of creating a prediction to map the likelihood or probability of palustrine forested/palustrine scrub/shrub (PFO/PSS) wetland occurrence using a variety of terrain variables. The provided training.csv table contains 1,000 examples of PFO/PSS wetlands and 1,000 not PFO/PSS examples. The validation.csv file contains a separate, non-overlapping 2,000 examples equally split between PFO/PSS wetlands and not PFO/PSS wetlands. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.
Here is a brief description of all the provided variables.
- class: PFO/PSS (wet) or not (not)
- cost: Distance from water bodies weighted by slope
- crv_arc: Topographic curvature
- crv_plane: Plan curvature
- crv_pro: Profile curvature
- ctmi: Compound topographic moisture index (CTMI)
- diss_5: Terrain dissection (11 x 11 window)
- diss_10: Terrain dissection (21 x 21 window)
- diss_20: Terrain dissection (41 x 41 window)
- diss_25: Terrain dissection (51 x 51 window)
- diss_a: Terrain dissection average
- rough_5: Terrain roughness (11 x 11 window)
- rough_10: Terrain roughness (21 x 21 window)
- rough_20: Terrain roughness (41 x 41 window)
- rough_25: Terrain Roughness (51 x 51 window)
- rough_a: Terrain roughness average
- slp_d: Topographic slope in degrees
- sp_5: Slope position (11 x 11 window)
- sp_10: Slope position (21 x 21 window)
- sp_20: Slope position (41 x 41 window)
- sp_25: slope position (51 x 51 window)
- sp_a: Slope position average
The data provided in this example are a subset of the data used in this publication:
Maxwell, A.E., T.A. Warner, and M.P. Strager, 2016. Predicting palustrine wetland probability using random forest machine learning and digital elevation data-derived terrain variables, Photogrammetric Engineering & Remote Sensing, 82(6): 437-447. https://doi.org/10.1016/S0099-1112(16)82038-8.
Example 1: Palustrine Wetland Occurrence Probabilistic Model (Binary Classification)
I begin by reading in the required packages. Here, tidymodels is used to load the component packages discussed above along with some other packages (e.g., ggplot2, purr, and dplyr).
library(tidymodels)
Next, I set the working directory and load in the data tables. I do not want to use the last column as a predictor variable in the model, so I drop it using the select() function from dplyr.
<- read.csv("D:/mydata/tidymodels/training.csv", header=TRUE, sep=",", stringsAsFactors=TRUE)
train <- read.csv("D:/mydata/tidymodels/training.csv", header=TRUE, sep=",", stringsAsFactors=TRUE)
test <- select(train, -23)
train <- select(test, -23) test
The example below demonstrates a workflow for tuning and training the model. Here is the process:
- I define a model specification (tune_spec) using parsnips that indicates what model I will use (random forest), what hyperparameter values to use or tune (use 500 trees and tune the mtry parameter), what implementation of the model to use (random forest model as implemented with the ranger package), and the mode (“classification”).
- I then define a workflow (rf_wf) using the workflow() function from the workflows() package and specifying the model to use (tune_spec) and the formula (the class column predicted by all other variables in the table).
- In order to tune the hyperparameters, I define data folds (folds) and configure ten-fold cross validation using the rsample package. I also specify that folds should be stratified by the “class” field in order to maintain data balance between the two classes within each fold.
- I then set up the hyperparameter tuning process using tune_grid() from the tune package (tune_res) by specifying the workflow, folds, and the number of values to try (20).
Running all of these processes will result in a set of test hyperparameters for the defined model using a grid search and ten-fold cross validation.
<-
tune_spec rand_forest(
mtry = tune(),
trees= 500
%>%
) set_engine("ranger") %>%
set_mode("classification")
<- workflow() %>%
rf_wf add_model(tune_spec) %>%
add_formula(class ~ .)
set.seed(123)
<- vfold_cv(train, v = 10, strata=class)
folds
set.seed(123)
<- tune_grid(
tune_res
rf_wf,resamples = folds,
grid = 20
)
Once the tuning process is executed, I write the results to a tibble using collect_metrics() from the tune package. Two metrics were calculated: overall accuracy and area under the receiver operating characteristic curve (AUC). The mean and standard deviation are provided for each hyperparameter combination tested, which are specifically obtained from the withheld fold for each of the ten model runs. I then filter out the AUC results and plot them using ggplot2.
<- tune_res %>% collect_metrics
metSet
metSet# A tibble: 34 x 7
mtry .metric .estimator mean n std_err .config <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 3 accuracy binary 0.966 10 0.00293 Preprocessor1_Model01
2 3 roc_auc binary 0.992 10 0.00162 Preprocessor1_Model01
3 16 accuracy binary 0.961 10 0.00221 Preprocessor1_Model02
4 16 roc_auc binary 0.992 10 0.00143 Preprocessor1_Model02
5 20 accuracy binary 0.96 10 0.00197 Preprocessor1_Model03
6 20 roc_auc binary 0.992 10 0.00148 Preprocessor1_Model03
7 4 accuracy binary 0.966 10 0.00245 Preprocessor1_Model04
8 4 roc_auc binary 0.991 10 0.00178 Preprocessor1_Model04
9 12 accuracy binary 0.964 10 0.00211 Preprocessor1_Model05
10 12 roc_auc binary 0.992 10 0.00161 Preprocessor1_Model05
# ... with 24 more rows
%>%
tune_res collect_metrics() %>%
filter(.metric == "roc_auc") %>%
ggplot(aes(mtry, mean))+
geom_point()+
labs(x="mtry", y="AUC")
To actually obtain the best performing mtry value based on the ten-fold cross validation and the grid search tuning process, I use the select_best() function from tune and specify that I am interested in the model with the best or highest AUC.
To actually define the model with the selected mtry hyperparameter, I use the finalize_model() function from tune and specify the desired hyperparameter using the results from select_best().
<- select_best(tune_res, "roc_auc")
best_auc
<- finalize_model(
final_rf
tune_spec,
best_auc )
Lastly, I finalize the workflow (finalize_workflow() from tune) using the finalized model and fit the model to the entire training data set.
<- rf_wf %>% finalize_workflow(final_rf)
final_wf <- final_wf %>% fit(train) final_fit
Once a trained model is obtained, it can be assessed using the withheld testing or validation set or used to predict to new data. Here, I predict the testing data set and create a new object (wet_pred) that contains the binary classification or “hard prediction”, probabilities for each class (wetland and not wetland) and the correct classification from the original testing data.
I then use the yardstick package to calculate AUC for the ROC curve using the correct classification and the predicted probability of wetland occurrence. Not that the second factor level (“wet”) is considered the positive case. Factors will be in alphabetical order unless explicitly reordered. I also calculate overall accuracy using the actual class and the predicted class.
<- predict(final_fit, new_data = test) %>%
wet_pred bind_cols(predict(final_fit, test, type = "prob")) %>%
bind_cols(test %>% select(class))
%>% roc_auc(truth = class, .pred_wet, event_level="second")
wet_pred # A tibble: 1 x 3
.metric .estimator .estimate<chr> <chr> <dbl>
1 roc_auc binary 1.00
%>% accuracy(truth = class, .pred_class)
wet_pred # A tibble: 1 x 3
.metric .estimator .estimate<chr> <chr> <dbl>
1 accuracy binary 0.996
Example 2: Cover Type Classification (Multiclass Classification)
In this second and new example, we will explore a multi-class classification problem using the Covertype Data Set, which I obtained from the UCI Machine Learning Repository. This data set provides a total of 581,012 instances. The goal is to differentiate seven forest community types using several environmental variables including elevation, topographic aspect, topographic slope, horizontal distance to streams, vertical distance to streams, horizontal distance to roadways, hillshade values at 9AM, hillshade values at noon, hillshade values at 3PM, horizontal distance to fire points, and a wilderness area designation, a binary and nominal variable. The seven community types are:
- 1 = Spruce/Fir
- 2 = Lodgepole Pine
- 3 = Ponderosa Pine
- 4 = Cottonwood/Willow
- 5 = Aspen
- 6 = Douglas Fir
- 7 = Krummholz
As usual, I begin by reading in the required packages and the data table. Note that a series of soil types are also included as predictor variables in the original data, but we will not use those here. Next, I use the forcats package to recode the numeric codes to the actual community names.
I then use dplyr to count the number of records or examples from each community type, which suggests significant data imbalance. In order to speed up the tuning and training process, I then select out 1,500 samples from each class using a stratified random sample. For potentially improved results, I should use all available samples. However, this would take a lot longer to execute. Subsetting the data is used here to speed up the example.
#Get libraries
library(tidymodels)
library(readr)
library(forcats)
#read data
<- read_csv("D:/mydata/tidymodels/covertype.csv")
coverT $cover <- as.factor(coverT$cover)
coverT$wilderness <- as.factor(coverT$wilderness)
coverT$cover <- fct_recode(coverT$cover, SpruceFir="1", LodgepolePine="2", PonderosaPine="3", CottonwoodWillow="4", Aspen="5", DouglasFir="6", Krummholz="7")
coverT%>% group_by(cover) %>% count()
coverT # A tibble: 7 x 2
# Groups: cover [7]
cover n<fct> <int>
1 SpruceFir 211840
2 LodgepolePine 283301
3 PonderosaPine 35754
4 CottonwoodWillow 2747
5 Aspen 9493
6 DouglasFir 17367
7 Krummholz 20510
#Subset data
<- coverT %>% group_by(cover) %>% sample_n(1500)
coverT2 %>% group_by(cover) %>% count()
coverT2 # A tibble: 7 x 2
# Groups: cover [7]
cover n<fct> <int>
1 SpruceFir 1500
2 LodgepolePine 1500
3 PonderosaPine 1500
4 CottonwoodWillow 1500
5 Aspen 1500
6 DouglasFir 1500
7 Krummholz 1500
Next, I use the parsnips package to define a random forest implementation using the “ranger” engine in “classification” mode. Note the use of tune() to indicate that I plan to tune the mtry parameter. Since the data have not already been split into separate training and testing sets, I use the initial_split() function from rsample to define training and testing partitions followed by the training() and testing() functions to create new tables for each split. Also, I am using a stratified random sample to maintain a balanced data set in both partitions.
#Define Model
<- rand_forest(mtry=tune(), trees=500) %>%
rf_model set_engine("ranger") %>%
set_mode("classification")
#Set split
set.seed(42)
<- initial_split(coverT2, prop=.75, strata=cover)
cover_split <- cover_split %>% training()
cover_train <- cover_split %>% testing() cover_test
I would like to normalize all continuous predictor variables and create a dummy variable from the single nominal predictor variable (“wilderness”). I define these transformations within a recipe using functions available in the recipes package. This also requires defining the formula and the input data. Here, I am referencing only the training set, as the test set should not be introduced to the model at this point, as this could result in a later bias assessment of model performance. The all_numeric(), all_nominal(), and all_outcomes() functions are used to select columns on which to apply the desired transformations.
The model and pre-processing recipe are then combined into a workflow.
#Define recipe
<- recipe(cover~., data=cover_train) %>%
cover_recipe step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes())
#Define workflow
<- workflow() %>%
cover_wf add_model(rf_model) %>%
add_recipe(cover_recipe)
I then use yardstick and the metric_set() function to define the desired assessment metrics, in this case only overall accuracy. To prepare for hyperparameter tuning using five-fold cross validation, I define folds using the vfold_cv() function from rsample. Similar to the training and testing split above, the folds are stratified by the community type to maintain class balance within each fold. Lastly, I then define values of mtry to test using the dials package. It would be better to test more values and maybe optimize additional parameters. However, I am trying to decrease the time required to execute the example.
#Define metrics
<- metric_set(accuracy)
my_mets
#Define folds
set.seed(42)
<- vfold_cv(cover_train, v=5, strata=cover)
cover_folds
#Define tuning grid
<- grid_regular(mtry(range = c(1, 12)),
rf_grid levels = 6)
Now that the model, pre-processing steps, workflow, metrics, data partitions, and mtry values to try have been defined, I tune the model using tune_grid() from the tune package. Note that this may take several minutes. Specifically, I make sure to use the defined workflow so that the pre-processing steps defined using the recipe are used. Once completed, I collect the resulting metrics for each mtry value for each fold using collect_metrics() from tune. The summarize parameter is set to FALSE because I want to obtain all results for each fold, as opposed to aggregated results. I then calculate the minimum, maximum, and median overall accuracies for each fold using dplyr and plot the results using ggplot2.
The best mtry parameter is defined using the select_best() function from tune. The workflow is then finalized and the model is trained using last_fit() from tune. The collect_predictions() function from tune is used to obtain the class prediction for each sample in the withheld test set.
<- cover_wf %>% tune_grid(resamples=cover_folds, grid = rf_grid, metrics=my_mets)
rf_tuning
<- rf_tuning %>% collect_metrics(summarize=FALSE) %>%
tune_result filter(.metric == 'accuracy') %>%
group_by(mtry) %>%
summarize(min_acc = min(.estimate),
median_acc = median(.estimate),
max_acc = max(.estimate))
ggplot(tune_result, aes(y=median_acc, x=mtry))+
geom_point()+
geom_errorbar(aes(ymin=min_acc, ymax=max_acc))+
theme_bw()+
labs(x="mtry Parameter", y = "Accuracy")
<- rf_tuning %>% select_best(metric="accuracy")
best_rf_model
<- cover_wf %>% finalize_workflow(best_rf_model)
final_cover_wf
<- final_cover_wf %>%
final_cover_fit last_fit(split=cover_split, metrics=my_mets) %>% collect_predictions()
Lastly, I use the conf_mat() function from the yardstick package to obtain a multi-class confusion matrix from the reference and predicted classes for each sample in the withheld testing set. Passing the matrix to summmary() will provide a set of assessment metrics calculated from the error matrix.
%>% conf_mat(truth=cover, estimate=.pred_class)
final_cover_fit
Truth
Prediction SpruceFir LodgepolePine PonderosaPine CottonwoodWillow Aspen284 81 0 0 1
SpruceFir 56 243 0 0 7
LodgepolePine 0 9 314 8 5
PonderosaPine 0 0 17 360 0
CottonwoodWillow 7 21 4 0 358
Aspen 0 16 40 7 4
DouglasFir 28 5 0 0 0
Krummholz
Truth
Prediction DouglasFir Krummholz0 6
SpruceFir 2 2
LodgepolePine 42 0
PonderosaPine 10 0
CottonwoodWillow 4 1
Aspen 317 0
DouglasFir 0 366
Krummholz
%>% conf_mat(truth=cover, estimate=.pred_class) %>% summary()
final_cover_fit # A tibble: 13 x 3
.metric .estimator .estimate<chr> <chr> <dbl>
1 accuracy multiclass 0.854
2 kap multiclass 0.830
3 sens macro 0.854
4 spec macro 0.976
5 ppv macro 0.851
6 npv macro 0.976
7 mcc multiclass 0.830
8 j_index macro 0.830
9 bal_accuracy macro 0.915
10 detection_prevalence macro 0.143
11 precision macro 0.851
12 recall macro 0.854
13 f_meas macro 0.851
Concluding Remarks
Similar to the tidyverse, tidymodels is a very powerful framework for creating machine learning workflows and experimental environments using a common philosophy and syntax.
Although this introduction was brief and there are many more components that could be discussed, this can serve as a starting point for continued learning and experimentation. Check out the tidymodels website for additional examples and tutorials. For comparing or experimenting with multiple algorithms, I recommend checking out the workflowsets package.