5 Accuracy Assessment

5.1 Topics Covered

Calculating accuracy assessment metrics using the yardstick package
Regression metrics (mean square error, root mean square error, mean absolute error, and R-squared)
Multiclass classification metrics (confusion matrix, overall accuracy, producer’s accuracy/recall, user’s accuracy/precision, F1-score, and macro-averaging)
Single-threshold binary assessment metrics (confusion matrix, precision, recall, F1-score, negative predictive value, and specificity)
Multi-threshold binary assessment metrics (receiver operating characteristic curve, precision-recall curve, and area under the curve)
Map image classification efficacies via the micer package

5.2 Introduction

In order to validate models, it is important to assess their performance on a new dataset. A model that performs well on the training set my be overfit and fail to generalize to new data. In this chapter, we explore assessment metrics for both classification and regression with a focus on the yardstick package, which is part of the tidymodels metapackage. We also explore the micer package, which provides a large set of aggregated and class-level assessment metrics alongside map image classification efficacy (MICE).

library(tidyverse)
library(yardstick)
library(micer)
library(gt)

5.3 Regression

To experiment with regression metrics, we make use of a hypothetical dataset that includes a known measurement, “truth”, and predictions from three models: “m1”, “m2”, and “m3”. There are a total of 156 samples included.

fldPth <- "gslrData/chpt5/data/"

regIn <- read_csv(str_glue("{fldPth}regression_data.csv"))

There are a variety of metrics used to assess regression models. Here, we focus on the most often used measures: mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE). MSE is simply the mean of the square of the residuals. A residual is the difference between the reference value and predicted value. Since a square is being used, the units of MSE are the square of the units of the variable being predicted. For example, if you are estimating mass in kilograms, the units of MSE would be kg². RMSE is simply the square root of MSE; as a result, the associated units are the same as the units for the variable being predicted. Lastly, MAE is the mean of the absolute value of the residuals. As with RMSE, the units are the same as those for the dependent variable.

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2 \]

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2} \]

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right| \]The code block below shows how to calculate RMSE and MAE using yardstick. Both functions require providing the reference value as the truth argument and the estimated value as the estimate argument. Note also that there are two forms of these functions. For example, RMSE can be calculated using rmse() and rmse_vec(). The “_vec()” version returns the measure as a vector or scalar while the other version returns a tibble where the “.metric” column names the metric, the “.estimator” notes the calculation method, and the “.estimate” column provides the metric value.

rmse_vec(truth=regIn$truth, 
         estimate=regIn$m1) |> 
  round(3)

[1] 0.37

mae_vec(truth=regIn$truth,
        estimate=regIn$m1) |> 
  round(3)

[1] 0.296

It is also common to calculate the R-squared measure, which quantifies the proportion of the total variance in the dependent variable explained by the model. The total variance in the model is defined as the sum of the squared differences between each value and the mean value while the total unexplained variance is the sum of squares of the residuals, or difference between the actual and predicted value. Subtracting the residual sum of squares from the total sum of squares provides an estimate of the explained variance. Dividing this result by the total variance yields a value between zero and one where zero indicates that no variance is explained and one indicates that all the variance is explained.

\[ R^2 = 1 - \frac{\sum_{i=1}^{n} \left( y_i - \hat{y}_i \right)^2}{\sum_{i=1}^{n} \left( y_i - \bar{y} \right)^2}, \quad \text{where} \quad \bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i. \]

The code block below demonstrates how to calculate R-squared using yardstick. For the first model results, the metric suggests that ~77% of the variance in the dependent variable is being explained by the model.

rsq_vec(truth=regIn$truth, 
        estimate=regIn$m1) |> 
  round(3)

[1] 0.771

To compare the three models, we now calculate RMSE and R-squared for each then bind the results into a single tibble. Generally, the results suggest that the first model is providing the strongest performance.

rmse1 <- rmse(regIn, truth=truth, estimate=m1)$.estimate |> round(3)
rmse2 <- rmse(regIn, truth=truth, estimate=m2)$.estimate |> round(3)
rmse3 <- rmse(regIn, truth=truth, estimate=m3)$.estimate |> round(3)

rsq1 <- rsq(regIn, truth=truth, estimate=m1)$.estimate |> round(3)
rsq2 <- rsq(regIn, truth=truth, estimate=m2)$.estimate |> round(3)
rsq3 <- rsq(regIn, truth=truth, estimate=m3)$.estimate |> round(3)

regDF <- tibble(Model = c("M1", "M2", "M3"),
                RMSE = c(rmse1, rmse2, rmse3),
                RSqaured = c(rsq1, rsq2, rsq3))

regDF |>gt()

Model	RMSE	RSqaured
M1	0.370	0.771
M2	0.380	0.758
M3	0.383	0.755

The yardstick package supports the calculation of a range of regression metrics including mean signed deviation (MSD), mean percentage error (MPE), concordance correlation coefficient (CCC), ratio of performance to interquartile range (RPIQ), and ratio of performance to deviation (RPD).

5.4 Multiclass Classification

To experiment with accuracy assessment metrics, we will use two example classification results for the EuroSat dataset. We begin by reading in the data. For both results, the predictions are stored in the “pred” column while the reference labels are stored in the “ref” column. Numeric codes are used as opposed to class names, so the columns are read in as a numeric data type; as a result, it is necessary to convert them to factors.

result1 <- read_csv(str_glue("{fldPth}assessmentR1.csv"))
result2 <- read_csv(str_glue("{fldPth}assessmentR2.csv"))

result1$ref <- as.factor(result1$ref)
result1$pred <- as.factor(result1$pred)

result2$ref <- as.factor(result2$ref)
result2$pred <- as.factor(result2$pred)

To make our results easier to interpret, we next use the fct_recode() function from the forcats package to recode the numeric codes to more meaningful text labels. Although this is not strictly necessary, we generally find it easier to use class names that are more meaningful and interpretable.

result1$ref <- fct_recode(result1$ref,
                          "AnnCrp" = "0",
                          "Frst" = "1",
                          "HrbVg" = "2",
                          "Highwy" = "3",
                          "Indst" = "4",
                          "Pstr" = "5",
                          "PrmCrp" = "6",
                          "Resid" = "7",
                          "Rvr" = "8",
                          "SL" = "9")

result1$pred <- fct_recode(result1$pred,
                           "AnnCrp" = "0",
                           "Frst" = "1",
                           "HrbVg" = "2",
                           "Highwy" = "3",
                           "Indst" = "4",
                           "Pstr" = "5",
                           "PrmCrp" = "6",
                           "Resid" = "7",
                           "Rvr" = "8",
                           "SL" = "9")

result2$ref <- fct_recode(result2$ref,
                          "AnnCrp" = "0",
                          "Frst" = "1",
                          "HrbVg" = "2",
                          "Highwy" = "3",
                          "Indst" = "4",
                          "Pstr" = "5",
                          "PrmCrp" = "6",
                          "Resid" = "7",
                          "Rvr" = "8",
                          "SL" = "9")

result2$pred <- fct_recode(result2$pred,
                           "AnnCrp" = "0",
                           "Frst" = "1",
                           "HrbVg" = "2",
                           "Highwy" = "3",
                           "Indst" = "4",
                           "Pstr" = "5",
                           "PrmCrp" = "6",
                           "Resid" = "7",
                           "Rvr" = "8",
                           "SL" = "9")

5.4.1 Confusion Matrix

We begin our analysis by creating confusion matrices for each prediction using the conf_mat() function from yardstick. The confusion matrices are formatted such that the columns represent the reference labels and the rows represent the predictions, which is the standard in remote sensing.

Values populating the main diagonal represent correct classifications while those off the diagonal represent incorrect predictions. Off-diagonal counts in a specific column represent omission errors relative to the class associated with that column. For example, in the first classification 6 Annual Crop samples were incorrectly predicted to the Forest class, which represents omission error relative to the Annual Crop class. Counts in off-diagonal cells within a row represent commission errors relative to the class associated with the row. In other word, they represent samples that were incorrectly included in the class associated with the row. For example, 4 Permanent Crop samples were incorrectly included in the Annual Crop class in the first classification.

One of the key benefits of the confusion matrix is the ability to quantify not just the overall error rate but the sources of error, or which classes are most commonly confused.

In remote sensing, we generally configure a confusion matrix such that the columns represent the reference labels and the rows represent the predictions. This may not be the case for all contingency tables, which can lead to calculating user’s and producer’s accuracies incorrectly. So, it is important to make sure the table is formatted as expected.

conf_mat(result1, truth=ref, estimate=pred)

          Truth
Prediction AnnCrp Frst HrbVg Highwy Indst Pstr PrmCrp Resid Rvr  SL
    AnnCrp    254    0     3      1     0    0      4     0   0   0
    Frst        6  300    12      0     0    4      0     0   0   0
    HrbVg       4    0   236      0     1    2     10     1   0   0
    Highwy      7    0     3    242    10    1      1     7   1   0
    Indst       0    0     1      2   227    0      0    11   0   0
    Pstr        6    0    22      0     0  192      6     0   2   0
    PrmCrp     22    0    16      2     1    1    226     5   0   0
    Resid       0    0     3      1    10    0      1   276   0   0
    Rvr         0    0     3      2     1    0      2     0 247   3
    SL          1    0     1      0     0    0      0     0   0 356

conf_mat(result2, truth=ref, estimate=pred)

          Truth
Prediction AnnCrp Frst HrbVg Highwy Indst Pstr PrmCrp Resid Rvr  SL
    AnnCrp    274    0     2      2     0    1      4     0   0   0
    Frst        1  289     3      0     0    1      0     0   0   0
    HrbVg       5    1   282      2     1    7     16     1   1   0
    Highwy      2    0     0    231     5    0      6     3   2   0
    Indst       0    0     1      3   236    0      1     1   1   0
    Pstr        3   10     2      1     0  188      2     0   1   0
    PrmCrp     14    0     4      1     0    3    219     1   0   0
    Resid       1    0     2      7     8    0      1   294   0   0
    Rvr         0    0     4      3     0    0      1     0 243   1
    SL          0    0     0      0     0    0      0     0   2 358

It is important to distinguish between a sample confusion matrix and an estimated population confusion matrix. The population confusion matrix is the confusion matrix for the entire map, and generally is the statistical information of interest. However, it is usually impractical to identify the correct value of every pixel in the population, and thus most confusion matrices are based on a sample, from which the population confusion matrix is estimated. The sample confusion matrix itself is only a direct estimate of the population confusion matrix if the sampling from the map was based on simple random sampling. Note that if stratified sampling is used, with sampling probabilities that vary by class (e.g., to produce a “balanced” dataset), then this information needs to be incorporated in the analysis for estimating the population confusion matrix. One way of understanding the difference between the sample and estimated population confusion matrix is to consider that only in the estimated population confusion matrix is the sum of the values in the table for a particular reference class (i.e. a column total) an estimate of the proportion of that class within the landscape being mapped.

5.4.2 Multiclass Assessment Metrics

A variety of assessment metrics can be generated from the confusion matrix. For each class, it is possible to calculate measures associated with omission and commission errors. Producer’s accuracy or recall for a class is calculated by dividing the correct count by the column total while user’s accuracy or precision is calculated by dividing the correct count by the row total. Note that it is common to report these metrics as proportions instead of percentages.

\[ \text{Recall}_i = \frac{C_{ii}}{\sum_{j=1}^{k} C_{ij}} \]

\[ \text{Precision}_i = \frac{C_{ii}}{\sum_{j=1}^{k} C_{ji}} \]

To summarize:

Producer’s Accuracy = Recall = 1 - Omission Error
User’s Accuracy = Precision = 1 - Commission Error

The overall accuracy is simply the count correct divided by the sum of the entire table. It is sometimes reported as a proportion and sometimes as a percentage.

\[ \text{Accuracy} = \frac{\sum_{i=1}^{k} C_{ii}}{\sum_{i=1}^{k} \sum_{j=1}^{k} C_{ij}} \]

It is possible to aggregate class-level producer’s accuracy/recall or class-level user’s accuracy/precision to a single metric using micro- or macro-averaging. Micro-averaging is actually equivalent to overall accuracy, so it is not necessary to calculate and report additional metrics calculated using this method. Macro-averaging is conducted by calculating the metrics separately for each class then taking the average such that each class has equal weight in the aggregated metric regardless of the relative abundances of the classes in the dataset or landscape of interest. If desired, it is also possible to incorporate class weights such that each class is not equally weighted.

\[ \text{Macro-Averaged Class Aggregated Precision ()} = \frac{1}{k} \sum_{i=1}^{k} \text{Precision}_i \]

\[ \text{Macro-Averaged Class Aggregated Recall} = \frac{1}{k} \sum_{i=1}^{k} \text{Recall}_i \]

Lastly, the F1-score is the harmonic mean of the precision and recall for a class. For macro-averaging, the metric is calculated separately for each class then averaged. Note that this metric is perfectly correlated with the intersection-over-union (IoU) metric, so it is not necessary to calculate both metrics.

\[ \text{Macro-Averaged Class Aggregated F1-Score} = \frac{1}{k} \sum_{i=1}^{k} \left( 2 \times \frac{\text{Precision}_i \times \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i} \right) \]

The code below demonstrates how to calculate these metrics using the yardstick package. The truth argument references the column associated with the reference label, in this case the “ref” column, while the estimate argument references the prediction, in this case the “pred” column. These functions return tibbles as opposed to single values. As noted above for the regression metrics, the “.metric” column names the metric, the “.estimator” column notes the calculation method (for example, micro- vs. macro-averaging), and the “.estimate” column provides the metric value. In the code, we are extracting out only the value.

Lastly, we merge the results into a single tibble and print the results, which suggest that the second model, “Result 2”, is generally outperforming the other model for this task.

Since micro-averaged, class aggregated precision, recall, and F1-score are equivalent to each other and also to overall accuracy, there is no need to report these when overall accuracy is reported.

acc1 <- accuracy(result1, truth=ref, estimate=pred)$.estimate |> round(3)
recall1 <- recall(result1, truth=ref, estimate=pred, estimator="macro")$.estimate |> round(3)
precision1 <- precision(result1, truth=ref, estimate=pred, estimator="macro")$.estimate |> round(3)
fscore1 <- f_meas(result1, truth=ref, estimate=pred, estimator="macro")$.estimate |> round(3)

acc2 <- accuracy(result2, truth=ref, estimate=pred)$.estimate |> round(3)
recall2 <- recall(result2, truth=ref, estimate=pred, estimator="macro")$.estimate |> round(3)
precision2 <- precision(result2, truth=ref, estimate=pred, estimator="macro")$.estimate |> round(3)
fscore2 <- f_meas(result2, truth=ref, estimate=pred, estimator="macro")$.estimate |> round(3)

metricsDF <- data.frame(Model = c("Result 1", "Result 2"),
                        Accuracy=c(acc1, acc2), 
                        F1Score = c(fscore1,fscore2), 
                        Precision = c(precision1, precision2),
                        Recall = c(recall1, recall2))
metricsDF |> gt()

Model	Accuracy	F1Score	Precision	Recall
Result 1	0.926	0.923	0.923	0.927
Result 2	0.947	0.945	0.945	0.945

Also as noted above, there are also versions of these functions that accept a vector of reference labels and a vector of predictions and return just the metric value as opposed to a tibble. These functions have the same name but are suffixed with “_vec”. These functions are demonstrated below.

accuracy_vec(truth=result1$ref, estimate=result1$pred) |> round(3)

[1] 0.926

recall_vec(truth=result1$ref, estimate=result1$pred, estimator="macro") |> round(3)

[1] 0.927

precision_vec(truth=result1$ref, estimate=result1$pred, estimator="macro") |> round(3)

[1] 0.923

f_meas_vec(truth=result1$ref, estimate=result1$pred, estimator="macro") |> round(3)

[1] 0.923

accuracy_vec(truth=result1$ref, estimate=result2$pred) |> round(3)

[1] 0.947

recall_vec(truth=result1$ref, estimate=result2$pred, estimator="macro") |> round(3)

[1] 0.945

precision_vec(truth=result1$ref, estimate=result2$pred, estimator="macro") |> round(3)

[1] 0.945

f_meas_vec(truth=result1$ref, estimate=result2$pred, estimator="macro") |> round(3)

[1] 0.945

Several other classification metrics are provide by yardstick including Matthews correlation coefficient (MCC) and the Kappa statistic. However, we generally discourage the use of Kappa.

5.5 Binary Classification

Since it is common to use different terminology for a binary classification with a positive, presence, or true case and a negative, absence, or false class, we will now discuss a binary classification problem.

5.5.1 Single-Threshold

Single-threshold metrics are those that are calculated using the “hard” classification based on a specific decision threshold as opposed to the predicted probability of class membership, sometimes referred to as a “soft” label or prediction. For example, a sample with a predicted probability greater than 0.5 of occurring in the positive case may be coded to “positive” while samples that do not meet this criteria are coded to “negative”. Generally, the user can specify the desired threshold. At this specific threshold each sample can be labelled as one of the following:

True Positive (TP): referenced to the positive class and correctly labeled as positive
True Negative (TN): referenced to the negative class and correctly labeled as negative
False Positive (FP): referenced to the negative class and incorrectly labeled as positive (commission error relative to the positive class and omission error relative to the negative class)
False Negative (FN): referenced to the positive class and incorrectly labeled as negative (omission error relative to the positive class and commission error relative to the negative class)

From the count of TP, TN, FP, and FN samples, several metrics can be calculated. Recall, also referred to as sensitivity, considers the omission error relative to the positive case. It is equivalent to 1 - positive case omission error or the positive case producer’s accuracy. Precision quantifies commission error relative to the positive case and is equivalent to 1 - positive case commission error or the positive case user’s accuracy. Negative predictive value (NPV) is equivalent to 1 - negative case commission error or the user’s accuracy for the negative case while specificity is equivalent to 1 - negative case omission error or producer’s accuracy for the negative case.

\[ \text{Recall} = \frac{TP}{TP + FN} \]

\[ \text{Precision} = \frac{TP}{TP + FP} \]

\[ \text{Negative Predictive Value (NPV)} = \frac{TN}{TN + FN} \]

\[ \text{Specificity} = \frac{TN}{TN + FP} \]

As noted above, the F1-score is the harmonic mean of precision and recall. It is generally only calculated for the positive case as opposed to averaging across the positive and negative case for a binary classification problem. Note that overall accuracy for a binary classification is the same as that for a multiclass classification: the total correct, TP + TN, divided by the total number of samples.

\[ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \times TP}{2 \times TP + FP + FN} \]

To experiment with a binary classification result, we will use a dataset generated by the authors and associated with the problem of predicting whether or not a location is a landslide based on topographic characteristics. In the code block below, we read in the data and recode the labels to “Class 1” and “Class 2”. In this case, “Class 2” represents slope failures while “Class 1” represents not slope failures or the background class. When calculating binary metrics, it is important to set the estimator argument to "binary" and specify which class represents the positive case. Since “Class 2” represents the positive case for this problem, we set the event-level argument to "second". We also generate a confusion matrix using the conf_mat() function.

For the landslide case, “Class 2”, there are more FPs in comparison to FNs. In other words, precision (~0.85) is lower than recall (~0.91).

biIn <- read_csv(str_glue("{fldPth}binary_data.csv"))
names(biIn) <- c("truth", "predicted", "Class1", "Class2")
biIn$truth <- fct_recode(biIn$truth,
                         "Class1" = "not",
                         "Class2" = "slopeD")
biIn$predicted <- fct_recode(biIn$predicted,
                             "Class1" = "not",
                             "Class2" = "slopeD")

yardstick::recall_vec(truth=biIn$truth, estimator="binary", event_level="second", estimate=biIn$predicted) |> round(3)

[1] 0.906

yardstick::precision_vec(truth=biIn$truth, estimator="binary", event_level="second", estimate=biIn$predicted) |> round(3)

[1] 0.853

yardstick::npv_vec(truth=biIn$truth, estimator="binary", event_level="second", estimate=biIn$predicted) |> round(3)

[1] 0.9

yardstick::spec_vec(truth=biIn$truth, estimator="binary", event_level="second", estimate=biIn$predicted) |> round(3)

[1] 0.844

yardstick::f_meas_vec(truth=biIn$truth, estimator="binary", event_level="second", estimate=biIn$predicted) |> round(3)

[1] 0.879

conf_mat(data=biIn, truth=truth, estimate=predicted)

          Truth
Prediction Class1 Class2
    Class1    422     47
    Class2     78    453

5.5.2 Multi-Threshold

It is also possible to generate metrics using multiple thresholds calculated from the “soft” predictions as opposed to a single threshold. Two common methods are the receiver operating characteristic curve (ROC curve) and precision-recall curve (PR curve). These curves can be aggregated to a single metric by integrating the area under the curve. These measures are called areas under the ROC curve (AUCROC) and area under the P-R curve (AUCPR).

The AUC curve considers sensitivity (i.e., recall) and specificity at multiple thresholds. In other words, in considers 1 - omission error relative to the positive and negative case, respectively. Omission errors are not impacted by class imbalanced, so the curve and associated AUCROC metric will not change as class proportions change. Since both sensitivity and specificity are scaled from zero to one, the maximum possible area under the curve is one.

The PR curves considers precision and recall. In other words it considers 1 - commission and 1 - omission error relative to the positive case at multiple thresholds. Precision is impacted by the the relative class proportions. As a result, PR curves and the associated AUCPR metric will vary as class proportions change. PR curves are generally considered more meaningful than ROC curves when their is significant class imbalance or when the positive or background case make up a much smaller or much larger percentage of the landscape. As with ROC curves, and since both precision and recall are scaled from zero to one, the maximum possible area under the curve is one.

Both ROC and PR curves and associated metrics can be calculated for multi-class problems. However, that is not demonstrated here.

The code below generates an ROC curve using the binary classification results. Note that this requires providing the “hard” correct label as the truth argument and the predicted probability for the positive case. It is also important to specify which label represents the positive case using the event_level argument.

roc_curve(biIn, Class2, truth=truth, event_level="second") |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

The AUCROC is calculated using the roc_auc() function. For these data, we obtain an AUCROC of 0.945. Note that this measure is generally interpreted like a US university grading scale:

.9 to 1 = A (excellent)
.8 to .9 = B (good)
.7 to .8 = C (acceptable)
<.7 = D/F (unacceptable)

However, an acceptable AUCROC will depend on the specific use case. Also, a AUCROC of 0.5 is generally interpreted as a model that is performing no better than a random guess.

roc_auc(biIn, Class2, truth=truth, event_level="second") |> gt()

.metric	.estimator	.estimate
roc_auc	binary	0.945172

The next two cold blocks demonstrate the generation of a PR curve and the associated AUCPR metric. The syntax is very similar to that for the ROC curve and AUCROC metric, and AUCPR values are often interpreted similarly to AUCROC, like a college grading scale, other than the consideration that changes in relative class proportions will change the results.

pr_curve(biIn, Class2, truth=truth, event_level="second") |>
  ggplot(aes(x = recall, y = precision)) +
  geom_path() +
  coord_equal() +
  theme_bw()

pr_auc(biIn, Class2, truth=truth, event_level="second") |> gt()

.metric	.estimator	.estimate
pr_auc	binary	0.9333488

5.6 Map Image Classification Efficacies (MICE)

MICE adjusts the accuracy rate relative to a random classification baseline. Only the proportions from the reference labels are considered, as opposed to the proportions from the reference and predictions, as is the case for the Kappa statistic. Due to documented issues with the Kappa statistic, its use in remote sensing and thematic map accuracy assessment is being discouraged. This metric was introduced in the following papers:

Shao, G., Tang, L. and Zhang, H., 2021. Introducing image classification efficacies. IEEE Access, 9, pp.134809-134816.

Tang, L., Shao, J., Pang, S., Wang, Y., Maxwell, A., Hu, X., Gao, Z., Lan, T. and Shao, G., 2024. Bolstering Performance Evaluation of Image Segmentation Models with Efficacy Metrics in the Absence of a Gold Standard. IEEE Transactions on Geoscience and Remote Sensing.

It can be calculated using the micer package. The metrics calculated depend on whether the problem is framed as a multiclass or binary classification. Multiclass must be used when more than two classes are differentiated. If two classes are differentiated, binary should be used if there is a clear positive, presence, or foreground class and a clear negative, absence, or background class. If this is not the case, multiclass mode is more meaningful.

The mice() function is used to calculate a set of metrics by providing vectors or a data frame/tibble containing columns of reference and predicted class labels. Alternatively, the miceCM() function can be used to calculate the metrics from a confusion matrix table where the columns represent the correct labels and the rows represent the predicted labels.

Results are returned as a list object. For a multiclass classification, the following objects are returned:

Mappings = class names
confusionMatrix = confusion matrix where columns represent the reference data and rows represent the classification result
referenceCounts = count of samples in each reference class
predictionCounts = count of predictions in each class
overallAccuracy = overall accuracy
MICE = map image classification efficacy
usersAccuracies = class-level user’s accuracies (1 - commission error)
CTBICEs = classification-total-based image classification efficacies (adjusted user’s accuracies)
producersAccuracies = class-level producer’s accuracies (1 - omission error)
RTBICEs = reference-total-based image classification efficacies (adjusted producer’s accuracies)
F1Scores = class-level harmonic means of user’s and producer’s accuracies
F1Efficacies = F1-score efficacies
macroPA = class-aggregated, macro-averaged producer’s accuracy
macroRTBICE = class-aggregated, macro-averaged reference-total-based image classification efficacy
macroUA = class-aggregated, macro-averaged user’s accuracy
macroCTBICE = class-aggregated, macro-averaged classification-total-based image classification efficacy
macroF1 = class-aggregated, macro-averaged F1-score
macroF1Efficacy = class-aggregated, macro-averaged F1 efficacy

The results are demonstrated below for our multiclass example.

mice(result1$ref, 
     result2$pred,
     multiclass=TRUE)

$Mappings
 [1] "AnnCrp" "Frst"   "HrbVg"  "Highwy" "Indst"  "Pstr"   "PrmCrp" "Resid" 
 [9] "Rvr"    "SL"    

$confusionMatrix
         Reference
Predicted AnnCrp Frst HrbVg Highwy Indst Pstr PrmCrp Resid Rvr  SL
   AnnCrp    274    0     2      2     0    1      4     0   0   0
   Frst        1  289     3      0     0    1      0     0   0   0
   HrbVg       5    1   282      2     1    7     16     1   1   0
   Highwy      2    0     0    231     5    0      6     3   2   0
   Indst       0    0     1      3   236    0      1     1   1   0
   Pstr        3   10     2      1     0  188      2     0   1   0
   PrmCrp     14    0     4      1     0    3    219     1   0   0
   Resid       1    0     2      7     8    0      1   294   0   0
   Rvr         0    0     4      3     0    0      1     0 243   1
   SL          0    0     0      0     0    0      0     0   2 358

$referenceCounts
AnnCrp   Frst  HrbVg Highwy  Indst   Pstr PrmCrp  Resid    Rvr     SL 
   300    300    300    250    250    200    250    300    250    359 

$predictionCounts
AnnCrp   Frst  HrbVg Highwy  Indst   Pstr PrmCrp  Resid    Rvr     SL 
   283    294    316    249    243    207    242    313    252    360 

$overallAccuracy
[1] 0.9474447

$MICE
[1] 0.9414529

$usersAccuracies
   AnnCrp      Frst     HrbVg    Highwy     Indst      Pstr    PrmCrp     Resid 
0.9681978 0.9829932 0.8924050 0.9277108 0.9711934 0.9082125 0.9049586 0.9392971 
      Rvr        SL 
0.9642857 0.9944444 

$CTBICEs
   AnnCrp      Frst     HrbVg    Highwy     Indst      Pstr    PrmCrp     Resid 
0.9643176 0.9809181 0.8792770 0.9205069 0.9683227 0.9010378 0.8954874 0.9318905 
      Rvr        SL 
0.9607266 0.9936133 

$producersAccuracies
   AnnCrp      Frst     HrbVg    Highwy     Indst      Pstr    PrmCrp     Resid 
0.9133333 0.9633333 0.9400000 0.9240000 0.9440000 0.9400000 0.8760000 0.9800000 
      Rvr        SL 
0.9720000 0.9972145 

$RTBICEs
   AnnCrp      Frst     HrbVg    Highwy     Indst      Pstr    PrmCrp     Resid 
0.9027588 0.9588595 0.9326792 0.9164263 0.9384194 0.9353099 0.8636429 0.9775597 
      Rvr        SL 
0.9692097 0.9967977 

$f1Scores
   AnnCrp      Frst     HrbVg    Highwy     Indst      Pstr    PrmCrp     Resid 
0.9399657 0.9730639 0.9155844 0.9258517 0.9574036 0.9238329 0.8902439 0.9592169 
      Rvr        SL 
0.9681275 0.9958275 

$f1Efficacies
   AnnCrp      Frst     HrbVg    Highwy     Indst      Pstr    PrmCrp     Resid 
0.9325234 0.9697634 0.9051911 0.9184621 0.9531365 0.9178540 0.8792770 0.9541790 
      Rvr        SL 
0.9649495 0.9952030 

$macroPA
[1] 0.9449881

$macroRTBUCE
[1] 0.9391663

$macroUA
[1] 0.9453699

$macroCTBICE
[1] 0.9396098

$macroF1
[1] 0.9451789

$macroF1Efficacy
[1] 0.939388

For a binary classification, the following objects are returned within a list object:

Mappings = class names
confusionMatrix = confusion matrix where columns represent the reference data and rows represent the classification result
referenceCounts = count of samples in each reference class
predictionCounts = count of predictions in each class
positiveCase = name or mapping for the positive case
overallAccuracy = overall accuracy
MICE = map image classification efficacy
Precision = precision (1 - commission error relative to positive case)
precisionEfficacy = precision efficacy
NPV = negative predictive value (1 - commission error relative to negative case)
npvEfficacy = negative predictive value efficacy
Recall = recall (1 - omission error relative to positive case)
recallEfficacy = recall efficacy
specificity = specificity (1 - omission error relative to negative case)
specificityEfficacy = specificity efficacy
f1Score = harmonic mean of precision and recall
f1Efficacy = F1-score efficacy

The results are demonstrated below for our binary example.

mice(biIn$truth,
     biIn$predicted,
     multiclass=FALSE,
     positiveIndex=1)

$Mappings
[1] "Class1" "Class2"

$confusionMatrix
         Reference
Predicted Class1 Class2
   Class1    422     47
   Class2     78    453

$referenceCounts
Class1 Class2 
   500    500 

$predictionCounts
Class1 Class2 
   469    531 

$positiveCase
[1] "Class1"

$overallAccuracy
[1] 0.875

$mice
[1] 0.74999

$Precision
[1] 0.8997868

$precisionEfficacy
[1] 0.7995695

$NPV
[1] 0.8531073

$npvEfficacy
[1] 0.7062088

$Recall
[1] 0.844

$recallEfficacy
[1] 0.6879937

$Specificity
[1] 0.906

$specificityEfficicacy
[1] 0.8119962

$f1Score
[1] 0.871001

$f1ScoreEfficacy
[1] 0.7395972

It is also possible to estimate confidence intervals for the aggregated metrics using the miceCI() function and using a bootstrap method. In the example below, 1,000 replicated are generated, 70% of the data with replacement are used in each sample, and a 95% confidence interval is estimated. For all the aggregated metric, the mean, median, and lower and upper confidence interval values are estimated. The example below generates the results for our example multiclass classification problem.

ciResultsMC <- miceCI(rep=1000,
                      frac=.7,
                      result1$ref, 
                      result1$pred,
                      lowPercentile=0.025,
                      highPercentile=0.975,
                      multiclass=TRUE)

ciResultsMC |> gt()

metric	mean	median	low.ci	high.ci
overallAccuracy	0.9260989	0.9259451	0.9145520	0.9373382
MICE	0.9176343	0.9174724	0.9047982	0.9301307
macroPA	0.9269741	0.9269344	0.9157374	0.9385132
macroRTBUCE	0.9187337	0.9186983	0.9064102	0.9313971
macroUA	0.9228861	0.9227111	0.9112341	0.9344061
macroCTBICE	0.9150543	0.9148550	0.9023634	0.9278416
macroF1	0.9249245	0.9247341	0.9136142	0.9359322
macroF1Efficacy	0.9168892	0.9167383	0.9042231	0.9292069

Lastly, it is possible to statistically compare two classifications using a paired t-test and bootstrapping to estimate the sample distribution. This is implemented with the miceCompare() function. It is based on a method outlined in Chapter 11 of Tidy Modeling in R.

For the multiclass classification, statistical difference between the model results is assumed using a p-value of 0.05 or a 95% confidence interval (p-value = < 2.2e^-16).

miceCompare(ref=result1$ref, 
            result1=result1$pred, 
            result2=result2$pred, 
            reps=1000, 
            frac=.7)


    Paired t-test

data:  resultsDF$mice1 and resultsDF$mice2
t = -109.34, df = 999, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.02382740 -0.02298724
sample estimates:
mean difference 
    -0.02340732

5.7 Concluding Remarks

This chapter concludes the first section of the text. We will now move on to explore different supervised learning algorithms. Throughout Section 2 and 3 of the text, we will implement the accuracy assessment methods and metrics introduced in this chapter.

5.8 Questions

For the yardstick package, explain the difference between the regular and “_vec” version of a metric (for example, recall() vs. recall_vec()).
Explain the terms true positive, true negative, false positive, and false negative using an example.
Explain the difference between omission and commission error using an example.
If the relative proportion of classes in a multiclass classification change, will class-level recall change? Explain your reasoning.
If the relative proportion of classes in a multiclass classification change, will class-level precision change? Explain your reasoning.
Explain the difference between micro- and macro-averaging methods when calculating a single, aggregated metric from class-level metrics.
Explain the difference between a receiver operating characteristic (ROC) curve and a precision-recall (PR) curve.
Explain the concordance correlation coefficient (CCC) and how it is different from the Pearson correlation coefficient.

5.9 Exercises

Multiclass Classification

You have been provided with accuracy assessment results for a land cover mapping project for the entire state of West Virginia as a point layer (chpt5.gpkg/multiClassWV) in the exercise folder for the chapter. The “GrndTruth” column represents the reference labels while the “pred” column represents the predictions. Complete the following tasks:

Read in the point features
Use dplyr to re-code the class indices for both the “GrndTruth” and “pred” columns as follows:
- 1 = “Forest”
- 2 = “LowVeg”
- 3 = “Barren”
- 4 = “Water”
- 5 = “Impervious”
- 6 = “Mix Dev”
Make sure all character columns are treated as factors
Generate a confusion matrix using yardstick
Calculate overall accuracy using yardstick
Calculate class-aggregated, macro-average recall, precision, and F1-score using yardstick
Write a paragraph describing the assessment results. Which classes were most confused? What classes were generally well differentiated?

Binary Classification

You have been provided with accuracy assessment results for the extraction of surface mine disturbance from historic topographic maps where two classes are differentiated: “Background” and “Mine”. “Mine” is considered the positive case while “Background” is the negative, background, or absence case. Three files have been provided in the exercise folder for the chapter:

chpt5.gpkg/valPnts: point location at which to perform validation
topo_pred2.tif: raster of predictions (0 = “Background”; 255 = “Mine”)
topo_ref.tif: raster of reference labels (0 = “Background”; 255 = “Mine”)

Perform the following tasks:

Extract the prediction and reference indices to the point locations
In the resulting table, re-code the numeric codes to class names
Make sure all character columns are treated as factors
Calculate a confusion matrix using yardstick
Calculate overall accuracy, precision, recall, and F1-score using yardstick (make sure the correct class is treated as the positive case)
Write a paragraph describing the sources of error in the results. Make sure to distinguish between commission and omission errors.