Assessment Metrics
Model Assessment
Terminology and Metrics
Accuracy assessment is an important component of the modeling process. Specifically, it is important to assess your model against withheld data. Assessing the model relative to the training samples can be misleading due to issue of overfitting. As a result, using a withheld, randomized, unbiased test set to assess the final model is important to quantify how well the model generalizes to new data, which is generally the point of creating a model. Before we begin, here are a few notes on key terminology:
- Training Set: samples used to update model parameters. These samples are used to calculated the loss and guide the model parameter updates as the mini-batches are processed.
- Validation Set: data used to assess the model at the end of each training epoch during the training process. It is common to select the model that provides the best results relative to the validation data and as measured using an assessment metric. The model that provides the best performance for the training data may not be the best model do to overfitting.
- Test Set: data used to assess the final selected model.
In this example, we are primarily interested in the test set. Once a final model is generated, it can be used to predict to a test set. The test set labels can be compared to the predictions to generated a confusion matrix and associated assessment metrics.
An example confusion matrix is shown below. geodl uses the confusion matrix configuration standard within the field of remote sensing where the columns represent the reference labels and the rows represent the predictions. In the example confusion matrix, 50 samples were predicted to Class A and were correctly predicted as Class A. 8 samples were examples of Class A but were incorrectly predicted as Class B. 10 samples were examples of Class B what were incorrectly labeled as Class A. Relative to Class A, the 8 samples there were mislabeled to Class B would represent omission errors: they were incorrectly omitted from Class A. In contrast, the 10 reference samples that were from Class B but incorrectly predicted to Class A would be examples of commission error relative to Class A: they were incorrectly included in Class A.
In short, the confusion matrix describes not just the overall amount of error, but differentiates the types of errors. This allows analysts and users to understand which classes were most commonly confused or which classes were most difficult to map or differentiate.
\[ \begin{array}{c|ccc} & \text{Reference A} & \text{Reference B} & \text{Reference C} \\ \hline \text{Prediction A} & 50 & 10 & 5 \\ \text{Prediction B} & 8 & 45 & 7 \\ \text{Prediction C} & 2 & 5 & 60 \\ \end{array} \]
Overall accuracy represents the percentage or proportion of the total samples that were correctly predicted.
\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]
Outside of an aggregated overall accuracy, it can be useful to report class-level metrics. In remote sensing 1 - omission error for a class is generally termed producer’s accuracy while 1 - commission error is termed user’s accuracy. When the confusion matrix is configured such that the reference labels define the columns and the predictions define the rows, producer’s accuracy is calculated as the number correct for the class divided by the column total while user’s accuracy is calculated as the number correct for the class divided by the associated row total.
\[ \text{Producer's Accuracy (PA)}\_i = \frac{\text{Number of Correctly Classified Samples of Class } i}{\text{Total Number of Reference Samples of Class } i} \]
\[ \text{User's Accuracy (UA)}\_i = \frac{\text{Number of Correctly Classified Samples of Class } i}{\text{Total Number of Samples Classified as Class } i} \]
For a binary classification problem where one class is the positive case or case of interest and the other class is the background or negative case, it is common to use different terminology. The confusion matrix below represents a binary confusion matrix. Here is an explanation of the associated terminology:
- TP: True Positive (positive case sample that was correctly predicted as positive)
- TN: True Negative (negative case sample that was correctly predicted as negative)
- FP: False Positive (negative case sample that was incorrectly labeled as positive)
- FN: False Negative (positive case sample that was incorrectly labeled as negative)
\[ \begin{array}{c|cc} & \text{Reference Positive} & \text{Reference Negative} \\ \hline \text{Prediction Positive} & TP & FP \\ \text{Prediction Negative} & FN & TN \\ \end{array} \]
From the binary confusion matrix, we can calculate overall accuracy as stated above. Overall accuracy can also be defined relative to TP, TN, FN, and FP counts as follows:
\[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]
At the class-level, recall for each class can be calculated using the TP and FN counts. Recall is equivalent to class-level producer’s accuracy and quantifies 1 - omission error relative to the positive case.
\[ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \]
Class-level precision quantifies 1 - commission error and is equivalent to user’s accuracy for the positive case. It is calculated using the TP and FP counts.
\[ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \]
Precision and recall can be combined to a single class-level metric as the F1-score, which is the harmonic mean of precision and recall. It can be stated relative to precision and recall or relative to TP, FP, and FN counts.
\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
\[ F1 = 2 \cdot \frac{TP}{2TP + FP + FN} \]
For the negative or background class, specificity represents 1 - omission error while negative predictive value (NPV) represents 1 - commission error.
\[ \text{Specificity} = \frac{TN}{TN + FP} \] \[ \text{NPV} = \frac{TN}{TN + FN} \]
Lastly, it might be of interest to aggregate class-level metrics. There are three general ways to do this. * macro-averaging: calculate the metric separately for each class then take the average such that each class is equally weighted in the aggregated metric.
micro-averaging: aggregate TP, TN, FP, and FN counts and calculate a single metric such that more abundant classes have a larger weight in the final calculation.
weighted macro-averaging: calculate a macro-average with user-specified class weights such that the classes are not equally weighted.
For a multiclass problem, micro-averaged user’s accuracy (precision) and producer’s accuracy (recall) are equivalent to each other and also equivalent to overall accuracy and the micro-averaged F1-score. So, there is no need to calculate micro-averaged metrics if overall accuracy is reported. Instead, it makes more sense to report overall accuracy, macro-averaged class metrics, and non-aggregated class metrics. This is the method used within geodl.
Example 1: Multiclass Classification
In this first example, geodl’s assessPnts() function is used to calculate assessment metrics for a multiclass classification from a table or at point locations. The “ref” column represents the reference labels while the “pred” column represents the predictions. The mappings parameter allows for providing more meaningful class names and is especially useful when classes are represented using numeric codes.
For a multiclass assessment, the following are returned: class names ($Classes), count of samples per class in the reference data ($referenceCounts), count of samples per class in the predictions ($predictionCounts), confusion matrix ($confusionMatrix), aggregated assessment metrics ($aggMetrics) (OA = overall accuracy, macroF1 = macro-averaged class aggregated F1-score, macroPA = macro-averaged class aggregated producer’s accuracy or recall, and macroUA = macro-averaged class aggregated user’s accuracy or precision), class-level user’s accuracies or precisions ($userAccuracies), class-level producer’s accuracies or recalls ($producerAccuracies), and class-level F1-scores ($F1Scores).
<- readr::read_csv("data/geodl/tables/multiClassExample.csv") mcIn
<- assessPnts(reference=mcIn$ref,
myMetrics predicted=mcIn$pred,
multiclass=TRUE,
mappings=c("Barren",
"Forest",
"Impervous",
"Low Vegetation",
"Mixed Developed",
"Water"))
print(myMetrics)
$Classes
[1] "Barren" "Forest" "Impervous" "Low Vegetation"
[5] "Mixed Developed" "Water"
$referenceCounts
Barren Forest Impervous Low Vegetation Mixed Developed
163 20807 426 3182 520
Water
200
$predictionCounts
Barren Forest Impervous Low Vegetation Mixed Developed
194 21440 281 2733 484
Water
166
$confusionMatrix
Reference
Predicted Barren Forest Impervous Low Vegetation Mixed Developed Water
Barren 75 7 59 46 1 6
Forest 13 20585 62 617 142 21
Impervous 10 8 196 33 22 12
Low Vegetation 63 138 34 2413 84 1
Mixed Developed 1 64 75 72 270 2
Water 1 5 0 1 1 158
$aggMetrics
OA macroF1 macroPA macroUA
1 0.9367 0.6991 0.6629 0.7395
$userAccuracies
Barren Forest Impervous Low Vegetation Mixed Developed
0.3866 0.9601 0.6975 0.8829 0.5579
Water
0.9518
$producerAccuracies
Barren Forest Impervous Low Vegetation Mixed Developed
0.4601 0.9893 0.4601 0.7583 0.5192
Water
0.7900
$f1Scores
Barren Forest Impervous Low Vegetation Mixed Developed
0.4202 0.9745 0.5545 0.8159 0.5378
Water
0.8634
Example 2: Binary Classification
A binary classification can also be assessed using the assessPnts() function and a table or point locations. For a binary classification the multiclass parameter should be set to FALSE. For a binary case, the $Classes, $referenceCounts,$predictionCounts, and $confusionMatrix objects are also returned; however, the $aggMets object is replaced with $Mets, which stores the following metrics: overall accuracy, recall, precision, specificity, negative predictive value (NPV), and F1-score. For binary cases, the second class is assumed to be the positive case.
<- readr::read_csv("data/geodl/tables/binaryExample.csv") bIn
<- assessPnts(reference=bIn$ref,
myMetrics predicted=bIn$pred,
multiclass=FALSE,
mappings=c("Not Mine", "Mine"))
print(myMetrics)
$Classes
[1] "Not Mine" "Mine"
$referenceCounts
Negative Positive
4822 178
$predictionCounts
Negative Positive
4840 160
$ConfusionMatrix
Reference
Predicted Negative Positive
Negative 4820 20
Positive 2 158
$Mets
OA Recall Precision Specificity NPV F1Score
Mine 0.9956 0.8876 0.9875 0.9996 0.9959 0.9349
Example 3: Extract Raster Data at Points
Before using the assessPnts() function, you may need to extract predictions into a table. This example demonstrates how to extract reference and prediction numeric codes from raster grids at point locations. Note that it is important to make sure all data layers use the same projection or coordinate reference system. The extract() function from the terra packages can be used to extract raster call values at point locations.
Once data are extracted, the assessPnts() tool can be used with the resulting table. It may be useful to recode the class numeric codes to more meaningful names beforehand.
<- terra::vect("data/geodl/topoResult/topoPnts.shp")
pntsIn <- terra::rast("data/geodl/topoResult/topoRef.tif")
refG <- terra::rast("data/geodl/topoResult/topoPred.tif") predG
<- terra::project(pntsIn, terra::crs(refG))
pntsIn2 <- terra::extract(refG, pntsIn2)
refIsect <- terra::extract(predG, pntsIn2)
predIsect
<- data.frame(ref=as.factor(refIsect$topoRef),
resultsIn pred=as.factor(predIsect$topoPred))
$ref <- forcats::fct_recode(resultsIn$ref,
resultsIn"Not Mine" = "0",
"Mine" = "1")
$pred <- forcats::fct_recode(resultsIn$pred,
resultsIn"Not Mine" = "0",
"Mine" = "1")
<- assessPnts(reference=bIn$ref,
myMetrics predicted=bIn$pred,
multiclass=FALSE,
mappings=c("Not Mine", "Mine")
)print(myMetrics)
$Classes
[1] "Not Mine" "Mine"
$referenceCounts
Negative Positive
4822 178
$predictionCounts
Negative Positive
4840 160
$ConfusionMatrix
Reference
Predicted Negative Positive
Negative 4820 20
Positive 2 158
$Mets
OA Recall Precision Specificity NPV F1Score
Mine 0.9956 0.8876 0.9875 0.9996 0.9959 0.9349
Example 4: Use Rasters Grids as Opposed to Point Locations
The assessRaster() function allows for calculating assessment metrics from reference and prediction categorical raster grids as opposed to point locations or tables. Note that the grids being compared should have the same spatial extent, coordinate reference system, and number of rows and columns of cells.
<- terra::rast("data/geodl/topoResult/topoRef.tif")
refG <- terra::rast("data/geodl/topoResult/topoPred.tif") predG
<- terra::crop(terra::project(refG, predG), predG) refG2
<- assessRaster(reference = refG2,
myMetrics predicted = predG,
multiclass = FALSE,
mappings = c("Not Mine", "Mine")
)print(myMetrics)
$Classes
[1] "Not Mine" "Mine"
$referenceCounts
Negative Positive
36022015 1301194
$predictionCounts
Negative Positive
36146932 1176277
$ConfusionMatrix
Reference
Predicted Negative Positive
Negative 35994704 152228
Positive 27311 1148966
$Mets
OA Recall Precision Specificity NPV F1Score
Mine 0.9952 0.883 0.9768 0.9992 0.9958 0.9275