Losses and Assessment Metrics

Introduction

Selecting an appropriate loss metric for a specific task is an important consideration when developing a deep learning workflow. The loss metric guides the weight/parameter updates and is the only measure available to the optimization algorithm as it attempts to adjust the weights/parameters and improve the model performance. I like to think of the learning process as stumbling around in the dark, and the loss metric serves as the only means to determine if it is moving in the right direction.

All loss metrics must be (1) differentiable and (2) minimized as the model improves or learning progresses. Since the goal is to minimize the loss metric, for classification problems you need to use a measure of error as opposed to a measure of accuracy.

In the first section of this module, we will focus on loss metrics for regression followed by classification. I will then explain how to define custom loss metrics. In the last section, we will explore the torchmetrics package, which offers a wide range of assessment metrics for a variety of scenarios.

This site offers examples of a wide range of custom loss functions: https://www.kaggle.com/code/bigironsphere/loss-function-library-keras-pytorch/notebook.

I begin by importing the required packages including torch, torch.nn, and torch.nn.functional. I also import numpy and pandas since I will be working with arrays and data tables. I will also make use of a few functions from scikit-learn.

import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
import pandas as pd

from sklearn import preprocessing 
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import seaborn as sbn

Next, I read in some regression and classification results so that I don’t have to train models and generate predictions just to experiment with loss functions and assessment metrics. The r1 object will be used to explore regression problems. The “truth” column represents the correct answer while the m1, m2, and m3 columns represent different prediction results. The c1 object represents a multiclass classification and includes predicted probabilities for 9 classes along with the correct class codes. Lastly, the b1 object represents results for a binary classification problem.

r1 = pd.read_csv("data/files/regression_data.csv")
r1.head()

      truth        m1        m2        m3
0  1.425000  1.276554  1.166438  1.173226
1  1.450000  1.716117  1.354679  1.371919
2  1.200000  1.003642  0.897605  0.903885
3  1.852941  1.401898  1.512243  1.497734
4  2.157895  2.326138  2.435581  2.455052

c1 = pd.read_csv("data/files/multiclass.csv")
c1.head()

   Unnamed: 0  asphalt   building   car   ...  shadow   soil   tree   reference
0           1      0.04       0.20  0.41  ...     0.01   0.03   0.02  concrete 
1           2      0.34       0.15  0.01  ...     0.49   0.00   0.00    shadow 
2           3      0.39       0.01  0.01  ...     0.57   0.00   0.01    shadow 
3           4      0.04       0.00  0.00  ...     0.33   0.00   0.61      tree 
4           5      0.40       0.09  0.12  ...     0.14   0.04   0.03   asphalt 

[5 rows x 11 columns]

b1 = pd.read_csv("data/files/binary_data.csv")
b1.head()

  truth predicted  prob_not  prob_fail
0   not       not  0.812483   0.187517
1   not       not  0.984213   0.015787
2   not       not  0.991930   0.008070
3   not       not  0.998299   0.001701
4   not       not  0.948095   0.051905

Regression Loss Metrics

We will begin with a regression example. A common loss metric for regression problems is the mean square error (MSE) loss, which is simply the mean of the squared residuals of all of the samples. As this measure is reduced, it suggests that the differences between the predicted values and actual values are decreasing, or the model is doing a better job.

The MSE loss is provided by PyTorch, so does not need to be built from scratch. Here, I am using the class-based version of the metric, which requires instantiating an instance of the class. I also need to prepare the reference and predicted values. This requires extracting the columns from the DataFrame, converting them to numpy arrays, then converting the numpy arrays to tensors. Since these are simple calculations and not computationally intensive, I will not use a GPU in this module. Once the data are prepared and the loss is instantiated, I obtain the loss measure using the loss metric and the reference and predicted values. The result is a tensor with a single value.

For the m1 results, the MSE loss is 0.1365. In the final cell block in this section, I calculate this metric from scratch and without using the defined loss metric. This is accomplished by (1) differencing the reference and predicted values, (2) taking the square of each difference, and (3) calculating the average. The same result is obtained.

target = torch.tensor(r1['truth'].to_numpy())
pred = torch.tensor(r1['m1'].to_numpy())

loss = nn.MSELoss()
loss(pred, target)

tensor(0.1365, dtype=torch.float64)

mse = torch.mean(torch.square(pred-target))
mse

tensor(0.1365, dtype=torch.float64)

Classification Assessment Metrics

When performing a binary classification, or differentiating two classes, it is possible to treat the problem the same as a multiclass classification. In such a case the last layer should output logits for both classes. Another option is to output only a single logit, which will be associated with the positive case. Once logits are converted to probabilities, they must sum to 1. So, for a two class problem, the background or negative class probability can be obtained by simply subtracting the positive class probability from 1. In this section, we will explore the binary cross entropy (BCE) loss, which is appropriate when only the postive case logit is returned.

First, I prepare the data using the LabelEncoder() function from scikit-learn. Since the correct class is presented as a text field, the data must be converted to numeric codes where 0 represents the negative class and 1 represents the positive class. The LabelEncoder() function converts text into numeric codes alphabetically. The data I am using represent predictions of whether or not a location is a landslide. “not” indicates the negative class while “slopeD” represents the positive class. So, based on alphabetical order, “not” is coded to 0 and “slopeD” is coded to 1, which is correct in this case. Again, the positive case must be coded to 1 and the negative case must be coded to 0.

label_encoder = preprocessing.LabelEncoder()
b1['class_code'] = label_encoder.fit_transform(b1['truth'])
b1.head()

  truth predicted  prob_not  prob_fail  class_code
0   not       not  0.812483   0.187517           0
1   not       not  0.984213   0.015787           0
2   not       not  0.991930   0.008070           0
3   not       not  0.998299   0.001701           0
4   not       not  0.948095   0.051905           0

I next convert the reference class numeric codes and positive class probabilities to numpy arrays followed by torch tensors. Note that BCELoss() expects both to be stored using the double data type.

I next instantiate an instance of the BCELoss() class then feed it the references and predictions to obtain the BCE loss. Note that the BCELoss() implementation in PyTorch expects the data to be converted to probabilities before using the loss metric. If the neural network outputs raw logits, you will need to apply a sigmoid function to obtain probabilities prior to using the loss metric. The resulting loss is 0.3026.

target = torch.tensor(b1['class_code'].to_numpy(), dtype=torch.double)
pred = torch.tensor(b1['prob_fail'].to_numpy())

loss = nn.BCELoss()
loss(pred, target)

tensor(0.3026, dtype=torch.float64)

The cell below shows how to implement the BCE loss from scratch. This requires multiplying the correct value (0 or 1) by the log of the positive class probability and then subtracting the class code from 1 and multiplying by the log of 1 minus the positive class probability. The results are averaged then multiplied by -1.

bce =-1*(torch.mean((target * torch.log(pred)) + ((1 - target)*torch.log(1-pred))))
bce

tensor(0.3026, dtype=torch.float64)

BCELoss() is not appropriate for a multiclass problem or a binary classification where the positive and negative probabilities are both obtained. Instead, cross entropy (CE) loss, implemented with CrossEntropyLoss(), should be used. This loss expects the output to be provided as logits as opposed to probabilities. It internally implements a softmax activation prior to calculating the CE loss. Since I already have class probabilities in the example data, I will use the negative log likelihood loss, implemented with NLLLoss(), which does not apply a softmax activation before calculating probabilities. CrossEntropyLoss() consists of a softmax activation followed by NLLLoss().

Similar to the binary classification problem, I first convert the class labels into numeric codes based on alphabetic order using the scikit-learn LabelEncoder() function. NLLLoss() expects probabilities for all classes; as a result, I extract the columns in the data table that provide the predicted class probabilities. Since these columns are in alphabetical order, the column index will match the calculated class code, which is necessary since probabilities are associated with the correct class such that the index is the same as the class code.

c1.head()

   Unnamed: 0  asphalt   building   car   ...  shadow   soil   tree   reference
0           1      0.04       0.20  0.41  ...     0.01   0.03   0.02  concrete 
1           2      0.34       0.15  0.01  ...     0.49   0.00   0.00    shadow 
2           3      0.39       0.01  0.01  ...     0.57   0.00   0.01    shadow 
3           4      0.04       0.00  0.00  ...     0.33   0.00   0.61      tree 
4           5      0.40       0.09  0.12  ...     0.14   0.04   0.03   asphalt 

[5 rows x 11 columns]

pred = torch.tensor(c1[c1.columns[1:10]].to_numpy())
label_encoder = preprocessing.LabelEncoder()
c1['class_code'] = label_encoder.fit_transform(c1['reference'])

target = torch.tensor(c1['class_code'].to_numpy(), dtype=torch.int64)

print(pred.shape)

torch.Size([507, 9])

print(target.shape)

torch.Size([507])

Printing the predictions, you can see that the result is a 2D array in which the first dimension, or rows, provide the predicted class probabilities for each sample and the columns correspond to a specific class. All probabilities for each sample or row must sum to 1. The shape of the tensor holding the predictions is (507, 9). So, there are 507 samples and 9 predicted probabilities (one for each class).

In contrast, the reference tensor is 1D and consists of the correct class code for each sample.

loss = nn.NLLLoss()
loss(pred, target)

tensor(-0.6677, dtype=torch.float64)

Custom Loss Metrics

It is possible to define a custom loss metric. Although functional forms can be defined, I will generate loss metrics by subclassing nn.Module, similar to how I define neural network architectures. The following link provides examples of a variety of custom loss function implementations using both Keras/Tensorflow and Pytorch: https://www.kaggle.com/code/bigironsphere/loss-function-library-keras-pytorch/notebook.

The BCE loss can be negatively impacted by class imbalance. So, it is common to use the Dice loss when there are issues of class imbalance. The Dice loss is calculated as follows:

\[ 1 - \left( \frac{(2 \sum \hat{p}_{\text{TP}}) + \epsilon}{(2 \sum \hat{p}_{\text{TP}}) + \sum \hat{p}_{\text{FN}} + \sum \hat{p}_{\text{FP}} + \epsilon} \right) \] True positives (TPs) represent samples that are examples of the positive class and were correctly predicted to the positive class whereas true negatives (TNs) are samples from the negative class that are correctly predicted to the negative class. False positives (FPs) are negative samples that are incorrectly predicted to the positive class while false negatives (FNs) are from the positive class but incorrectly predicted to the negative class. Note that class probabilites are used here as opposed to “hard” counts. Also, the Dice loss can be modified for a multiclass problem. The example below is only appropriate for a binary classification where the logit for the positive case is predicted.

In order to implement the Dice loss, I begin by subclassing nn.Module. The loss function accepts input predictions and reference data. I have also included a smoothing parameter to avoid divide by zero issues and to improve numeric stability. A few notes about this implementations:

To make the loss differentiable, class probabilties are used as opposed to counts.
Since the Dice coefficient is a measure of accuracy as opposed to error, Dice loss is 1 minus Dice.
If the user provides raw logits, the fromProbs parameter should be set to False, and the implementation will convert the raw logits to probabilities using a sigmoid function. If the raw logits have already been converted to probabilities, this conversion is skipped.

#https://neptune.ai/blog/pytorch-loss-functions
class DiceLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs, targets, smooth=1, fromProbs=True):

        if fromProbs == False:
          inputs = F.sigmoid(inputs)    
        
        TP = (inputs * targets).sum()    
        FP = ((1-targets) * inputs).sum()
        FN = (targets * (1-inputs)).sum()
       
        DICE = (2*TP) / ((TP + FN) + (TP + FP) + smooth)  
        
        return 1 - DICE

Once the Dice loss is defined, it can be implemented the same as the BCELoss(). The result is 0.1790 for the binary classification of landslides.

target = torch.tensor(b1['class_code'].to_numpy(), dtype=torch.double)
pred = torch.tensor(b1['prob_fail'].to_numpy())

loss = DiceLoss()
loss(pred, target, fromProbs=True)

tensor(0.1790, dtype=torch.float64)

The Tversky loss is similar to the Dice loss except that it accepts alpha and beta parameters that control the relative weight or influence of FN and FP errors, respectively.

\[ 1 - \left( \frac{\sum \hat{p}_{\text{TP}} + \epsilon}{\sum \hat{p}_{\text{TP}} + \alpha \sum \hat{p}_{\text{FN}} + \beta \sum \hat{p}_{\text{FP}} + \epsilon} \right) \]

The example below demonstrates an implementation of the Tversky loss for a binary classification. In comparison to the Dice loss, this requires adding the alpha and beta parameters to the calculation.

I then apply the loss to the binary classification result using an alpha of 0.7 and a beta of 0.3. This results in a higher weight or penalty applied to incorrectly labeling a positive sample to the negative class (FNs) in comparison to incorrectly labeling a negative sample to the positive class (FPs).

class TverskyLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs, targets, smooth=1, alpha=.5, beta=.5, fromProbs=False):
        
        if fromProbs == False:
          inputs = F.sigmoid(inputs)    
        
        TP = (inputs * targets).sum()    
        FP = ((1-targets) * inputs).sum()
        FN = (targets * (1-inputs)).sum()
       
        Tversky = (TP + smooth) / (TP + alpha*FN + beta*FP + smooth)  
        
        return 1 - Tversky

target = torch.tensor(b1['class_code'].to_numpy(), dtype=torch.double)
pred = torch.tensor(b1['prob_fail'].to_numpy())

loss = TverskyLoss()
loss(pred, target, alpha=.7, beta=.3, fromProbs=True)

tensor(0.1719, dtype=torch.float64)

The focal Tversky loss is the same as the Tversky loss except that the result is raised to a power specified by the gamma term.

\[ \left[ 1 - \left( \frac{\sum \hat{p}_{\text{TP}} + \epsilon}{\sum \hat{p}_{\text{TP}} + \alpha \sum \hat{p}_{\text{FN}} + \beta \sum \hat{p}_{\text{FP}} + \epsilon} \right) \right]^{\frac{1}{\gamma}} \]

This controls how much the model is penalized for missing difficult samples. Again, the implementation is very similar to the Dice and Tversky examples above except that the new gamma parameter is defined then applied to the Tversky loss. Larger values of gamma will apply more weight on difficult training samples.

class FocalTverskyLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs, targets, smooth=1, alpha=.5, beta=.5, gamma=1, fromProbs=False):
        
        if fromProbs == False:
          inputs = F.sigmoid(inputs)       
        
        TP = (inputs * targets).sum()    
        FP = ((1-targets) * inputs).sum()
        FN = (targets * (1-inputs)).sum()
       
        Tversky = (TP + smooth) / (TP + alpha*FP + beta*FN + smooth)  
        FocalTversky = (1 - Tversky)**gamma
        
        return FocalTversky

target = torch.tensor(b1['class_code'].to_numpy(), dtype=torch.double)
pred = torch.tensor(b1['prob_fail'].to_numpy())

loss = FocalTverskyLoss()
loss(pred, target, alpha=.7, beta=.3, gamma=0.75, fromProbs=True)

tensor(0.2805, dtype=torch.float64)

Lastly, it is possible to combine losses to create a combo loss. In the example below, I am creating a combo loss from Dice and BCE. The user can also specify the relative weights of the losses in the final metric.

#https://neptune.ai/blog/pytorch-loss-functions
class DiceBCELoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs, targets, smooth=1, bceWght=1, fromProbs=True):

        if fromProbs == False:
          inputs = F.sigmoid(inputs)
        
        bce = F.binary_cross_entropy(inputs, targets)
        
        TP = (inputs * targets).sum()    
        FP = ((1-targets) * inputs).sum()
        FN = (targets * (1-inputs)).sum()
       
        DICE = (2*TP) / ((TP + FN) + (TP + FP) + smooth)  

        combo = (1-DICE) + bceWght*bce

        return combo

target = torch.tensor(b1['class_code'].to_numpy(), dtype=torch.double)
pred = torch.tensor(b1['prob_fail'].to_numpy())

loss = DiceBCELoss()
loss(pred, target, bceWght=.75, fromProbs=True)

tensor(0.4059, dtype=torch.float64)

Assessment Metrics with torchmetrics

Similar to loss metrics, you can define custom assessment metrics functionally or by subclassing nn.Module. However, I will not demonstrate that here. This is partially because there is some preferred functionality that is difficult to code on your own, such as adapting classification metrics to work with both binary and multiclass classification problems or accumulating metric results with each mini-batch update. Given this complexity, I will provide a brief introduction to the torchmetrics package. This package provides access to a wide variety of metrics for both regression and classification. It also provides metrics for specific tasks, such as text and audio classification.

It should also be noted that assessment should be conducted using a withheld validation set as opposed to the training set. This is because, due to overfitting, assessment metrics calculated from the training data can be overly optimistic. It is generally preferred to calculate assessment metrics on the withheld validation data at the end of each training epoch. A third subset, the testing data, is often used to evaluate the final model.

Since torchmetrics is not a standard package, it may need to be installed, as demonstrated in the cell below. Following installation, I am importing the package with an alias of tm.

#!pip install torchmetrics

import torchmetrics as tm

In later modules, you will see applications of metrics implemented with torchmetrics within training loops with aggregation performed across mini-batches. In such cases, I will use the class-based implementation of the metrics. For demonstration purposes, here I will use the functional implementations, which do not require instantiation.

The first example demonstrates obtaining overall accuracy, the class-aggregated F1-score, and the full confusion matrix for a multiclass problem. This requires setting the task parameter to ‘multiclass’ and defining the number of classes being differentiated. Overall accuracy and F1-score generate a single metric while the confusion matrix is a square matrix. It should be noted that most classification assessment metrics are derived from the confusion matrix, other than those calculated from class probabilities using multiple thresholds (i.e., area under the receiver operating characteristic curve (AUC ROC) and area under the precision-recall curve (AUC PR)). So, if you obtain a confusion matrix, you can derive other metrics from it.

Again, you will see more examples of classification metric implementations using torchmetrics in later modules.

pred = torch.tensor(c1[c1.columns[1:10]].to_numpy())
label_encoder = preprocessing.LabelEncoder()
c1['class_code'] = label_encoder.fit_transform(c1['reference'])

target = torch.tensor(c1['class_code'].to_numpy())

acc = tm.functional.accuracy(pred, target, task='multiclass', num_classes=9)
f1 = tm.functional.f1_score(pred, target, task='multiclass', num_classes=9)
cm = tm.functional.confusion_matrix(pred, target, task='multiclass', num_classes=9)
print(f'Accuracy = {acc:.3f}, Average F1-Score: {f1:.3f}')

Accuracy = 0.803, Average F1-Score: 0.803

print(cm)

tensor([[38,  0,  2,  1,  0,  0,  3,  1,  0],
        [ 0, 68,  3, 21,  0,  0,  2,  3,  0],
        [ 0,  0, 20,  1,  0,  0,  0,  0,  0],
        [ 0,  4,  5, 83,  0,  0,  0,  1,  0],
        [ 0,  1,  1,  0, 70,  0,  0,  4,  7],
        [ 0,  0,  0,  0,  1, 13,  0,  0,  0],
        [ 4,  0,  1,  0,  0,  0, 39,  0,  1],
        [ 0,  3,  3,  1,  0,  0,  0, 13,  0],
        [ 0,  0,  0,  0, 23,  0,  3,  0, 63]])

For metrics that are calculated at the class-level, it is possible to generate an average to obtain a single metric. For example, the class-level F1-scores can be aggregated using either macro or micro averaging. In the case of macro averaging, each F1-score is calculated separately for each class and then averaged. In the case of micro averaging, the number of TPs, FNs, and FPs are calculated for each class and then summed. The summed TP, FN, and FP counts are then used to calculate the class-aggregated metric. In torchmetrics, the average parameter controls if and how averaging is applied. The default is micro averaging. If this argument is set to “none” then no averaging is applied and the class-level metrics are returned. The code block below demonstrates these different behaviors.

Macro and micro averaging will generally yield similar results if the classes are not heavily imbalanced. However, the results will diverge with increasing class imbalance. Since macro averaging simply calculates the metric separately for each class and then averages them, each class is equally weighted in the calculation. When micro averaging is used, the more abundant classes will have greater weights in the aggregated metric. In other words, micro averaging is more sensitive to class imbalance. Note that micro-averaged F1-score, recall, and precision are all equivalent. They are also equivalent to overall accuracy. As a result, I generally recommend calculating overall accuracy and the macro-averaged versions of the F1-score, precision, and recall.

f1Micro = tm.functional.f1_score(pred, target, task='multiclass', average = "macro",  num_classes=9)
f1Macro = tm.functional.f1_score(pred, target, task='multiclass', average = "micro",  num_classes=9)
f1NoAgg = tm.functional.f1_score(pred, target, task='multiclass', average = "none",  num_classes=9)

print("Class-Aggregated F1-Score with Micro Averaging = " + str(f1Micro))

Class-Aggregated F1-Score with Micro Averaging = tensor(0.8014)

print("Class-Aggregated F1-Score with Macro Averaging = " + str(f1Macro))

Class-Aggregated F1-Score with Macro Averaging = tensor(0.8028)

print("F1-Score with No Averaging = " + str(f1NoAgg))

F1-Score with No Averaging = tensor([0.8736, 0.7861, 0.7143, 0.8300, 0.7910, 0.9630, 0.8478, 0.6190, 0.7875])

The code blocks below demonstrate the different averaging techniques for the recall and precision metrics.

reMicro = tm.functional.recall(pred, target, task='multiclass', average = "macro",  num_classes=9)
reMacro = tm.functional.recall(pred, target, task='multiclass', average = "micro",  num_classes=9)
reNoAgg = tm.functional.recall(pred, target, task='multiclass', average = "none",  num_classes=9)

print("Class-Aggregated Recall with Micro Averaging = " + str(reMicro))

Class-Aggregated Recall with Micro Averaging = tensor(0.8208)

print("Class-Aggregated Recall with Macro Averaging = " + str(reMacro))

Class-Aggregated Recall with Macro Averaging = tensor(0.8028)

print("Recall with No Averaging = " + str(reNoAgg))

Recall with No Averaging = tensor([0.8444, 0.7010, 0.9524, 0.8925, 0.8434, 0.9286, 0.8667, 0.6500, 0.7079])

preMicro = tm.functional.precision(pred, target, task='multiclass', average = "macro",  num_classes=9)
preMacro = tm.functional.precision(pred, target, task='multiclass', average = "micro",  num_classes=9)
preNoAgg = tm.functional.precision(pred, target, task='multiclass', average = "none",  num_classes=9)

print("Class-Aggregated Precision with Micro Averaging = " + str(preMicro))

Class-Aggregated Precision with Micro Averaging = tensor(0.7999)

print("Class-Aggregated Precision with Macro Averaging = " + str(preMacro))

Class-Aggregated Precision with Macro Averaging = tensor(0.8028)

print("Precision with No Averaging = " + str(preNoAgg))

Precision with No Averaging = tensor([0.9048, 0.8947, 0.5714, 0.7757, 0.7447, 1.0000, 0.8298, 0.5909, 0.8873])

The example below demonstrates calculating root mean square error (RMSE) and R-squared for a regression problem. RMSE is calculated by using the mean_square_error() function and setting the squared parameter to False. To obtain MSE, you would set the squared parameter equal to True.

target = torch.tensor(r1['truth'].to_numpy())
pred = torch.tensor(r1['m1'].to_numpy())

rmse = tm.functional.mean_squared_error(pred, target, squared=False)
r2 = tm.functional.r2_score(pred, target)
print(f'RMSE = {rmse:.3f}, R-Squared: {r2:.3f}')

RMSE = 0.370, R-Squared: 0.771

Concluding Remarks

The goal of this module was to provide an introduction to using and calculating loss and assessment metrics. I used PyTorch to calculate loss metrics and demonstrated how to create custom metrics by subclassing nn.Module. For assessment metrics, I demonstrated the torchmetrics module. You will see example implementations of loss and assessment metrics within the training loop throughout the remainder of this course. We will also apply the loss and assessment metrics to the training set and a withheld validation set, which is consider best practice. We will generally assess the final model using a third testing set.

In the next section, we will discuss preparing data for input into a modeling process using the DataSet and DataLoader classes.