CNN Components
Convolutional Neural Network Components
Introduction
The goal of this module is to explore convolutional operations and the building blocks of convolutional neural networks (CNNs) before we dive into building CNNs using PyTorch. Fortunately, CNNs, especially simple architectures, rely on a small set of building blocks. As a result, understanding these building blocks goes a long way towards understanding how CNNs work and are built.
In the code block below, I am importing torch, torch.nn, and torch.nn.functional, which is assigned the alias F. The torch.nn.functional subpackage provides functional implementations (i.e., they are coded and executed like functions) to compliment many of the class-based layers and building blocks available in PyTorch. For the demonstrations in this module, the functional implementations are much simpler to implement since I don’t have to subclass nn.Module and instantiate an instance.
import torch
import torch.nn
import torch.nn.functional as F
2D Convolution
We will begin by building a random tensor on which to perform operations. This tensor is 4D with a shape of (1,1,9,9) and is meant to represent a tensor with the following dimensions: (mini-batch, channels, height, width). As you will see in later modules, this is a common tensor shape when feeding images into a CNN in mini-batches. I have also rescaled the random values from 0 to 1 to -1 to 1.
Next, I create another tensor that is meant to represent a kernel or moving window. This moving window has a shape of (1,1,3,3). So, it represents a single kernel that will pass over the input tensor to create a single feature map. The weights associated with this tensor would be learned during the training process to create a transformation of the input image or prior feature map that represents some useful information. However, here I am not actually building a CNN architecture or training it. Instead, I am just demonstrating convolutional operations using some random data of the correct shape. As a result, I have just filled the tensor with 1s.
= (torch.rand((1,1,9,9))*2)-1
inT inT
tensor([[[[-0.0466, -0.3008, 0.3269, 0.0739, 0.6109, -0.6042, 0.4905,
-0.4813, 0.5852],
[ 0.0080, -0.2993, 0.5226, -0.6695, -0.1031, -0.9346, 0.3875,
-0.3100, -0.6509],
[-0.5376, -0.2214, 0.1116, 0.0544, 0.5073, 0.5440, 0.2281,
0.6407, 0.7883],
[ 0.2645, 0.5250, -0.9872, -0.8321, -0.4084, 0.6151, -0.0289,
-0.5859, -0.3935],
[-0.0946, 0.9851, -0.4272, 0.0783, -0.7337, 0.7443, 0.5472,
0.9170, -0.7735],
[-0.4122, -0.1005, 0.8044, 0.4973, 0.2798, -0.7644, 0.6766,
-0.7833, -0.6130],
[ 0.3232, -0.8316, -0.8228, -0.7214, -0.0533, 0.2811, -0.8952,
-0.9909, -0.7224],
[ 0.8259, 0.7808, 0.0729, 0.6655, 0.7967, -0.1503, 0.7765,
-0.2083, -0.0885],
[-0.4544, -0.5565, -0.2842, 0.1704, 0.2546, 0.4024, 0.9126,
-0.3584, -0.4120]]]])
= torch.ones((1,1,3,3))
inW inW
tensor([[[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]]])
In order to apply the kernel to the image, I use the functional version of conv2d(), which performs two-dimensional convolution. There are also functions for 1D and 3D convolution (conv1d() and conv3d()). The conv2d() function accepts an input tensor, an input kernel, and stride and padding arguments. The stride represents how much the kernel moves as it steps over the image. Since the stride is set to 1, each pixel will be placed at the center of the 3x3 kernel as it passes over the image. The padding argument is set to “same”, which indicates that padding will be added such that the height and width of the resulting feature map will be the same as the input image. Since a 3x3 kernel with a stride of 1 is being used here, “same” would yield the same result as a padding of 1.
In the first example, the result is equivalent to adding all of the values in the 3x3 window around each cell. This is because each weight is set to 1, so each value will be multiplied by 1 then all the values will be added. In the next example, I have defined a new kernel in which only the center weight of the 3x3 kernel is 1 while all other weights are 0. This results in simply returning the center value or replicating the original values or “image”.
= F.conv2d(inT, inW, stride=1, padding="same")
outT outT
tensor([[[[-0.6387, 0.2108, -0.3463, 0.7617, -1.6267, -0.1531, -1.4523,
0.0209, -0.8570],
[-1.3977, -0.4366, -0.4016, 1.4351, -0.5209, 1.1263, -0.0395,
1.6780, 0.5720],
[-0.2608, -0.6138, -1.7960, -1.8044, -1.2269, 0.8070, 0.5560,
0.0755, -0.5112],
[ 0.9211, -0.3818, -0.7135, -2.6370, 0.5692, 2.0150, 3.6217,
1.3396, 0.5932],
[ 1.1674, 0.5573, 0.5431, -1.7288, -0.5238, 0.9276, 1.3377,
-1.0372, -2.2321],
[-0.1306, -0.5762, -0.5382, -1.0985, -0.3919, 0.0823, -0.2677,
-2.6375, -2.9661],
[ 0.5855, 0.6401, 0.3447, 1.5193, 0.8312, 0.9475, -2.0582,
-2.8486, -3.4064],
[ 0.0873, -0.9468, -1.5268, 0.0785, 1.6459, 2.3252, -0.2304,
-1.9866, -2.7805],
[ 0.5958, 0.3845, 0.8489, 1.6759, 2.1394, 2.9926, 1.3746,
0.6219, -1.0672]]]])
= torch.tensor([[[[0., 0., 0.],[0., 1., 0.],[0., 0., 0.]]]])
inW = F.conv2d(inT, inW, stride=1, padding="same")
outT outT
tensor([[[[-0.0466, -0.3008, 0.3269, 0.0739, 0.6109, -0.6042, 0.4905,
-0.4813, 0.5852],
[ 0.0080, -0.2993, 0.5226, -0.6695, -0.1031, -0.9346, 0.3875,
-0.3100, -0.6509],
[-0.5376, -0.2214, 0.1116, 0.0544, 0.5073, 0.5440, 0.2281,
0.6407, 0.7883],
[ 0.2645, 0.5250, -0.9872, -0.8321, -0.4084, 0.6151, -0.0289,
-0.5859, -0.3935],
[-0.0946, 0.9851, -0.4272, 0.0783, -0.7337, 0.7443, 0.5472,
0.9170, -0.7735],
[-0.4122, -0.1005, 0.8044, 0.4973, 0.2798, -0.7644, 0.6766,
-0.7833, -0.6130],
[ 0.3232, -0.8316, -0.8228, -0.7214, -0.0533, 0.2811, -0.8952,
-0.9909, -0.7224],
[ 0.8259, 0.7808, 0.0729, 0.6655, 0.7967, -0.1503, 0.7765,
-0.2083, -0.0885],
[-0.4544, -0.5565, -0.2842, 0.1704, 0.2546, 0.4024, 0.9126,
-0.3584, -0.4120]]]])
print(inT.shape)
torch.Size([1, 1, 9, 9])
print(outT.shape)
torch.Size([1, 1, 9, 9])
As the code blocks above demonstrate, when padding is set to “same” the height and width of the resulting feature map will be the same as the input image or tensor. If the padding is set to 0, no padding will be added, and only cells that have a full set of neighbors within a 3x3 window will be processed. This will result in dropping the outer most rows and columns to yield a tensor that has a height and width of 7x7 as opposed to 9x9.
= torch.tensor([[[[0., 0., 0.],[0., 1., 0.],[0., 0., 0.]]]])
inW = F.conv2d(inT, inW, stride=1, padding=0)
outT outT
tensor([[[[-0.2993, 0.5226, -0.6695, -0.1031, -0.9346, 0.3875, -0.3100],
[-0.2214, 0.1116, 0.0544, 0.5073, 0.5440, 0.2281, 0.6407],
[ 0.5250, -0.9872, -0.8321, -0.4084, 0.6151, -0.0289, -0.5859],
[ 0.9851, -0.4272, 0.0783, -0.7337, 0.7443, 0.5472, 0.9170],
[-0.1005, 0.8044, 0.4973, 0.2798, -0.7644, 0.6766, -0.7833],
[-0.8316, -0.8228, -0.7214, -0.0533, 0.2811, -0.8952, -0.9909],
[ 0.7808, 0.0729, 0.6655, 0.7967, -0.1503, 0.7765, -0.2083]]]])
print(inT.shape)
torch.Size([1, 1, 9, 9])
print(outT.shape)
torch.Size([1, 1, 7, 7])
It is also possible to apply multiple kernels to an input. Below, I am creating 3 3x3 kernels, all filled with weights of 1. When these kernels are applied to the input tensor, three feature maps are generated.
= torch.ones((3,1,3,3))
inW = F.conv2d(inT, inW, stride=1, padding="same")
outT outT
tensor([[[[-0.6387, 0.2108, -0.3463, 0.7617, -1.6267, -0.1531, -1.4523,
0.0209, -0.8570],
[-1.3977, -0.4366, -0.4016, 1.4351, -0.5209, 1.1263, -0.0395,
1.6780, 0.5720],
[-0.2608, -0.6138, -1.7960, -1.8044, -1.2269, 0.8070, 0.5560,
0.0755, -0.5112],
[ 0.9211, -0.3818, -0.7135, -2.6370, 0.5692, 2.0150, 3.6217,
1.3396, 0.5932],
[ 1.1674, 0.5573, 0.5431, -1.7288, -0.5238, 0.9276, 1.3377,
-1.0372, -2.2321],
[-0.1306, -0.5762, -0.5382, -1.0985, -0.3919, 0.0823, -0.2677,
-2.6375, -2.9661],
[ 0.5855, 0.6401, 0.3447, 1.5193, 0.8312, 0.9475, -2.0582,
-2.8486, -3.4064],
[ 0.0873, -0.9468, -1.5268, 0.0785, 1.6459, 2.3252, -0.2304,
-1.9866, -2.7805],
[ 0.5958, 0.3845, 0.8489, 1.6759, 2.1394, 2.9926, 1.3746,
0.6219, -1.0672]],
[[-0.6387, 0.2108, -0.3463, 0.7617, -1.6267, -0.1531, -1.4523,
0.0209, -0.8570],
[-1.3977, -0.4366, -0.4016, 1.4351, -0.5209, 1.1263, -0.0395,
1.6780, 0.5720],
[-0.2608, -0.6138, -1.7960, -1.8044, -1.2269, 0.8070, 0.5560,
0.0755, -0.5112],
[ 0.9211, -0.3818, -0.7135, -2.6370, 0.5692, 2.0150, 3.6217,
1.3396, 0.5932],
[ 1.1674, 0.5573, 0.5431, -1.7288, -0.5238, 0.9276, 1.3377,
-1.0372, -2.2321],
[-0.1306, -0.5762, -0.5382, -1.0985, -0.3919, 0.0823, -0.2677,
-2.6375, -2.9661],
[ 0.5855, 0.6401, 0.3447, 1.5193, 0.8312, 0.9475, -2.0582,
-2.8486, -3.4064],
[ 0.0873, -0.9468, -1.5268, 0.0785, 1.6459, 2.3252, -0.2304,
-1.9866, -2.7805],
[ 0.5958, 0.3845, 0.8489, 1.6759, 2.1394, 2.9926, 1.3746,
0.6219, -1.0672]],
[[-0.6387, 0.2108, -0.3463, 0.7617, -1.6267, -0.1531, -1.4523,
0.0209, -0.8570],
[-1.3977, -0.4366, -0.4016, 1.4351, -0.5209, 1.1263, -0.0395,
1.6780, 0.5720],
[-0.2608, -0.6138, -1.7960, -1.8044, -1.2269, 0.8070, 0.5560,
0.0755, -0.5112],
[ 0.9211, -0.3818, -0.7135, -2.6370, 0.5692, 2.0150, 3.6217,
1.3396, 0.5932],
[ 1.1674, 0.5573, 0.5431, -1.7288, -0.5238, 0.9276, 1.3377,
-1.0372, -2.2321],
[-0.1306, -0.5762, -0.5382, -1.0985, -0.3919, 0.0823, -0.2677,
-2.6375, -2.9661],
[ 0.5855, 0.6401, 0.3447, 1.5193, 0.8312, 0.9475, -2.0582,
-2.8486, -3.4064],
[ 0.0873, -0.9468, -1.5268, 0.0785, 1.6459, 2.3252, -0.2304,
-1.9866, -2.7805],
[ 0.5958, 0.3845, 0.8489, 1.6759, 2.1394, 2.9926, 1.3746,
0.6219, -1.0672]]]])
If a stride greater than 1 is used, the size of the resulting array in the spatial dimensions will decrease in comparison to the input since the kernel will not be centered over each cell but will skip cells. This is one means to decrease the size of the spatial dimensions of an array. However, this is not commonly used. Instead, pooling operations are applied, which will be discussed below. When building CNNs, we will generally use a stride of 1. However, there are some applications where other strides are used. For example, when we discuss the ResNet architecture, you will see that it uses a stride of 2 as a means to decrease the size of the arrays at certain points in the network architecture.
= torch.ones((1,1,3,3))
inW = F.conv2d(inT, inW, stride=2)
outT outT
tensor([[[[-0.4366, 1.4351, 1.1263, 1.6780],
[-0.3818, -2.6370, 2.0150, 1.3396],
[-0.5762, -1.0985, 0.0823, -2.6375],
[-0.9468, 0.0785, 2.3252, -1.9866]]]])
print(inT.shape)
torch.Size([1, 1, 9, 9])
print(outT.shape)
torch.Size([1, 1, 4, 4])
Pooling
In CNN architectures, pooling operations are commonly used to decrease the size of the array in the spatial dimensions as opposed to accomplising this by increasing the stride in the convolution operations. The most commonly used pooling operation is max pooling, which is demonstrated below. Here, an input tensor with a shape of (1,1,10,10) is transformed to a shape of (1,1,5,5) using max pooling. This is a very simple operation. In a 2x2 window, the largest or maximum value is returned. You can see that the max_pool2d() function accepts an input tensor, a window size, and a stride. By using a window size of 2x2, I essentially combine 4 cells to a single cell by returning the maximum value of the 4 values. When using a stride of 2, there is no overlap between the 2x2 windows.
= (torch.rand((1,1,10,10))*2)-1
inT inT
tensor([[[[ 0.8244, -0.3165, -0.4161, 0.7906, -0.2729, -0.5686, -0.0613,
0.2458, 0.5515, 0.4250],
[ 0.4568, -0.2253, -0.1902, -0.5387, -0.4931, 0.2119, 0.4240,
-0.1909, 0.1276, 0.3345],
[ 0.9204, -0.7825, -0.3465, -0.2403, -0.5415, 0.0396, -0.7452,
-0.2270, -0.9286, -0.9468],
[-0.1660, 0.1304, 0.2584, 0.0501, -0.2343, 0.7338, 0.4320,
0.2482, -0.5208, -0.2825],
[-0.0673, -0.1077, 0.2624, 0.4024, 0.0498, -0.0299, -0.6856,
-0.1890, -0.1905, -0.8239],
[ 0.4519, 0.9195, -0.0889, -0.8265, 0.4021, 0.8477, 0.8053,
0.6788, 0.2094, 0.2588],
[ 0.5731, 0.3003, 0.2275, -0.5732, 0.8001, 0.5481, -0.1115,
-0.3856, 0.7018, 0.4083],
[-0.8134, 0.5852, 0.4007, -0.0130, -0.6072, -0.8360, 0.0397,
0.0711, 0.4120, 0.6668],
[-0.3245, -0.9693, 0.5755, 0.7887, -0.7603, 0.7697, -0.8927,
0.1853, -0.2303, 0.2337],
[ 0.5675, -0.0874, 0.6941, -0.0191, -0.2465, -0.3402, -0.8829,
0.6852, 0.3002, 0.8044]]]])
= F.max_pool2d(inT, (2,2), stride=2)
outT outT
tensor([[[[ 0.8244, 0.7906, 0.2119, 0.4240, 0.5515],
[ 0.9204, 0.2584, 0.7338, 0.4320, -0.2825],
[ 0.9195, 0.4024, 0.8477, 0.8053, 0.2588],
[ 0.5852, 0.4007, 0.8001, 0.0711, 0.7018],
[ 0.5675, 0.7887, 0.7697, 0.6852, 0.8044]]]])
print(inT.shape)
torch.Size([1, 1, 10, 10])
print(outT.shape)
torch.Size([1, 1, 5, 5])
Although there are other pooling operations, max pooling tends to be used most often. This is because the maximum value in the window is generally associated with the most defining feature or the largest activation. However, this is just a simple interpretation and may not hold true in all cases.
Below, I demonstrate another pooling operation: average pooling. This is the same as max pooling except that the average is returned as opposed to the maximum. You will see this applied when we discuss ResNet.
Similar to convolutional operations, there are 1D and 3D versions of pooling operations.
= F.avg_pool2d(inT, (2,2), stride=2)
outT outT
tensor([[[[ 1.8485e-01, -8.8598e-02, -2.8070e-01, 1.0438e-01, 3.5964e-01],
[ 2.5562e-02, -6.9585e-02, -5.9980e-04, -7.3004e-02, -6.6968e-01],
[ 2.9911e-01, -6.2636e-02, 3.1741e-01, 1.5237e-01, -1.3654e-01],
[ 1.6129e-01, 1.0494e-02, -2.3743e-02, -9.6602e-02, 5.4724e-01],
[-2.0343e-01, 5.0981e-01, -1.4433e-01, -2.2626e-01, 2.7698e-01]]]])
Activation Functions
One of the issues with only applying convolution operations or fully connected layers is that this only allows for linear transformations of values. In order to introduce non-linearity into the CNN, we need to incorporate activation functions. Below, I have implemented a few activation functions. Currently, one of the most commonly used activation functions is the rectified linear unit (ReLU). This is an operation that simply converts any negative values to 0 and does not alter any positive values. Leaky ReLU is an alternative version of ReLU that maintains negative values but reduces their magnitude using a slope term. This is sometimes useful to combat the “dying ReLU” problem.
Some activation functions have an inplace parameter, which allows you to modify the values in place as opposed to saving the result to a new location in memory.
= torch.rand(1,2,5,5)*2-1
inT inT
tensor([[[[-0.5014, -0.9047, -0.6306, 0.8416, -0.8460],
[ 0.9726, 0.7070, 0.2985, -0.9696, -0.5565],
[-0.5973, -0.5704, 0.3628, -0.1858, -0.2573],
[ 0.4994, 0.4288, 0.0320, -0.8564, 0.9174],
[ 0.0768, 0.1749, -0.4774, 0.5014, -0.7107]],
[[-0.1248, 0.9581, 0.8195, 0.4614, 0.0015],
[-0.4375, -0.0924, -0.7037, -0.2770, -0.3566],
[-0.7936, 0.6015, 0.8983, 0.2130, 0.5785],
[-0.0043, 0.5153, -0.0678, 0.4792, 0.2003],
[-0.4519, 0.7003, -0.1019, -0.5348, 0.7836]]]])
= F.relu(inT)
outT outT
tensor([[[[0.0000, 0.0000, 0.0000, 0.8416, 0.0000],
[0.9726, 0.7070, 0.2985, 0.0000, 0.0000],
[0.0000, 0.0000, 0.3628, 0.0000, 0.0000],
[0.4994, 0.4288, 0.0320, 0.0000, 0.9174],
[0.0768, 0.1749, 0.0000, 0.5014, 0.0000]],
[[0.0000, 0.9581, 0.8195, 0.4614, 0.0015],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.6015, 0.8983, 0.2130, 0.5785],
[0.0000, 0.5153, 0.0000, 0.4792, 0.2003],
[0.0000, 0.7003, 0.0000, 0.0000, 0.7836]]]])
= F.leaky_relu(inT, negative_slope=.1)
outT outT
tensor([[[[-5.0141e-02, -9.0470e-02, -6.3059e-02, 8.4156e-01, -8.4604e-02],
[ 9.7260e-01, 7.0704e-01, 2.9846e-01, -9.6956e-02, -5.5648e-02],
[-5.9729e-02, -5.7038e-02, 3.6279e-01, -1.8580e-02, -2.5725e-02],
[ 4.9945e-01, 4.2885e-01, 3.1984e-02, -8.5640e-02, 9.1744e-01],
[ 7.6792e-02, 1.7486e-01, -4.7744e-02, 5.0136e-01, -7.1074e-02]],
[[-1.2482e-02, 9.5810e-01, 8.1955e-01, 4.6137e-01, 1.4788e-03],
[-4.3753e-02, -9.2384e-03, -7.0374e-02, -2.7697e-02, -3.5662e-02],
[-7.9356e-02, 6.0155e-01, 8.9828e-01, 2.1302e-01, 5.7854e-01],
[-4.2657e-04, 5.1531e-01, -6.7812e-03, 4.7924e-01, 2.0030e-01],
[-4.5194e-02, 7.0033e-01, -1.0189e-02, -5.3475e-02, 7.8360e-01]]]])
The sigmoid activation function can be used to convert values to a range between 0 and 1. This is a common activation function after the last linear layer in a fully connected neural network or a CNN that is performing a binary classification. For binary classification, the last fully connected layer will have an output size of 1. The raw logits can be passed through a sigmoid activation function in order to scale them from 0 to 1. In the example below, I generate a 1D array of logits representing results for four separate data points. Each data point is then rescaled separately, or element-wise, using the sigmoid function to be between 0 and 1.
= torch.tensor([4.1, -1.1, 3.3, 2.2])
inT print(torch.sigmoid(inT))
tensor([0.9837, 0.2497, 0.9644, 0.9002])
For multiclass classification, a softmax activation is commonly used instead of a sigmoid activation. When performing a multiclass classification, the last layer will output one node for each class. As a result, raw logits will be returned for each class, and the prediction will correspond to the class with the highest logit. In order to convert these raw logits to probabilities that sum to 1, the softmax activation is applied. This requires defining a dimension over which the probabilities should sum to 1. In the example, the 2nd dimension (i.e., index 1) represents all the predicted logits for a single observation.
= torch.tensor([[4.1, -1.1, 3.3, 2.2], [3.6, 2.3, -0.5, -1.2], [1.7, 3.3, .7, .2]])
inT print(torch.softmax(inT, dim=1))
tensor([[0.6233, 0.0034, 0.2801, 0.0932],
[0.7708, 0.2101, 0.0128, 0.0063],
[0.1528, 0.7569, 0.0562, 0.0341]])
It should be noted that some loss functions may expect raw logits while others may expect the data to be passed through a sigmoid or softmax activation beforehand. Here are a few tips.
nn.CrossEntropyLoss() for multiclass problems expects logits
nn.NLLLoss() for multiclass problems expects probabilities
nn.BCELoss() for binary classification expects probabilities
nn.BCEWithLogitsLoss() for binary classification expects logits (this combines the sigmoid and BCE loss operations)
Given the above tips, when using PyTorch for a multiclass classification, we generally do not incorporate a softmax activation at the end of the network if we are using cross entropy loss since the PyTorch implementation of this loss metric expects raw logits. If you are using an alternative loss metric, such as Dice or Tversky, you will need to investigate the implementation and make sure you are providing it with the expected data: either raw logits or probabilities (i.e., logits have been passed through a softmax activation).
For binary classification and if you use nn.BCELoss(), you must pass the logits through a sigmoid activation beforehand. If you use nn.BCEWithLogitsLoss(), you should provide the raw logits since this loss incorporates the sigmoid activation. If you are using an alternative loss metric, such as Dice or Tversky, you will need to explore the implementation and make sure you are providing it with the expected data: either raw logits or probabilities (i.e., logits have been passed through a sigmoid activation).
Finally, it is possible to treat any binary classification problem as a multiclass classification problem in which the last layer outputs two logits, one for each class, as opposed to one for the positive case. If you treat a binary classification problem the same as a multiclass classification problem, then you would want to make use of nn.CrossEntropyLoss(). If you use a loss that expects probabilities, you would want to use softmax as opposed to a sigmoid activation.
Deconvolution and Upsampling
For regular CNNs in which the goal is to perform labeling of an entire image as opposed to labeling or predicting each individual pixel in the image, deconvolution is not required. However, when performing pixel-level classification (i.e., semantic segmentation) or regression the decoder component must have a means to scale up the feature maps in the spatial dimensions in order to regenerate the original spatial dimensions of the scene.
This upscaling or upsampling can be performed using a couple of different methods. The upsample() function actually operates similarly to resampling methods that we use in the geospatial sciences. New cell values are estimated using nearby original values mathematically using some interpolation method, such as nearest neighbor, bilinear, or bicubic. This form of upsampling is not really deconvolution since no kernels and associated weights are being applied.
As examples below, I have performed upsampling on a tensor with spatial dimensions of 4x4 using the nearest neighbor and bilinear interpolation methods. When using nearest neighbor, the nearest cell value is used, and you can see that values have been replicated in the result. In contrast, bilinear interpolation performs a inverse distance-weighted averaging of the 4 nearest original cells to the output cell.
= torch.rand(1,2,4,4)*2-1
inT inT
tensor([[[[ 0.5602, 0.2437, 0.2268, 0.3509],
[-0.5953, -0.5326, 0.9710, -0.0372],
[-0.6942, 0.7757, 0.1789, 0.7181],
[-0.7883, -0.9999, -0.8242, -0.4884]],
[[ 0.0559, -0.8419, -0.5129, 0.5415],
[-0.8274, 0.6626, 0.5665, 0.2222],
[ 0.8373, 0.7271, -0.5251, 0.7163],
[ 0.0129, 0.1158, -0.4064, 0.5346]]]])
= F.upsample(inT, size=(8,8), mode="nearest") outT
C:\Users\vidcg\ANACON~1\envs\torchENV\lib\site-packages\torch\nn\functional.py:3734: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
outT
tensor([[[[ 0.5602, 0.5602, 0.2437, 0.2437, 0.2268, 0.2268, 0.3509,
0.3509],
[ 0.5602, 0.5602, 0.2437, 0.2437, 0.2268, 0.2268, 0.3509,
0.3509],
[-0.5953, -0.5953, -0.5326, -0.5326, 0.9710, 0.9710, -0.0372,
-0.0372],
[-0.5953, -0.5953, -0.5326, -0.5326, 0.9710, 0.9710, -0.0372,
-0.0372],
[-0.6942, -0.6942, 0.7757, 0.7757, 0.1789, 0.1789, 0.7181,
0.7181],
[-0.6942, -0.6942, 0.7757, 0.7757, 0.1789, 0.1789, 0.7181,
0.7181],
[-0.7883, -0.7883, -0.9999, -0.9999, -0.8242, -0.8242, -0.4884,
-0.4884],
[-0.7883, -0.7883, -0.9999, -0.9999, -0.8242, -0.8242, -0.4884,
-0.4884]],
[[ 0.0559, 0.0559, -0.8419, -0.8419, -0.5129, -0.5129, 0.5415,
0.5415],
[ 0.0559, 0.0559, -0.8419, -0.8419, -0.5129, -0.5129, 0.5415,
0.5415],
[-0.8274, -0.8274, 0.6626, 0.6626, 0.5665, 0.5665, 0.2222,
0.2222],
[-0.8274, -0.8274, 0.6626, 0.6626, 0.5665, 0.5665, 0.2222,
0.2222],
[ 0.8373, 0.8373, 0.7271, 0.7271, -0.5251, -0.5251, 0.7163,
0.7163],
[ 0.8373, 0.8373, 0.7271, 0.7271, -0.5251, -0.5251, 0.7163,
0.7163],
[ 0.0129, 0.0129, 0.1158, 0.1158, -0.4064, -0.4064, 0.5346,
0.5346],
[ 0.0129, 0.0129, 0.1158, 0.1158, -0.4064, -0.4064, 0.5346,
0.5346]]]])
= F.upsample(inT, size=(8,8), mode="bilinear")
outT outT
tensor([[[[ 0.5602, 0.4811, 0.3228, 0.2395, 0.2310, 0.2578, 0.3199,
0.3509],
[ 0.2713, 0.2159, 0.1051, 0.1404, 0.3221, 0.3731, 0.2936,
0.2539],
[-0.3064, -0.3144, -0.3305, -0.0576, 0.5041, 0.6037, 0.2411,
0.0598],
[-0.6200, -0.5164, -0.3091, 0.0391, 0.5284, 0.6177, 0.3070,
0.1516],
[-0.6695, -0.3900, 0.1691, 0.4307, 0.3949, 0.4150, 0.4912,
0.5293],
[-0.7177, -0.4553, 0.0694, 0.2309, 0.0291, 0.0502, 0.2944,
0.4165],
[-0.7648, -0.7126, -0.6082, -0.5603, -0.5690, -0.4767, -0.2834,
-0.1868],
[-0.7883, -0.8412, -0.9470, -0.9559, -0.8681, -0.7402, -0.5724,
-0.4884]],
[[ 0.0559, -0.1686, -0.6175, -0.7597, -0.5952, -0.2493, 0.2779,
0.5415],
[-0.1649, -0.2402, -0.3906, -0.4101, -0.2987, -0.0669, 0.2855,
0.4616],
[-0.6066, -0.3833, 0.0632, 0.2890, 0.2941, 0.2980, 0.3007,
0.3020],
[-0.4112, -0.1387, 0.4062, 0.5824, 0.3899, 0.3066, 0.3327,
0.3457],
[ 0.4211, 0.4936, 0.6385, 0.4702, -0.0114, -0.0409, 0.3815,
0.5928],
[ 0.6312, 0.6170, 0.5885, 0.3069, -0.2280, -0.2038, 0.3793,
0.6709],
[ 0.2190, 0.2314, 0.2562, 0.0924, -0.2599, -0.1821, 0.3260,
0.5800],
[ 0.0129, 0.0386, 0.0900, -0.0148, -0.2759, -0.1712, 0.2993,
0.5346]]]])
As noted above, the upsampling method is not a convolutional, or “deconvolutional”, operation since no kernels are applied. Since there are no kernels and associated weights, there are no trainable parameters associated with upsampling.
The transpose2d() function allows for upsampling a tensor in the spatial dimensions using kernels that have trainable weights. If transpose convolution is applied with a kernel size of 2x2 and a stride of 2 then this will result in doubling the size of the tensor in the spatial dimensions. Practically, this operation can return the original tensor spatial dimensions before max pooling with a 2x2 kernel and a stride of 2 was applied. To be clear, 2D transpose does not undo prior convolutional operations. Instead, it learns new weights as it scales up the data.
As demonstrated using the code below, 2D transpose convolution with a kernel size of 2x2 and a stride of 2 converts a tensor of shape (1,1,9,9) to a shape of (1,1,18,18). Since weights are applied, it requires that both an input tensor and kernel be provided.
= torch.rand(1,1,9,9)
inT inT
tensor([[[[0.2038, 0.3857, 0.2131, 0.2192, 0.3941, 0.7336, 0.9421, 0.9957,
0.9204],
[0.6419, 0.0673, 0.5942, 0.8745, 0.7517, 0.4133, 0.7754, 0.5427,
0.1905],
[0.1336, 0.4876, 0.1040, 0.8921, 0.3560, 0.8355, 0.1151, 0.6093,
0.7128],
[0.7074, 0.6405, 0.5839, 0.3738, 0.3379, 0.3583, 0.8562, 0.9661,
0.6771],
[0.0058, 0.0299, 0.6825, 0.5826, 0.1684, 0.1118, 0.5335, 0.5647,
0.9345],
[0.8992, 0.1684, 0.4020, 0.8129, 0.9448, 0.2247, 0.7999, 0.4484,
0.1818],
[0.3697, 0.0610, 0.8861, 0.7284, 0.4945, 0.5020, 0.4667, 0.5411,
0.7651],
[0.0045, 0.5754, 0.6614, 0.6216, 0.8295, 0.8961, 0.4114, 0.1198,
0.8199],
[0.6839, 0.9587, 0.1226, 0.8781, 0.3175, 0.7190, 0.1138, 0.5664,
0.8930]]]])
= torch.rand(1,1,2,2)
inW inW
tensor([[[[0.1218, 0.6238],
[0.2795, 0.0839]]]])
= F.conv_transpose2d(inT, inW, stride=2, padding=0)
outT outT.shape
torch.Size([1, 1, 18, 18])
In the example below, I have created a new 2x2 kernel in which the bottom-right value is 1 and all other values are 0. When applying this kernel, you can see that it results in expanding the array by adding rows and columns of zeros. When the weights are not set to zero, these added rows and columns can hold non-zero values. The kernel weights are trainable, resulting in the model learning how to upsample the image as opposed to simply applying an interpolation algorithm.
In the next example, I have performed 2D transpose convolution using a 2x2 kernel in which all weights are set to 1. This result is equivalent to performing nearest neighbor interpolation.
In later modules, you will see examples of deconvolution in action in the context of semantic segmentation. I will primarily rely on 2D transpose convolution as opposed to upsampling.
= torch.tensor([[[[0., 0.], [0., 1.]]]]) inW
= F.conv_transpose2d(inT, inW, stride=2, padding=0)
outT outT.shape
torch.Size([1, 1, 18, 18])
outT
tensor([[[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.2038, 0.0000, 0.3857, 0.0000, 0.2131, 0.0000, 0.2192,
0.0000, 0.3941, 0.0000, 0.7336, 0.0000, 0.9421, 0.0000, 0.9957,
0.0000, 0.9204],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.6419, 0.0000, 0.0673, 0.0000, 0.5942, 0.0000, 0.8745,
0.0000, 0.7517, 0.0000, 0.4133, 0.0000, 0.7754, 0.0000, 0.5427,
0.0000, 0.1905],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.1336, 0.0000, 0.4876, 0.0000, 0.1040, 0.0000, 0.8921,
0.0000, 0.3560, 0.0000, 0.8355, 0.0000, 0.1151, 0.0000, 0.6093,
0.0000, 0.7128],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.7074, 0.0000, 0.6405, 0.0000, 0.5839, 0.0000, 0.3738,
0.0000, 0.3379, 0.0000, 0.3583, 0.0000, 0.8562, 0.0000, 0.9661,
0.0000, 0.6771],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.0058, 0.0000, 0.0299, 0.0000, 0.6825, 0.0000, 0.5826,
0.0000, 0.1684, 0.0000, 0.1118, 0.0000, 0.5335, 0.0000, 0.5647,
0.0000, 0.9345],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.8992, 0.0000, 0.1684, 0.0000, 0.4020, 0.0000, 0.8129,
0.0000, 0.9448, 0.0000, 0.2247, 0.0000, 0.7999, 0.0000, 0.4484,
0.0000, 0.1818],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.3697, 0.0000, 0.0610, 0.0000, 0.8861, 0.0000, 0.7284,
0.0000, 0.4945, 0.0000, 0.5020, 0.0000, 0.4667, 0.0000, 0.5411,
0.0000, 0.7651],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.0045, 0.0000, 0.5754, 0.0000, 0.6614, 0.0000, 0.6216,
0.0000, 0.8295, 0.0000, 0.8961, 0.0000, 0.4114, 0.0000, 0.1198,
0.0000, 0.8199],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000],
[0.0000, 0.6839, 0.0000, 0.9587, 0.0000, 0.1226, 0.0000, 0.8781,
0.0000, 0.3175, 0.0000, 0.7190, 0.0000, 0.1138, 0.0000, 0.5664,
0.0000, 0.8930]]]])
= torch.tensor([[[[1., 1.], [1., 1.]]]])
inW = F.conv_transpose2d(inT, inW, stride=2, padding=0)
outT outT.shape
torch.Size([1, 1, 18, 18])
outT
tensor([[[[0.2038, 0.2038, 0.3857, 0.3857, 0.2131, 0.2131, 0.2192, 0.2192,
0.3941, 0.3941, 0.7336, 0.7336, 0.9421, 0.9421, 0.9957, 0.9957,
0.9204, 0.9204],
[0.2038, 0.2038, 0.3857, 0.3857, 0.2131, 0.2131, 0.2192, 0.2192,
0.3941, 0.3941, 0.7336, 0.7336, 0.9421, 0.9421, 0.9957, 0.9957,
0.9204, 0.9204],
[0.6419, 0.6419, 0.0673, 0.0673, 0.5942, 0.5942, 0.8745, 0.8745,
0.7517, 0.7517, 0.4133, 0.4133, 0.7754, 0.7754, 0.5427, 0.5427,
0.1905, 0.1905],
[0.6419, 0.6419, 0.0673, 0.0673, 0.5942, 0.5942, 0.8745, 0.8745,
0.7517, 0.7517, 0.4133, 0.4133, 0.7754, 0.7754, 0.5427, 0.5427,
0.1905, 0.1905],
[0.1336, 0.1336, 0.4876, 0.4876, 0.1040, 0.1040, 0.8921, 0.8921,
0.3560, 0.3560, 0.8355, 0.8355, 0.1151, 0.1151, 0.6093, 0.6093,
0.7128, 0.7128],
[0.1336, 0.1336, 0.4876, 0.4876, 0.1040, 0.1040, 0.8921, 0.8921,
0.3560, 0.3560, 0.8355, 0.8355, 0.1151, 0.1151, 0.6093, 0.6093,
0.7128, 0.7128],
[0.7074, 0.7074, 0.6405, 0.6405, 0.5839, 0.5839, 0.3738, 0.3738,
0.3379, 0.3379, 0.3583, 0.3583, 0.8562, 0.8562, 0.9661, 0.9661,
0.6771, 0.6771],
[0.7074, 0.7074, 0.6405, 0.6405, 0.5839, 0.5839, 0.3738, 0.3738,
0.3379, 0.3379, 0.3583, 0.3583, 0.8562, 0.8562, 0.9661, 0.9661,
0.6771, 0.6771],
[0.0058, 0.0058, 0.0299, 0.0299, 0.6825, 0.6825, 0.5826, 0.5826,
0.1684, 0.1684, 0.1118, 0.1118, 0.5335, 0.5335, 0.5647, 0.5647,
0.9345, 0.9345],
[0.0058, 0.0058, 0.0299, 0.0299, 0.6825, 0.6825, 0.5826, 0.5826,
0.1684, 0.1684, 0.1118, 0.1118, 0.5335, 0.5335, 0.5647, 0.5647,
0.9345, 0.9345],
[0.8992, 0.8992, 0.1684, 0.1684, 0.4020, 0.4020, 0.8129, 0.8129,
0.9448, 0.9448, 0.2247, 0.2247, 0.7999, 0.7999, 0.4484, 0.4484,
0.1818, 0.1818],
[0.8992, 0.8992, 0.1684, 0.1684, 0.4020, 0.4020, 0.8129, 0.8129,
0.9448, 0.9448, 0.2247, 0.2247, 0.7999, 0.7999, 0.4484, 0.4484,
0.1818, 0.1818],
[0.3697, 0.3697, 0.0610, 0.0610, 0.8861, 0.8861, 0.7284, 0.7284,
0.4945, 0.4945, 0.5020, 0.5020, 0.4667, 0.4667, 0.5411, 0.5411,
0.7651, 0.7651],
[0.3697, 0.3697, 0.0610, 0.0610, 0.8861, 0.8861, 0.7284, 0.7284,
0.4945, 0.4945, 0.5020, 0.5020, 0.4667, 0.4667, 0.5411, 0.5411,
0.7651, 0.7651],
[0.0045, 0.0045, 0.5754, 0.5754, 0.6614, 0.6614, 0.6216, 0.6216,
0.8295, 0.8295, 0.8961, 0.8961, 0.4114, 0.4114, 0.1198, 0.1198,
0.8199, 0.8199],
[0.0045, 0.0045, 0.5754, 0.5754, 0.6614, 0.6614, 0.6216, 0.6216,
0.8295, 0.8295, 0.8961, 0.8961, 0.4114, 0.4114, 0.1198, 0.1198,
0.8199, 0.8199],
[0.6839, 0.6839, 0.9587, 0.9587, 0.1226, 0.1226, 0.8781, 0.8781,
0.3175, 0.3175, 0.7190, 0.7190, 0.1138, 0.1138, 0.5664, 0.5664,
0.8930, 0.8930],
[0.6839, 0.6839, 0.9587, 0.9587, 0.1226, 0.1226, 0.8781, 0.8781,
0.3175, 0.3175, 0.7190, 0.7190, 0.1138, 0.1138, 0.5664, 0.5664,
0.8930, 0.8930]]]])
Concluding Remarks
Now that we have investigated the common layers used to build CNNs, we will move on to actually creating them. In the next section relating to CNNs for scene labeling, we will primarily make use of nn.Conv2d() and nn.MaxPool2d() along with layers we have investigated while implementing fully connected neural networks: nn.Linear(), nn.ReLU(), and nn.BatchNorm2d(). In later modules you will see applications of nn.ConvTranspose2d() for upsampling.