In an earlier post, I had explained convolution and deconvolution in deep neural networks. The purpose of this post is to demo these operations using PyTorch. Before doing that, we will visit different operations associated with a convolution.
Convolution is an operation on two functions of real valued arguments. One of these functions, considered a signal, is an n-dimensional array of numbers, for example a 3-dimensional array of numbers representing a color image. The second function is a kernel or filter whose size is typically much smaller than the input array size. The array representing the kernel function is called kernel mask. The purpose of the convolution operation is to transform the input into a new array with the aim of highlighting some property of the input array. Thus, convolution can be viewed as feature extraction, and the transformed array is often called feature map where feature implies a particular characteristic of the input extracted by the kernel.
The convolution operation is performed by moving the kernel mask over the signal array and calculating the kernel response at each location. To understand the convolution operation, let’s consider a 3-dimensional input array representing the red, green, and blue channels of a colored image patch and a 3×3 convolution filter as shown below. To perform convolution at a particular position of the input array, we place the center of the convolution mask at the desired position and perform element by element multiplication between the signal array elements and the convolution mask elements followed by summation for each input channel as shown in the figure below. The responses from three channels are then added to produce the output of the convolution operation. The response over the entire input array is obtained by moving the mask center one step at a time and repeating the calculations.
Looking at the above figure, we see that we cannot place the center of kernel mask anywhere in the top or bottom row or in the left or rightmost column; doing so will place part of the mask outside the input array. However, if we were to pad our input array with an additional row at top and bottom, and with an additional column on left and right with all element values being zero, then we can place the convolution mask even at all positions in the top or bottom rows or left or rightmost columns of the input array. Adding extra rows/columns is what is meant by padding in convolution. Without padding, the result of convolution for the above example would be a 6×6 feature map. With padding, the result would be 8×8, same as the input array size. Although the mask used in the example here is a square mask, it is not necessary to have mask height (H) same as mask width (W). It is easy to see that we must add (H-1)/2 rows on top and bottom of the input, and (W-1)/2 columns on each side of the input to maintain the feature map size identical to the input array size. [These numbers for padding assume H and W to be odd integers, which is common.]
Stride and Dilation
We generally move the kernel mask over the input array to the next pixel. However, we can skip a pixel or two in between when moving the mask. The parameter stride determines how the mask is moved during convolution. A stride of 1 means moving to the next pixel with no skipping of pixels/cells and a stride of 2 means moving by two pixels. A stride value other than the default value of 1 means convolution response will be calculated at fewer positions. This means the size of the resulting feature map will be smaller than the input even with padding. Thus, setting a suitable value for stride allows us to down sample the convolution result. The figure below shows the positions where a 3×3 mask would be placed with the default stride value of 1 (blue cells) and with a stride value of 2 (Cells marked with X), when there is no padding. Clearly, stride of 2 will down sample the input to produce a smaller feature map.
Another convolution layer parameter is dilation. This parameter is used to enlarge the mask so that convolution is applied over a larger area. This is different from using a larger kernel mask to start with. The figure below illustrates how a 3×3 mask would be enlarged for dilation of 2. The original 3×3 mask is considered to have a dilation of 1 which means the mask elements are adjacent to each other. The mask on right is the dilated version of the mask on left. As you can see, dilating the convolution mask ignores a certain number of input array elements while computing convolution. The main use of dilation is to produce better quality output in semantic segmentation.
Pooling and Rectification
A typical convolutional neural network (CNN) is used for classification. In such a network, you will find a large number of convolution layers. Since convolution is a linear operation, we need to insert some nonlinearity between two consecutive convolution layers. Thus, the output of convolution layer is rectified via running it through ReLU (Rectifier Linear Unit). The rectified output of each convolutional layer is followed by a pooling layer whose task is to down sample the convolution result. This is done by replacing a block of convolution layer cells with a single cell. For example, the convolution layer output can be divided into adjacent groups of 2×2 blocks to be replaced by the 2×2 block average. This is called average pooling. When a 2×2 blocks is replaced by the maximum value of the block, the resulting pooling is known as max pooling. Irrespective of the type of pooling used, the basic advantage of pooling is the resulting down sampling which in turn speeds up the computation and minimizes the variance in data moving forward.
Convolution Layer Visualization
With the above introduction to the different operations involved with a single convolution layer, lets try to put together a demo to show the effect of different parameters on convolution operation. To do the demo, lets get an image that we will use.
from PIL import Image import matplotlib.pyplot as plt %matplotlib inline pil_image = Image.open('data/chair.jpg') plt.imshow(pil_image)
The image size is 240X180. Next, we import the necessary libraries. Since PyTorch accepts tensors, the image read earlier will be converted to a tensor. We are going to use four convolution filters. These will not be learned but set by defining a numpy array. The code for this part including the visualization of the filters is shown below.
import torch import torch.nn as nn import torch.nn.functional as F import numpy as np from torchvision import datasets, models, transforms # Transform PIL image to a tensor transform = transforms.ToTensor() img = transform(pil_image) #Define filters filter_array = np.array([[-1, -0.5,0, 0.5, 1], [-1, -0.5,0, 0.5, 1], [-1, -0.5, 0, 0.5, 1], [-1, -0.5,0, 0.5, 1], [-1, -0.5,0, 0.5, 1]]) filter_1 = filter_array filter_2 = -filter_1 filter_3 = filter_1.T filter_4 = -filter_3 filters = np.array([filter_1, filter_2, filter_3, filter_4]) #Visualize filters fig = plt.figure(figsize=(12, 6)) fig.subplots_adjust(left=0, right=0.5, bottom=0.8, top=1, hspace=0.05, wspace=0.05) for i in range(4): ax = fig.add_subplot(1, 4, i+1, xticks=, yticks=) ax.imshow(filters[i], cmap='hot') ax.set_title('Filter %s' % str(i+1))
Now, we will set up a two-layer convolution network to perform convolution. The code for this is given below.
class DemoNet(nn.Module): def __init__(self, wt1,wt2): super(DemoNet, self).__init__() # We initialize the weights of the convolutional layer as the 4 defined filters self.conv1 = nn.Conv2d(3, 4, kernel_size=5,stride=1,dilation=1, bias=False) self.conv1.wt1 = torch.nn.Parameter(wt1) # define a pooling layer self.pool1 = nn.MaxPool2d(2, 2) #Define another conv layer self.conv2 = nn.Conv2d(4,4,kernel_size=5, bias =False) self.conv2.wt2 = torch.nn.Parameter(wt2) self.pool2 = nn.MaxPool2d(2,2) def forward(self, x): # calculates the output of a convolutional layer # pre- and post-activation conv1_x = self.conv1(x) activated1_x = F.relu(conv1_x) # applies pooling layer pooled1_x = self.pool1(activated1_x) conv2_x = self.conv2(pooled1_x) activated2_x = F.relu(conv2_x) pooled2_x = self.pool2(activated2_x) # returns all layers return conv1_x, activated1_x, pooled1_x, conv2_x,activated2_x,pooled2_x
Next, we define a function that will be used to visualize the output of the convolution layer filters.
def visualize_layer(layer, n_filters= 4): fig = plt.figure(figsize=(12, 12)) for i in range(n_filters): ax = fig.add_subplot(1, n_filters, i+1) ax.imshow(np.squeeze(layer[0,i].data.numpy())) ax.set_title('Filter %s' % str(i+1))
Now, we are ready to instantiate our network, feed the input image, and compute the output at different layers. Just to make the second layer filters different, we will add a random perturbation as shown below.
wt1 = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor) wt2 = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor)+0.5*torch.randn(5,5) model = DemoNet(wt1,wt2) #Compute output conv1_x, activated1_layer, pooled1_layer, conv2_x,activated2_layer,pooled2_layer = model.forward(img.unsqueeze(0))
Lets now visualize the output of the first convolution layer. The first row below shows the outputs of the four filters before rectification. The second row of four images is the output of the first convolution layer after rectification.
Looking at the images in the first row, we notice that filters 1 and 2 produce complimentary response; so is done by filters3 and 4. Further, some of the image features are highlighted in the rectified output. Next, we visualize the second convolution layer in a similar manner.
Although all images are displayed at same size, the tick marks on axes indicate that the images at the output of the second layer filters are half of the input image size because of pooling. The mostly dark image in the second row indicates filter2 producing mostly negative values that are getting rectified to 0. To see how changing the stride value from 1 to 2 will change the output, we set the stride to 2 and run the network again. The first row of the images show the rectified output from the first layer and the second row of images are the rectified output from the second layer with the updated stride value of 2 for both layers. With stride of 2, the output at second layer is heavily down sampled.
Now, lets see the effect of dilation. With a dilation value of 3, the result at the first and second layers after rectification appear as shown below. In this case, image features appear prominently compared to output without dilation.
A 1×1 convolution is often confusing because its utility is not obvious. A 1×1 convolution applied to a single image will only scale the pixel values by a factor of the 1×1 convolution weight; thus, it is unclear what benefit might be there of such a convolution. Well! to understand what benefit might be there, lets consider m input channels over which 1×1 convolution is to be applied. In this case, the 1×1 convolution operation can be expressed with the following equation where input(k) stands for the k-th input channel:
As this equation indicates, 1×1 convolution aggregates the input channel values along the depth axis; thus it is often called the depth convolution. This is also illustrated in the figure below.
The main usage of 1×1 convolution is in reducing computation or dimensionality reduction by reshaping input before filtering. Suppose at some intermediate stage in your convolution network, you have 64 filtered images or feature maps of size 28×28 pixels. You want to apply 16 different convolution masks of size 3×3 to these 28x28x64 images. This will require 28*28*16*3*3*64 (7225344) operations. Instead of directly applying 16, 3×3 masks on 64 channels of incoming images, we first reshape the incoming images to 28x28x4 via 4, 1×1 convolution filters. This will require 28*28*4*1*1*64 (200704) operations. Next, applying 16, 3×3 filters on the reshaped input will require 28*28*4*3*3*16 (451584) operations. Adding these two sets of operations, we can see that reshaping via 1×1 convolution requires about 90% fewer operations.
Lets now perform 1×1 convolution on the output of our demo network. To do this, we add another convolution layer to our network and make necessary changes to the network definition. The result of 1×1 convolution is then the feature map shown below. This output is the sum of four rectified images after pooling at the second convolution layer of our demo network.
The use of the term deconvolution in deep learning is different from its meaning in signal and image processing. While convolution without padding results in a smaller sized output, deconvolution increases the output size. With stride values greater than 1, deconvolution is used as a way of up sampling the data stream. This appears to be its main usage in deep learning. Both the convolution and deconvolution operations in deep learning are actually implemented as matrix multiplication operations and the deconvolution is actually transposed convolution. I would direct interested readers to my earlier post on this topic where I explain how convolution and deconvolution operations are carried out as matrix multiplications. Here I will just show the result of deconvolution operation performed on the output of our two-layer demo network. The first image is with the default stride value of 1 and the second image is with a stride value of 2. In the second image, the size of the original image has been recovered. I use the following code to perform the deconvolution operation:
inputs_to_decon = activated2_layer decon = F.conv_transpose2d(inputs_to_decon,wt2,padding=1, stride=2) plt.imshow(np.squeeze(decon.data.numpy()), cmap='hot')
As we can see from above discussion, there are various parameter choices available in the convolution layer that can be used to control up or down sampling of the data as it moves through numerous layers in a deep convolutional neural network. Before closing this post, I want to tell you that the actual operation in the convolution layer is not really convolution but cross-correlation. However, the term convolution has come to be accepted and used because the convolution masks are not pre-specified, as we did in this example, but are rather learned. Since the difference between convolution and correlation is whether the kernel mask is flipped before applying or not, one can argue that the masks used are flipped versions of the actual learned masks.