Understanding Compression of Convolutional Neural Nets: Part 1

There are neural network applications where the resources, for example compute power, memory space, battery power etc., are limited. In such instances, we may want to simplify a trained network to better deal with available resources. While simplifying, we obviously would like to maintain the network accuracy as much as possible. Such a network simplification is termed compression of deep neural networks. In this three-part blog posts, I am going to share with you some simple examples of neural network compression to help you better understand compressing deep neural nets. In Part 1, I will begin with a technique, known as singular value decomposition (SVD), which is the basis of compression methods. Part 2 will show compression of feedforward neural networks and Part 3 will demonstrate compression of convolution layers in convolutional neural networks (CNNs).

Singular Value Decomposition (SVD)

Let’s begin our discussion on SVD by recalling that the rank of a matrix is the number of independent columns or rows. For example, the matrix

\bold A = \begin{bmatrix}1 & 2\\3 & 4\end {bmatrix}

has rank 2 as its both columns are independents, i.e. one column/row is not a scaled version of the other column. On the other hand, the following matrix is a matrix of rank 1 as the second column/row is same as the first column multiplied by 2.

\bold A = \begin{bmatrix}1 & 2\\2 & 4\end {bmatrix}

Given a matrix, sometimes you may want to obtain its approximation. Use of such approximations in common in numerous tasks such as dimensionality reduction, information retrieval and text mining, and in image processing. The singular value decomposition (SVD) is one such method for obtaining low rank approximations of a given matrix. Another popular matrix decomposition method is non-negative matrix factorization (NMF). You can read about it in my earlier blog post.
In SVD, a real-valued matrix A of size m x n is factorized as

\bold A = \bold {UDV}^t

where \bold U is an orthogonal matrix of size m x m of left singular vectors and \bold V is an orthogonal matrix of size n x n of right singular vectors. The matrix \bold D is a diagonal matrix of size m x n of singular values. A low rank approximation to matrix A of rank r is obtained by using only a subset of singular values and the corresponding left and right singular vectors as given by the following expression. In other words, the approximation is obtained by the weighted sum of rank one matrices.

\hat{ \bold A} = \sum\limits_{j=1}\limits^{k} d_{jj}\bold U_j\bold V^t,\text{   }k\leq r

Let me give you an example by decomposing a matrix and obtaining its different approximations.

from scipy import linalg
import numpy as np
# Create a 4x3 matrix
A = np.array([[6,6,2],[0,1,3],[4,0,1],[0,6,2]])
# Obtain SVD decomposition
U,d,Vt = linalg.svd(A) 
print(' U matrix\n',U,'\n d array\n',d,'\n V-Transpose matrix\n',Vt)

This code creates a 4×3 matrix A and decomposes it to obtain U and V matrices along with an array d of singular values as shown below. Note that the array of d singular values represents the diagonal elements of matrix D of the SVD equation given earlier. In this case, there are three non-zero singular values; this means that the rank of matrix A is three.

 U matrix
 [[-0.82 -0.28  0.21 -0.46]
 [-0.16  0.2  -0.93 -0.26]
 [-0.24 -0.62 -0.28  0.69]
 [-0.5   0.7   0.1   0.5 ]] 
 d array
 [10.51  5.06  2.63] 
 V-Transpose matrix
 [[-0.56 -0.77 -0.32]
 [-0.83  0.54  0.16]
 [ 0.05  0.36 -0.93]]

Given U, V, and d, we can get back our original matrix A without any error by performing the matrix multiplication \bold {UDV}^t.

D = np.zeros((4,3))# (4,3) is the size of matrix A
for i in range(len(d)):
    D[i,i]= d[i]
Aa = U@(D@Vt)# Aa is the reconstructed matrix
print(' Reconstructed Matrix An',Aa)
Reconstructed Matrix A
[[6. 6. 2.]
[0. 1. 3.]
[4. 0. 1.]
[0. 6. 2.]]

The reconstructed matrix is identical to the original matrix. Suppose we reconstruct matrix A using only the first two singular values. We do that by replacing  ‘len(d)’ above with  ‘len(d)-1’ and get the following approximated matrix A.

Reconstructed Matrix A
 [[ 5.97  5.8   2.51]
 [ 0.12  1.87  0.72]
 [ 4.04  0.26  0.31]
 [-0.01  5.91  2.25]]

Since this approximation is obtained by summing two rank one matrices, you can easily verify the rank of this matrix as 2. You can also calculate the approximation error by comparing the original and the reconstructed matrix element by element , squaring the difference, and adding them, as shown below. This measure of comparing two matrices is known as Frobenius norm.

from numpy.linalg import matrix_rank
err = np.sum(np.square(F-Aa))
print(' Approximation Error = ',err)
Approximation Error = 6.904768216213785

If you look at the approximation error, you will find that it equals square of 2.63, the singular value not used in reconstructing the approximation. You can also approximate matrix A by using only one singular value. We do this by replacing ‘len(d)’ with ‘len(d)-2’ and get the following approximation.

Reconstructed Matrix A
 [[4.79 6.57 2.75]
 [0.96 1.32 0.55]
 [1.43 1.95 0.82]
 [2.92 4.01 1.67]]

You can verify that the error in this case would be the sum of squares of the second and third singular values not used in creating approximation.

Truncated SVD

The singular value decomposition described above creates an approximate matrix whose size remains the same. In some cases, the matrix A to be decomposed is treated as a data matrix with each row representing an observation and each column representing an observed feature. In such a scenario, the objective is not only to approximate but also to achieve a reduction in the size of the approximated matrix, i.e. the resulting matrix has same number of rows (observations) but a fewer number of columns. Thus, the matrix approximation is aimed at achieving dimensionality reduction similar to PCA.  The use of SVD in this context is known as truncated SVD. To obtain a reduced approximation matrix, we use only top k < r singular values and the first k columns of the U matrix and calculate the reduced representation as

\bold A_k= \bold U_{k}\bold D_{k},\text{   }k\leq r

where \bold {U}_k is the sub-matrix formed by the first k columns of U and \bold {D}_k is the sub-matrix of D formed by first k rows and columns.

An example of calculating the truncated SVD is shown below.

k = 2
Ak = U[:,0:k]@D[0:k,0:k]
print('Reduced matrix Ak \n',Ak)
Reduced matrix Ak 
 [[-8.58 -1.43]
 [-1.73  1.02]
 [-2.55 -3.15]
 [-5.23  3.54]]

The Scikit-learn library also provides an implementation of truncated SVD as shown below.

from sklearn.decomposition import TruncatedSVD
dim_red = TruncatedSVD(k, algorithm = 'arpack')
Ak = dim_red.fit_transform(A)


Before concluding this post, let me show you two examples of SVD applications. The first example is of image compression wherein  I have performed singular value decomposition on an image and reconstructed it using 32, 16, and 8 top singular values. The results of this reconstruction along with the original image are shown below. As you can see, the first two approximations provide a good compression without satisfying much of the image quality.

The second example is of truncated SVD application. In this case, the truncated SVD has been used to reduce the Iris data matrix to a two-dimensional space, thus doing dimensionality reduction.

from sklearn.datasets import load_iris
iris = load_iris()
iris_data = iris.data
iris_target = iris.target
iris_transformed = dim_red.fit_transform(iris_data)
plt.scatter(iris_transformed[:,0],iris_transformed[:,1], c = iris_target,label=cla)
plt.title('Plot of Iris Data in 2-Dimensional Transformed Space')

The dimensionality reduction shows that Iris data can be well separated in the transformed two-dimensional space.

With the above introduction to SVD, we are ready to move on to compressing neural networks and convolutional neural networks in the coming blogs.

3 thoughts on “Understanding Compression of Convolutional Neural Nets: Part 1

  1. Hey Krishan,

    I was wondering about the truncated SVD part: why did the V^T matrix disappear? We don’t get an accurate reduced representation of matrix A now, right? As I understand it, we want the columns of the original matrix A with the highest singular values (as they hold the most information). This means we have to get that part of the A matrix back, which isn’t the case right now (the values in Ak are nothing like the values in the original A). Please do let me know if I’m wrong or if you have any clarifications!

    Best regards,


    1. You are correct in that we don’t get an accurate reduced representation of matrix A in truncated SVD. The reason is that the truncated SVD is transforming the original data, matrix A, into a new feature space of lesser dimensionality. Being a different space, the numbers in this space look very different from the original space. If we need to go back to the original values, we would need to use the V matrix also. You may see some similarities with PCA here. There also the transformed data has no resemblance with the original numbers but we can get back to the original numbers or their approximated values by applying the inverse transformation.
      Hope this helps.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s