Understanding Tensors and Tensor Decompositions: Part 2

In Part 1 of this blog, I had explained what tensors are and some related terminology. Just to briefly recap, a tensor is a multidimensional array of numbers. Tensors provide a useful representation for data that involves multiple modalities of sensing or the data has multiple aspects in terms of time or space or both. In this and the next post, we are going to look at tensor decomposition and suggest some applications of tensor decomposition. We are all familiar with decomposition of matrices, for example PCA/SVD, and non-negative matrix factorization etc. Such decompositions provide insights to different data components as well as ways to compress data. Tensor decomposition yields similar benefits with multidimensional data.

The CP Decomposition

There are several ways of decomposing tensors. In this post, we will look at the CANDECOMP/PARAFAC decomposition, commonly referred as CP decomposition. Although this decomposition dates back to 1927 when it was introduced as canonical polyadic decomposition (CPD), it became popular in seventies under the names canonical decomposition (CANDECOMP) and parallel factors (PARAFAC) with applications in psychometrics and phonetics respectively.  The CP decomposition factorizes a tensor into a linear combination of rank one tensors. Thus, a tensor of order 3 can be decomposed as a sum of R rank one tensors as

\boldsymbol{\mathcal{X}}\approx \sum\limits_{r=1}\limits^{R}\bf{a}^{r}\circ \bf{b}^{r}\circ \bf{c}^{r},

where (\circ) represents the outer product. This is illustrated in the figure below.

Often, the vectors in rank one tensors are normalized to unit length. In such cases, the CP decomposition is expressed as

\boldsymbol{\mathcal{X}}\approx \sum\limits_{r=1}\limits^{R}\lambda_r\bf{a}^{r}\circ \bf{b}^{r}\circ \bf{c}^{r},

where {\lambda}_r is a scalar accounting for normalization. With such a decomposition, a tensor element x_{ijk} can be approximated as

x_{ijk}\approx \hat{x}_{ijk} = \sum\limits_{r=1}\limits^{R}\lambda_r a^{r}_i b^{r}_j c^{r}_k

The decomposition is performed by a minimization algorithm, known as alternating least squares (ALS). The basic gist of this algorithm  is to keep certain components of the solution fixed while finding the remaining components and then iterating the procedure by switching the components to be kept fixed.

An Example of CP Decomposition

Let me use an example to understand CP decomposition. Lets begin by generating 20 binary matrices of size 7×5 representing each of noisy digits 1 to 3. These matrices form a multidimensional array of size 60x7x5 that is treated as a tensor. I used the following code in R for generating the array.

> dig1  one  dig2  two  dig3  three  noise = 0.05
> X  for(i in 1:20) {
+ X[i, ,] <- (xor(one,matrix(rbinom(7*5,1,noise),7,5)))*1
+ X[i+20, ,] <- (xor(two,matrix(rbinom(7*5,1,noise),7,5)))*1
+ X[i+40, ,] <- (xor(three,matrix(rbinom(7*5,1,noise),7,5)))*1
+ }

Three examples of generated noisy digits are shown below.

I used the rTensor package for tensor decomposition. If you prefer Python, then use can use the TensorLy package. To start with, I just selected to decompose the tensor X with a single rank one tensor using the code shown below:

> library(rTensor)
> nc = 1# number of rank 1 components
> xdecomp  xdecomp$conv
[1] TRUE
> str(xdecomp$U)
List of 3
 $ : num [1:60, 1] 0.0146 0.0171 0.0119 0.0152 0.0157 ...
 $ : num [1:7, 1] -0.2276 -0.0986 -0.0681 -0.1522 -0.0913 ...
 $ : num [1:5, 1] -0.108 -0.233 -0.307 -0.224 -0.128
> attributes(xdecomp)
[1] "lambdas" "U" "conv" "est" "norm_percent" "fnorm_resid" "all_resids"
> attributes(xdecomp$U)
> xdecomp$lambdas
[1] 771.7821
> dim(xdecomp$U)
NULL> summary(xdecomp$U)
 Length Class Mode
[1,] 60 -none- numeric
[2,] 7 -none- numeric
[3,] 5 -none- numeric
> xdecomp$norm_percent
[1] 30.57773
> xdecomp$fnorm_resid
[1] 19.7213

The decomposition results in three vectors. The mode one vector has 60 components with each component corresponding to one of the noisy digits. A plot of these 60 values is shown below. It shows that the values in this tensor mode are barely different from each other.The other two mode vectors of size 1×7 and 1×5 basically define a two dimensional matrix that is common to all 60 digit patterns. This means that the CP decomposition is telling us to view all 60 digit patterns as scaled versions of a single two dimensional matrix. The value of xdecomp$norm_percent indicates the level of approximation achieved by the decomposition. In this case, it is 30.57773% showing a poor approximation; no surprise there because we are using just one rank one tensor for approximation. To see how the approximated digits look like, I reconstructed two examples from each category of digits 1 to 3. These are shown below. Clearly, the reconstructed digits look all alike because of poor approximation achieved by a single rank one tensor.

Next, I decided to increase the number of rank one tensors in the approximation to two. The approximation in this case accounts for 45.915% of the variability which is better than the case with one component only. In this case, the decomposition suggests to view each digit image as a linear sum of two 7×5 pattens where the coefficients in the linear sum are given by the first mode matrix of size 2×60. Plotting these components of the first mode matrix with color coding for three digit categories, I generated the plot given below.

The above plot shows that the values in the first mode of the decomposed tensor can differentiate between digit 1 pattens (shown in black circles) against digit two and three patterns (shown in red and green circles). This is further supported by reconstruction examples, shown below, where patterns for digit 1 are able to reasonbly approximate it while reconstructed patterns for digits 2 and 3 are indistinguiable.

Next, I decided to jump to five components in the decomposition. This resulted in the mode one matrix of size 5×60. The mode two matrix in this case is 5×7 and the mode three matrix is 5×5. Again the mode two and three matrices together represent five patterns of size 7×5 matrices. When we combine these five patterns with weights/scales coming from columns of mode one matrix, we are able to obtain approximated digit images. The scatter plots below show that it is possible to cluster/classify digits 1 to 3 using the columns of mode one matrix in the decomposed representation (see row 4 and column 3). This shows that CP decomposition can be used for clustering/classification tasks.

The reconstructed images with five components, given below, clearly show patterns for digits 1 to 3. We can increase the number of components for better approximation.

As this example shows, we can view the CP decomposition producing a set of patterns in modes two and three that together act analogous to basis vectors of PCA. The first mode matrix can be viewed as producing weights that can be linearly combined to produce approximations to images/data in the original tensor. By treating the elements of mode one matrix in the resulting decomposition as features, it is possible to use the CP decomposition for classification/clustering tasks.

The CP decomposition has been applied to many signal processing and data mining applications. One application that is worth mentioning here is by the company Stitch Fix. In their application, a tensor of order 3 is created by the words customers use to describe clothing made by Stitch Fix and the words used by the company to describe their clothing styles. To form the tensor, word embedding vectors are used. The company uses the CP decomposition to learn the relationships between the words used by the customers and the styles offered by the company. This helps in returning more appropriate results to customer queries for clothing. The blog describing this application is worth checking out.

In Part 3, I am going to describe another popular tensor decomposition method, known as Tucker decomposition. Till then, happy reading and surfing.