In my previous post, I had written about principal component analysis (PCA) for dimensionality reduction. In PCA, the class label of each and every example, even if available, is ignored and only the feature values from each example are considered. Thus, PCA is considered an *unsupervised* approach. The emphasis in PCA is to preserve data variability as much as possible while reducing the dimensionality. Linear discriminant analysis (LDA) on the other hand makes use of class labels as well and its focus is on finding a lower dimensional space that emphasizes class separability. In other words, LDA tries to find such a lower dimensional representation of the data where training examples from different classes are mapped far apart. Since LDA uses class labels, it is considered a *supervised* learning approach.

##### Basic Idea Behind LDA

The basic idea behind LDA is best understood by looking at the figure below where data from two classes, shown in red and black, is being projected onto a line. The figure shows only two choices for the line onto which the projections are made. Since our goal is to prefer a projection where the projected data from two classes do not overlap as far as possible, clearly the projection on the right is preferable.

Given that an infinite number of projections are possible, we need to define a measure that can help us select the best projection. Intuitively, we would like the projected points from two classes to be as far apart as possible. We can capture this notion by measuring the difference between means of class 1 and class 2 projected points. Although means can be far apart, it is still possible for projected points from two classes to show a fair amount of overlap if the projected points are spread out. To counter this, we also want the projected points from each class to lie close to each other. This means the mapped points for each class should have as small a spread as possible. Now, combining the two requirements for a good projection we can write a criterion function *J* as

*J(Projection) = (Difference between projected means of class 1 and 2)**2/(Sum of spread of class 1 and class 2 projected points)*

Thus, we look for a line onto which the projection yields a high enough value for the difference in projected means and a small enough value for the sum of variances so as to yield a maximum value for the criterion function. This criterion function was suggested by R.A. Fisher and hence is known as *Fisher’s criterion* and the resulting solution is called *Fisher’s linear discriminant function*.

##### LDA Math and Steps

Lets look at few equations to figure out the steps for applying LDA. We begin with *N **d*-dimensional feature vectors, , from two classes, . Let a vector of unit length **w **define the line onto which the N data vectors are mapped to produce projections via the relationship .

The means and spreads, more commonly known as *scatter*, of the projected data points are then:

The sum is called the total *within-class scatter* of the projected data points. The Fisher criterion, measuring the goodness of projection onto the line defined by **w** is thus expressed as

Instead of searching for optimal **w** to maximize the criterion, it turns out that a close form solution exists. It is given by the following relationship

, where

are the means of class 1 and class 2 examples in the original space. The matrix is known as the *within-class scatter matrix*; it is formed by summing the *scatter matrices* of class 1 and class2. That is

, and

Thus, the steps for applying LDA are:

- Compute the mean vectors of two classes
- Compute the scatter matrices of two classes
- Add scatter matrices to form the within-class scatter matrix and calculate its inverse
- Obtain the solution vector
**w**by performing matrix multiplication - Obtain the projections as

##### A Simple LDA Example

Lets consider a business that uses three skill tests to assign its employees to one of two groups: A team and B team. We have test scores of ten employees, five from each group. Thus, we have ten three-dimensional vectors from two classes. We will obtain a one-dimensional representation of these ten three-dimensional examples by following the steps outlined above. Lets first describe the input vectors and calculate the mean vectors of team A and team B.

import numpy as np from numpy import linalg as LA import matplotlib.pyplot as plt Ateam = np.array([[8,9,6],[6,7,5],[9,6,3],[7,8,2],[9,4,4]])# Skill scores for employees in team A Ateam_mean = np.mean(Ateam,0)# compute mean vector print(Ateam_mean)

[7.8 6.8 4. ]

Bteam = np.array([[5,4,7],[3,7,2],[4,5,5],[2,6,4],[4,3,4]])# Skill scores for employees in team B Bteam_mean = np.mean(Bteam,0) print(Bteam_mean)

[3.6 5. 4.4]

Next, we calculate scatter matrices of individual classes and add them to get the within-class scatter matrix.

S1 = np.zeros((3,3)) for j in range(len(Ateam)): S1 += np.outer((Ateam[j]-Ateam_mean),(Ateam[j]-Ateam_mean)) print(S1)

[[ 6.8 -5.2 -1. ] [-5.2 14.8 3. ] [-1. 3. 10. ]]

S2 = np.zeros((3,3)) for j in range(len(Bteam)): S2 += np.outer((Bteam[j]-Bteam_mean),(Bteam[j]-Bteam_mean)) print(S2)

[[ 5.2 -5. 5.8] [-5. 10. -7. ] [ 5.8 -7. 13.2]]

SW = S1+S2 ##Within-Class scatter matrix print(SW)

[[ 12. -10.2 4.8] [-10.2 24.8 -4. ] [ 4.8 -4. 23.2]]

Note that we could have calculated class scatter matrices as shown below. This is because the sample covariance matrix is obtained by dividing every term of the scatter matrix by a factor of number of examples minus one.

S1 = (len(Ateam)-1)*np.cov(Ateam.T) S2 = (len(Bteam)-1)*np.cov(Bteam.T)

Once we have the within-class scatter matrix and mean vectors, we calculate the solution vector to project our three-dimensional data to a line by following the steps shown below.

from numpy.linalg import inv SW_inv = inv(SW) W = np.matmul(SW_inv,(Ateam_mean-Bteam_mean))## Weight vector for projection print(SW_inv) print(W)

[[ 0.13580391 0.05279105 -0.01899546] [ 0.05279105 0.06199744 -0.00023307] [-0.01899546 -0.00023307 0.04699336]]

[ 0.67299849 0.33341102 -0.09899779]

We can now map our ten examples to one dimension. We can also calculate the projected means. Assuming that two classes, team A and B, are equally probable, we can calculate the midway point between the two projected means. This point, designated as AB_cutoff point, can be used to perform classification of new data.

Ateam_projections = np.matmul(W,Ateam.T) print(Ateam_projections) Bteam_projections = np.matmul(W,Bteam.T) print(Bteam_projections)

[7.79070038 5.87687915 7.76045915 7.18028202 6.99463932] [4.00565202 4.15487705 3.86406013 2.95047197 3.29623587]

Ateam_mean_proj = np.matmul(W.T,Ateam_mean) Bteam_mean_proj = np.matmul(W.T,Bteam_mean) AB_cutoff = 0.5*(Ateam_mean_proj+Bteam_mean_proj) print(Ateam_mean_proj,Bteam_mean_proj,AB_cutoff)

7.1205920055937515 3.6542594103251362 5.387425707959444

A graph of the resulting projections is shown below; it clearly shows that team A and team B data gets projected separately.

Now suppose the business in our example hires a new employee whose score on three skills is 5, 5, and 6. Should this employee be assigned to team A or B? We can answer this by finding the new employee’s projection. This turns out to be 4.438 which is less than the AB_cutoff value, i.e. closer to team B’s projection mean. Hence the new employee should be assigned to team B.

##### How to Apply LDA for More than Two Classes?

Not all classification problems are two class problems. So the question is how do we perform dimensionality reduction with LDA when the number of classes is, say, *K*. In such situations, the LDA approach is known as the *multiple discriminant analysis* and it uses *K-1* projections to map the data from the original *d*-dimensional space to a (*K-1)-*dimensional space under the condition that *d > K*.

Lets represent the *K-1* projections through the mapping , where ** W** is a (

*k-1) x d*projection matrix with each row representing a discriminant function. The projected vectors are of

*K-1*dimensions. Following the requirement that projected points from different classes should be far apart with small spread, the solution for projection matrix

*is given by the eigenvectors corresponding to the top*

**W***k-1*eigenvalues of the following matrix

, where

is the within-class matrix as defined earlier, and

is known as the *between-class scatter matrix *with ** m **being the mean of all the

*d*-dimensional examples from

*K*classes.

Lets go back to our A and B teams example and assume that we have another set of three-dimensional examples based on three skill scores. These examples represent members of team C. To perform dimensionality reduction on our three teams data, we will now generate two projections using LDA. To do this, we first compute team C mean and scatter.

Cteam = np.array([[3,5,8],[3,4,8],[4,5,9],[4,5,8],[5,4,7]])# Skill scores for employees in team C Cteam_mean = np.mean(Cteam,0) print(Cteam_mean)

[3.8 4.6 8.]

S3 = np.zeros((3,3)) for j in range(len(Cteam)): S3 += np.outer((Cteam[j]-Cteam_mean),(Cteam[j]-Cteam_mean)) print(S3)

[[ 2.8 -0.4 -1. ] [-0.4 1.2 1. ] [-1. 1. 2. ]]

To calculate the between-class matrix, we calculate the mean of all team members from teams A, B and C, and then perform the necessary vector multiplications as shown below. The within-class scatter matrix is calculated by adding team A, B and C scatter matrices.

total_mean = np.mean([Ateam_mean,Bteam_mean,Cteam_mean],0) SB = np.zeros((3,3)) SB += np.outer((Ateam_mean - total_mean),(Ateam_mean - total_mean)) SB += np.outer((Bteam_mean - total_mean),(Bteam_mean - total_mean)) SB += np.outer((Cteam_mean - total_mean),(Cteam_mean - total_mean)) print(total_mean) print(SB)

[5.06666667 5.46666667 5.46666667] [[11.22666667 5.42666667 -5.65333333] [ 5.42666667 2.74666667 -3.65333333] [-5.65333333 -3.65333333 9.70666667]]

SW = S1+S2+S3 ##Within-Class scatter matrix print(SW)

[[ 14.8 -10.6 3.8] [-10.6 26. -3. ] [ 3.8 -3. 25.2]]

Having calculated the needed matrices, we can now compute the eigenvalues and eigenvectors of the matrix.

SW_invSB = np.matmul(SW_inv,SB) eig_vals, eig_vecs = np.linalg.eig(SW_invSB) print(eig_vals) print(eig_vecs)

[ 1.96266109e+00 -4.68125067e-17 2.17635603e-01] [[-0.85426543 -0.3810106 0.43488256] [-0.45224032 0.9163546 0.26973255] [ 0.25633818 0.12298443 0.85913998]]

First and third eigenvalues are the two largest eigenvalues. Thus we select the corresponding eigenvectors to form the projection matrix and project the data to two dimensions. The steps for these and a scatter plot of the projected points is shown below.

W = np.array([eig_vecs[:,0],eig_vecs[:,2]]) y_Ateam = Ateam.dot(W.T)# Projections of team A members y_Bteam = Bteam.dot(W.T)# Projections of team B members y_Cteam = Cteam.dot(W.T)# Projections of team C members plt.scatter(y_Ateam[:,0],y_Ateam[:,1],c="r",label='A Team',marker= 'x') plt.scatter(y_Bteam[:,0],y_Bteam[:,1], c="g", label='B Team') plt.scatter(y_Cteam[:,0],y_Cteam[:,1], c="b", label='C Team',marker= 'x') plt.xlabel('Proj1') plt.ylabel('Proj2') plt.legend() plt.show()

It is seen that the projections of different classes are well separated except a point from team B overlapping with a point from team C. Lets use the new employee’s scores again and map the new employee to the 2-dimensional space defined above. In this case, the mapped coordinates are [-4.9944 8.6779]. Since this point is closest to mean of projections for team C, the new employee gets assigned to team C.

##### A Comparison of Dimensionality Reduction via LDA and PCA

We will use a well known data set and apply LDA and PCA to see what kind of results are obtained. We do this by using the *wine* data set available in the scikit-learn machine learning library. This data set has 178 examples from three classes. Each example consists of 13 real-valued features. In this case, we will use the LDA and PCA class from the scikit-learn library.

from sklearn.datasets import load_wine from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.decomposition import PCA wine = load_wine() Wine_Data = wine.data Wine_Class = wine.target lda = LinearDiscriminantAnalysis(n_components=2) lda.fit(Wine_Data, Wine_Class) Proj_lda = lda.transform(Wine_Data) pca = PCA(n_components=2) pca.fit(Wine_Data) Proj_pca = pca.transform(Wine_Data)

The resulting projections in two-dimensions by the both methods are shown below. It is clear from these projections that the dimensionality reduction by the LDA approach maintains class separation well while the reduction by the PCA has data from all three classes mixed up.

To summarize, we should be using LDA for dimensionality reduction if the objective is to maintain class separability. If preserving the data variability, i.e. keeping the data approximation error while reducing dimensionality, is of importance, then PCA should be used. While PCA is meant only for dimensionality reduction, LDA can be used for dimensionality reduction as well as for classification.