This page will show you how to do a Principal Component and Linear Discriminant Analysis with GNU Octave, a high-level language and environment for numerical computing.
Start GNU Octave from command line with:
$ octave
and create some data in X with corresponding classes in c:
X = [2 3;3 4;4 5;5 6;5 7;2 1;3 2;4 2;4 3;6 4;7 6]; c = [ 1; 1; 1; 1; 1; 2; 2; 2; 2; 2; 2];
Find the indices of each class {1,2}:
c1 = X(find(c==1),:) c2 = X(find(c==2),:)
And then plot the data:
figure; p1 = plot(c1(:,1), c1(:,2), "ro", "markersize",10, "linewidth", 3); hold on; p2 = plot(c2(:,1), c2(:,2), "go", "markersize",10, "linewidth", 3) xlim([0 8]) ylim([0 8])
We can see that both classes can clearly be separated in one dimension:
So how does a Principal Component Analysis perform on this dataset?
The Principal Component Analysis (PCA) was independently proposed by Karl Pearson (1901) and Harold Hotelling (1933) to turn a set of possibly correlated variables into a smaller set of uncorrelated variables. The idea is that a high-dimensional dataset is often described by correlated variables and therefore only a few meaningful dimensions account for most of the information. The PCA method finds those directions in the original dataset that accounts for the greatest variance in data, also called the principal components.
A good start is to read A tutorial on Principal Components Analysis or Wikipedia on PCA.
The math in the recommended documents can be reduced to these four lines of code:
mu = mean(X) Xm = bsxfun(@minus, X, mu) C = cov(Xm) [V,D] = eig(C)
That's it! Note that the Eigenvectors in GNU Octave are sorted ascending, so the last column is the first principal component. Strange to work with? So better sort the eigenvectors in a descending order:
% sort eigenvectors desc [D, i] = sort(diag(D), 'descend'); V = V(:,i);
Because all variables in this example are measured on the same scale we don't need to apply any standardization. Plotting both eigenvectors:
scale = 5 pc1 = line([mu(1) - scale * V(1,1) mu(1) + scale * V(1,1)], [mu(2) - scale * V(2,1) mu(2) + scale * V(2,1)]); pc2 = line([mu(1) - scale * V(1,2) mu(1) + scale * V(1,2)], [mu(2) - scale * V(2,2) mu(2) + scale * V(2,2)]); set(pc1, 'color', [1 0 0], "linestyle", "--") set(pc2, 'color', [0 1 0], "linestyle", "--")
yields a new coordinate system with the mean as origin and the orthogonal principal components as axes:
According to the PCA we can safely discard the second component, because the first principal component is responsible for 85% of the total variance.
octave> cumsum(D) / sum(D) ans = 0.85471 1.00000
But what will happen if we project the data on the first principal component (red)? Do you already see the problem?
To understand why the PCA fails in some situations, project the data onto the first principal component:
% project on pc1 z = Xm*V(:,1) % and reconstruct it p = z*V(:,1)' p = bsxfun(@plus, p, mu)
and visualize it:
% delete old plots delete(p1);delete(p2); y1 = p(find(c==1),:) y2 = p(find(c==2),:) p1 = plot(y1(:,1),y1(:,2),"ro", "markersize", 10, "linewidth", 3); p2 = plot(y2(:,1), y2(:,2),"go", "markersize", 10, "linewidth", 3);
… et voilà. The data isn't linearly separable anymore in this lower-dimensional representation. What does this mean? The classes are smeared together and classification becomes tough:
… while a projection on the second principal component yields a much better representation for classification:
See that even a simple Nearest Neighbor classifier would perform awesome here? So we learn that the directions of maximum variance may be useless for classification tasks. The second principal component had a smaller variance, but provided a much better discrimination between the two classes.
What we aim for is a projection, that maintains the maximum discriminative power of a given dataset, so a method should make use of class labels (if they are known a priori). The Linear Discriminant Analysis, invented by R. A. Fisher, 1936, does so by maximizing the between-class scatter, while minimizing the within-class scatter at the same time.
I took the equations from Ricardo Gutierrez-Osuna's lecture notes on Linear Discriminant Analysis and Wikipedia on LDA.
We'll use the same data as for the PCA example. Again create the data in X with corresponding classes in c:
X = [2 3;3 4;4 5;5 6;5 7;2 1;3 2;4 2;4 3;6 4;7 6]; c = [ 1; 1; 1; 1; 1; 2; 2; 2; 2; 2; 2];
Find the indices of each class {1,2}:
c1 = X(find(c==1),:) c2 = X(find(c==2),:)
And then plot the data:
figure; p1 = plot(c1(:,1), c1(:,2), "ro", "markersize",10, "linewidth", 3); hold on; p2 = plot(c2(:,1), c2(:,2), "go", "markersize",10, "linewidth", 3) xlim([0 8]) ylim([0 8])
So how does a Linear Discriminant Analysis perform on this dataset?
Finding the between and within-classes scatter matrices is easy for our two classes example:
classes = max(c) mu_total = mean(X) mu = [ mean(c1); mean(c2) ] Sw = (X - mu(c,:))'*(X - mu(c,:)) Sb = (ones(classes,1) * mu_total - mu)' * (ones(classes,1) * mu_total - mu)
The General Eigenvalue problem is now solved by simply doing:
[V, D] = eig(Sw\Sb)
And that's it. Now let's see what separations the LDA method has found! Don't forget to first sort the Eigenvectors in a descending order:
% sort eigenvectors desc [D, i] = sort(diag(D), 'descend'); V = V(:,i);
and then plot them:
scale = 5 pc1 = line([mu_total(1) - scale * V(1,1) mu_total(1) + scale * V(1,1)], [mu_total(2) - scale * V(2,1) mu_total(2) + scale * V(2,1)]); set(pc1, 'color', [1 0 0], "linestyle", "--")
Let's look at the component found by the LDA:
Yeah! Perfect separation on the first component, that was what we were looking for!
First shift the data to the new center:
Xm = bsxfun(@minus, X, mu_total)
then calculate the projection and reconstruction:
z = Xm*V(:,1) % and reconstruct it p = z*V(:,1)' p = bsxfun(@plus, p, mu_total)
and now plotting it:
% delete old plots delete(p1);delete(p2); y1 = p(find(c==1),:) y2 = p(find(c==2),:) p1 = plot(y1(:,1),y1(:,2),"ro", "markersize", 10, "linewidth", 3); p2 = plot(y2(:,1), y2(:,2),"go", "markersize", 10, "linewidth", 3);
yields this beautiful plot:
The first experiment was somewhat constructed. Let's get our hands now on some real data from the UCI Machine Learning Repository! I decided to test both methods on the Wine Dataset1), an admittedly easy dataset. It consists of 178 samples with 13 constituents drawn from three types of wine.
We will turn the methods from the first experiment into functions to make them a bit more usable:
function Z = zscore(X)
Z = bsxfun(@rdivide, bsxfun(@minus, X, mean(X)), std(X));
endfunction
function [D, W_pca] = pca(X)
mu = mean(X);
Xm = bsxfun(@minus, X ,mu);
C = cov(Xm);
[W_pca,D] = eig(C);
[D, i] = sort(diag(D), 'descend');
W_pca = W_pca(:,i);
endfunction
function [D, W_lda] = lda(X,y)
dimension = columns(X);
labels = unique(y);
C = length(labels);
Sw = zeros(dimension,dimension);
Sb = zeros(dimension,dimension);
mu = mean(X);
for i = 1:C
Xi = X(find(y == labels(i)),:);
n = rows(Xi);
mu_i = mean(Xi);
XMi = bsxfun(@minus, Xi, mu_i);
Sw = Sw + (XMi' * XMi );
MiM = mu_i - mu;
Sb = Sb + n * MiM' * MiM;
endfor
[W_lda, D] = eig(Sw\Sb);
[D, i] = sort(diag(D), 'descend');
W_lda = W_lda(:,i);
endfunction
function X_proj = project(X, W)
X_proj = X*W;
endfunction
function X = reconstruct(X_proj, W)
X = X_proj * W';
endfunction
Loading the Wine Dataset is easy in GNU Octave with the dlmread function:
% http://archive.ics.uci.edu/ml/datasets/Wine
data = dlmread("wine.data",",");
y = data(:,1);
X = data(:,2:end);
Everything's ok?
> size(X) ans = 178 13 > size(y) ans = 178 1 > length(unique(y)) ans = 3
Yes! 178 instances with 13 attributes from 3 classes.
With the functions it's easy to perform a PCA:
[DPca, Wpca] = pca(X); Xm = bsxfun(@minus, X, mean(X)); Xproj = project(Xm, Wpca(:,1:2));
According to the eigenvalues, the first component already accounts for 99% of the total variance.
> cumsum(DPca) / sum(DPca) ans = 0.99809 0.99983 0.99992 0.99997 0.99998 0.99999 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
But the plot
wine1 = Xproj(find(y==1),:);
wine2 = Xproj(find(y==2),:);
wine3 = Xproj(find(y==3),:);
figure;
plot(wine1(:,1), wine1(:,2),"ro", "markersize", 10, "linewidth", 3); hold on;
plot(wine2(:,1), wine2(:,2),"go", "markersize", 10, "linewidth", 3);
plot(wine3(:,1), wine3(:,2),"bo", "markersize", 10, "linewidth", 3);
title("PCA (original data)")
shows that the clusters are not clearly separated
If we apply Fisher's Discriminant Analysis on the Wine Dataset:
[D, W_lda] = lda(X,y); Xproj = project(Xm, W_lda(:,1:2));
and project it on the first two components, the classes are way better separated:
The PCA performs bad. Why is that? Because the features are all on a different scale. A common trick is to scale the input to zero mean and unit variance, also called z-scores. If the normalization is applied on the data:
Xm = zscore(X);
then the first principal component is responsible for only 36% of the total variance. We have to take 8 components to rise above 90% variance.
> cumsum(DPca) / sum(DPca) ans = 0.36199 0.55406 0.66530 0.73599 0.80162 0.85098 0.89337 0.92018 0.94240 0.96170 0.97907 0.99205 1.00000
On normalized data the clusters are better separated by the PCA:
While the projection found by Fisher's Discriminant Analysis only changes in scale:
Something I didn't mention is the close relation between a Singular Value Decomposition (SVD) and a Principle Component Analysis. A nice tutorial for the SVD is given here and a real-life example with Ruby is given here.
Let's decompose the data matrix:
[U, S, V] = svd(Xm); Xproj = project(Xm,V(:,1:2)); DSvd = diag(S).^2;
and you can compare yourself. The diagonal elements squared are the eigenvalues found by the PCA. We need to normalize both vectors first:
function N = normalize(I, l, h)
% set default values
if(nargin < 3)
l=0;
h=1;
endif
minI = min(I(:));
maxI = max(I(:));
%% normalize between [0...1]
N = I - minI;
N = N ./ (maxI - minI);
%% scale between [l...h]
N = N .* (h-l);
N = N + l;
end
and then we can compare both:
all((normalize(DPca) - normalize(DSvd)) < eps)
Does it return 1? Hooray:
1
Performing a PCA and LDA is easy, as long as you got a solver! You have seen that the PCA is likely to fail under certain conditions. While the first example was constructed to fail, the second example showed some other limitations you can encounter in real life.
So why is this important for object recognition? The PCA finds a coordinate system, which points into the direction of maximum variance. It doesn't use any class labels and so it doesn't know wether an axis is actually discriminative or not. In face recognition and general object recognition, you'll often run into situations where most of the variance is generated from external sources; let it be the illumination or the background. If this is the case, the axes you identify with a PCA will describe the variance in light or background instead of describing discriminative parts of your objects. The classes will be smeared together in the lower dimensional representation, your classification results are bad.
The LDA can help by learning a subspace, where the scatter within a class is minimized and the scatter between the classes is maximized. With this formulation, the instances will be clearer separated in the subspace. In my blog you can see it works and generates good recognition results. But don't get too excited! The LDA has some limitations I can't discuss in the scope of this article, that makes its usage in real life almost impossible. If you want to research on this topic, go for Small Sample Size Problem on Google Scholar.
If you don't like GNU Octave for some reasons, you can also use R for the above examples. Please note that there is already the sophisticated function princomp in R. Anyway, here is the first example translated to R:
repmat <- function(data, nrows, ncols, byrow=T) {
ncols <- length(data) * ncols
rep <- matrix(rep(data, nrows), nrow=nrows, ncol=ncols, byrow=byrow)
return(rep)
}
data <- c(2,3,3,4,4,5,5,6,5,7,2,1,3,2,4,2,4,3,6,4,7,6)
y <- c(1,1,1,1,1,2,2,2,2,2,2);
X <- matrix(data, ncol=2, byrow=T)
mean <- colMeans(X)
Xm <- X - repmat(mean, nrow(X), 1)
C <- cov(Xm)
eig <- eigen(C)
z <- Xm%*%eig$vectors[,1]
p <- z%*%eig$vectors[,1]
p <- p + repmat(mean, nrow(p), 1)
plot(p[y==1,], col="red", xlim=c(0,8), ylim=c(0,8))
points(p[y==2,], col="green")
title(main="PCA projection", xlab="X", ylab="Y")
Discussion
very nice tutorial! congratulations
Thanks! I had fun writing it.
Is it possible to edit your code? You have a few stylistic/numerical errors, such as inv(a)*b instead of a\b (cf. http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ ) and some of your repmat uses should really be replaced with broadcasting: http://www.gnu.org/software/octave/doc/interpreter/Broadcasting.html
No problem. I have edited the code accordingly.
Excellent tutorial, its refreshing to have this kind of documentation.
Thanks!