Principal components analysis

This page is intended to provide information on running PCAs using various methods. The idea is primarily to disambiguate what matrices which methods exactly return. I'll use the following hopefully helpful notation.

Let the observed data be O(N, K), N observations of K variables. The columns of O will generally be correlated. PCA provides a description of O as weighted sums of uncorrelated latent variables L. There are two kinds of weights matrices:

Observed-to-latent weights, O2L
Latent-to-observed weights, L2O

Then O = L * L2O and L = O * O2L. Using the ICA terminology, the Mixing matrix is L2O and the Unmixing matrix is O2L: components are considered to be mixed and then observed as signals.

Calculating principal components



O = randn(100, 9);
Om = O - ones(size(O, 1), 1) * mean(O);
[O2L, L, EV] = princomp(O);

Then Om * O2L = L.

Note that the de-meaned Om must be used for the equality; princomp does the de-meaning automatically.

The L2O mixing matrix can be calculated from the O2L matrix by

L2O = inv(O2L' * O2L) * O2L';

Then L * L2O = Om.

The eigenvalues (EV) are equal to the variance of the latent variables L.

Eigendecomposition of the covariance matrix

Using the eigendecomposition of the covariance matrix of O gives the same results, but the columns are sorted in reverse (from low to high eigenvalue) and the weights may be multiplied by -1:

Ocov = cov(O);
[O2L, EV] = eig(Ocov);

A proof of why this eigendecomposition works is given on Wikipedia.

NIPALS algorithm

Implemented in teg_PCA, this works differently: it finds the L2O matrix iteratively. The O2L matrix is subsequently calculated from the L2O matrix as O2L = L2O' * inv(L2O * L2O');

[L, L2O, O2L, EV] = teg_PCA(O);

Then L * L2O = Om. The main advantage here is that the covariance matrix doesn't need to be held in memory, and the home-made function can be further optimized for memory usage.

VARIMAX rotation

O2Lr = rotatefactors(O2L, 'method', 'varimax') gives the varimax rotation. This maximizes the variance of the columns of the new O2L matrix. That is, the resulting rotated latent variables in Lr = O * O2Lr are based on simple combinations of variables.

Quartimax maximizes row-wise variance. Equimax combines the previous two criteria. However, it seems that most often the goal will be to have components that can be easily understood in terms of variables, which is what varimax provides.

After finding a rotation matrix, the following relationship holds:

Om * O2Lr = Lr

The scree criterion (knik criterium)

An objective criterion for selecting the "top" components, i.e. those with the highest eigenvalues is the scree criterion. This determines at which point the slope of the ordered eigenvalues passes through 1 (after normalizing the eigenvalues and index numbers to a [0 1] range). Matlab code here.