Extreme Components Analysis

TL;DR: Both Principal Components Analysis (PCA) and Minor Components Analysis (MCA) can be used for dimensionality reduction, identifying low-dimensional subspaces of interest as those which have the greatest variation in the original data (PCA), or those which have the least variation in the origina data MCA). As real data will contain both directions of unusually high variance and directions of unusually low variance, using just PCA or just MCA will lead to biased estimates of the low-dimensional subspace. The 2003 NeurIPs paper from Welling et al unifies PCA and MCA into a single probabilistic model XCA (Extreme Components Analysis). This post explains the XCA paper of Welling et al and demonstrates the XCA algorithm using simulated data. Code for the demonstration is available from https://github.com/dchoyle/xca_post

A deadline

This post arose because of a deadline I have to meet. I don’t know when the deadline is, I just know there is a deadline. Okay, it is a self-imposed deadline, but it will start to become embarrassing if I don’t hit it.

I was chatting with a connection, Will Faithfull, at a PyData Manchester Leaders meeting almost a year ago. I mentioned that one of my areas of expertise was Principal Components Analysis (PCA), or more specifically, the use of Random Matrix Theory to study the behaviour of PCA when applied to high-dimensional data.

A recap of PCA

In PCA we are trying to approximate a d-dimensional dataset by a reduced number of dimensions k < d. Obviously we want to retain as much of the structure and variation of the original data, so we choose our k-dimensional subspace such that the variance of the original data in the subspace is as high as possible. Given a mean-centered data matrix \underline{\underline{X}} consisting of N observations, we can calculate the sample covariance matrix \hat{\underline{\underline{C}}} as 1,

\hat{\underline{\underline{C}}} = \frac{1}{N-1} \underline{\underline{X}}^{\top} \underline{\underline{X}}

Once we have the (symmetric) matrix \hat{\underline{\underline{C}}} we can easily compute its eigenvectors \underline{v}_{i}, i=1,\ldots, d, and their corresponding eigenvalues \lambda_{i}.

The optimal k-dimensional PCA subspace is then spanned by the k eigenvectors of \hat{\underline{\underline{C}}} that correspond to the k largest eigenvalues of \hat{\underline{\underline{C}}}. These eigenvectors are the directions of greatest variance in the original data. Alternatively, one can just do a Singular Value Decomposition (SVD) of the original data matrix \underline{\underline{X}}, and work with the singular values of \underline{\underline{X}} instead of the eigenvalues of \hat{\underline{\underline{C}}}.

That is a heuristic derivation/justification of PCA (minus the detailed maths) that goes back to Harold Hotelling in 19332. There is a probabilistic model-based derivation due to Tipping and Bishop (1999), which we will return to later.

MCA

Will responded that as part of his PhD, he’d worked on a problem where he was more interested in the directions in the dataset along which the variation is least. The problem Will was working on was “unsupervised change detection in multivariate streaming data”. The solution Will developed was a modular one, chaining together several univariate change detection methods each of which monitored a single feature of the input space. This was combined with a MCA feature extraction and selection pre-processing step. The solution was tested against a problem of unsupervised endogenous eye blink detection.

The idea behind Will’s use of MCA was that for the streaming data he was interested in it was likely that the inter-class variances of various features were likely to be much smaller than intra-class variances, and so any principal components were likely to be dominated by what the classes had in common rather than what had changed, so the directions of greatest variance weren’t very useful for his change detection algorithm.

I’ve put a link here to Will’s PhD in case you are interested in the details of the problem and solution – yes, Will I have read your PhD.

Directions of least variance in a dataset can be found from the same eigen-decomposition of the sample covariance matrix and by selecting the components with the smallest non-zero eigenvalues. Unsurprisingly, focusing on directions of least variance in a dataset is called Minor Components Analysis (MCA)3,4. Where we have the least variation in the data the data is effectively constrained so, MCA is good for identifying/modelling invariants or constraints within a dataset.

At this point in the conversation, I recalled the last time I’d thought about MCA. That was when an academic colleague and I had a paper accepted at the NeurIPs conference in 2003. Our paper was on kernel PCA applied to high-dimensional data, in particular the eigenvalue distributions that result. As I was moving job and house at the time I was unable to go the conference, so my co-author, Magnus Rattray (now Director of the Institute for Data Science and Artificial Intelligence at the University of Manchester), went instead. On returning, Magnus told me of an interesting conversation he’d had at the conference with Max Welling about our paper. Max also had a paper at the conference, on XCA – Extreme Components Analysis. Max and his collaborators had managed to unify PCA and MCA into a single framework.

I mentioned the XCA paper to Will at the PyData Manchester Leaders meeting and said I’d write something up explaining XCA. It would also give me an excuse to revisit something that I hadn’t looked at since 2003. That conversation with Will was nearly a year ago. Another PyData Manchester Leaders meeting came and went and another will be coming around sometime soon. To avoid having to give a lame apology I thought it was about time I wrote this post.

XCA

Welling et al rightly point out that if we are modelling a dataset as lying in some reduced dimensionality subspace then we consider the data as being a combination of variation and constraint. We have variation of the data within a subspace and a constraint that the data does not fall outside the subspace. So we can model the same dataset focusing either on the variation (PCA) or on the constraints (MCA).

Note that in my blog post I have used a different, more commonly used notation. for the number of features and the number of components, than that used in the Welling et al paper. The mapping between the two notations is given below,

  • Number of features: My notation = d, Welling et al notation =D
  • Number of components: My notation = k, Welling et al notation = d

Probabilistic PCA and MCA

PCA and MCA both have probabilistic formulations, PPCA and PMCA5 respectively. Welling et al state that, “probabilistic PCA can be interpreted as a low variance data cloud which has been stretched in certain directions. Probabilistic MCA on the other hand can be thought of as a large variance data cloud which has been pushed inward in certain directions.” In both probabilistic models a d-dimensional datapoint \underline{x} is considered as coming from a zero-mean multivariate Gaussian distribution. In PCA the covariance matrix of the Gaussian is modelled as,

\underline{\underline{C}}_{PCA} = \sigma^{2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}

The matrix \underline{\underline{A}} is k \times d and its columns are the principal components that span the low dimensional subspace we are trying to model.

In MCA the covariance matrix is modelled as,

\underline{\underline{C}}_{MCA}^{-1} = \sigma^{-2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}

The matrix \underline{\underline{W}} is d \times (d-k) and its rows are the minor components that define the d-k subspace where we want as little variation as possible.

Since in real data we probably have both exceptional directions whose variance is greater than the bulk and exceptional directions whose variance is less than the bulk, both PCA and MCA would lead to biased estimates for these datasets. The problem is that if we use PCA we lump the low variation eigenvalues (minor components) in with our estimate of the isotropic noise, thereby underestimating the true noise variance and consequently biasing our estimate of the large variation PC subspace. Likewise, if we use MCA, we lump all the large variation eigenvalues (principal components) into our estimate of the noise and overestimate the true noise variance, thereby biasing our estimate of the low variation MC subspace.

Probabilistic XCA

In XCA we don’t have that problem. In XCA we include both large variation and small variation directions in our reduced dimensionality subspace. In fact we just have a set of orthogonal directions \underline{a}_{i}\;,\;i=1,\ldots,k that span a low-dimensional subspace and again form the columns of a matrix \underline{\underline{A}}. These are our directions of interest in the data. Some of them, say, k_{PC}, have unusually large variance, some of them, say k_{MC}, have unusually small variance. The overall number of extreme components (XC) is k = k_{PC} + k_{MC}.

As with probabilistic PCA, we then add on top an isotropic noise component to the overall covariance matrix. However, the clever trick used by Welling et al was that they realized that adding noise always increases variances, and so adding noise to all features will make the minor components undetectable as the minor components have, by definition, variances below that of the bulk noise. To circumvent this, Welling et al only added noise to the subspace orthogonal to the subspace spanned by the vectors \underline{a}_{i}. They do this by introducing a projection operator {\cal{P}}_{A}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{A}} \left ( \underline{\underline{A}}^{\top}\,\underline{\underline{A}}\right )^{-1}\underline{\underline{A}}^{\top}. Again we model the data as coming from a zero-mean multivariate Gaussian, but for XCA the final covariance matrix is then of the form,

\underline{\underline{C}}_{XCA}  = \sigma^{2}_{0} {\cal{P}}_{A}^{\perp}  + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}

and the  XCA model is,

\underline{x} \sim \underline{\underline{A}}\,\underline{y} + {\cal{P}}_{A}^{\perp} \underline{n}\;\;\;,\;\; \underline{y} \sim {\cal{N}}\left ( \underline{0}, \underline{\underline{I}}_{k}\right )\;\;\;,\;\;\underline{n} \sim {\cal {N}}\left ( \underline{0},  \sigma^{2}_{0}\underline{\underline{I}}_{d}\right )

We can also start from the MCA side, by defining a projection operator {\cal{P}}_{W}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{W}}^{\top} \left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top}\right )^{-1}\underline{\underline{W}}, where the rows of the k\times d matrix \underline{\underline{W}} span the k dimensional XC subspace we wish to identify. From this MCA-based approach Welling et al derive the probabilistic model for XCA as zero-mean multivariate Gaussian distribution with inverse covariance,

\underline{\underline{C}}_{XCA}^{-1}  = \frac{1}{\sigma^{2}_{0}} {\cal{P}}_{W}^{\perp}  + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}

The two probabilistic forms of XCA are equivalent and so one finds that the matrices \underline{\underline{A}} and \underline{\underline{W}} are related via \underline{\underline{A}} = \underline{\underline{W}}^{\top}\left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top} \right )^{-1}

If we also look at the two ways in which Welling et al derived a probabilistic model for XCA, we can see that they are very similar to the formulations of PPCA and PMCA respectively, just with the replacement of {\cal{P}}_{A}^{\perp} for \underline{\underline{I}}_{d} in the PPCA formulation, and the replacement of {\cal{P}}_{W}^{\perp} for \underline{\underline{I}}_{d} in the PMCA formulation. So with just a redefinition of how we add the noise in the probabilistic model, Welling et al derived a single probabilistic model that unifies PCA and MCA.

Note that we are now defining the minor components subspace as directions of unusually low variance, so we only need a few dimensions, i.e. k_{MC} \ll d, whilst previously when we defined the minor components subspace as the subspace where we wanted to constrain the data away from, we needed k_{MC} = d - k_{PC} directions. The probabilistic formulation of XCA is a very natural and efficient way to express MCA.

Maximum Likelihood solution for XCA

The model likelihood is easily written down and the maximum likelihood solution identified.  As one might anticipate the maximum-likelihood estimates for the vectors \underline{a}_{i} are just eigenvectors of \underline{\underline{\hat{C}}}, but we need to work out which ones. We can use the likelihood value at the maximum likelihood solution to do that for us.

Let’s say we want to retain k=6 extreme components overall, and we’ll use {\cal{C}} to denote the corresponding set of eigenvalues of \hat{\underline{\underline{C}}} that are retained. The maximum likelihood value for k extreme components (XC) is given by,

\log L_{ML} = - \frac{Nd}{2}\log \left ( 2\pi e\right )\;-\;\frac{N}{2}\sum_{i\in {\cal{C}}}\lambda_{i}\;-\;\frac{N(d-k)}{2}\log \left ( \frac{1}{d-k}\left [ {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right ]\right )

All we need to do is evaluate the above equation for all possible subsets {\cal{C}} of size k selected from the d eigenvalues \lambda_{i}\, , i=1,\ldots,d of \hat{\underline{\underline{C}}}. Superificially, this looks like a nasty combinatorial optimization problem, of exponential complexity. But as Welling et al point out, we know from a result proved in the PPCA paper of Tipping and Bishop that in the maximum likelihood solution the non-extreme components have eigenvalues that form a contiguous group, and so the optimal choice of subset {\cal{C}} reduces to determining where to split the ordered eigenvalue spectrum of \hat{\underline{\underline{C}}}. Since we have k = k_{PC} + k_{MC} that reduces to simply determining the optimal number of the largest \lambda_{i} to retain. That makes the optimization problem linear in k.

For example, in our hypothetical example we have said we want k = 6, but that could be a 3 PCs + 3 MCs split, or a 2 PCs + 4MCs split, and so on. To determine which we simply compute the maxium likelihood value for all the possible values of k_{PC} from k_{PC} = 0 to k_{PC} = k, each time keeping the largest k_{PC} values of \lambda_{i} and the smallest k - k_{PC} values of \lambda_{i} in our set {\cal{C}}.

Some of the terms in \log L_{ML} don’t change as we vary k_{PC} and can be dropped. Welling et al introduce a quantity {\cal{K}} defined by,

{\cal{K}}\;=\sum_{i\in {\cal{C}}}\lambda_{i}\; + \;(d-k)\log \left ( {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right )

{\cal{K}} is the negative of \log L_{ML}, up to an irrelevant constant and scale. If we then compute {\cal{K}}\left ( k_{PC}\right ) for all values of k_{PC} = 0 to k_{PC} = k and select the minimum, we can determine the optimal split of k = k_{PC} + k_{MC}.

PCA and MCA as special cases

Potentially, we could find that k_{PC} = k, in which case all the selected extreme components would correspond to principal components, and so the XCA algorithm becomes equivalent to PCA. Likewise, we could get k_{PC} = 0, in which case all the selected extreme components would correspond to minor components and the XCA algorithm becomes equivalent to MCA. So XCA contains pure PCA and pure MCA as special cases. But when do these special cases arise? Obviously, it will depend upon the precise values of the sample covariance eigenvalues \lambda_{i}, or rather the shape of the eigen-spectrum, but Welling et al also give some insight here, namely,

  • A log-convex sample covariance eigen-spectrum will give PCA
  • A log-concave sample covariance eigen-spectrum will give MCA
  • A sample covariance eigen-spectrum that is neither log-convex nor log-concave will yield both principal components and minor components

In layman’s terms, if the plot of the (sorted) eigenvalues on a log scale only bends upwards (has positive second derivative) then XCA will give just principal components, whilst if the plot of the (sorted) eigenvalues on a log scale only bends downwards (has negative second derivative) then we’ll get just minor components. If the plot of the log-eigenvalues has places where the second derivative is positive and places where it is negative, then XCA will yield a mixture of principal and minor components.

Experimental demonstration

To illustrate the XCA theory I produced a Jupyter notebook that generates simulated data containing a known number of principal components and a known number of minor components. The simulated data is drawn from a zero-mean multivariate Gaussian distribution with population covariance matrix \underline{\underline{C}} whose eigenvalues \Lambda_{i} have been set to the following values,

\begin{array}{cclcl} \Lambda_{i} & = & 3\sigma^{2}\left ( 1 + \frac{i}{3k_{PC}}\right ) & , &  i=1,\ldots, k_{PC}\\ \Lambda_{i} & = &\sigma^{2}  & , & i=k_{PC}+1,\ldots, d - k_{MC}\\ \Lambda_{i} & = & 0.1\sigma^{2}\left ( 1 + \frac{i}{k_{MC}} \right ) & , & i=d - k_{MC} + 1,\ldots, d\end{array}

The first k_{PC} eigenvalues represent principal components, as their variance is considerable higher than the ‘noise’ eigenvalues, which are represented by eigenvalues i=k_{PC}+1 to i=d - k_{MC}. The last k_{MC} eigenvalues represent minor components, as their variance is considerably lower than the ‘noise’ eigenvalues. Note, I have scaled both the PC and MC population eigenvalues by the ‘noise’ variance \sigma^{2}, so that \sigma^{2} just sets an arbitrary (user-chosen) scale for all the variances. I have chosen a large value of \sigma^{2}=20, so that when I plot the minor component eigenvalues of \hat{\underline{\underline{C}}} I can easily distinguish them from zero (without having to plot on a logarithmic scale).

We would expect the eigenvalues of the sample covariance matrix to follow a similar pattern to the eigenvalues of the population covariance matrix that we used to generate the data, i.e. we expect a small group of noticeably low-valued eigenvalues, a small group of noticeably high-valued eigenvalues, and the bulk (majority) of the eigenvalues to form a continuous spectrum of values.

I generated a dataset consisting of N=2000 datapoints, each with d=200 features. For this dataset I chose k_{PC} = k_{MC} = 10. From the simulated data I computed the sample covariance matrix \hat{\underline{\underline{C}}} and then calculated the eigenvalues of \hat{\underline{\underline{C}}}.

The left-hand plot below shows all the eigenvalues (sorted from lowest to highest), and I have also zoomed in on just the smallest values (middle plot) and just the largest values (right-hand plot).  We can clearly see that the sample covariance eigenvalues follow the pattern we expected.

Plots of the eigenvalues of the sample covariance matrix of the simulated data.

All the code (with explanations) for my calculations are in a Jupyter notebook and freely available from the github repository https://github.com/dchoyle/xca_post

From the left-hand plot we can see that there are places where the sample covariance eigenspectrum bends upwards and places where it bends downwards, indicating that we would expect the XCA algorithm to retain both principal and minor components. In fact, we can clearly see from the middle and right-hand plots the distinct group of minor component eigenvalues and the distinct group of principal component eigenvalues, and how these correspond to the distinct groups of extreme components in the population covariance eigenvalues. However, it would be interesting to see how the XCA algorithm performs in selecting a value for the number of principal components.

For the eigenvalues above I have calculated the minimum value of {\cal{K}} for a total of k=20 extreme components. The minimum value of {\cal{K}} occurs at k_{PC} = 10, indicating that the method of Welling et al estimates that there are 10 principal components in this dataset, and by definition there are then k - k_{PC} = 20 - 10 = 10 minor components in the dataset. In this case, the XCA algorithm has identified the dimensionalities of the PC and MC subspaces exactly.

Final comments

That is the post done. I can look Will in the eye when we meet for a beer at the next PyData Manchester Leaders meeting. The post ended up being longer (and more fun) than I expected, and you may have got the impression from my post that the Welling et al paper has completely solved the problem of selecting interesting low-dimensional subspaces in Gaussian distributed data. Note quite true. There are still challenges with XCA, as there are with PCA. For example, we have not said how we choose the total number of extreme components k. That is a whole other model selection problem and one that is particularly interesting for PCA when we have high-dimensional data. This is one of my research areas – see for example my JMLR paper Hoyle2008.

Another challenge that is particularly relevant for high-dimensional data is the question of whether we will see distinct groups of principal and minor component sample covariance eigenvalues at all, even when we have distinct groups of population covariance eigenvalues. I chose very carefully the settings used to generate the simulated data in the example above. I ensured that we had many more samples than features, i.e. N \gg d, and that the extreme component population covariance eigenvalues were distinctly different from the ‘noise’ population eigenvalues. This ensured that the sample covariance eigenvalues separated into three clearly visible groups.

However, in PCA when we have N < d and/or weak signal strengths for the extreme components of the population covariance, then the extreme component sample covariance eigenvalues may not be separated from the bulk of the other eigenvalues. As we increase the ratio \alpha = N/d we observe a series of phase transitions at which each of the extreme components becomes detectable – again this is another of my areas of research expertise [HoyleRattray2003, HoyleRattray2004, HoyleRattray2007]

Footnotes
  1. I have used the usual divide by N-1 Bessel correction in the definition of the sample covariance. This is because I have assumed any data matrix will have been explicitly mean-centered. In many of the analyses of PCA the starting assumption is that the data is drawn from a mean-zero distribution, so that the sample mean of any feature is zero only under expectation, not as a constraint. Consequently, most formal analysis of PCA will define the sample covariance matrix with a 1/N factor. Since I have to deal with real data, I will never presume the data been drawn from population distribution that has zero-mean and so to model the data with a zero-mean distribution I will explicitly mean-center the data. Therefore, I use the 1/(N-1) definition of the sample covariance. Strictly speaking, that means the various theories and analyses I discuss later in the post are not applicable to the data I’ll work with. It is possible to modify the various analyses to explicitly take into account the mean-centering step, but it is tedious to do so. In practice, (for large N) the difference is largely inconsequential, and formulae derived from analysis of zero-mean distributed data can be accurate for mean-centered data, so we’ll stick with using the 1/(N-1) definition for \hat{\underline{\underline{C}}}.

  2. Hotelling, H. “Analysis of a complex of statistical variables into principal components”. Journal of Educational Psychology, 24:417-441 and also 24:498–520, 1933.
    https://dx.doi.org/10.1037/h0071325

  3. Oja, E. “Principal components, minor components, and linear neural networks”. Neural Networks, 5(6):927-935, 1992. https://doi.org/10.1016/S0893-6080(05)80089-9

  4. Luo, F.-L., Unbehauen, R. and Cichocki, A. “A Minor Component Analysis Algorithm”. Neural Networks, 10(2):291-297, 1997. https://doi.org/10.1016/S0893-6080(96)00063-9

  5. See for example, Williams, C.K.I. and Agakov, F.V. “Products of gaussians and probabilistic minor components analysis”. Neural Computation, 14(5):1169-1182, 2002. https://doi.org/10.1162/089976602753633439

© 2025 David Hoyle. All Rights Reserved

Practical PCA

PCA_Schematic

Principal Component Analysis (PCA) is a commonly applied algorithm in statistics and data science. Because it is so easy to understand at a high-level, and because it is so easy to apply, PCA has become ubiquitous. It is often applied without much thought and with the output rarely questioned. Typically, the questions I ask when applying PCA are,

  1. Do I need to do any transformation of the data before applying PCA?
  2. How many principal components should I select?
  3. How do I interpret the loadings (and scores)?

Unfortunately, I have seen a number of talks and presentations recently where PCA has been used and the impact on the analysis of not having thought about these questions was clear. Ok, I’m biased, as the behaviour of PCA, particularly when applied to high-dimensional data, is one of my areas of research. Whilst the research papers on PCA can be very complex, they do however provide some useful insight and guides on how to apply PCA to real data. In this post I’m going to look at those three questions in turn. In the following I’m also going to assume you are already familiar with PCA and that you are aware that the principal components are the eigenvectors of the sample covariance matrix corresponding to the largest eigenvalues. For a good introduction to PCA see the blog post by Laura Hamilton and also the classic book by Jolliffe.

TL;DR  – the the short answers to the questions above are, i) Check for outliers and/or heavy tails in your data before applying PCA, ii) use the ‘knee’ in the eigenvalue scree plot to select the number of components, iii) make sure you look at the loadings of the selected principal components and that you can explain any major patterns in those loadings.

1. Transformation of the data

The original derivations of PCA, such as that by Hotelling (1933), are heuristic and make no explicit assumptions other than the requirement that the selected components retain as much of the variance in the original data as possible.  The formal derivation of PCA as the maximum likelihood estimate of parameters of a probabilistic model assumes additive Gaussian noise – see for example the paper by Tipping and Bishop. In practice, where the distributions of latent factors and additive noise are still reasonably symmetric and decay sufficiently fast we would expect deviations from perfect Gaussians to have only a minor effect. Despite this I have seen a number of talks with PCA plots similar to that shown on the left below.

PCAScores_HeavyTailExample1

I have based this example on a real commercial data set I was shown, but for privacy and confidentiality reasons I have generated a simulated data set that reproduces the issue. The majority of PC1 scores are very squashed into the left hand side of the plot, with a few scores reaching much higher values than the rest. This can been seen more clearly if we look at the estimated density of the 1st PC scores – see the right hand plot above. A heavy tail is clearly present, meaning we are deviating significantly from the assumptions under which the eigenvectors of the sample covariance matrix are the optimal estimators of the signal directions in the raw data.

So an obvious first step would be to take a quick look at the distribution of the raw data that we are trying to decompose using PCA. If that distribution has a heavy tail or significant outliers then there is an argument for applying a transformation (e.g. logarithm) before applying PCA. If we take the log of the example data above then we obtain much more reasonable distributions for the PC scores – see below.

PCAScores_HeavyTailExample2

Tip: If you see an elongated distribution of scores along the PCs you have selected, then it may be worth going back and looking at the distribution of the raw data going into the PCA – you should have already looked at the distribution of your raw data anyway as part of EDA best practice.

2. Selection of the number of components

There are two common methods for selecting the number of components that most people will be familiar with or encounter. Those are,

  1. Select the number of components so that a fixed proportion, e.g. 90%, of the total variance is retained.
  2. Look at a scree plot of the eigenvalues and locate the ‘elbow’ (or ‘knee’ depending on your interpretation of anatomy).

Of those two methods it is the second that I always prefer. This is because it has a sound theoretical underpinning and is more robust when applied to the kind of high-dimensional data sets that are commonplace nowadays. Let me explain why. The goal of PCA is to select a small number of directions in the data that we believe capture the signal within the data. The first approach to PCA model selection is effectively based upon the assumption that the ‘signal’ contribution to the total variance is considerably greater than the noise contribution to the total variance, and thus by selecting to retain the majority of the total variance in the original data we believe we are effectively selecting components that represent signal.

If our data points are p dimensional, then we have p sample covariance eigenvalues, \lambda_{i},\;i=1,\ldots,p. We consider the signal part of the data is represented by a small number, k, of eigenvectors/eigenvalues, and we have the p-k-1 remaining non-zero eigenvalues that correspond to pure noise in the original data.

Typically, there is no reason to believe that the noise process affects any of the original features/variables more strongly than the others – i.e. it is reasonable to consider the noise process to be isotropic. If not, then the preferred directions are essentially some form of signal and not noise. This means we expect the noise eigenvalues to be approximately the same, and let’s say they have an average value \sigma^{2}. As we look at data sets of increasing dimension p, unless the number of signal eigenvalues, k, or the eigenvalues, \lambda_{1},\ldots,\lambda_{k}, themselves grow with p, then the variance explained by the signal values, \sum_{i=1}^{k}\lambda_{i}, will remain relatively static whilst the noise eigenvalues contribute (p-k-1)\sigma^{2} to the total variance. Thus we can see that the fraction of total variance explained by the signal components is essentially,

{\rm Fraction\;of\;variance\;explained}\;=\;\frac{\sum_{i=1}^{k}\lambda_{i}}{(p-k-1)\sigma^{2}\;+\;\sum_{i=1}^{k}\lambda_{i}}\;\;,

and so decreases as the data dimension p increases. Consequently, for high-dimensional data, where p is very large (often in the thousands) the percentage of variance explained by the true signal components can be very very small. Conversely, if we select the number of components so as to retain say 90% of the total variance, we will be including a lot of noise in the retained components and not reducing the dimensionality as efficiently as we could be.

In contrast, if the true eigenvalues are split into a small number of ‘signal’ eigenvalues and a much larger number of  ‘noise’ eigenvalues and we expect the noise process to be isotropic (i.e. we have a single highly degenerate noise eigenvalue), then the observed (sample) eigenvalues will also consist of a bulk of eigenvalues clustered around a single value and a small number of larger eigenvalues separated from this bulk. In other words we expect to see the distribution of sample covariance eigenvalues look something like the plot on the left below.

SP500_ScreePlot1

The eigenvalues in this example have been obtained from the sample covariance matrix of inter-day returns of closing prices over the 4 year period 2010 to 2013 for S&P500 stocks. We have actually omitted the largest eigenvalue, as this of a different scale and represents a ‘market-mode’ – see the next section.

Note that the bulk of the sample eigenvalues show some dispersion even though we had only one true, highly degenerate, noise eigenvalue. This is due to sampling variation, i.e. the fact that we have only a finite sized sample. If we plot the top 20 eigenvalues from the distribution above, ranked from largest to smallest, we get a scree plot looking like the plot on the right above.

Clearly, the sample eigenvalues corresponding to the bulk are very similar to each other, and so we see only small decreases with increasing rank in the sample eigenvalues in the bulk. In contrast, where there is an atypical jump in eigenvalue as we decrease the rank, this represents the point at which a signal eigenvalue can be detected as being separate from the bulk. This point also represents the ‘knee’ of the scree plot.

Statistical models that produce this kind of scree-plot are called ‘spiked-covariance’ models, so-called because the true population eigenvalue distribution is concentrated (or spiked) around just a small number of values. For these models we consider the data to have been produced by a small number of latent factors with isotropic noise. That is, our data point {\bf x}_{i} is given by,

{\bf x}_{i}\;=\;\sum_{j=1}^{k}z_{ij}\lambda_{j}{\bf B}_{j}\;+\; \boldsymbol{\epsilon}_{i}\;\;\;,\;\;\;\boldsymbol{\epsilon}_{i}\sim {\cal N}\left ( {\bf 0}, \sigma^{2}{\bf I} \right)\;\;\;,\;\;\;z_{ij}\sim {\cal N}\left ( 0, 1\right)\;\;\;,

with the vectors {\bf B}_{1},\ldots,{\bf B}_{k} forming an orthonormal set.

From the mathematical form of the spiked-covariance models we can work out the expected distribution of sample covariance eigenvalues when our data consists of N data points (each of dimension p). A large amount of research has been done in this area over the last 10-15 years. I won’t try to summarize it here, instead I’ll point you to the excellent review by Johnstone, and the original work by Paul. This research allows us to devise methods for the automatic detection of the number of principal components. Such methods have the advantage that they can be automated, i.e. programmed as part of an algorithm. On the practical side you can always “eyeball” the scree plot. For an isolated piece of analysis this is always advisable, as it takes so little time to do.

Also on the pragmatic side I have often found that simple ‘knee’ detection algorithms  work surprisingly well, particularly for real ‘real-world’ data sets that I encounter as part of my day-to-day work. The simplest of such algorithms involves finding the maximum in the value of a discrete approximation to the second derivative of the ranked eigenvalue plot. That is, we choose k as,

k \;=\; \underset{i}{\mathrm{argmax}} \left ( \lambda_{i+1} + \lambda_{i-1} - 2\lambda_{i}\right )

Improved approaches to ‘knee’ detection are based upon discrete approximations of the curvature. The paper by Satopää et al gives a good introduction to such methods. Again, these simple ‘knee’ detection approaches have the advantage that they can be coded and hence included as part of some automated process.

Note that in situations where the signal eigenvalues (those distinct from the bulk) do contribute the majority of the variance in the original data, then selecting the number of PCs on the basis of detecting a ‘knee’ in a scree plot will have the same advantages as selecting on the basis of wanting to retain a certain fixed percentage of the total variance. Consequently, there is very little reason not use the scree plot for selection in all cases.

3. Interpretation of the loadings

I have also found it is not uncommon for PCA to be blindly applied and the scores, i.e. the values of the new, lower-dimensional, features to be plotted, fed into some downstream process without any further curiosity being applied. Where the leading sample eigenvalue is very large – that is on a different scale to the other principal components retained or the bulk of the sample eigenvalues – I always take a look at the loadings. This is the case for the S&P500 data discussed above. The loadings tell us the contribution each of the original variables make to the principal component. The loadings for the 1st principal component of the S&P500 data is shown in the plot below,

SP500_1stPC_Loadings

Where the 1st sample eigenvalue is very large (compared to the others) it is not uncommon to see a loading pattern like that above – each loading value is of the same sign (and in this case of comparable size). The large 1st eigenvalue tells us that variation along the 1st principal component is the dominant mode of variation present in the original data. The loading pattern tells us that in this mode of variation all the original variables increase together, or decrease together. Such a pattern is often termed a ‘global mode’, i.e. it is a mode of variation that largely has the same effect globally on all the original variables. There are several scenarios and situations were the presence of a global mode can be naturally explained or is to be expected. For example,

  • Market modes in stock prices. This is where a rising or falling market causes all stock prices to rise or fall together.
  • In gene expression data obtained from model organisms exposed to a large environmental perturbation or insult. Here, for model organisms, e.g. yeast cultures, we can shock the biological system being studied without any ethical concerns, e.g. starve the organism of its primary food/fuel source or other essential nutrients. Consequently we see a system wide response to the starvation.
  • Price sensitivities of a collection of products sold by a retailer in a store. Here we expect price elasticities of products will reflect, to a large part, the economic conditions of the local geography. Consequently, a predominant part of the variation in product elasticities will be due to store-to-store variation in economic conditions and may show up as a global mode.

Where we see a global mode in the loadings, we should ask whether we can identify a credible mechanism behind the global mode. If not, then this should make us cautious about the appropriateness of the 1st principal component and hence the complete decomposition.

Finally it is worth mentioning that, if a sparse loading pattern is more naturally expected or convenient then sparse versions of PCA can be used. The lecture notes by Rob Tibshirani provide a good introduction to sparse PCA.