Featured

Data Science Notes: 3. Choosing good sentinel values

Summary

  • A sentinel value is a special value of a variable that is used to signify something, or draw the users attention to something.
  • A sentinel value is often used to distinguish between two states; “it worked ok”, “it didn’t work ok”.
  • Don’t choose a sentinel value that can inadvertently be interpreted or processed as a normal value of the variable in which it is stored.
  • Choose a sentinel value such that if it becomes corrupted it can’t accidently still be interpreted as indicating either of the states it is intended to distinguish between.
  • If a sentinel value is processed is processed by a downstream calculation, the sentinel value should be such that the downstream calculation generates an exception.
  • If processing a sentinel value generates an exception, the choice of sentinel value should be such that the exception is generated as close as possible (in code) to the place where the sentinel value is first inappropriately processed.

Introduction

All the posts in my Data Science Notes series are based on techniques I use regularly, or on conversations I’ve had with other Data Scientists where I’ve explained why experience has taught me to do something in a particular way. This post is no exception. It stemmed from someone asking why I initialize all my arrays with NaNs, particularly when I’m going to later reinitialize each array element to zero when I start adding to it. It’s because I want the NaN value to act as a sentinel value.

What is a sentinel value I can hear you asking? Well, a sentinel value watches. It warns. A bit like a lighthouse. It communicates when something happened or went wrong. Okay, but isn’t that just a flag? Can’t I just use a Boolean variable when something went wrong? Here’s the thing, often we want to use or store a sentinel value in a variable that is already processed as part of a calculation.

Here’s an example. Say I have an array of variances that I need to calculate. That array of variances might be the variances of a set of features I’m using in a predictive model. But for some of the features I might not be able to calculate the variance. The feature value might be missing across all but one of the training data points. No problem, I’ll just set the variance of that feature equal to -1. Since every Data Scientist knows a variance has to be non-negative, a value of -1 clearly indicates that the variance for that feature has not been calculated, and it is not available for any downstream calculation. That is a sentinel value in action. The -1 is the sentinel value. It has communicated to me when the variance calculation has not worked but it is not a flag variable. It is just a different value stored inside a normal variable, the feature variance in this case.

In the example above there are clear and obvious reasons why I chose -1 as the sentinel value for my array of variances. Now, here’s the thing; choosing good sentinel values can be a bit of an art learnt through experience. To demonstrate this, I’m going to give two real anecdotes from my career as a Data Scientist where good sentinel values were not used. From these two anecdotes we can distil a number of useful lessons about what makes a good sentinel value.

First anecdote:

On one project I worked on we had a piece of code that produced some parameters estimates for a predictive model, and some associated parameter uncertainties. Interestingly, the parameter uncertainties had a sentinel value of 9999.99. A parameter uncertainty of 9999.99 was supposed to indicate that we hadn’t been able to estimate the parameter – there were a number of genuine reasons why this could happen. The value of 9999.99 was considered too large to be mistakenly confused for a genuine parameter uncertainty. So whenever we saw a parameter uncertainty of 9999.99 we knew that the parameter had in fact not been estimated. And whenever we saw a parameter uncertainty less than 9999.99 we knew that the parameter had been estimated. A few years later I discovered that it was possible for a downstream processing step, under certain circumstances, to modify those 9999.99 values to something different. The final result was a parameter uncertainty which was large, e.g. 9548.17, but not 9999.99. We’d lost the ability to interpret not being 9999.99 as an indication of an estimated parameter. Fortunately, there were other ways we could tell whether the parameter had been estimated other than through the sentinel value, but the lessons learnt from this anecdote are two-fold,

  1. Don’t choose sentinel values that can inadvertently be interpreted as something different to what the sentinel value was intended to indicate. In this case the 9999.99 was intended to be implausibly large for a parameter uncertainty. However, although implausible it is not impossible, and so it is possible for it to be accidentally processed in a way it shouldn’t. And accidents will happen. In fact, you should assume that if an accident can happen, it will happen.
  2. The second lesson learnt here is that a sentinel value flags a condition or state. So we use sentinel values to infer two situations. In this example, even with the mistaken downstream processing of the sentinel value, a final result of 9999.99 would still have indicated an un-estimated parameter, so it would look like the sentinel value of 9999.99 was doing its job. However, we couldn’t reliably infer the converse case. A large parameter uncertainty that was less than 9999.99 could not be used on its own to reliably infer that the parameter had been estimated. So when designing a good sentinel value we need to think about how it will be used to infer either of the two scenarios, and whether correct inference of either scenario is always possible.

Second anecdote:

This second anecdote is not about what happens when you have a bad choice of sentinel value, but about what happens when you don’t have a sentinel value at all, and what that reveals about what you want to happen when a sentinel value is accidentally processed.

I was working on an academic project with a software developer. The developer was a C++ expert with many years of experience. We were modifying an existing very large academic codebase that analyzed Genome Wide Association (GWAS) data . I was supplying the additional equations needed, they were implementing them. During a particular three week period the developer had been chasing down a corner-case bug in the existing code. After three weeks they had managed to track down the bug to a particular subroutine producing -inf values in its output for this particular corner-case. They wanted advice on how to handle this scenario from a scientific perspective. What should we replace the -inf values with, if at all? The first thing I suggested was just looking at what had been the input to the subroutine in this case. Yep, it was -inf as well. In fact, we traced the -inf values back through three further subroutine calls. The -inf values had first arisen from zero values being present in an array where they shouldn’t have been. When first processed, the zeros had generated the -inf values which then got further processed by three subroutine calls. Okay, part of the issue here is that there should have been a sentinel value used in place of the default value of zero. Ideally, one would want that sentinel value to be incapable of being processed by any of the downstream subroutine calls. But here’s the thing; you would have thought that the -inf value, once created, would serve as some sort of sentinel value for later subroutine calls. The lessons I learnt from this anecdote were also two-fold, namely,

  1. A good sentinel value that is accidently processed by a function should cause your program to crash. Sorry, “ahem, cough”, cause your program to generate an instance of the custom exception class you’ve beautifully written, which your code then catches and gracefully handles.
  2. The exception raised by the inadvertent/inappropriate processing of a well chosen sentinel value should occur as close as possible to the point where the sentinel value is first inappropriately processed, not three subroutine calls down the line. That way the sentinel value also provides useful debugging information.

Conclusion

Choosing good sentinel values is a bit of an art form, learnt from hard-won experience. But there are some general, high-level rules and guidelines we can give. In order of importance these are,

  • Don’t choose a sentinel value that can inadvertently be interpreted or processed as a normal value of the variable in which it is stored. A sentinel value is meant to be exceptional, not just different.
  • A sentinel value is often used to distinguish between two states. Choose a sentinel value such that if it becomes corrupted it can’t accidently still be interpreted as indicating either of those states. If a sentinel value becomes corrupted it should become meaningless.
  • If a sentinel value is processed is processed by a downstream calculation, the sentinel value should be such that the downstream calculation generates an exception.
  • If processing a sentinel value generates an exception, the choice of sentinel value should be such that the exception is generated as close as possible (in code) to the place where the sentinel value is first inappropriately processed.

Try using sentinel values in your coding. The only way to get better at using them is to try.

© 2026 David Hoyle. All Rights Reserved

Testing your models before you build them

Introduction

TL;DR: There are tests on models you can do even before you have done any training of the model. These are tests of the model form, and are more mathematical in nature. These tests stop you from putting a model with a flawed mathematical form into production.

My last blogpost was on using simulation data to test a model. I was asked if there are other tests I do for models, to which I replied,  “other than the obvious, it depends on the model and the circumstances”. Then it occurred to me that “the obvious” tests might not be so obvious, so I should explain them here.

Personally, I broadly break down model tests into two categories:

  1. Tests on a model before training/estimation of the model parameters.
  2. Tests on a model after training/estimation of the model parameters.

The first category (pre-training) are typically tests on model form – does the model make sense, does the model include features in a sensible way. These are tests that get omitted most often and the majority of Data Scientists don’t have in their toolkit. However, these are tests that will spot the big costly problems before the model makes it into production.

The second category of tests (post-training) are typically tests on the numerical values of the model parameters and various goodness-of-fit measures. These are the tests that most Data Scientists will know about and will use regularly. Because of this I’m not going to go into details of any tests in this second category. What I want to focus on is tests in the first category, as this is where I think there is a gap in most Data Scientists’ toolkit.

The tests in the first category are largely mathematical, so I’m not going to give code examples. Instead, I ‘m just going to give a short description of each type of test and what it tries to achieve. Let’s start.

Pre-training tests of model form

Asymptotic behaviour tests:

One of the easiest ways to test a model form is to look at its output in circumstances which are easy to understand. In a model with many features and interacting parts this is best done by seeing what happens when you make one of the variables  or parameters as large as possible (or as small as possible). In these circumstances the other variables will often become irrelevant and so the behaviour of the model is easier to spot. For example, in a demand model that predicts how much of a grocery product you’re going to sell, does putting up the price to infinity cause the predicted sales volume to drop to zero? If not, you’ve got a problem with your model.

Asymptotic behaviour tests are not limited to scenarios in which variables/parameters become very large or very small. In some cases the appropriate asymptotic scenario might be a parameter approaching a finite value at which a marked change in behaviour is expected. It should be clear that identifying asymptotic scenarios for which we can easily predict what should happen can require some domain knowledge. If you aren’t confident of your understanding of the application domain, then a good start is to make variables/parameters very large and/or very small one at a time and see if the resulting behaviour makes sense.

Typically, working out the behaviour of your model form in some asymptotic limit can be done simply by visual inspection of the mathematical form of your model, or with a few lines of pen-and-paper algebra. This gives us the leading order asymptotic behaviour. With a bit more pen-and-paper work we can also work out a formula for the next-to-leading order term in the asymptotic expansion of the model output. The next-to-leading order term tells us how quickly the model output approaches its asymptotic value – does it increase to the asymptotic value as we increase the variable, or does it decrease to the asymptotic value? We can also see which other variables and parameters affect the rate of this approach to the asymptotic value, again allowing us to identify potential flaws in the model form.

The asymptotic expansion approach to testing a model form can be continued to even higher orders, although I rarely do so. Constructing asymptotic expansions requires some experience with specific analysis techniques, e.g. saddle-point expansions. So I would recommend the following approach,

  1. Always do the asymptotic limit (leading order term) test(s) as this is easy and usually requires minimal pen-and-paper work.
  2. Only derive the next-to-leading order behaviour if you have experience with the right mathematical techniques. Don’t sweat if you don’t have the skills/experience to do this as you will still get a huge amount of insight from just doing 1.

Stress tests/Breakdown tests:


These are similar in spirit to the asymptotic analysis tests. Your looking to see if there are any scenarios in which the model breaks down. And by “break down”, I mean it gives a non-sensical answer such as predicting a negative value for a quantity that in real life can only be positive. How a model breaks down can be informative. For example, does the scenario in which the model breaks down clearly reflect an obvious limitation of the model assumptions, in which case breakdown is entirely expected and nothing to worry about. The breakdown is telling you what you already know, that in this scenario the assumptions don’t hold or are inappropriate and so we expect the model to be inaccurate or not work at all. If the breakdown scenario doesn’t reflect known weaknesses of the model assumptions you’ve either uncovered a flaw in the mathematical form of your model, which you can now fix, or you’ve uncovered an extra hidden assumption you didn’t know about. Either way, you’ve made progress.

Recover known behaviours:

Another test that has similarities to the asymptotic analysis and the stress tests. For example, your model may be a generalization of a more specialized model. It may contain extra parameters that capture non-linear effects. If we set those extra parameters to zero in the model or in any downstream mathematical analysis we have performed, then we would expect to get the same behaviour as the purely linear model. Is this what happens? If not, you’ve got a problem with your model or the downstream analysis. Again this is using known expected behaviour of a nested sub-case as a check on the general model.

Coefficients before fitting:

Your probably familiar with the idea of checking the parameters of a model after fitting, to check that those parameter values make sense. Here, I’m talking about models with small numbers of features and hence parameters, which also have some easy interpretation. Because we can interpret the parameters we can probably come up with what we think are reasonable ball-park values for them even before training the model. This gives us, i) a check on the final fitted parameter values, and ii) a check on what scale of output we think is reasonable from the model. We can then compare what we think should be the scale of the model output against what is needed to explain the response data. If there is an order of magnitude or more mis-match then we have a problem. Our model will either be incapable of explaining the training data in its current mathematical form, or one or more of the parameters is going to have an exceptional value. Either way, it is probably wise to look at the mathematical form of your model again.

Dimensional analysis:


In high school you may have encountered dimensional analysis in physics lessons. There you checked that the left-hand and right-hand sides of a formula were consistent when expressed in dimensions of Mass (M), Length (L), and Time (T). However, we can extend the idea to any sets of dimensions. If the right-hand side of a formula consists of clicks divided by spend, and so has units of \rm{[currency]}^{-1}, then so must the left-hand side. Similarly, arguments to transcendental functions such as exp or sin and cos must be dimensionless. These checks are a quick and easy way to spot if a formula is inadvertently missing a dimensionful factor.

Conclusion:

These tests of the mathematical form of a model ensure that a model is robust and its output is sensible when used in scenarios outside of its training data. And let’s be realistic here; in commercial Data Science all models get used beyond the scope for which they are technically valid. Not having a robust and sensible mathematical form for your model means you run the risk of it outputting garbage.

© 2025 David Hoyle. All Rights Reserved

Extreme Components Analysis

TL;DR: Both Principal Components Analysis (PCA) and Minor Components Analysis (MCA) can be used for dimensionality reduction, identifying low-dimensional subspaces of interest as those which have the greatest variation in the original data (PCA), or those which have the least variation in the origina data MCA). As real data will contain both directions of unusually high variance and directions of unusually low variance, using just PCA or just MCA will lead to biased estimates of the low-dimensional subspace. The 2003 NeurIPs paper from Welling et al unifies PCA and MCA into a single probabilistic model XCA (Extreme Components Analysis). This post explains the XCA paper of Welling et al and demonstrates the XCA algorithm using simulated data. Code for the demonstration is available from https://github.com/dchoyle/xca_post

A deadline

This post arose because of a deadline I have to meet. I don’t know when the deadline is, I just know there is a deadline. Okay, it is a self-imposed deadline, but it will start to become embarrassing if I don’t hit it.

I was chatting with a connection, Will Faithfull, at a PyData Manchester Leaders meeting almost a year ago. I mentioned that one of my areas of expertise was Principal Components Analysis (PCA), or more specifically, the use of Random Matrix Theory to study the behaviour of PCA when applied to high-dimensional data.

A recap of PCA

In PCA we are trying to approximate a d-dimensional dataset by a reduced number of dimensions k < d. Obviously we want to retain as much of the structure and variation of the original data, so we choose our k-dimensional subspace such that the variance of the original data in the subspace is as high as possible. Given a mean-centered data matrix \underline{\underline{X}} consisting of N observations, we can calculate the sample covariance matrix \hat{\underline{\underline{C}}} as 1,

\hat{\underline{\underline{C}}} = \frac{1}{N-1} \underline{\underline{X}}^{\top} \underline{\underline{X}}

Once we have the (symmetric) matrix \hat{\underline{\underline{C}}} we can easily compute its eigenvectors \underline{v}_{i}, i=1,\ldots, d, and their corresponding eigenvalues \lambda_{i}.

The optimal k-dimensional PCA subspace is then spanned by the k eigenvectors of \hat{\underline{\underline{C}}} that correspond to the k largest eigenvalues of \hat{\underline{\underline{C}}}. These eigenvectors are the directions of greatest variance in the original data. Alternatively, one can just do a Singular Value Decomposition (SVD) of the original data matrix \underline{\underline{X}}, and work with the singular values of \underline{\underline{X}} instead of the eigenvalues of \hat{\underline{\underline{C}}}.

That is a heuristic derivation/justification of PCA (minus the detailed maths) that goes back to Harold Hotelling in 19332. There is a probabilistic model-based derivation due to Tipping and Bishop (1999), which we will return to later.

MCA

Will responded that as part of his PhD, he’d worked on a problem where he was more interested in the directions in the dataset along which the variation is least. The problem Will was working on was “unsupervised change detection in multivariate streaming data”. The solution Will developed was a modular one, chaining together several univariate change detection methods each of which monitored a single feature of the input space. This was combined with a MCA feature extraction and selection pre-processing step. The solution was tested against a problem of unsupervised endogenous eye blink detection.

The idea behind Will’s use of MCA was that for the streaming data he was interested in it was likely that the inter-class variances of various features were likely to be much smaller than intra-class variances, and so any principal components were likely to be dominated by what the classes had in common rather than what had changed, so the directions of greatest variance weren’t very useful for his change detection algorithm.

I’ve put a link here to Will’s PhD in case you are interested in the details of the problem and solution – yes, Will I have read your PhD.

Directions of least variance in a dataset can be found from the same eigen-decomposition of the sample covariance matrix and by selecting the components with the smallest non-zero eigenvalues. Unsurprisingly, focusing on directions of least variance in a dataset is called Minor Components Analysis (MCA)3,4. Where we have the least variation in the data the data is effectively constrained so, MCA is good for identifying/modelling invariants or constraints within a dataset.

At this point in the conversation, I recalled the last time I’d thought about MCA. That was when an academic colleague and I had a paper accepted at the NeurIPs conference in 2003. Our paper was on kernel PCA applied to high-dimensional data, in particular the eigenvalue distributions that result. As I was moving job and house at the time I was unable to go the conference, so my co-author, Magnus Rattray (now Director of the Institute for Data Science and Artificial Intelligence at the University of Manchester), went instead. On returning, Magnus told me of an interesting conversation he’d had at the conference with Max Welling about our paper. Max also had a paper at the conference, on XCA – Extreme Components Analysis. Max and his collaborators had managed to unify PCA and MCA into a single framework.

I mentioned the XCA paper to Will at the PyData Manchester Leaders meeting and said I’d write something up explaining XCA. It would also give me an excuse to revisit something that I hadn’t looked at since 2003. That conversation with Will was nearly a year ago. Another PyData Manchester Leaders meeting came and went and another will be coming around sometime soon. To avoid having to give a lame apology I thought it was about time I wrote this post.

XCA

Welling et al rightly point out that if we are modelling a dataset as lying in some reduced dimensionality subspace then we consider the data as being a combination of variation and constraint. We have variation of the data within a subspace and a constraint that the data does not fall outside the subspace. So we can model the same dataset focusing either on the variation (PCA) or on the constraints (MCA).

Note that in my blog post I have used a different, more commonly used notation. for the number of features and the number of components, than that used in the Welling et al paper. The mapping between the two notations is given below,

  • Number of features: My notation = d, Welling et al notation =D
  • Number of components: My notation = k, Welling et al notation = d

Probabilistic PCA and MCA

PCA and MCA both have probabilistic formulations, PPCA and PMCA5 respectively. Welling et al state that, “probabilistic PCA can be interpreted as a low variance data cloud which has been stretched in certain directions. Probabilistic MCA on the other hand can be thought of as a large variance data cloud which has been pushed inward in certain directions.” In both probabilistic models a d-dimensional datapoint \underline{x} is considered as coming from a zero-mean multivariate Gaussian distribution. In PCA the covariance matrix of the Gaussian is modelled as,

\underline{\underline{C}}_{PCA} = \sigma^{2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}

The matrix \underline{\underline{A}} is k \times d and its columns are the principal components that span the low dimensional subspace we are trying to model.

In MCA the covariance matrix is modelled as,

\underline{\underline{C}}_{MCA}^{-1} = \sigma^{-2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}

The matrix \underline{\underline{W}} is d \times (d-k) and its rows are the minor components that define the d-k subspace where we want as little variation as possible.

Since in real data we probably have both exceptional directions whose variance is greater than the bulk and exceptional directions whose variance is less than the bulk, both PCA and MCA would lead to biased estimates for these datasets. The problem is that if we use PCA we lump the low variation eigenvalues (minor components) in with our estimate of the isotropic noise, thereby underestimating the true noise variance and consequently biasing our estimate of the large variation PC subspace. Likewise, if we use MCA, we lump all the large variation eigenvalues (principal components) into our estimate of the noise and overestimate the true noise variance, thereby biasing our estimate of the low variation MC subspace.

Probabilistic XCA

In XCA we don’t have that problem. In XCA we include both large variation and small variation directions in our reduced dimensionality subspace. In fact we just have a set of orthogonal directions \underline{a}_{i}\;,\;i=1,\ldots,k that span a low-dimensional subspace and again form the columns of a matrix \underline{\underline{A}}. These are our directions of interest in the data. Some of them, say, k_{PC}, have unusually large variance, some of them, say k_{MC}, have unusually small variance. The overall number of extreme components (XC) is k = k_{PC} + k_{MC}.

As with probabilistic PCA, we then add on top an isotropic noise component to the overall covariance matrix. However, the clever trick used by Welling et al was that they realized that adding noise always increases variances, and so adding noise to all features will make the minor components undetectable as the minor components have, by definition, variances below that of the bulk noise. To circumvent this, Welling et al only added noise to the subspace orthogonal to the subspace spanned by the vectors \underline{a}_{i}. They do this by introducing a projection operator {\cal{P}}_{A}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{A}} \left ( \underline{\underline{A}}^{\top}\,\underline{\underline{A}}\right )^{-1}\underline{\underline{A}}^{\top}. Again we model the data as coming from a zero-mean multivariate Gaussian, but for XCA the final covariance matrix is then of the form,

\underline{\underline{C}}_{XCA}  = \sigma^{2}_{0} {\cal{P}}_{A}^{\perp}  + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}

and the  XCA model is,

\underline{x} \sim \underline{\underline{A}}\,\underline{y} + {\cal{P}}_{A}^{\perp} \underline{n}\;\;\;,\;\; \underline{y} \sim {\cal{N}}\left ( \underline{0}, \underline{\underline{I}}_{k}\right )\;\;\;,\;\;\underline{n} \sim {\cal {N}}\left ( \underline{0},  \sigma^{2}_{0}\underline{\underline{I}}_{d}\right )

We can also start from the MCA side, by defining a projection operator {\cal{P}}_{W}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{W}}^{\top} \left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top}\right )^{-1}\underline{\underline{W}}, where the rows of the k\times d matrix \underline{\underline{W}} span the k dimensional XC subspace we wish to identify. From this MCA-based approach Welling et al derive the probabilistic model for XCA as zero-mean multivariate Gaussian distribution with inverse covariance,

\underline{\underline{C}}_{XCA}^{-1}  = \frac{1}{\sigma^{2}_{0}} {\cal{P}}_{W}^{\perp}  + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}

The two probabilistic forms of XCA are equivalent and so one finds that the matrices \underline{\underline{A}} and \underline{\underline{W}} are related via \underline{\underline{A}} = \underline{\underline{W}}^{\top}\left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top} \right )^{-1}

If we also look at the two ways in which Welling et al derived a probabilistic model for XCA, we can see that they are very similar to the formulations of PPCA and PMCA respectively, just with the replacement of {\cal{P}}_{A}^{\perp} for \underline{\underline{I}}_{d} in the PPCA formulation, and the replacement of {\cal{P}}_{W}^{\perp} for \underline{\underline{I}}_{d} in the PMCA formulation. So with just a redefinition of how we add the noise in the probabilistic model, Welling et al derived a single probabilistic model that unifies PCA and MCA.

Note that we are now defining the minor components subspace as directions of unusually low variance, so we only need a few dimensions, i.e. k_{MC} \ll d, whilst previously when we defined the minor components subspace as the subspace where we wanted to constrain the data away from, we needed k_{MC} = d - k_{PC} directions. The probabilistic formulation of XCA is a very natural and efficient way to express MCA.

Maximum Likelihood solution for XCA

The model likelihood is easily written down and the maximum likelihood solution identified.  As one might anticipate the maximum-likelihood estimates for the vectors \underline{a}_{i} are just eigenvectors of \underline{\underline{\hat{C}}}, but we need to work out which ones. We can use the likelihood value at the maximum likelihood solution to do that for us.

Let’s say we want to retain k=6 extreme components overall, and we’ll use {\cal{C}} to denote the corresponding set of eigenvalues of \hat{\underline{\underline{C}}} that are retained. The maximum likelihood value for k extreme components (XC) is given by,

\log L_{ML} = - \frac{Nd}{2}\log \left ( 2\pi e\right )\;-\;\frac{N}{2}\sum_{i\in {\cal{C}}}\lambda_{i}\;-\;\frac{N(d-k)}{2}\log \left ( \frac{1}{d-k}\left [ {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right ]\right )

All we need to do is evaluate the above equation for all possible subsets {\cal{C}} of size k selected from the d eigenvalues \lambda_{i}\, , i=1,\ldots,d of \hat{\underline{\underline{C}}}. Superificially, this looks like a nasty combinatorial optimization problem, of exponential complexity. But as Welling et al point out, we know from a result proved in the PPCA paper of Tipping and Bishop that in the maximum likelihood solution the non-extreme components have eigenvalues that form a contiguous group, and so the optimal choice of subset {\cal{C}} reduces to determining where to split the ordered eigenvalue spectrum of \hat{\underline{\underline{C}}}. Since we have k = k_{PC} + k_{MC} that reduces to simply determining the optimal number of the largest \lambda_{i} to retain. That makes the optimization problem linear in k.

For example, in our hypothetical example we have said we want k = 6, but that could be a 3 PCs + 3 MCs split, or a 2 PCs + 4MCs split, and so on. To determine which we simply compute the maxium likelihood value for all the possible values of k_{PC} from k_{PC} = 0 to k_{PC} = k, each time keeping the largest k_{PC} values of \lambda_{i} and the smallest k - k_{PC} values of \lambda_{i} in our set {\cal{C}}.

Some of the terms in \log L_{ML} don’t change as we vary k_{PC} and can be dropped. Welling et al introduce a quantity {\cal{K}} defined by,

{\cal{K}}\;=\sum_{i\in {\cal{C}}}\lambda_{i}\; + \;(d-k)\log \left ( {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right )

{\cal{K}} is the negative of \log L_{ML}, up to an irrelevant constant and scale. If we then compute {\cal{K}}\left ( k_{PC}\right ) for all values of k_{PC} = 0 to k_{PC} = k and select the minimum, we can determine the optimal split of k = k_{PC} + k_{MC}.

PCA and MCA as special cases

Potentially, we could find that k_{PC} = k, in which case all the selected extreme components would correspond to principal components, and so the XCA algorithm becomes equivalent to PCA. Likewise, we could get k_{PC} = 0, in which case all the selected extreme components would correspond to minor components and the XCA algorithm becomes equivalent to MCA. So XCA contains pure PCA and pure MCA as special cases. But when do these special cases arise? Obviously, it will depend upon the precise values of the sample covariance eigenvalues \lambda_{i}, or rather the shape of the eigen-spectrum, but Welling et al also give some insight here, namely,

  • A log-convex sample covariance eigen-spectrum will give PCA
  • A log-concave sample covariance eigen-spectrum will give MCA
  • A sample covariance eigen-spectrum that is neither log-convex nor log-concave will yield both principal components and minor components

In layman’s terms, if the plot of the (sorted) eigenvalues on a log scale only bends upwards (has positive second derivative) then XCA will give just principal components, whilst if the plot of the (sorted) eigenvalues on a log scale only bends downwards (has negative second derivative) then we’ll get just minor components. If the plot of the log-eigenvalues has places where the second derivative is positive and places where it is negative, then XCA will yield a mixture of principal and minor components.

Experimental demonstration

To illustrate the XCA theory I produced a Jupyter notebook that generates simulated data containing a known number of principal components and a known number of minor components. The simulated data is drawn from a zero-mean multivariate Gaussian distribution with population covariance matrix \underline{\underline{C}} whose eigenvalues \Lambda_{i} have been set to the following values,

\begin{array}{cclcl} \Lambda_{i} & = & 3\sigma^{2}\left ( 1 + \frac{i}{3k_{PC}}\right ) & , &  i=1,\ldots, k_{PC}\\ \Lambda_{i} & = &\sigma^{2}  & , & i=k_{PC}+1,\ldots, d - k_{MC}\\ \Lambda_{i} & = & 0.1\sigma^{2}\left ( 1 + \frac{i}{k_{MC}} \right ) & , & i=d - k_{MC} + 1,\ldots, d\end{array}

The first k_{PC} eigenvalues represent principal components, as their variance is considerable higher than the ‘noise’ eigenvalues, which are represented by eigenvalues i=k_{PC}+1 to i=d - k_{MC}. The last k_{MC} eigenvalues represent minor components, as their variance is considerably lower than the ‘noise’ eigenvalues. Note, I have scaled both the PC and MC population eigenvalues by the ‘noise’ variance \sigma^{2}, so that \sigma^{2} just sets an arbitrary (user-chosen) scale for all the variances. I have chosen a large value of \sigma^{2}=20, so that when I plot the minor component eigenvalues of \hat{\underline{\underline{C}}} I can easily distinguish them from zero (without having to plot on a logarithmic scale).

We would expect the eigenvalues of the sample covariance matrix to follow a similar pattern to the eigenvalues of the population covariance matrix that we used to generate the data, i.e. we expect a small group of noticeably low-valued eigenvalues, a small group of noticeably high-valued eigenvalues, and the bulk (majority) of the eigenvalues to form a continuous spectrum of values.

I generated a dataset consisting of N=2000 datapoints, each with d=200 features. For this dataset I chose k_{PC} = k_{MC} = 10. From the simulated data I computed the sample covariance matrix \hat{\underline{\underline{C}}} and then calculated the eigenvalues of \hat{\underline{\underline{C}}}.

The left-hand plot below shows all the eigenvalues (sorted from lowest to highest), and I have also zoomed in on just the smallest values (middle plot) and just the largest values (right-hand plot).  We can clearly see that the sample covariance eigenvalues follow the pattern we expected.

Plots of the eigenvalues of the sample covariance matrix of the simulated data.

All the code (with explanations) for my calculations are in a Jupyter notebook and freely available from the github repository https://github.com/dchoyle/xca_post

From the left-hand plot we can see that there are places where the sample covariance eigenspectrum bends upwards and places where it bends downwards, indicating that we would expect the XCA algorithm to retain both principal and minor components. In fact, we can clearly see from the middle and right-hand plots the distinct group of minor component eigenvalues and the distinct group of principal component eigenvalues, and how these correspond to the distinct groups of extreme components in the population covariance eigenvalues. However, it would be interesting to see how the XCA algorithm performs in selecting a value for the number of principal components.

For the eigenvalues above I have calculated the minimum value of {\cal{K}} for a total of k=20 extreme components. The minimum value of {\cal{K}} occurs at k_{PC} = 10, indicating that the method of Welling et al estimates that there are 10 principal components in this dataset, and by definition there are then k - k_{PC} = 20 - 10 = 10 minor components in the dataset. In this case, the XCA algorithm has identified the dimensionalities of the PC and MC subspaces exactly.

Final comments

That is the post done. I can look Will in the eye when we meet for a beer at the next PyData Manchester Leaders meeting. The post ended up being longer (and more fun) than I expected, and you may have got the impression from my post that the Welling et al paper has completely solved the problem of selecting interesting low-dimensional subspaces in Gaussian distributed data. Note quite true. There are still challenges with XCA, as there are with PCA. For example, we have not said how we choose the total number of extreme components k. That is a whole other model selection problem and one that is particularly interesting for PCA when we have high-dimensional data. This is one of my research areas – see for example my JMLR paper Hoyle2008.

Another challenge that is particularly relevant for high-dimensional data is the question of whether we will see distinct groups of principal and minor component sample covariance eigenvalues at all, even when we have distinct groups of population covariance eigenvalues. I chose very carefully the settings used to generate the simulated data in the example above. I ensured that we had many more samples than features, i.e. N \gg d, and that the extreme component population covariance eigenvalues were distinctly different from the ‘noise’ population eigenvalues. This ensured that the sample covariance eigenvalues separated into three clearly visible groups.

However, in PCA when we have N < d and/or weak signal strengths for the extreme components of the population covariance, then the extreme component sample covariance eigenvalues may not be separated from the bulk of the other eigenvalues. As we increase the ratio \alpha = N/d we observe a series of phase transitions at which each of the extreme components becomes detectable – again this is another of my areas of research expertise [HoyleRattray2003, HoyleRattray2004, HoyleRattray2007]

Footnotes
  1. I have used the usual divide by N-1 Bessel correction in the definition of the sample covariance. This is because I have assumed any data matrix will have been explicitly mean-centered. In many of the analyses of PCA the starting assumption is that the data is drawn from a mean-zero distribution, so that the sample mean of any feature is zero only under expectation, not as a constraint. Consequently, most formal analysis of PCA will define the sample covariance matrix with a 1/N factor. Since I have to deal with real data, I will never presume the data been drawn from population distribution that has zero-mean and so to model the data with a zero-mean distribution I will explicitly mean-center the data. Therefore, I use the 1/(N-1) definition of the sample covariance. Strictly speaking, that means the various theories and analyses I discuss later in the post are not applicable to the data I’ll work with. It is possible to modify the various analyses to explicitly take into account the mean-centering step, but it is tedious to do so. In practice, (for large N) the difference is largely inconsequential, and formulae derived from analysis of zero-mean distributed data can be accurate for mean-centered data, so we’ll stick with using the 1/(N-1) definition for \hat{\underline{\underline{C}}}.

  2. Hotelling, H. “Analysis of a complex of statistical variables into principal components”. Journal of Educational Psychology, 24:417-441 and also 24:498–520, 1933.
    https://dx.doi.org/10.1037/h0071325

  3. Oja, E. “Principal components, minor components, and linear neural networks”. Neural Networks, 5(6):927-935, 1992. https://doi.org/10.1016/S0893-6080(05)80089-9

  4. Luo, F.-L., Unbehauen, R. and Cichocki, A. “A Minor Component Analysis Algorithm”. Neural Networks, 10(2):291-297, 1997. https://doi.org/10.1016/S0893-6080(96)00063-9

  5. See for example, Williams, C.K.I. and Agakov, F.V. “Products of gaussians and probabilistic minor components analysis”. Neural Computation, 14(5):1169-1182, 2002. https://doi.org/10.1162/089976602753633439

© 2025 David Hoyle. All Rights Reserved

The Manchester football team that conquered the world – but you probably haven’t heard of them

Q: Which was the first Manchester football team to win a European Championship?

A: Manchester United right? Won the European Cup in 1968. No, not quite right.

This is a story about a Manchester team that swept all before them, winning a European Championship in 1957, before going on world tours, beating all the competition they encountered and playing in front of crowds of 50,000 – 60,000 spectators along the way.

The story has humble beginnings – Fog Lane Park in south Manchester. The photo above is of the duck ponds in the park, and yes, they are relevant to the story.

The founding of Manchester Corinthians

In 1949 Percy Ashley, a football scout for Bolton Wanderers, set up a women’s football team, Manchester Corinthians LFC, primarily so that his daughter Doris could play football. This was in an era when the Football Association (FA) had prohibited women from playing on pitches of FA affiliated clubs – effectively a ban on women’s football that lasted from 1921 up until 1971.

The Corinthians home ‘ground’ was Fog Lane Park. They did their training sessions there on Sundays. On poor quality muddy pitches, typical of a municipal park. Despite this, within a few years Corinthians were beating every team they encountered. By 1951 they had won the Southern Cup, the Manchester Area Cup, the Sports Magazine Cup, the Roses Trophy, the Midland Trophy, the Cresswell Trophy, the Odeon Championship Trophy, the Belle Vue Trophy, and the Festival of Great Britain Trophy.

In 1957 Corinthians were invited to represent England in a European championship, which they duly won, beating Germany 4-0 in the final. Bert Trautmann acted as Corinthian’s interpreter during the championship – yes that Bert Trautmann, the Manchester City goalkeeper who famously broke his neck partway through the 1956 FA Cup Final and carried on playing the rest of the match.

A tour of Portugal and a 3-month tour of South America followed. In Portugal Corinthians played against Benfica, with a 50,000 crowd watching. The South American tour saw crowds of 50,000 – 60,000. Former Corinthian player Margaret Whitworth recalls having to wait for the President of the country they were playing in (possibly Venezuela) to arrive by helicopter before their match could start.

The success of Corinthians was such that to find a team of similar standard against whom they could play, they had to create another Fog Lane Park based team, Nomads, formed in 1957. Nomads were just as brilliant as the Corinthians. The Corinthians and Nomads were interchangeable, rather than being a 1st team and a reserve team. As this 1961 snippet below from the Crewe Chronicle newspaper indicates, organizers of women’s football matches would arrange for Corinthians to play Nomads, possibly because they were the two best teams in the area.

Advertisement for a Corinthians vs Nomads match. Taken from the Crewe Chronicle, May 1961.

By 1965, Corinthians had played 394 matches, winning 353 of them and losing only 20. That is a phenomenal win rate! Likewise, Nomads had played 68, won 47 and lost 11.

The success continued. 1970 saw Corinthians being invited to France to compete in a tournament against top continental European clubs, such as Italian club Juventus, Czechoslovakian champions Slavia Kaplice, and French hosts Stade de Reims. Corinthians won the tournament, beating Juventus 1-0 in a tightly contested final.

At this point Corinthians are arguably the best women’s team in England and among the best in Europe. This was at a time when women’s football in France and Italy was not banned but actively supported. The European clubs Corinthians were competing against had significant advantages in terms of quality of facilities and pitches.

In stark contrast, Corinthian players recall having to carry buckets of water from the coach’s house on Fog Lane for washing after training sessions and matches. And if the buckets of water weren’t available, they had to wash in the duck ponds (in the picture), sometimes even having to break the ice on the ponds first.

One of the Corinthian players, Jan Lyons, later went on to play for the Italian club Juventus and has remarked on the differences in facilities between women’s football in England and women’s football in Italy at the time.

The Corinthians and Nomads continued up until 1982 and were two of the 44 founding teams when the Women’s Football Association was formed in 1969/1970, meaning that just under 5% of founding teams came from a small municipal park in south Manchester.

An almost forgotten story

Surprisingly, this is a story I hadn’t heard before until a few months ago, even though I’ve lived for over 15 years on the doorstep of where it all took place – the duck ponds are 150m as the crow flies from where I’m sitting writing this in my kitchen.

It is also surprising, as my eldest daughter played football in Fog Lane Park for nearly 10 years, on those same pitches, for the local team, Didsbury Juniors, who are based out of the park. During this time my daughter was one of only two female players in what was unofficially a boys’ team. My daughter then went onto play for Manchester City Girls team and Stockport County LFC, yet in all my time connected to girls’ football in Manchester I never came across the Corinthians story.

I found out about the story when The Friends of Fog Lane Park group was contacted by football historian Gary James about putting up a plaque to commemorate the achievements of the Corinthians club. I was astounded when I heard this piece of local history – like many people in the Fog Lane/Didsbury area that I’ve spoken to since and who similarly knew nothing of the Corinthians story.

I’ve only given a small excerpt here about the Corinthians story. What I’ve written cannot really do justice to this fascinating and incredible story, so please do read more about it. For my source material I’ve used the excellent works by Gary James who has written extensively about football in the Manchester area including numerous articles about the Corinthians, and the comprehensive recent book1 and 2019 academic paper2 by Jean Williams. I’ve also used various other websites, such as here, here and here.

A permanent commemoration

The Friends of Fog Lane Park group, along with Manchester City Council and MCRactive, are raising money for a permanent reminder and celebration of the achievements of the Corinthians and Nomads teams. This is being done with the help and support of Gary James. The aim is for this commemoration to hopefully coincide with the UEFA Women’s Euros 2022, which are taking place in England during July this year. If you wish, you can donate to the fundraising at this JustGiving page. Your support would be very much appreciated.

As part of this effort to raise funds I had the pleasure of meeting with three former Corinthians (see photo below) in May this year. Naturally, the meeting took place in Fog Lane Park.

Former Corinthian players in May 2022. L-R are Jan Lyons, Margaret Shepherd (Tiny), and Margaret Whitworth (Whitty).

During the meeting the players recounted their memories of Percy and Doris Ashley, having to wash in the duck ponds, and playing for Juventus after transferring from Corinthians. One of the players also recounted that when buying football boots, they had to pretend that the boots were for hockey or lacrosse, in order for the shopkeeper to be willing to sell the boots to them, such was the social stigma against women’s football that the men’s FA had created during the period.

My thanks in writing this post go to Pam Barnes and Alice Moody for supplying material related to the Corinthians story, introducing me to the story, and many interesting conversations about it.

Footnote 1: J. Williams, The History of Women’s Football, 2019. Pen & Sword Books Ltd, Barnsley, UK. https://www.pen-and-sword.co.uk/The-History-of-Womens-Football-Hardback/p/19214.

Footnote 2: J. Williams, ‘We’re the lassies from Lancashire’: Manchester Corinthians Ladies FC and the use of overseas tours to defy the FA ban on women’s football, Sport in History 39(4):395-417, 2019. https://doi.org/10.1080/17460263.2019.1678068

© 2022 David Hoyle. All Rights Reserved.