Data Science Notes: 4. What is an outlier?

On March 9, 2026March 9, 2026 By dchoyleIn Data Science, Data Science Notes, Statistics1 Comment

Summary

What defines an outlier is something that causes a great deal of confusion amongst Data Scientists. Most Data Scientists get this wrong.
An outlier is not just simply a datapoint that has a very large or very small value.
My personal definition of an outlier is, “a datapoint that comes from a different statistical process than the one I’m currently modelling”.
Since real world data is typically made from multiple statistical processes, it is better to think not in terms of a dataset being comprised of “ok data + outliers”, but instead to think of data as being comprised from data from process 1, plus data from process 2, and so on.
Throwing away outliers is not a good idea. They made up part of your real dataset so you are throwing away insight into how your data gets generated. It is ok however to exclude “outliers” from estimation of the parameters of a specific statistical process if you have confidently identified those “outliers” as coming from a different mechanism.

Introduction

The image above is a well known meme, right. It’s quite funny, right? Yes, except I have seen this precise situation happen. 25 years ago I went to a talk where a scientist was presenting some experimental gene expression work they’d been doing. As they were using a new assay technology called ‘microarrays’, which they were concerned could provide noisy measurements, they’d decided under advice from another scientist to throw away any datapoints that were more than +/- 2 standard deviations from the mean of their data. My colleague and I, who were working on the theory of statistical analysis of microarray data, asked how much data they’d discarded. “It’s quite weird. We always seem to end up removing about 5% of the total data, no matter what experiment we do” was the reply. I’ve remembered that reply word for word since then.

What do you think an outlier is?

Take a look at the plot below. It is a dataset of 150 datapoints, with the values plotted on the y-axis and the index of each datapoint on the x-axis. Which datapoints do you think might be outliers? Possibly the ones I’ve circled in red, as they are clearly much larger than the bulk of the data points. Perhaps we should discard them?

Plot of example data points with two data points circled in red as possible outliers — An example dataset with possible outliers highlighted in red.

Okay, what about this second plot below? Again, a dataset of 150 datapoints. Probably no outliers right? All the data points are reasonably clustered together on the y-axis. Yes, some are more towards the edges but you would think you could easily model the distribution of this data using a single Gaussian distribution.

An example dataset with no outliers highlighted — A second example dataset

The two plots are of the same data.

How is this so? The data is drawn (simulated) from a log-normal distribution. The second plot shows the data values on a log scale, and as we’ve already commented we would be happy modelling the distribution of the data using a Gaussian distribution, using all the data points to estimate the mean and variance of that Gaussian distribution. The first plot shows the data on the original scale – sometimes we refer to this as the ‘normal scale’, but not to be confused with ‘normal’ as in ‘normal distribution’, we just mean we haven’t taken the logarithm. And yet, now we know that the appropriate distribution to use to model this data is a log-normal distribution, we are happy to use all the datapoints in the first plot to estimate the parameters of that log-normal distribution. There is no need to discard or throw away any datapoints.

What have we learnt from this little experiment? Two things:

Whether a datapoint is an outlier or not has nothing to do with how big the data value is.
If a datapoint is consistent with the statistical distribution or process we are trying to model, then we can use it. It is not an outlier, no matter how big or small it is.

The second point is my favourite way of defining an outlier. An outlier is a datapoint that comes from a statistical process other than that process which we are currently trying to model. Okay, that definition sounds a bit dry and technical, but I like it because it emphasizes that an outlier is not ‘wrong’, ‘incorrect’ or ‘deficient’ in any sense. It has not ‘corrupted’ or ‘contaminated’ the dataset in any true sense, even though those are common terms used in statistics when discussing robust parameter estimation methods. Instead, saying that a datapoint is an outlier just means we are saying it comes from a different mechanism than the one we are currently modelling. All the datapoints in the two plots above came from the same statistical process. I know because that is how I generated them.

The definition of an outlier I have just given also tells us why we are justified in omitting (not throwing away) outliers when estimating the parameters associated with the statistical mechanism we want to model. If a datapoint hasn’t come from the statistical mechanism we want to model, including it in any estimation or training algorithm will lead to incorrect parameter estimates. So once we are confident we have correctly identified any outliers, omitting them from the parameter estimation process is justified.

An illustration

Let’s we want to estimate the typical energy usage per square metre of rooms in a residential house. I can get data on a load of houses; the total energy use, the number of rooms, and the total floor aera of those rooms. It’s an easy calculation to then work out the average energy consumption per square metre. But if I notice that several of the ‘houses’ in my dataset have over 150 rooms each, I‘m probably going to think that they are not typical residential houses. Most likely they are hotels, or mansions of some tech billionaires. Either way, the pattern of energy usage for the rooms in those hotels/mansions is likely to be very different from that of a room in a typical residential house. The energy usage for those hotel rooms comes from a different statistical process to that for the residential house rooms. The datapoints from those hotels are outliers to what I’m trying to model. I should therefore exclude them from my estimation algorithm. That’s easy to do now that I’m am confident I have identified the outliers appropriately. Job done!

Life (and data) is always made up of multiple processes

The other reason I like the outlier definition I have given is that it emphasizes that real data is almost always made up of data coming from multiple mechanisms or processes. The outlier definition I’ve used highlights that an outlier isn’t wrong, it’s just different. Sometimes we only want to model in depth just one of the statistical mechanisms contributing to our dataset, so we omit the outliers and run our training algorithm. Other times we need to model all of the data in our dataset and so we break down our dataset into the different generative processes/mechanisms that we think are at play, select the appropriate subset of data for each process, and run an appropriate training algorithm on that subset. Instead of thinking of the data as being made up of ‘good’ data plus ‘outliers’, I now think of it as, ‘data from mechanism 1’ plus ‘data from mechanism 2’ plus ‘data from mechanism 3’, and so on.

Let’s revisit our energy consumption example to illustrate this. Say, now I want to model energy use per square metre for different types of buildings – residential houses and hotels. Now it is just a question of filtering the data to the datapoints matching the building type. As we said before, the energy use patterns will probably be different for houses and for hotels, i.e. we have different statistical processes generating the different subtypes of data. Consequently, we choose to use different distribution types to model the different building types. So what we have is not outliers, just a different training process for each of the different building types.

How to detect outliers?

So far I have only discussed how to think about what an outlier is. I’ve said that you are justified in excluding genuine outliers from parameter estimation once you have correctly identified them. What I have not discussed is how you actually do the outlier detection. That is deliberate, as it is a much bigger topic. It is also one I will touch upon, in part, in later posts in the Data Science Notes series.

What you will hopefully have grasped is that if we are to use something like standard deviation (SD) to determine whether a datapoint is an outlier, we need to replace the naive “|x – mean | > 2SD” filter shown in the meme image at the start and replace it with a question that asks whether the datapoint is consistent with a statistical process that has the given mean and SD. So we need to replace the meme with a question more like, “what is the probability of the highest value in a dataset of size N being greater than x?” We can see that outlier detection now takes into account the sample size and brings us into the realm of extreme value theory (EVT) – also a topic for a later Data Science Note.

Conclusions

An outlier is a datapoint that has come from a statistical process different to the one we are currently modelling.
Excluding outliers from parameter estimation for the statistical process we are currently modelling is valid because they don’t, by definition, conform to that statistical process and so would bias the parameter estimation/training algorithm.
Most datasets are made up of data from multiple different statistical processes, and so we should get used to thinking not in terms of “ok data + outliers”, but data from process 1, data from process 2, and so on.
If we want to model a complete dataset we may need to appropriately subset the data into its different statistical process and run parameter estimation algorithms for each subset/process.
By breaking down a dataset into its component processes, we are handling in an appropriate way what we previously thought of as the “outliers”. We haven’t had to throw away any datapoints, merely just temporarily exclude them when necessary.

An equation showing the mathematical form of the log sum exp calculation.

Data Science Notes: 2. Log-Sum-Exp

On January 1, 2026January 31, 2026 By dchoyleIn Algorithms, Coding, Data Science, Data Science Notes, Numerical Analysis, pythonLeave a comment

Summary

Summing up many probabilities that are on very different scales often involves calculation of quantities of the form $\log\left ( \sum_{k} \exp\left( a_{k}\right )\right )$ . This calculation is called log-sum-exp.
Calcuating log-sum-exp the naive way can lead to numerical instabilities. The solution to this numerical problem is the “log-sum-exp” trick.
The scipy.special.logsumexp function provides a very useful implementation of the log-sum-exp trick.
The log-sum-exp function also has uses in machine learning, as it is a smooth, differentiable approximation to the ${\rm max}$ function.

Introduction

This is the second in my series of Data Science Notes series. The first on Bland-Altman plots can be found here. This post is on a very simple numerical trick that ensures accuracy when adding lots of probability contributions together. The trick is so simple that implementations of it exist in standard Python packages, so you only need to call the appropriate function. However, you still need to understand why you can’t just naively code-up the calculation yourself, and why you need to use the numerical trick. As with the Bland-Altman plots, this is something I’ve had to explain to another Data Scientist in the last year.

The log-sum-exp trick

Sometimes you’ll need to calculate a sum of the form, $\sum_{k} \exp\left ( a_{k}\right )$ , where you have values for the $a_{k}$ . Really? Will you? Yes, it will probably be calculating a log-likelihood, or a contribution to a log-likelihood, so the actual calculation you want to do is of the form,

$\log\left ( \sum_{k} \exp \left ( a_{k}\right )\right )$

These sorts of calculations arise where you have log-likelihood or log-probability values $a_{k}$ for individual parts of an overall likelihood calculation. If you come from a physics calculation you’ll also recognise the expression above as the calculation of a log-partition function.

So what we need to do is exponentiate the $a_{k}$ values, sum them, and then take the log at the end. Hence the expression “log-sum-exp”. But why a blogpost on “log-sum-exp”? Surely, it’s an easy calculation. It’s just np.log(np.sum(np.exp(a))) , right ? Not quite.

It depends on the relative values of the $a_{k}$ . Do the naïve calculation np.log(np.sum(np.exp(a))) and it can be horribly inaccurate. Why? Because of overflow and underflow errors.

If we have two values $a_{1}$ and $a_{2}$ and $a_{1}$ is much bigger than $a_{2}$ , when we add $\exp(a_{1})$ to $\exp(a_{2})$ we are using floating point arithmetic to try and add a very large number to a much smaller number. Most likely we will get an overflow error. If would be much better if we’d started with $\exp(a_{1})$ and try to add $\exp(a_{2})$ to it. In fact, we could pre-compute $a_{2} - a_{1}$ , which would be very negative and from this we could easily infer that adding $\exp(a_{2})$ to $\exp(a_{1})$ would make very little difference. In fact, we could just approximate $\exp(a_{1}) + \exp(a_{2})$ by $\exp(a_{1})$ .

But how negative does $a_{2} - a_{1}$ have to be before we ignore the addition of $\exp(a_{2})$ ? We can set some pre-specified threshold, chosen to avoid overflow or underflow errors. From this, we can see how to construct a little Python function that takes an array of values $a_{1}, a_{2},\ldots, a_{N}$ and computes an accurate approximation to $\sum_{k=1}^{N}\exp(a_{k})$ without encountering underflow or overflow errors.

In fact we can go further and approximate the whole sum by first of all identifying the maximum value in an array $a = [a_{1}, a_{2}, \ldots, a_{N}]$ . Let’s say, without loss of generality, the maximum value is $a_{1}$ . We could ensure this by first sorting the array, but it isn’t necessary actually do this to get the log-sum-exp trick to work. We can then subtract $a_{1}$ from all the other values of the array, and we get the result,

$\log \left ( \sum_{k=1}^{N} \exp \left ( a_{k} \right )\right )\;=\; a_{1} + \log \left ( 1\;+\;\sum_{k=2}^{N} \exp \left ( a_{k} - a_{1} \right )\right )$

The values $a_{k} - a_{1}$ are all negative for $k \ge 2$ , so we can easily approximate the logarithm on the right-hand side of the equation by a suitable expansion of $\log (1 + x)$ . This is the “log-sum-exp” trick.

The great news is that this “log-sum-exp” calculation is so common in different scientific fields that there are already Python functions written to do this for us. There is a very convenient “log-sum-exp” function in the SciPy package, which I’ll demonstrate in a moment.

The log-sum-exp function

The sharp-eyed amongst you may have noticed that the last formula above gives us a way of providing upper and lower bounds for the ${\rm max}$ function. We can simply re-arrange the last equation to get,

${\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; \log \left ( \sum_{k=1}^{N}\exp \left ( a_{k}\right )\right )$

The logarithm calculation on the right-hand side of the inequality above is what we call the log-sum-exp function (lse for short). So we have,

${\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; {\rm lse}\left ( a_{1}, a_{2}, \ldots, a_{N}\right )$

This gives us an upper bound for the ${\rm max}$ function. Since $a_{k} \le {\rm max}\left ( a_{1}, a_{2},\ldots,a_{N}\right )$ , it is also relatively easy to show that,

${\rm lse} \left ( a_{1}, a_{2}, \ldots, a_{N}\right )\;\le\; {\rm max}\left ( a_{1}, a_{2},\ldots, a_{N} \right )\;+\;\log N$

and so we have have a lower bound for the ${\rm max}$ function. So the log-sum-exp function allows us to compute lower and upper bounds for the maximum of an array of real values, and it can provide an approximation of the maximum function. The advantage is that the log-sum-exp function is smooth and differentiable. In contrast, the maximum function itself is not smooth nor differentiable everywhere, and so is less convenient to work with mathematically. For this reason the log-sum-exp is often called the “real-soft-max” function because it is a “soft” version of the maximum function. It is often used in machine learning settings to replace a maximum calculation.

Calculating log-sum-exp in Python

So how do we calculate the log-sum-exp function in Python. As I said, we can use the SciPy implementation which is in scipy.special. All we need to do is pass an array-like set of values $a_{k}$ . I’ve given a simple example below,

			
# import the packages and functions we need
import numpy as np
from scipy.special import logsumexp
# create the array of a_k values 
a = np.array([70.0, 68.9, 20.3, 72.9, 40.0])
# Calculate log-sum-exp using the scipy function 
lse = logsumexp(a)
# look at the result
print(lse)

		

This will give the result 72.9707742189605

The example above and several more can be found in the Jupyter notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes

The great thing about the SciPy implementation of log-sum-exp is that it allows us to include signed scale factors, i.e. we can compute,

$\log \left ( \sum_{k=1}^{N} b_{k}\exp\left ( a_{k}\right ) \right )$

where the values $b_{k}$ are allowed to be negative. This means, that when we are using the SciPy log-sum-exp function to perform the log-sum-exp trick, we can actually use it to calculate numerically stable estimates of sums of the form,

$\log \left ( \exp\left ( a_{1}\right ) \;-\; \exp\left ( a_{2}\right ) \;-\;\exp\left ( a_{3}\right )\;+\;\exp\left ( a_{4}\right )\;+\ldots\; + \exp\left ( a_{N}\right )\right )$ .

Here’a small code snippet illustrating the use of the scipy.special.logsumexp with signed contributions,

			
# Create the array of the a_k values
a = np.array([10.0, 9.99999, 1.2])
b = np.array([1.0, -1.0, 1.0])
# Use the scipy.special log-sum-exp function
lse = logsumexp(a=a, b=b)
# Look at the result
print(lse)

		

This will give the result 1.2642342014146895.

If you look at the output of the example above you’ll see that the final result is much closer to the value of the last array element $a_{3} = 1.2$ . This is because the first two contributions, $\exp(a_{1})$ and $\exp(a_{2})$ almost cancel each other out because the contribution $\exp(a_{2})$ is pre-fixed by a factor of -1. What we get left with is something close to $\log(\exp(a_{3}))\;=\; a_{3}$ .

There is also a small subtlety in using the SciPy logsumexp function with signed contributions. If the substraction of some terms had led to an overall negative result, scipy.special.logsumexp will rerturn NaN as the result. In order to get it to always return a result for us, we have to tell it to return the sign of the final summation as well, by setting the return_sign argument of the function to True. Again, you can find the code example above and others in the notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes.

When you are having to combine lots of different probabilities, that are on very different scales, and you need to subtract some of them and add others, the SciPy log-sum-exp function is very very useful.

A Bland-Altman plot of the peak expiratory flow rate data taken from the 1986 Lancet paper of Bland and Altman.

Data Science Notes: 1. Bland-Altman plots

On December 6, 2025January 31, 2026 By dchoyleIn Data Science, Data Science Notes, Data Visualization1 Comment

Introduction

Summary: If you are using a scatter plot to compare two datasets, rotate your data.

Three times in the last six months I’ve explained to different colleagues and former colleagues what a Bland-Altman (BA) plot is. Admittedly, the last of those explanations was because I remarked to a colleague that I’d been talking about BA-plots and they then wanted to know what they were.

BA-plots are a really simple idea. I like them because they highlight how a human’s ability to perceive patterns in data can be markedly affected by relatively small changes in how that data is presented; rotating the data in this case.

I also like them because they are from the statisticians Martin Bland and Doug Altman who produced a well-known series of short articles, “Statistics Notes”, in the BMJ in the 1990s. Each article focused on a simple, basic, but very important statistical concept. The series ran over nearly 70 articles and the idea was to explain to a medical audience about ‘statistical thinking’. You can find the articles at Martin Bland’s website here. Interestingly, BA-plots were not actually part of this series of BMJ articles as their work on BA-plots had been published in earlier articles. However, I’d still thoroughly recommend having a browse of the BMJ series.

Since I’ve had to explain BA-plots three times recently, I thought I’d give it another go in a blogpost. Also, inspired by the Bland-Altman series, I’m going to attempt a series of 10 or so short blogposts on simple, basic Data Science techniques and concepts that I find useful and/or interesting. The main criterion for inclusion in my series is whether I think I can explain it in a short post, not whether I think it is important.

What is a Bland-Altman plot?

BA-plots are used for comparing similar sets of data. The original use-case was to test how reproducible a process was. Take two samples of data that ideally you would want to be identical and compare them using a BA plot. This could be comparing clinical measurements made by two different clinicians across the same set of patients. What we want to know is how reproducible is a clinical measurement if made by two different clinicians.

Perhaps the first way of visually comparing two datasets on the same objects would be to just do a scatter plot – one dataset values on the x-axis, the other dataset values on the y-axis. I’ve got an example in the plot below. In fact, I’ve taken this data from Bland and Altman’s original 1986 Lancet paper. You can see the plotted points are pretty close to the 45-degree line (shown as a black dashed line), indicating the two datasets are measuring the same thing with some scatter, perhaps due to measurement error.

A scatter plot of peak expiratory flow rate data taken from the original Bland Altman paper in the Lancet from 1986. — Scatter plot of original Bland Altman PEFR data

Now, here’s the neat idea. I can do exactly the same plot, but I’m just going to rotate it clockwise by 45-degrees. A little bit of high-school/college linear algebra will convince you that I can do that by creating two new features,

$\frac{1}{\sqrt{2}} \left ( y + x \right )$
$\frac{1}{\sqrt{2}} \left ( y - x \right )$

Here $x$ and $y$ are our starting features or values from the two datsets we are comparing. Typically, the pre-factors of $\sqrt{2}$ are omitted and we simply define our new features as,

$A = \frac{1}{2} \left ( y + x \right )$
$M = \left ( y - x \right )$

Now we plot $M$ against $A$ . I’ve shown the new plot below.

Now a couple of things become clearer. Firstly, $A$ is the mean of $x$ and $y$ and so it gives us a better estimate of any common underlying value than just $x$ on its own or $y$ on its own. It gives us a good estimate of the size of the ‘thing’ we are interested in. Secondly, $M$ is the difference between $x$ and $y$ . $M$ tells us how different $x$ and $y$ are. Plotting $M$ against $A$ as I’ve done above shows me how reproducible the measurement is because I can easily see the scale of any discrepancies against the new vertical axis. I also get to see if there is any pattern in the level of discrepancy as the size of the ‘thing’ varies on the new horizontal axis. This was the original motivation for the Bland-Altman plot – to see the level of discrepancy between two sets of measurements as the true underlying value changes.

What the eye doesn’t see

What I really like about BA-plots though, is how much easier I find it to pick out if there is any systematic pattern to the differences between the two datasets. I haven’t looked into the psychological theory of visual perception, but it makes sense to me that humans would find it easier looking for differences as we move our eyes across one dimension – the horizontal axis – compared to moving our eyes across two dimensions – both the horizontal and vertical axes – when trying to scan the 45-degree line.

I first encountered BA-plots 25 years ago in the domain of microarray analysis. In that domain they were referred to as MA-plots (for obvious reasons). The choice of the symbols $M$ and $A$ also had a logic behind it. $M$ and $A$ are constructed as linear combinations of $x$ and $y$ , and in this case we “Add” them when constructing $A$ and “Minus” them when constructing $M$ . Hence the symbols $M$ and $A$ even tell you how to calculate the new features. You will also see BA-plots referred to Tukey mean-difference plots (again for obvious reasons).

In microarray analysis we were typically measuring the levels of mRNA gene expression in every gene in an organism across two different environmental conditions. We expected some genes to show differences in expression and so a few data points were expected to show deviations from zero on the vertical $M$ -axis. However, we didn’t expect broad systematic differences across all the genes, so we expected a horizontal data cloud on the MA-plot. Any broad systematic deviations from a horizontal data cloud were indicative of a systematic bias in the experimental set-up that needed to be corrected for. The MA plots gave an easy way to both visually detect any bias but also suggested an easy way to correct it. To correct it we just needed to fit a non-linear trendline through the data cloud, say using a non-parametric fit method like lowess. The vertical difference between a datapoint and the trendline was our estimate of the bias-corrected value of $M$ for that datapoint.

To illustrate this point I’ve constructed a synthetic example below. The left-hand plot shows the raw data in a standard scatterplot. The scatterplot suggests there is good agreement between the two samples – maybe a bit of disagreement but not much. However, when we look at the same data as a Bland-Altman plot (right-hand plot) we see a different picture. We can clearly see a systematic pattern to the discrepancy between the two samples. I’ve also estimated this systematic variation by fitting a non-linear trendline (in red) using the lowess function in the Python statsmodels package.

Two plots. The left hand plot shows a standard scatterplot, whilst the right-hand plot shows the corresponding Bland-Altman plot. — Scatterplot and Bland-Altman plot for the second example dataset.

Sometimes we may expect a global systematic shift between our paired data samples, i.e. a constant vertical shift on the $M$ axis. Or at least we can explain/interpret such a shift. Or there may be other patterns of shift we can comfortably interpret. This widens the applications domains we can use BA-plots for. In commercial Data Science I’ve seen BA-plots used to assess reproducibility of metrics on TV streaming advertising, and also calibration of transaction data across different supermarket stores. Next time you’re using a vanilla scatterplot to compare two data series, think about rotating and making a BA-plot.

All the code for the examples I’ve given in this post is in the Jupyter notebook DataScienceNotes1_BlandAltmanPlots.ipynb which can be found in the public GitHub repository https://github.com/dchoyle/datascience_notes. Feel free to clone the repository and play with the notebook. I’ll be adding to the repository as I add further “Data Science Notes” blogposts.

Testing your models before you build them

On October 7, 2025December 21, 2025 By dchoyleIn UncategorizedLeave a comment

Introduction

TL;DR: There are tests on models you can do even before you have done any training of the model. These are tests of the model form, and are more mathematical in nature. These tests stop you from putting a model with a flawed mathematical form into production.

My last blogpost was on using simulation data to test a model. I was asked if there are other tests I do for models, to which I replied, “other than the obvious, it depends on the model and the circumstances”. Then it occurred to me that “the obvious” tests might not be so obvious, so I should explain them here.

Personally, I broadly break down model tests into two categories:

Tests on a model before training/estimation of the model parameters.
Tests on a model after training/estimation of the model parameters.

The first category (pre-training) are typically tests on model form – does the model make sense, does the model include features in a sensible way. These are tests that get omitted most often and the majority of Data Scientists don’t have in their toolkit. However, these are tests that will spot the big costly problems before the model makes it into production.

The second category of tests (post-training) are typically tests on the numerical values of the model parameters and various goodness-of-fit measures. These are the tests that most Data Scientists will know about and will use regularly. Because of this I’m not going to go into details of any tests in this second category. What I want to focus on is tests in the first category, as this is where I think there is a gap in most Data Scientists’ toolkit.

The tests in the first category are largely mathematical, so I’m not going to give code examples. Instead, I ‘m just going to give a short description of each type of test and what it tries to achieve. Let’s start.

Pre-training tests of model form

Asymptotic behaviour tests:

One of the easiest ways to test a model form is to look at its output in circumstances which are easy to understand. In a model with many features and interacting parts this is best done by seeing what happens when you make one of the variables or parameters as large as possible (or as small as possible). In these circumstances the other variables will often become irrelevant and so the behaviour of the model is easier to spot. For example, in a demand model that predicts how much of a grocery product you’re going to sell, does putting up the price to infinity cause the predicted sales volume to drop to zero? If not, you’ve got a problem with your model.

Asymptotic behaviour tests are not limited to scenarios in which variables/parameters become very large or very small. In some cases the appropriate asymptotic scenario might be a parameter approaching a finite value at which a marked change in behaviour is expected. It should be clear that identifying asymptotic scenarios for which we can easily predict what should happen can require some domain knowledge. If you aren’t confident of your understanding of the application domain, then a good start is to make variables/parameters very large and/or very small one at a time and see if the resulting behaviour makes sense.

Typically, working out the behaviour of your model form in some asymptotic limit can be done simply by visual inspection of the mathematical form of your model, or with a few lines of pen-and-paper algebra. This gives us the leading order asymptotic behaviour. With a bit more pen-and-paper work we can also work out a formula for the next-to-leading order term in the asymptotic expansion of the model output. The next-to-leading order term tells us how quickly the model output approaches its asymptotic value – does it increase to the asymptotic value as we increase the variable, or does it decrease to the asymptotic value? We can also see which other variables and parameters affect the rate of this approach to the asymptotic value, again allowing us to identify potential flaws in the model form.

The asymptotic expansion approach to testing a model form can be continued to even higher orders, although I rarely do so. Constructing asymptotic expansions requires some experience with specific analysis techniques, e.g. saddle-point expansions. So I would recommend the following approach,

Always do the asymptotic limit (leading order term) test(s) as this is easy and usually requires minimal pen-and-paper work.
Only derive the next-to-leading order behaviour if you have experience with the right mathematical techniques. Don’t sweat if you don’t have the skills/experience to do this as you will still get a huge amount of insight from just doing 1.

Stress tests/Breakdown tests:

These are similar in spirit to the asymptotic analysis tests. Your looking to see if there are any scenarios in which the model breaks down. And by “break down”, I mean it gives a non-sensical answer such as predicting a negative value for a quantity that in real life can only be positive. How a model breaks down can be informative. For example, does the scenario in which the model breaks down clearly reflect an obvious limitation of the model assumptions, in which case breakdown is entirely expected and nothing to worry about. The breakdown is telling you what you already know, that in this scenario the assumptions don’t hold or are inappropriate and so we expect the model to be inaccurate or not work at all. If the breakdown scenario doesn’t reflect known weaknesses of the model assumptions you’ve either uncovered a flaw in the mathematical form of your model, which you can now fix, or you’ve uncovered an extra hidden assumption you didn’t know about. Either way, you’ve made progress.

Recover known behaviours:

Another test that has similarities to the asymptotic analysis and the stress tests. For example, your model may be a generalization of a more specialized model. It may contain extra parameters that capture non-linear effects. If we set those extra parameters to zero in the model or in any downstream mathematical analysis we have performed, then we would expect to get the same behaviour as the purely linear model. Is this what happens? If not, you’ve got a problem with your model or the downstream analysis. Again this is using known expected behaviour of a nested sub-case as a check on the general model.

Coefficients before fitting:

Your probably familiar with the idea of checking the parameters of a model after fitting, to check that those parameter values make sense. Here, I’m talking about models with small numbers of features and hence parameters, which also have some easy interpretation. Because we can interpret the parameters we can probably come up with what we think are reasonable ball-park values for them even before training the model. This gives us, i) a check on the final fitted parameter values, and ii) a check on what scale of output we think is reasonable from the model. We can then compare what we think should be the scale of the model output against what is needed to explain the response data. If there is an order of magnitude or more mis-match then we have a problem. Our model will either be incapable of explaining the training data in its current mathematical form, or one or more of the parameters is going to have an exceptional value. Either way, it is probably wise to look at the mathematical form of your model again.

Dimensional analysis:

In high school you may have encountered dimensional analysis in physics lessons. There you checked that the left-hand and right-hand sides of a formula were consistent when expressed in dimensions of Mass (M), Length (L), and Time (T). However, we can extend the idea to any sets of dimensions. If the right-hand side of a formula consists of clicks divided by spend, and so has units of $\rm{[currency]}^{-1}$ , then so must the left-hand side. Similarly, arguments to transcendental functions such as exp or sin and cos must be dimensionless. These checks are a quick and easy way to spot if a formula is inadvertently missing a dimensionful factor.

Conclusion:

These tests of the mathematical form of a model ensure that a model is robust and its output is sensible when used in scenarios outside of its training data. And let’s be realistic here; in commercial Data Science all models get used beyond the scope for which they are technically valid. Not having a robust and sensible mathematical form for your model means you run the risk of it outputting garbage.

Extreme Components Analysis

On June 18, 2025August 4, 2025 By dchoyleIn UncategorizedLeave a comment

TL;DR: Both Principal Components Analysis (PCA) and Minor Components Analysis (MCA) can be used for dimensionality reduction, identifying low-dimensional subspaces of interest as those which have the greatest variation in the original data (PCA), or those which have the least variation in the origina data MCA). As real data will contain both directions of unusually high variance and directions of unusually low variance, using just PCA or just MCA will lead to biased estimates of the low-dimensional subspace. The 2003 NeurIPs paper from Welling et al unifies PCA and MCA into a single probabilistic model XCA (Extreme Components Analysis). This post explains the XCA paper of Welling et al and demonstrates the XCA algorithm using simulated data. Code for the demonstration is available from https://github.com/dchoyle/xca_post

A deadline

This post arose because of a deadline I have to meet. I don’t know when the deadline is, I just know there is a deadline. Okay, it is a self-imposed deadline, but it will start to become embarrassing if I don’t hit it.

I was chatting with a connection, Will Faithfull, at a PyData Manchester Leaders meeting almost a year ago. I mentioned that one of my areas of expertise was Principal Components Analysis (PCA), or more specifically, the use of Random Matrix Theory to study the behaviour of PCA when applied to high-dimensional data.

A recap of PCA

In PCA we are trying to approximate a d-dimensional dataset by a reduced number of dimensions $k < d$ . Obviously we want to retain as much of the structure and variation of the original data, so we choose our k-dimensional subspace such that the variance of the original data in the subspace is as high as possible. Given a mean-centered data matrix $\underline{\underline{X}}$ consisting of $N$ observations, we can calculate the sample covariance matrix $\hat{\underline{\underline{C}}}$ as ¹,

$\hat{\underline{\underline{C}}} = \frac{1}{N-1} \underline{\underline{X}}^{\top} \underline{\underline{X}}$

Once we have the (symmetric) matrix $\hat{\underline{\underline{C}}}$ we can easily compute its eigenvectors $\underline{v}_{i}, i=1,\ldots, d$ , and their corresponding eigenvalues $\lambda_{i}$ .

The optimal $k$ -dimensional PCA subspace is then spanned by the $k$ eigenvectors of $\hat{\underline{\underline{C}}}$ that correspond to the $k$ largest eigenvalues of $\hat{\underline{\underline{C}}}$ . These eigenvectors are the directions of greatest variance in the original data. Alternatively, one can just do a Singular Value Decomposition (SVD) of the original data matrix $\underline{\underline{X}}$ , and work with the singular values of $\underline{\underline{X}}$ instead of the eigenvalues of $\hat{\underline{\underline{C}}}$ .

That is a heuristic derivation/justification of PCA (minus the detailed maths) that goes back to Harold Hotelling in 1933². There is a probabilistic model-based derivation due to Tipping and Bishop (1999), which we will return to later.

MCA

Will responded that as part of his PhD, he’d worked on a problem where he was more interested in the directions in the dataset along which the variation is least. The problem Will was working on was “unsupervised change detection in multivariate streaming data”. The solution Will developed was a modular one, chaining together several univariate change detection methods each of which monitored a single feature of the input space. This was combined with a MCA feature extraction and selection pre-processing step. The solution was tested against a problem of unsupervised endogenous eye blink detection.

The idea behind Will’s use of MCA was that for the streaming data he was interested in it was likely that the inter-class variances of various features were likely to be much smaller than intra-class variances, and so any principal components were likely to be dominated by what the classes had in common rather than what had changed, so the directions of greatest variance weren’t very useful for his change detection algorithm.

I’ve put a link here to Will’s PhD in case you are interested in the details of the problem and solution – yes, Will I have read your PhD.

Directions of least variance in a dataset can be found from the same eigen-decomposition of the sample covariance matrix and by selecting the components with the smallest non-zero eigenvalues. Unsurprisingly, focusing on directions of least variance in a dataset is called Minor Components Analysis (MCA)^3,4. Where we have the least variation in the data the data is effectively constrained so, MCA is good for identifying/modelling invariants or constraints within a dataset.

At this point in the conversation, I recalled the last time I’d thought about MCA. That was when an academic colleague and I had a paper accepted at the NeurIPs conference in 2003. Our paper was on kernel PCA applied to high-dimensional data, in particular the eigenvalue distributions that result. As I was moving job and house at the time I was unable to go the conference, so my co-author, Magnus Rattray (now Director of the Institute for Data Science and Artificial Intelligence at the University of Manchester), went instead. On returning, Magnus told me of an interesting conversation he’d had at the conference with Max Welling about our paper. Max also had a paper at the conference, on XCA – Extreme Components Analysis. Max and his collaborators had managed to unify PCA and MCA into a single framework.

I mentioned the XCA paper to Will at the PyData Manchester Leaders meeting and said I’d write something up explaining XCA. It would also give me an excuse to revisit something that I hadn’t looked at since 2003. That conversation with Will was nearly a year ago. Another PyData Manchester Leaders meeting came and went and another will be coming around sometime soon. To avoid having to give a lame apology I thought it was about time I wrote this post.

XCA

Welling et al rightly point out that if we are modelling a dataset as lying in some reduced dimensionality subspace then we consider the data as being a combination of variation and constraint. We have variation of the data within a subspace and a constraint that the data does not fall outside the subspace. So we can model the same dataset focusing either on the variation (PCA) or on the constraints (MCA).

Note that in my blog post I have used a different, more commonly used notation. for the number of features and the number of components, than that used in the Welling et al paper. The mapping between the two notations is given below,

Number of features: My notation = $d$ , Welling et al notation = $D$
Number of components: My notation = $k$ , Welling et al notation = $d$

Probabilistic PCA and MCA

PCA and MCA both have probabilistic formulations, PPCA and PMCA⁵ respectively. Welling et al state that, “probabilistic PCA can be interpreted as a low variance data cloud which has been stretched in certain directions. Probabilistic MCA on the other hand can be thought of as a large variance data cloud which has been pushed inward in certain directions.” In both probabilistic models a $d$ -dimensional datapoint $\underline{x}$ is considered as coming from a zero-mean multivariate Gaussian distribution. In PCA the covariance matrix of the Gaussian is modelled as,

$\underline{\underline{C}}_{PCA} = \sigma^{2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}$

The matrix $\underline{\underline{A}}$ is $k \times d$ and its columns are the principal components that span the low dimensional subspace we are trying to model.

In MCA the covariance matrix is modelled as,

$\underline{\underline{C}}_{MCA}^{-1} = \sigma^{-2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}$

The matrix $\underline{\underline{W}}$ is $d \times (d-k)$ and its rows are the minor components that define the $d-k$ subspace where we want as little variation as possible.

Since in real data we probably have both exceptional directions whose variance is greater than the bulk and exceptional directions whose variance is less than the bulk, both PCA and MCA would lead to biased estimates for these datasets. The problem is that if we use PCA we lump the low variation eigenvalues (minor components) in with our estimate of the isotropic noise, thereby underestimating the true noise variance and consequently biasing our estimate of the large variation PC subspace. Likewise, if we use MCA, we lump all the large variation eigenvalues (principal components) into our estimate of the noise and overestimate the true noise variance, thereby biasing our estimate of the low variation MC subspace.

Probabilistic XCA

In XCA we don’t have that problem. In XCA we include both large variation and small variation directions in our reduced dimensionality subspace. In fact we just have a set of orthogonal directions $\underline{a}_{i}\;,\;i=1,\ldots,k$ that span a low-dimensional subspace and again form the columns of a matrix $\underline{\underline{A}}$ . These are our directions of interest in the data. Some of them, say, $k_{PC}$ , have unusually large variance, some of them, say $k_{MC}$ , have unusually small variance. The overall number of extreme components (XC) is $k = k_{PC} + k_{MC}$ .

As with probabilistic PCA, we then add on top an isotropic noise component to the overall covariance matrix. However, the clever trick used by Welling et al was that they realized that adding noise always increases variances, and so adding noise to all features will make the minor components undetectable as the minor components have, by definition, variances below that of the bulk noise. To circumvent this, Welling et al only added noise to the subspace orthogonal to the subspace spanned by the vectors $\underline{a}_{i}$ . They do this by introducing a projection operator ${\cal{P}}_{A}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{A}} \left ( \underline{\underline{A}}^{\top}\,\underline{\underline{A}}\right )^{-1}\underline{\underline{A}}^{\top}$ . Again we model the data as coming from a zero-mean multivariate Gaussian, but for XCA the final covariance matrix is then of the form,

$\underline{\underline{C}}_{XCA} = \sigma^{2}_{0} {\cal{P}}_{A}^{\perp} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}$

and the XCA model is,

$\underline{x} \sim \underline{\underline{A}}\,\underline{y} + {\cal{P}}_{A}^{\perp} \underline{n}\;\;\;,\;\; \underline{y} \sim {\cal{N}}\left ( \underline{0}, \underline{\underline{I}}_{k}\right )\;\;\;,\;\;\underline{n} \sim {\cal {N}}\left ( \underline{0}, \sigma^{2}_{0}\underline{\underline{I}}_{d}\right )$

We can also start from the MCA side, by defining a projection operator ${\cal{P}}_{W}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{W}}^{\top} \left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top}\right )^{-1}\underline{\underline{W}}$ , where the rows of the $k\times d$ matrix $\underline{\underline{W}}$ span the $k$ dimensional XC subspace we wish to identify. From this MCA-based approach Welling et al derive the probabilistic model for XCA as zero-mean multivariate Gaussian distribution with inverse covariance,

$\underline{\underline{C}}_{XCA}^{-1} = \frac{1}{\sigma^{2}_{0}} {\cal{P}}_{W}^{\perp} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}$

The two probabilistic forms of XCA are equivalent and so one finds that the matrices $\underline{\underline{A}}$ and $\underline{\underline{W}}$ are related via $\underline{\underline{A}} = \underline{\underline{W}}^{\top}\left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top} \right )^{-1}$

If we also look at the two ways in which Welling et al derived a probabilistic model for XCA, we can see that they are very similar to the formulations of PPCA and PMCA respectively, just with the replacement of ${\cal{P}}_{A}^{\perp}$ for $\underline{\underline{I}}_{d}$ in the PPCA formulation, and the replacement of ${\cal{P}}_{W}^{\perp}$ for $\underline{\underline{I}}_{d}$ in the PMCA formulation. So with just a redefinition of how we add the noise in the probabilistic model, Welling et al derived a single probabilistic model that unifies PCA and MCA.

Note that we are now defining the minor components subspace as directions of unusually low variance, so we only need a few dimensions, i.e. $k_{MC} \ll d$ , whilst previously when we defined the minor components subspace as the subspace where we wanted to constrain the data away from, we needed $k_{MC} = d - k_{PC}$ directions. The probabilistic formulation of XCA is a very natural and efficient way to express MCA.

Maximum Likelihood solution for XCA

The model likelihood is easily written down and the maximum likelihood solution identified. As one might anticipate the maximum-likelihood estimates for the vectors $\underline{a}_{i}$ are just eigenvectors of $\underline{\underline{\hat{C}}}$ , but we need to work out which ones. We can use the likelihood value at the maximum likelihood solution to do that for us.

Let’s say we want to retain $k=6$ extreme components overall, and we’ll use ${\cal{C}}$ to denote the corresponding set of eigenvalues of $\hat{\underline{\underline{C}}}$ that are retained. The maximum likelihood value for $k$ extreme components (XC) is given by,

$\log L_{ML} = - \frac{Nd}{2}\log \left ( 2\pi e\right )\;-\;\frac{N}{2}\sum_{i\in {\cal{C}}}\lambda_{i}\;-\;\frac{N(d-k)}{2}\log \left ( \frac{1}{d-k}\left [ {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right ]\right )$

All we need to do is evaluate the above equation for all possible subsets ${\cal{C}}$ of size $k$ selected from the $d$ eigenvalues $\lambda_{i}\, , i=1,\ldots,d$ of $\hat{\underline{\underline{C}}}$ . Superificially, this looks like a nasty combinatorial optimization problem, of exponential complexity. But as Welling et al point out, we know from a result proved in the PPCA paper of Tipping and Bishop that in the maximum likelihood solution the non-extreme components have eigenvalues that form a contiguous group, and so the optimal choice of subset ${\cal{C}}$ reduces to determining where to split the ordered eigenvalue spectrum of $\hat{\underline{\underline{C}}}$ . Since we have $k = k_{PC} + k_{MC}$ that reduces to simply determining the optimal number of the largest $\lambda_{i}$ to retain. That makes the optimization problem linear in $k$ .

For example, in our hypothetical example we have said we want $k = 6$ , but that could be a 3 PCs + 3 MCs split, or a 2 PCs + 4MCs split, and so on. To determine which we simply compute the maxium likelihood value for all the possible values of $k_{PC}$ from $k_{PC} = 0$ to $k_{PC} = k$ , each time keeping the largest $k_{PC}$ values of $\lambda_{i}$ and the smallest $k - k_{PC}$ values of $\lambda_{i}$ in our set ${\cal{C}}$ .

Some of the terms in $\log L_{ML}$ don’t change as we vary $k_{PC}$ and can be dropped. Welling et al introduce a quantity ${\cal{K}}$ defined by,

${\cal{K}}\;=\sum_{i\in {\cal{C}}}\lambda_{i}\; + \;(d-k)\log \left ( {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right )$

${\cal{K}}$ is the negative of $\log L_{ML}$ , up to an irrelevant constant and scale. If we then compute ${\cal{K}}\left ( k_{PC}\right )$ for all values of $k_{PC} = 0$ to $k_{PC} = k$ and select the minimum, we can determine the optimal split of $k = k_{PC} + k_{MC}$ .

PCA and MCA as special cases

Potentially, we could find that $k_{PC} = k$ , in which case all the selected extreme components would correspond to principal components, and so the XCA algorithm becomes equivalent to PCA. Likewise, we could get $k_{PC} = 0$ , in which case all the selected extreme components would correspond to minor components and the XCA algorithm becomes equivalent to MCA. So XCA contains pure PCA and pure MCA as special cases. But when do these special cases arise? Obviously, it will depend upon the precise values of the sample covariance eigenvalues $\lambda_{i}$ , or rather the shape of the eigen-spectrum, but Welling et al also give some insight here, namely,

A log-convex sample covariance eigen-spectrum will give PCA
A log-concave sample covariance eigen-spectrum will give MCA
A sample covariance eigen-spectrum that is neither log-convex nor log-concave will yield both principal components and minor components

In layman’s terms, if the plot of the (sorted) eigenvalues on a log scale only bends upwards (has positive second derivative) then XCA will give just principal components, whilst if the plot of the (sorted) eigenvalues on a log scale only bends downwards (has negative second derivative) then we’ll get just minor components. If the plot of the log-eigenvalues has places where the second derivative is positive and places where it is negative, then XCA will yield a mixture of principal and minor components.

Experimental demonstration

To illustrate the XCA theory I produced a Jupyter notebook that generates simulated data containing a known number of principal components and a known number of minor components. The simulated data is drawn from a zero-mean multivariate Gaussian distribution with population covariance matrix $\underline{\underline{C}}$ whose eigenvalues $\Lambda_{i}$ have been set to the following values,

$\begin{array}{cclcl} \Lambda_{i} & = & 3\sigma^{2}\left ( 1 + \frac{i}{3k_{PC}}\right ) & , & i=1,\ldots, k_{PC}\\ \Lambda_{i} & = &\sigma^{2} & , & i=k_{PC}+1,\ldots, d - k_{MC}\\ \Lambda_{i} & = & 0.1\sigma^{2}\left ( 1 + \frac{i}{k_{MC}} \right ) & , & i=d - k_{MC} + 1,\ldots, d\end{array}$

The first $k_{PC}$ eigenvalues represent principal components, as their variance is considerable higher than the ‘noise’ eigenvalues, which are represented by eigenvalues $i=k_{PC}+1$ to $i=d - k_{MC}$ . The last $k_{MC}$ eigenvalues represent minor components, as their variance is considerably lower than the ‘noise’ eigenvalues. Note, I have scaled both the PC and MC population eigenvalues by the ‘noise’ variance $\sigma^{2}$ , so that $\sigma^{2}$ just sets an arbitrary (user-chosen) scale for all the variances. I have chosen a large value of $\sigma^{2}=20$ , so that when I plot the minor component eigenvalues of $\hat{\underline{\underline{C}}}$ I can easily distinguish them from zero (without having to plot on a logarithmic scale).

We would expect the eigenvalues of the sample covariance matrix to follow a similar pattern to the eigenvalues of the population covariance matrix that we used to generate the data, i.e. we expect a small group of noticeably low-valued eigenvalues, a small group of noticeably high-valued eigenvalues, and the bulk (majority) of the eigenvalues to form a continuous spectrum of values.

I generated a dataset consisting of $N=2000$ datapoints, each with $d=200$ features. For this dataset I chose $k_{PC} = k_{MC} = 10$ . From the simulated data I computed the sample covariance matrix $\hat{\underline{\underline{C}}}$ and then calculated the eigenvalues of $\hat{\underline{\underline{C}}}$ .

The left-hand plot below shows all the eigenvalues (sorted from lowest to highest), and I have also zoomed in on just the smallest values (middle plot) and just the largest values (right-hand plot). We can clearly see that the sample covariance eigenvalues follow the pattern we expected.

All the code (with explanations) for my calculations are in a Jupyter notebook and freely available from the github repository https://github.com/dchoyle/xca_post

From the left-hand plot we can see that there are places where the sample covariance eigenspectrum bends upwards and places where it bends downwards, indicating that we would expect the XCA algorithm to retain both principal and minor components. In fact, we can clearly see from the middle and right-hand plots the distinct group of minor component eigenvalues and the distinct group of principal component eigenvalues, and how these correspond to the distinct groups of extreme components in the population covariance eigenvalues. However, it would be interesting to see how the XCA algorithm performs in selecting a value for the number of principal components.

For the eigenvalues above I have calculated the minimum value of ${\cal{K}}$ for a total of $k=20$ extreme components. The minimum value of ${\cal{K}}$ occurs at $k_{PC} = 10$ , indicating that the method of Welling et al estimates that there are 10 principal components in this dataset, and by definition there are then $k - k_{PC} = 20 - 10 = 10$ minor components in the dataset. In this case, the XCA algorithm has identified the dimensionalities of the PC and MC subspaces exactly.

Final comments

That is the post done. I can look Will in the eye when we meet for a beer at the next PyData Manchester Leaders meeting. The post ended up being longer (and more fun) than I expected, and you may have got the impression from my post that the Welling et al paper has completely solved the problem of selecting interesting low-dimensional subspaces in Gaussian distributed data. Note quite true. There are still challenges with XCA, as there are with PCA. For example, we have not said how we choose the total number of extreme components $k$ . That is a whole other model selection problem and one that is particularly interesting for PCA when we have high-dimensional data. This is one of my research areas – see for example my JMLR paper Hoyle2008.

Another challenge that is particularly relevant for high-dimensional data is the question of whether we will see distinct groups of principal and minor component sample covariance eigenvalues at all, even when we have distinct groups of population covariance eigenvalues. I chose very carefully the settings used to generate the simulated data in the example above. I ensured that we had many more samples than features, i.e. $N \gg d$ , and that the extreme component population covariance eigenvalues were distinctly different from the ‘noise’ population eigenvalues. This ensured that the sample covariance eigenvalues separated into three clearly visible groups.

However, in PCA when we have $N < d$ and/or weak signal strengths for the extreme components of the population covariance, then the extreme component sample covariance eigenvalues may not be separated from the bulk of the other eigenvalues. As we increase the ratio $\alpha = N/d$ we observe a series of phase transitions at which each of the extreme components becomes detectable – again this is another of my areas of research expertise [HoyleRattray2003, HoyleRattray2004, HoyleRattray2007]

Footnotes

I have used the usual divide by $N-1$ Bessel correction in the definition of the sample covariance. This is because I have assumed any data matrix will have been explicitly mean-centered. In many of the analyses of PCA the starting assumption is that the data is drawn from a mean-zero distribution, so that the sample mean of any feature is zero only under expectation, not as a constraint. Consequently, most formal analysis of PCA will define the sample covariance matrix with a $1/N$ factor. Since I have to deal with real data, I will never presume the data been drawn from population distribution that has zero-mean and so to model the data with a zero-mean distribution I will explicitly mean-center the data. Therefore, I use the $1/(N-1)$ definition of the sample covariance. Strictly speaking, that means the various theories and analyses I discuss later in the post are not applicable to the data I’ll work with. It is possible to modify the various analyses to explicitly take into account the mean-centering step, but it is tedious to do so. In practice, (for large $N$ ) the difference is largely inconsequential, and formulae derived from analysis of zero-mean distributed data can be accurate for mean-centered data, so we’ll stick with using the $1/(N-1)$ definition for $\hat{\underline{\underline{C}}}$ .
Hotelling, H. “Analysis of a complex of statistical variables into principal components”. Journal of Educational Psychology, 24:417-441 and also 24:498–520, 1933.
https://dx.doi.org/10.1037/h0071325
Oja, E. “Principal components, minor components, and linear neural networks”. Neural Networks, 5(6):927-935, 1992. https://doi.org/10.1016/S0893-6080(05)80089-9
Luo, F.-L., Unbehauen, R. and Cichocki, A. “A Minor Component Analysis Algorithm”. Neural Networks, 10(2):291-297, 1997. https://doi.org/10.1016/S0893-6080(96)00063-9
See for example, Williams, C.K.I. and Agakov, F.V. “Products of gaussians and probabilistic minor components analysis”. Neural Computation, 14(5):1169-1182, 2002. https://doi.org/10.1162/089976602753633439

The Royal Statistical Society Conference and Data Science

On September 24, 2022December 9, 2023 By dchoyleIn Conferences, Data Science, MeetUpsLeave a comment

This year the UK’s Royal Statistical Society (RSS) held its annual international conference in Aberdeen between the 12^th and 15^th September 2022.

You may think that the society’s main conference doesn’t hold that much relevance for you as a Data Scientist. Yes, you have an interest in Data Science with a statistical flavour, but surely the main conference is all clinical trials analysis and the like, isn’t it? My job over the next 980 words is to persuade you otherwise.

Statistics is about the whole data life cycle

Go to the RSS website or look at an official email from the RSS and you’ll see that the RSS strapline is “Data | Evidence | Decisions”. This accurately reflects the breadth of topics covered at the conference – in the session talks, the posters, and the plenary lectures. Statistics is about data, and modern statistics now concerns itself with all aspects related to data – how it is collected, how it is analysed, how models are built from that data, how inferences are made from those models, and how decisions are made off the back of those inferences. A modern general statistics conference now has to reflect the full end-to-end lifecycle of data and also the computational and engineering workflows that go with it. This year’s RSS conference did just that.

A Strong Data Science focus

Over the three main days of the conference there were 7 specific sessions dedicated to Data Science, totalling 8hrs and 20mins of talks. You can see from the full list below the breadth covered in the Data Science sessions.

Novel applications and Data Sets
Introduction to MLOps
The secret sauce of Open Source
Data Science for Health Equity
The UK’s future data research infrastructure
Epidemiological applications of Data Science
Algorithmic bias and ethical considerations in Data Science

On top of this there were Data Science topics in the 8 rapid fire talk sessions and in the 110 accepted posters. Example Data Science related topics included MLOps, Decentralized finance, Genetic algorithms, Kernels for optimal compression of distributions, Changepoint detection, Quantifying the Shannon entropy of a histogram, Digital Twins, Joint node degree estimation in Erdos-Renyi networks, Car club usage prediction, and Deep hierarchical classification of crop types from satellite images.

A growing Data Science presence

I’ve been involved with the conference board this year and last (Manchester 2021) and my perception is the size of the conference in increasing, in terms of number of submissions and attendees, the range of topics, and the amount of Data Science represented. However, I only have two datapoints here. One of those was just as the UK was coming out of its first Covid-19 lockdown, so will probably not provide a representative baseline. So I’m not going to stick my neck out too much here, but I do expect further increases in the amount of Data Science presence at next year’s conference.

Other relevant sessions

If like me you work primarily as a Data Scientist in a commercial environment, then there were also many talks from other Sections of the RSS that were highly relevant. The Business, Industry and Finance section had talks on Explainable AI, Novel Applications of Statistics in Business, and Democratisation of Statistics in GlaxoSmithKline, whilst the Professional Development section had talks on Linked Open Data, programming in R and Python, and the new Quarto scientific publishing system.

The Future of the Data Science Profession

Of particular relevance to Data Scientists was the Professional Development section’s talk on the new Alliance for Data Science Professionals accreditations of which the RSS is part. The session walked through the various paths to accreditation and the collaborative nature of the application process. This was backed up by a Data Science ‘Beer and Pizza’ event hosted by Brian Tarran (former Significance magazine editor and now RSS Head of Data Science Platform) and Ricky McGowan (RSS Head of Standards and Corporate Relations) who both explained some of the RSS long-term plans for Data Science.

Diversity of topics across the whole conference

Diversity of topics was a noticeable theme emerging from the conference as a whole, not just in the Data Science and commercial statistics streams. For me, this reflects the broader desire of the RSS to embrace Data Scientists and any practitioners who are involved with analysing and handling data. It reflects a healthy antidote to the ‘Two cultures of statistical modelling‘ divide identified and discussed by Leo Breiman many years ago.

For example, the range of plenary talks was equally impressive as the diversity of topics in the various sessions. Like many Data Scientists my original background was a PhD in Theoretical Physics. So, a talk from Ewain Gwynne on Random Surfaces and Liouville Quantum Gravity – see picture below – took me back 30 years and also gave me an enjoyable update on what has happened in the field in those intervening years.

Ewain Gwynne talking about Random Surfaces and Liouville Quantum Gravity.

Other plenary highlights for me were Ruth King’s Barnett lecture on statistical ecology and Adrian Raftery’s talk on the challenges of forecasting world populations out to the year 2100 and as far as 2300 – see below.

Adrian Raftery talking about Bayesian Demography.

A friendly conference

The conference is not a mega-conference. We not talking NeurIPS or ICML. It was around 600 attendees – big enough not to be too insular and focused only on one or two topics, but still small enough to be welcoming, friendly and very sociable. There were social events on every evening of the conference. And to top it all, it was even sunny in Aberdeen for the whole week.

I also got to play pool against the person who led the UK’s COVID-19 dashboard work, reporting the UK government’s official daily COVID-19 stats to the general public. I lost 2-1. I now hold a grudge.

Next year – Harrogate 2023

Next year’s conference is in Harrogate, 4^th – 7^th September 2023. I will be going. Between now and then I will be practicing my pool for a revenge match. I will also be involved with the conference board again, helping to shape the Data Science content. I can promise a wide range of Data Science contributions and talks on other statistical topics Data Scientists will find interesting. I can’t promise sunshine, but that’s Yorkshire for you.

How many iterations are needed for the bisection algorithm?

On August 9, 2022September 24, 2022 By dchoyleIn Algorithms, Data Science, Mathematical Analysis, Scientific computationLeave a comment

<TL;DR>

The bisection algorithm is a very simple algorithm for finding the root of a 1-D function.
Working out the number of iterations of the algorithm required to determine the root location within a specified tolerance can be determined from a very simple little hack, which I explain here.
Things get more interesting when we consider variants of the bisection algorithm, where we cut an interval into unequal portions.

</TL;DR>

A little while ago a colleague mentioned that they were repeatedly using an off-the-shelf bisection algorithm to find the root of a function. The algorithm required the user to specify the number of iterations to run the bisection for. Since my colleague was running the algorithm repeatedly they wanted to set the number of iterations efficiently and also to achieve a guaranteed level of accuracy, but they didn’t know how to do this.

I mentioned that it was very simple to do this and it was a couple of lines of arithmetic in a little hack that I’d used many times. Then I realised that the hack was obvious and known to me because I was old – I’d been doing this sort of thing for years. My colleague hadn’t. So I thought the hack would be a good subject for a short blog post.

The idea behind a bisection algorithm is simple and illustrated in Figure 1 below.

How the bisection algorithm works — Figure 1: Schematic of how the bisection algorithm works

At each iteration we determine whether the root is to the right of the current mid-point, in the right-hand interval, or to the left of the current mid-point, in the left-hand interval. In either case, the range within which we locate the root halves. We have gone from knowing it was in the interval $[x_{lower}, x_{upper}]$ , which has width $x_{upper}-x_{lower}$ , to knowing it is in an interval of width $\frac{1}{2}(x_{upper}-x_{lower})$ . So with every iteration we reduce our uncertainty of where the root is located by half. After $N$ iterations we have reduced our initial uncertainty by $(1/2)^{N}$ . Given our initial uncertainty is determined by the initial bracketing of the root, i.e. an interval of width $(x_{upper}^{(initial)}-x_{lower}^{(initial)})$ , we can now work out that after $N$ iterations we have narrowed down the root to an interval of width ${\rm initial\;width} \times \left ( \frac{1}{2}\right ) ^{N}$ . Now if we want to locate the root to within a tolerance ${\rm tol}$ , we just have to keep iterating until the uncertainty reaches ${\rm tol}$ . That is, we run for $N$ iterations where $N$ satisfies,

$\displaystyle N\;=\; -\frac{\ln({\rm initial\;width/tol})}{\ln\left (\frac{1}{2} \right )}$

Strictly speaking we need to run for $\lceil N \rceil$ iterations. Usually I will add on a few extra iterations, e.g. 3 to 5, as an engineering safety factor.

As a means of easily and quickly determining the number of iterations to run a bisection algorithm the calculation above is simple, easy to understand and a great little hack to remember.

Is bisection optimal?

The bisection algorithm works by dividing into two our current estimate of the interval in which the root lies. Dividing the interval in two is efficient. It is like we are playing the childhood game “Guess Who”, where we ask questions about the characters’ features in order to eliminate them.

Asking about a feature that approximately half the remaining characters possess is the most efficient – it has a reasonable probability of applying to the target character and eliminates half of the remaining characters. If we have single question, with a binary outcome and a probability $p$ of one of those outcomes, then the question that has $p = \frac{1}{2}$ maximizes the expected information (the entropy), $p\ln (p)\;+\; (1-p)\ln(1-p)$ .

Dividing the interval unequally

When we first played “Guess Who” as kids we learnt that asking questions with a much lower probability $p$ of being correct didn’t win the game. Is the same true for our root finding algorithm? If instead we divide each interval into unequal portions is the root finding less efficient than when we bisect the interval?

Let’s repeat the derivation but with a different cut-point e.g. 25% along the current interval bracketing the root. In general we can test whether the root is to the left of right of a point that is a proportion $\phi$ along the current interval, meaning the cut-point is $x_{lower} + \phi (x_{upper}-x_{lower})$ . At each iteration we don’t know in advance which side of the cut-point the root lies until we test for it, so in trying to determine in advance the number of iterations we need to run, we have to assume the worst case scenario and assume that the root is still in the larger of the two intervals. The reduction in uncertainty is then, ${\rm max}\{\phi, 1-\phi\}$ . Repeating the derivation we find that we have to run at least,

$\displaystyle N_{Worst\;Case}\;=\;\ -\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm max}\{\phi, 1 - \phi \right \})}$

iterations to be guaranteed that we have located the root to within $tol$ .

Now to determine the cut-point $\phi$ that minimizes the upper bound on number of iterations required, we simply differentiate the expression above with respect to $\phi$ . Doing so we find,

$\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; -\frac{\ln({\rm initial\;width/tol})}{ (1-\phi) \left ( \ln (1 - \phi) \right )^{2}} \;\;,\;\; \phi < \frac{1}{2}$

and

$\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; \frac{\ln({\rm initial\;width/tol})}{\phi \left ( \ln (\phi) \right)^{2}} \;\;,\;\; \phi > \frac{1}{2}$

The minimum of $N_{Worst\;Case}$ is at $\phi =\frac{1}{2}$ , although $\phi=\frac{1}{2}$ is not a stationary point of the upper bound $N_{Worst\;Case}$ , as $N_{Worst\;Case}$ has a discontinuous gradient there.

That is the behaviour of the worst-case scenario. A similar analysis can be applied to the best-case scenario – we simply replace $max$ with $min$ in all the above formula. That is, in the best-case scenario the number of iterations required is given by,

$\displaystyle N_{Best\;Case}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm min}\{\phi, 1 - \phi \right \})}$

Here, the maximum of the best-case number of iterations occurs when $\phi = \frac{1}{2}$ .

That’s the worst-case and best-case scenarios, but how many iterations do we expect to use on average? Let’s look at the expected reduction in uncertainty in the root location after $N$ iterations. In a single iteration a root that is randomly located within our interval will lie, with probability $\phi$ , in segement to the left of our cut-point and leads to a reduction in the uncertainty by a factor of $\phi$ . Similarly, we get a reduction in uncertainty of $1-\phi$ with probability $1-\phi$ if our randomly located root is to the right of the cut-point. So after $N$ iterations the expected reduction in uncertainty is,

$\displaystyle {\rm Expected\;reduction}\;=\;\left ( \phi^{2}\;+\;(1-\phi)^{2}\right )^{N}$

Using this as an approximation to determine the typical number of iterations, we get,

$\displaystyle N_{Expected\;Reduction}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ( \phi^{2} + (1-\phi)^{2} \right )}$

This still isn’t the expected number of iterations, but to see how it compares Figure 2 belows shows simulation estimates of $\mathbb{E}\left ( N \right )$ plotted against $\phi$ when the root is random and uniformly distributed within the original interval.

The number of iterations needed for the bisection algorithm — Number of iterations required for the different root finding methods.

For Figure 2 we have set $w = ({\rm initial\;width/tol}) = 0.01$ . Also plotted in Figure 2 are our three theoretical estimates, $\lceil N_{Worst\;Case}\rceil, \lceil N_{Best\;Case}\rceil, \lceil N_{Expected\;Reduction}\rceil$ . The stepped structure in these 3 integer quantities is clearly apparent, as is how many more iterations are required under the worst case method when $\phi \neq \frac{1}{2}$ .

The expected number of iterations required, $\mathbb{E}( N )$ , actually shows a rich structure that isn’t clear unless you zoom in. Some aspects of that structure were unexpected, but requires some more involved mathematics to understand. I may save that for a follow-up post at a later date.

Summary

Introduction

What do you think an outlier is?

An illustration

Life (and data) is always made up of multiple processes

How to detect outliers?

Conclusions

Share this:

Summary

Introduction

The log-sum-exp trick

The log-sum-exp function

Calculating log-sum-exp in Python

Share this:

Introduction

What is a Bland-Altman plot?

What the eye doesn’t see

Share this:

Share this:

Introduction

Pre-training tests of model form

Asymptotic behaviour tests:

Stress tests/Breakdown tests:

Recover known behaviours:

Coefficients before fitting:

Dimensional analysis:

Conclusion:

Share this:

A deadline

A recap of PCA

MCA

XCA

Probabilistic PCA and MCA

Probabilistic XCA

Maximum Likelihood solution for XCA

PCA and MCA as special cases

Experimental demonstration

Final comments

Footnotes

Share this:

Statistics is about the whole data life cycle

A Strong Data Science focus

A growing Data Science presence

Other relevant sessions

The Future of the Data Science Profession

Diversity of topics across the whole conference

A friendly conference

Next year – Harrogate 2023

Share this:

<TL;DR>

</TL;DR>

Is bisection optimal?

Dividing the interval unequally

Share this: