Abstraction is your friend

On February 13, 2026February 13, 2026 By dchoyleIn Algorithms, Coding, Data ScienceLeave a comment

Summary

When constructing a coding solution to an implememtation problem, or an algorithm/theory for a mathematical problem, we can get bogged down in detail and lose our way.
Abstraction helps us to move up a level and leave the detail behind.
Once we temporarily leave the detail behind, what we need to do becomes clearer.
Now we know what we have to do or implement, we can successfully put the detail back in because we have a clear ‘north star’ of what we need to implement.

Introduction

The computer scientist Edsger Dijkstra once said,

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

This is one of my favourite computing-related quotes, not least because it comes from Dijkstra. Edsger Dijkstra was a prodigious and extremely impactful computer scientist. In 1972 he was awarded the ACM A.M. Turing prize, widely regarded as the “Nobel prize for computer science”. Dijkstra was also “the computer scientist’s computer scientist”, providing rigorous solutions to tough practical problems, such as finding the shortest path between any two nodes on a network.

I also like the Dijkstra quote because I can relate to it. Whenever I want clarity on a problem, I abstract. Abstraction allows me to identify and focus on the “what” by taking the details of the “how” out of the discussion (for the time being). Once I know “what” it is I need to do I can put the details of the “how” back in and it becomes a straightforward technical implementation task (with debugging of course).

Yes, I hear you say, but what do you actually mean by all this high-level philosophical talk? I’ll give you two example. Yes, they are concocted examples, but they help illustrate what I’m driving at. Once I’ve discussed the two examples, we can then distil a more general approach of, i) when to recognize that we need more abstraction, and ii) how to do it.

Example 1:

Imagine you have the following equation,

$G + \Lambda g = \kappa T$ Eq.1

In Eq.1 $\Lambda$ and $\kappa$ are constants. With just your high-school mathematics, you can see that Eq. 1 is just a linear equation. With just high-school maths you can confidently understand that as $T$ increases then so does $G$ .

I will now tell you that Eq.1 represents the field equations for Einstein’s theory of general relativity. With just high-schools maths you’ve been able to understand general relativity. $G$ is the curvature tensor – it describes how spacetime is curved, $g$ is called the metric tensor, and $T$ is the stress-energy tensor. The simple high-level form of Eq.1 allows us to understand that energy, or equivalently mass (remember $E = mc^{2}$ ) affects how curved spacetime is. The more mass we have or the higher the energy density, then the more spacetime will be curved. And the curvature of spacetime affects the dynamics of anything moving in that spacetime. This simple high-level picture is so easy to grasp that the famous cosmologist John Archibald Wheeler could summarize Einstein’s theory in one short sentence, “Spacetime tells matter how to move; matter tells spacetime how to curve”.

All of the physics concepts of Einstein’s theory are in that simple equation of Eq.1 and likewise in the simple sentence from Wheeler. At this abstract level, the “what”, i.e. the physics, is crystal clear. However, when we start getting into the technical detail we have to go deeper. To “implement” a calculation we have to start to get to grips with the “how” and learn about Christoffel symbols, Ricci tensors, covariant and contravariant tensors, and much more. It requires mathematical training.

The details are also a lot messier. In the way I have presented the example, we started with the abstraction and then introduced some of the detail. When we are actually doing research, it is usually the reverse. We start with some existing details and want to construct a simplifying, unifying theory, but we get lost in the details because of the deep technical nature of those details.

A similar thing can happen when we are coding a Data Science solution. When we are deep in the technical details we often fail to spot the high-level patterns and so we fail to spot the simple description. We end-up producing lots of corner-cases and edge-cases. Our code starts to become one big series of “if-else if” statements or a big “switch” statement. Our second example illustrates that in a Data Science context.

Example 2:

In our first example we established that we are comfortable with linear models. In statistics a linear model is of the form,

$\mathbb{E}\left ( y \right ) \;=\; \underline{\beta}^{\top}\underline{x}$ Eq.2

The left-hand side of Eq.2 is the mean of our target variable $y$ and the right-hand side is what we call the “linear predictor” because it is a linear combination of the various predictive features and it predicts the average value of $y$ given the feature vector $\underline{x}$ .

What about non-linear models? Imagine that we have a non-linear relationship between our target variable and a single explanatory feature $x$ . One approach I’ve seen some Data Scientists take is to partition the values of $x$ into a number of sub-ranges and build a linear model for each sub-range. Sticking with linear models seems comfortable. But we end up with whole zoo of different linear models (yes, I really have seen someone do this).

Let’s move away from the details about how we do the model fitting and focus on what we want. Let’s abstract. We want to model the mean of the target variable. When we state the problem in these simple terms we realise we can just write the mean of the target variable as a non-linear transformation of the linear predictor in Eq.2. We write this as,

$\mathbb{E}\left( y \right )\;=\;g^{-1}\left ( \underline{\beta}^{\top}\underline{x} \right )$ . Eq.3

This is what statisticians call a Generalized Linear Model (GLM). Again, the left-hand side of Eq.3 is how we represent the mean of the target variable $y$ and the non-linear transformation function is traditionally represented as $g^{-1}(\cdot)$ . The function $g$ is called the “link function” and is monotonic. So we can read Eq.3 as “mean of y = simple monotonic non-linear transformation of $\underline{\beta}^{\top}\underline{x}$ ”. You can see where the name Generalized Linear Model comes from. It is just a non-linear transformation of the same thing we focus on when build a linear model.

The way we introduced Eq.3 seemed very natural and intuitive. And yet so many Data Scientists seem scared of GLMs. The jargon of link functions and the notation of expectation values $\mathbb{E}( y)$ can be off-putting when you first encounter them. Consequently, many Data Scientists don’t persevere with GLMs. When you understand what Eq.3 is saying at high-level they are very easy to understand. When we step away from the messy detail of “how” to code the non-linearity and instead focus on the “what” of what we want to achieve – introduce a non-linear relationship between the mean of our target variable and a linear combination of predictive features – then Eq.3 becomes the intuitive and obvious way to do it. Coding it up then becomes easy.

What can we learn more generally from those two examples?

How to use abstraction as a practical tool

What the two examples above will hopefully have convinced you about is:

Focusing at a high-level on the ‘things’ in our problem and what we want them to do helps us to identify the main actors in problem, the relationships between them and what we need to do to them. We discover the “what”.
Once we have the what, the “how”, i.e. the implementation is typically easy.

But how to use this in practice? Whenever I find myself trying to solve a coding problem and I’m getting bogged down in the details of structuring the problem, flipping between different choices of data-structures to use, then I know I’m too close to the problem. I know I need to step away from the keyboard and get the pen and paper out.

I start by representing the main objects in my problem by some sort of symbol, e.g. a circle or a square. I then sketch out what interactions happen between those objects. By this point I’m beginning to describe the interactions in terms of high-level language such as, “I have matrix of type X. It is processed by a transform of type Y that is parameterized by mathematical object of type A”. You’ll spot that I’m already beginning to describe the interactions more in terms of interfaces and method signatures. That is, I’m focusing on the abstraction of the problem and not on the implementation details that would occur behind those interfaces and method signatures. This is because I have no other choice. I have only pen and paper. I can’t code, so I can’t bogged down again in implementation details and discussions on what data structures to use.

Once I have finished a couple of iterations of pen-and-paper sketching I have the architecture of my solution nailed. And the best thing is the structure typically mimics the structure of the actual mathematical calculation I’m trying to code. At this point I have identified what interfaces, classes, and methods I need. I now return to the keyboard and the implementation is easy.

Conclusion

If you’re getting bogged down in working out implementation details, it is a good sign you don’t actually know “what” it is you need to implement.

Step back, move up a level of abstraction and start identifying what are the things/actors in your problem and what are the interactions between them. Doing this using pen-and-paper will aid you a lot, because it forces you away from the keyboard and the temptation to continue implementing.

Once you have sketched out the necessary objects and their interactions you will pretty much have the classes, interfaces, and methods you need. You have the “what”. Now you can implement. It will now be a lot faster.

Data Science Notes: 3. Choosing good sentinel values

On January 31, 2026February 13, 2026 By dchoyleIn UncategorizedLeave a comment

Summary

A sentinel value is a special value of a variable that is used to signify something, or draw the users attention to something.
A sentinel value is often used to distinguish between two states; “it worked ok”, “it didn’t work ok”.
Don’t choose a sentinel value that can inadvertently be interpreted or processed as a normal value of the variable in which it is stored.
Choose a sentinel value such that if it becomes corrupted it can’t accidently still be interpreted as indicating either of the states it is intended to distinguish between.
If a sentinel value is processed is processed by a downstream calculation, the sentinel value should be such that the downstream calculation generates an exception.
If processing a sentinel value generates an exception, the choice of sentinel value should be such that the exception is generated as close as possible (in code) to the place where the sentinel value is first inappropriately processed.

Introduction

All the posts in my Data Science Notes series are based on techniques I use regularly, or on conversations I’ve had with other Data Scientists where I’ve explained why experience has taught me to do something in a particular way. This post is no exception. It stemmed from someone asking why I initialize all my arrays with NaNs, particularly when I’m going to later reinitialize each array element to zero when I start adding to it. It’s because I want the NaN value to act as a sentinel value.

What is a sentinel value I can hear you asking? Well, a sentinel value watches. It warns. A bit like a lighthouse. It communicates when something happened or went wrong. Okay, but isn’t that just a flag? Can’t I just use a Boolean variable when something went wrong? Here’s the thing, often we want to use or store a sentinel value in a variable that is already processed as part of a calculation.

Here’s an example. Say I have an array of variances that I need to calculate. That array of variances might be the variances of a set of features I’m using in a predictive model. But for some of the features I might not be able to calculate the variance. The feature value might be missing across all but one of the training data points. No problem, I’ll just set the variance of that feature equal to -1. Since every Data Scientist knows a variance has to be non-negative, a value of -1 clearly indicates that the variance for that feature has not been calculated, and it is not available for any downstream calculation. That is a sentinel value in action. The -1 is the sentinel value. It has communicated to me when the variance calculation has not worked but it is not a flag variable. It is just a different value stored inside a normal variable, the feature variance in this case.

In the example above there are clear and obvious reasons why I chose -1 as the sentinel value for my array of variances. Now, here’s the thing; choosing good sentinel values can be a bit of an art learnt through experience. To demonstrate this, I’m going to give two real anecdotes from my career as a Data Scientist where good sentinel values were not used. From these two anecdotes we can distil a number of useful lessons about what makes a good sentinel value.

First anecdote:

On one project I worked on we had a piece of code that produced some parameters estimates for a predictive model, and some associated parameter uncertainties. Interestingly, the parameter uncertainties had a sentinel value of 9999.99. A parameter uncertainty of 9999.99 was supposed to indicate that we hadn’t been able to estimate the parameter – there were a number of genuine reasons why this could happen. The value of 9999.99 was considered too large to be mistakenly confused for a genuine parameter uncertainty. So whenever we saw a parameter uncertainty of 9999.99 we knew that the parameter had in fact not been estimated. And whenever we saw a parameter uncertainty less than 9999.99 we knew that the parameter had been estimated. A few years later I discovered that it was possible for a downstream processing step, under certain circumstances, to modify those 9999.99 values to something different. The final result was a parameter uncertainty which was large, e.g. 9548.17, but not 9999.99. We’d lost the ability to interpret not being 9999.99 as an indication of an estimated parameter. Fortunately, there were other ways we could tell whether the parameter had been estimated other than through the sentinel value, but the lessons learnt from this anecdote are two-fold,

Don’t choose sentinel values that can inadvertently be interpreted as something different to what the sentinel value was intended to indicate. In this case the 9999.99 was intended to be implausibly large for a parameter uncertainty. However, although implausible it is not impossible, and so it is possible for it to be accidentally processed in a way it shouldn’t. And accidents will happen. In fact, you should assume that if an accident can happen, it will happen.
The second lesson learnt here is that a sentinel value flags a condition or state. So we use sentinel values to infer two situations. In this example, even with the mistaken downstream processing of the sentinel value, a final result of 9999.99 would still have indicated an un-estimated parameter, so it would look like the sentinel value of 9999.99 was doing its job. However, we couldn’t reliably infer the converse case. A large parameter uncertainty that was less than 9999.99 could not be used on its own to reliably infer that the parameter had been estimated. So when designing a good sentinel value we need to think about how it will be used to infer either of the two scenarios, and whether correct inference of either scenario is always possible.

Second anecdote:

This second anecdote is not about what happens when you have a bad choice of sentinel value, but about what happens when you don’t have a sentinel value at all, and what that reveals about what you want to happen when a sentinel value is accidentally processed.

I was working on an academic project with a software developer. The developer was a C++ expert with many years of experience. We were modifying an existing very large academic codebase that analyzed Genome Wide Association (GWAS) data . I was supplying the additional equations needed, they were implementing them. During a particular three week period the developer had been chasing down a corner-case bug in the existing code. After three weeks they had managed to track down the bug to a particular subroutine producing -inf values in its output for this particular corner-case. They wanted advice on how to handle this scenario from a scientific perspective. What should we replace the -inf values with, if at all? The first thing I suggested was just looking at what had been the input to the subroutine in this case. Yep, it was -inf as well. In fact, we traced the -inf values back through three further subroutine calls. The -inf values had first arisen from zero values being present in an array where they shouldn’t have been. When first processed, the zeros had generated the -inf values which then got further processed by three subroutine calls. Okay, part of the issue here is that there should have been a sentinel value used in place of the default value of zero. Ideally, one would want that sentinel value to be incapable of being processed by any of the downstream subroutine calls. But here’s the thing; you would have thought that the -inf value, once created, would serve as some sort of sentinel value for later subroutine calls. The lessons I learnt from this anecdote were also two-fold, namely,

A good sentinel value that is accidently processed by a function should ~~cause your program to crash~~. Sorry, “ahem, cough”, cause your program to generate an instance of the custom exception class you’ve beautifully written, which your code then catches and gracefully handles.
The exception raised by the inadvertent/inappropriate processing of a well chosen sentinel value should occur as close as possible to the point where the sentinel value is first inappropriately processed, not three subroutine calls down the line. That way the sentinel value also provides useful debugging information.

Conclusion

Choosing good sentinel values is a bit of an art form, learnt from hard-won experience. But there are some general, high-level rules and guidelines we can give. In order of importance these are,

Don’t choose a sentinel value that can inadvertently be interpreted or processed as a normal value of the variable in which it is stored. A sentinel value is meant to be exceptional, not just different.
A sentinel value is often used to distinguish between two states. Choose a sentinel value such that if it becomes corrupted it can’t accidently still be interpreted as indicating either of those states. If a sentinel value becomes corrupted it should become meaningless.
If a sentinel value is processed is processed by a downstream calculation, the sentinel value should be such that the downstream calculation generates an exception.
If processing a sentinel value generates an exception, the choice of sentinel value should be such that the exception is generated as close as possible (in code) to the place where the sentinel value is first inappropriately processed.

Try using sentinel values in your coding. The only way to get better at using them is to try.

Log Normal Bias and the infinite capacity of mathematics to surpise

On January 17, 2026January 31, 2026 By dchoyleIn Data Science, Mathematical Analysis, MathematicsLeave a comment

Summary

Always confirm your mathematical intuition with an explicit calculation, no matter how simple the calculation.
Always confirm your mathematical intuition with an explicit calculation, no matter how well you know the particular area of mathematics.
Don’t leave it 25 years to check your mathematical intuition with a calculation, no matter how confident you are in that intuition.

Introduction

One of the things about mathematics is that it’s not afraid to slap you in the face and stand there laughing at you. Metaphorically speaking, I mean. Mathematics has an infinite capacity to surprise even in areas where you think you can’t be surprised. And you end up looking or feeling like an idiot, with mathematics laughing at you.

This is a salutary lesson about how I thought I understood an area of statistics, but didn’t. The area in question is the log-normal distribution. This is a distribution I first encountered 25 years ago and have worked with almost continuously since.

The problem which slapped me in the face

The particular aspect of the log-normal distribution in question, was what happens when you go from models or estimates on the log-scale, to models or estimates on the ‘normal’ or exponentiated scale. For a log-normal random variable $Y$ with $\sigma^{2}$ being the variance of $\log Y$ , we have the well-known result,

$\mathbb{E}\left ( Y \right ) \;=\; \exp\left( \mathbb{E}\left (\log Y \right ) \right ) \times \exp\left ( \frac{1}{2}\sigma^{2}\right )$

This is an example of Jensen’s inequality at work. If you apply a non-linear function $f$ to a random variable $Y$ , then the average of the function is not equal to the function of the average. That is, $\mathbb{E}\left ( f(Y)\right )\;\neq\; f\left ( \mathbb{E}\left ( Y\right )\right )$ .

In the case of the log-normal distribution, the first equation above tells us that we can relate the $\mathbb{E}\left ( \log Y\right )$ to $\mathbb{E}\left (Y)\right)$ via a simple bias-correction factor $\exp ( \frac{1}{2}\sigma^{2} )$ . So, if we had a sample of $N$ observations of $\log Y$ we could estimate the expectation of $Y$ from an estimate of the expectation of $\log Y$ by multiplying by a bias correction factor of $\exp\left ( \frac{1}{2} \sigma^{2}\right )$ . We can estimate $\sigma^{2}$ from the residuals of our estimate of the expectation of $\log Y$ . In this case, I’m going to assume our model is just an estimation of the mean, $\mathbb{E}\left ( \log Y \right )$ , which I’ll denote by $\mu$ , and I’m not going to try to decompose $\mu$ any further. If our estimate of $\mu$ is just the sample mean of $\log Y$ , then our estimate of $\sigma^{2}$ would simply be the sample variance, $s^{2}$ . If we denote our observations by $y_{i}\;,\; i=1,\ldots,N$ then $s^{2}$ is given by the formula,

$s^{2}\;=\; \frac{1}{N-1}\sum_{i} \log^{2} y_{i} \;-\; \frac{1}{N(N-1)}\left( \sum_{i}\log y_{i}\right )^{2}$

Note the usual Bessel correction that I’ve applied in the calculation of $s^{2}$ .

We can see from this discussion that the bias correction in going from the log-scale to the ‘normal’ scale is multiplicative in this case. The bias correction factor $B_{1}$ is given by,

$B_{1} \;=\;\exp\left ( \frac{1}{2} s^{2}\right )$

If we had suspected that any bias correction needs to be multiplicative, which seems reasonable given we are going from a log-scale to a normal-scale, then we could easily have estimated a bias correction factor as simply the ratio of the mean of the observations on the normal scale to the exponential of the mean of the observations on the log scale. That is, we can construct a second estimate, $B_{2}$ , of the bias correction factor needed, with $B_{2}$ given by,

$B_{2} \; =\; \frac{\frac{1}{N}\sum_{i}y_{i}}{\exp\left ( \frac{1}{N}\sum_{i}\log y_{i}\right )}$

I tend to refer to the $B_{2}$ method as a non-parametric method because it is how you would estimate $\mathbb{E}\left ( Y \right )$ from an estimate of $\mathbb{E}\left ( \log Y \right )$ and you only knew the bias correction needed was multiplicative. You wouldn’t explicitly need to make any assumptions about the distribution of the data. In contrast, $B_{1}$ is specific to the data coming from a log-normal distribution, so I call it a parametric estimate. For these reasons, I tend to prefer using the $B_{2}$ way of estimating any necessary bias correction factor, as obviously any real data won’t be perfectly log-normal and so the $B_{2}$ estimator will be more robust whilst also being very simple to calculate.

A colleague asks an awkward question

So far everything I’ve described is well known and pretty standard. However, a few months ago a colleague asked me if there would be any differences between the estimates $B_{1}$ and $B_{2}$ ? “Of course”, I said, if the data is not log-normal. “No”, my colleague said. They meant, even if the data was log-normal would there be any differences between the estimates $B_{1}$ and $B_{2}$ . Understanding the size of any differences when the parametric assumptions are met would help understand how big the differences could be when the parametric assumptions are violated. “Yeah, there will obviously still be differences between $B_{1}$ and $B_{2}$ even if we do have log-normal data, because of sampling variation affecting the two formulae differently. But, I would expect those differences to be small even for moderate sample sizes”, I confidently replied.

I’ve been using those formulae for $B_{1}$ and $B_{2}$ for over 25 years and never considered the question my colleague asked. I’d never confirmed my intuition. So 25 years later, I thought I’d do a quick calculation to confirm my intuition. My intuition was very very wrong! I thought it would be interesting to explain the calculations I did and what, to me, was the surprising behaviour of $B_{1}$ . Note, with hindsight the behaviour I uncovered shouldn’t have been surprising. It was obvious once I’d begun to think about it. It was just that I’d left thinking about it till 25 years too late.

The mathematical detail

The question my colleague was asking was this. If we have $N$ i.i.d. observations drawn from a log-normal distribution, i.e.,

$x_{i}\;=\;\log y_{i} \sim {\cal{N}}\left (\mu, \sigma^{2}\right )\;\;\; , i = 1,\ldots,N\;\;\;,$

what are the expectation values of the bias correction factors $B_{1}$ and $B_{2}$ at any finite value of $N$ . Do $\mathbb{E}\left ( B_{1} \right )$ and $\mathbb{E}\left ( B_{2}\right )$ converge to $\exp(\frac{1}{2}\sigma^{2})$ as $N\rightarrow\infty$ , and if so how fast?

Since the $\log y_{i}$ are Gaussian random variables we can express these expectation values in terms of simple integrals and we get,

$\mathbb{E}\left ( B_{1}\right )\;=\; \left( 2\pi\sigma^{2}\right)^{-\frac{N}{2}}\int_{{\mathbb{R}}^{N}} d{\underline{x}}\exp\left [-\frac{1}{2\sigma^{2}}\left ( \underline{x} - \mu\underline{1}_{N}\right )^{\top}\left ( \underline{x} - \mu\underline{1}_{N}\right )\right ] \exp\left[ \frac{1}{2} \underline{x}^{\top}\underline{\underline{M}}\underline{x}\right ]$ Eq.1

where the matrix $\underline{\underline{M}}$ is given by,

$\underline{\underline{M}}\;=\;\frac{1}{N-1}\underline{\underline{I}}_{N}\;-\;\frac{1}{N(N-1)}\underline{1}_{N}\underline{1}_{N}^{\top}$ .

Here ${\underline{1}}_{N}$ is an N-dimensional vector consisting of all ones, and $\underline{\underline{I}}_{N}$ is the $N\times N$ identity matrix.

We can similarly express the expectation of $B_{2}$ and we get,

$\mathbb{E}\left ( B_{2}\right )\;=\;\frac{1}{N}\sum_{i=1}^{N} \left( 2\pi\sigma^{2}\right)^{-\frac{N}{2}}\int_{{\mathbb{R}}^{N}} d{\underline{x}}\exp\left [-\frac{1}{2\sigma^{2}}\left ( \underline{x} - \mu\underline{1}_{N}\right )^{\top}\left ( \underline{x} - \mu\underline{1}_{N}\right )\right ] \exp \left [ \sum_{j=1}^{N}\left ( \delta_{ij} - \frac{1}{N}\right )x_{j}\right ]$ Eq.2

The matrices in the exponents of the integrands of Eq.1 and Eq.2 are easily diagonalized analytically and so the integrals are relatively easy to evaluate analytically. From this we find,

$\mathbb{E}\left ( B_{1}\right )\;=\; \left ( 1\;\;-\;\frac{\sigma^{2}}{N-1}\right )^{-\frac{(N-1)}{2}}$ Eq.3

$\mathbb{E}\left ( B_{2}\right )\;=\; \exp \left [ \frac{\sigma^{2}}{2}\left ( 1\;-\;\frac{1}{N}\right )\right ]$ Eq.4

We can see that as $N\rightarrow\infty$ both expressions for the bias correction tend to the correct value of $\exp(\sigma^2/2)$ . However, what is noticeable is at finite $N$ the value of $\mathbb{E}\left( B_{1}\right))$ has a divergence. Namely, as $\sigma^{2} \rightarrow N-1$ , we have $\mathbb{E}\left( B_{1}\right ) \rightarrow \infty$ and for $\sigma^{2} \ge N-1$ , we have that $\mathbb{E}\left( B_{1}\right )$ doesn’t exist. This means that for a finite sample size there are situations where the parametric form of the bias correction will be very very wrong.

At first sight I was shocked and surprised. The divergence shouldn’t have been there. But it was. My intuition was wrong. I was expecting there to be differences between $\mathbb{E}\left ( B_{1}\right)$ and $\mathbb{E}\left ( B_{2}\right )$ , but very small ones. I’d largely always thought of any large differences that arise in practice between the two methods to be due to the distributional assumption made – because a real noise process won’t follow a log-normal distribution we’ll always get some differences. But the derivations above are analytic. They show that there can be a large difference between the two bias-correction estimation methods even when all assumptions about the data have been met.

With hindsight, the reason for the large difference between the two expectations is obvious. The second, $\mathbb{E}\left ( B_{2}\right )$ , involves the expectation of the exponential of a sum of Gaussian random variables, whilst the first involves the expectation of the exponential of a sum of the squares of those same Gaussian random variables. The expectation in Eq.1 is a Gaussian integral and if $\sigma^{2}$ is large enough the coefficient of the quadratic term in the exponent becomes positive, leading to the expectation value becoming undefined.

Even though in hindsight the result I’d just derived was obvious, I still decided to simulate the process to check it. As part of the simulation I also wanted to check the behaviour of the variances of $B_{1}$ and $B_{2}$ , so I needed to derive analytical results for ${\rm Var}\left ( B_{1}\right )$ and ${\rm Var}\left ( B_{2}\right )$ . These calculations are similar to the evaluation of $\mathbb{E}\left ( B_{1}\right )$ and $\mathbb{E}\left ( B_{2}\right )$ , so I’ll just state the results below,

${\rm Var}\left ( B_{1}\right)\;=\; \left ( 1 - \frac{2\sigma^{2}}{N-1}\right )^{-\frac{(N-1)}{2}}\;-\;\left ( 1 - \frac{\sigma^{2}}{N-1} \right )^{-(N-1)}$ Eq.5

${\rm Var}\left ( B_{2}\right)\;=\; \left ( 1 - \frac{1}{N}\right )\exp \left [ \sigma^{2}\left ( 1 - \frac{2}{N}\right )\right ]\;+\;\frac{1}{N}\exp \left [ 2\sigma^{2}\left ( 1 - \frac{1}{N}\right )\right ]\;-\; \exp \left [ \sigma^{2}\left ( 1 - \frac{1}{N}\right )\right ]$ Eq.6

Note that the variance of $B_{1}$ also diverges, but when $\frac{\sigma^{2}}{N-1} = \frac{1}{2}$ , so much before $\mathbb{E}\left ( B_{1}\right )$ diverges.

To test Eq.3 – Eq.6 I ran some simulations. First, I introduced a scaled variance $\phi = \sigma^{2}/(N-1)$ . The divergence in $\mathbb{E}\left ( B_{1}\right )$ occurs at $\phi = 1$ , so $\phi$ gives us a way of measuring how problematic the value of $\sigma^{2}$ will be for the given value of $N$ . The divergence in ${\rm Var}\left ( B_{1} \right )$ occurs at $\phi = 0.5$ . I chose to set $N=20$ in my simulations, so quite a small sample but there are good reasons that I’ll explain later. For each value of $\phi$ I generated $N_{sim} = 100000$ sample datasets, each of $N$ log-normal datapoints, and calculated $B_{1}$ and $B_{2}$ . From the simulation sample of $B_{1}$ and $B_{2}$ values I then calculated sample estimates of $\mathbb{E}\left ( B_{1}\right )$ , $\mathbb{E}\left ( B_{2}\right )$ , ${\rm Var}\left ( B_{1}\right )$ , and ${\rm Var}\left ( B_{2}\right )$ . The simulation code is in the Jupyter notebook LogNormalBiasCorrection.ipynb in the GitHub repository here.

Below in Figure 1 I’ve shown a plot of the expectation of $B_{1}$ , relative to the true bias correction factor $\exp\left( \frac{1}{2}\sigma^{2}\right )$ , as a function of $\phi$ . The red line is based on the theoretical calculation in Eq.3, whilst the blue points are simulation estimates,

A plot showing how the average relative error made by the parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 1: Plot of theory and simulation estimates of the average relative error made by the parametric bias correction factor, as we increase the variance of the log-normal data.

The agreement between simulation and theory is very good. But even for these small values of $\phi$ we can see the error that $B_{1}$ makes in estimating the correct bias correction factor is significant – nearly 40% relative error at $\phi = 0.25$ . For $N=20$ , the value of $\sigma$ is 2.24 at $\phi=0.25$ . Whilst this is large for the additive log-scale noise, it is not ridiculously large. It shows that the presence of the divergence at $\phi = 1$ is strongly felt even a long way away from that divergence point.

The error bars on the simulation estimates correspond to $\pm 2$ standard-errors of the mean. In calculating the standard-errors of the mean I’ve used the theoretical calculation of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ (derived from Eq.5), rather than the simulation sample estimate of it. The reason being that once one goes above about $N=20$ and above $\phi = 0.25$ , the variance (over different sets of simulation runs) of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ becomes large itself, and so obtaining a stable accurate simulation estimate of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ requires such a prohibitively large number of simulations (way greater than 100000) that estimating ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ from a simulation sample becomes impractical. So I used the theoretical result instead. Hence also why the plot below is restricted to $\phi \le 0.25$ .

For the range $\phi\le 0.25$ , where the simulation sample size is sufficiently large enough that we can trust our simulation sample based estimates of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ , we can compare them with the theoretical values. This comparison is plotted in Figure 2 below.

A plot showing how variance of the relative error made by the parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 2: Plot of theory and simulation estimates of the variance of the relative error made by the parametric bias correction factor, as we increase the variance of the log-normal data.

We can see that the simulation estimates of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ confirm the accuracy of the theoretical result in Eq.5, and consequently confirm the validity of the standard error estimates in Figure 1.

Okay, so that’s confirmed the ‘surprising’ behaviour of the bias-correction estimate $B_{1}$ , but what about $B_{2}$ ? How do simulation estimates of $\mathbb{E}\left ( B_{2}\right )$ and ${\rm Var}\left ( B_{2}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ compare to the theoretical results in Eq.4 and Eq.6 ? I’ve plotted these in Figures 3 and 4 below.

A plot showing how the average relative error made by the non-parametric bias correction factor changes as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 3: Plot of theory and simulation estimates of the average relative error made by the non-parametric bias correction factor, as we increase the variance of the log-normal data.

The accuracy of $B_{2}$ is much better than that of $B_{1}$ . By comparison $B_{2}$ only makes a 10% relative error (under-estimate) on average at $\phi=0.25$ , compared to the 40% over-estimate of $B_{1}$ (on average). We can also see that the magnitude of the relative error increases slowly as $\phi$ increases, again in constrast with the rapid increase in the relative error made by $B_{1}$ (in Figure 1).

A plot showing how variance of the relative error made by the non-parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 4: Plot of theory and simulation estimates of the variance of the relative error made by the non-parametric bias correction factor, as we increase the variance of the log-normal data.

Conclusion

So, the theoretical derivations and analysis I did of $B_{1}$ and $B_{2}$ turned out to be correct. What did I learn from this? Three things. Firstly, even areas of mathematics that you understand well, have worked with for a long time, and are relatively simple in mathematical terms (e.g. just a univariate distribution) have the capacity to surprise you. Secondly, given this capacity to surprise it is always a good idea to check your mathematical hunches, even with a simple calculation, no matter how confident you are in those hunches. Thirdly, don’t leave it 25 years to check your mathematical hunches. The evaluation of those integrals defining $\mathbb{E}\left ( B_{1}\right )$ and $\mathbb{E}\left ( B_{2}\right )$ only took me a couple of minutes. If I had calculated those integrals 25 years ago, I wouldn’t be feeling so stupid now.

An equation showing the mathematical form of the log sum exp calculation.

Data Science Notes: 2. Log-Sum-Exp

On January 1, 2026January 31, 2026 By dchoyleIn Algorithms, Coding, Data Science, Data Science Notes, Numerical Analysis, pythonLeave a comment

Summary

Summing up many probabilities that are on very different scales often involves calculation of quantities of the form $\log\left ( \sum_{k} \exp\left( a_{k}\right )\right )$ . This calculation is called log-sum-exp.
Calcuating log-sum-exp the naive way can lead to numerical instabilities. The solution to this numerical problem is the “log-sum-exp” trick.
The scipy.special.logsumexp function provides a very useful implementation of the log-sum-exp trick.
The log-sum-exp function also has uses in machine learning, as it is a smooth, differentiable approximation to the ${\rm max}$ function.

Introduction

This is the second in my series of Data Science Notes series. The first on Bland-Altman plots can be found here. This post is on a very simple numerical trick that ensures accuracy when adding lots of probability contributions together. The trick is so simple that implementations of it exist in standard Python packages, so you only need to call the appropriate function. However, you still need to understand why you can’t just naively code-up the calculation yourself, and why you need to use the numerical trick. As with the Bland-Altman plots, this is something I’ve had to explain to another Data Scientist in the last year.

The log-sum-exp trick

Sometimes you’ll need to calculate a sum of the form, $\sum_{k} \exp\left ( a_{k}\right )$ , where you have values for the $a_{k}$ . Really? Will you? Yes, it will probably be calculating a log-likelihood, or a contribution to a log-likelihood, so the actual calculation you want to do is of the form,

$\log\left ( \sum_{k} \exp \left ( a_{k}\right )\right )$

These sorts of calculations arise where you have log-likelihood or log-probability values $a_{k}$ for individual parts of an overall likelihood calculation. If you come from a physics calculation you’ll also recognise the expression above as the calculation of a log-partition function.

So what we need to do is exponentiate the $a_{k}$ values, sum them, and then take the log at the end. Hence the expression “log-sum-exp”. But why a blogpost on “log-sum-exp”? Surely, it’s an easy calculation. It’s just np.log(np.sum(np.exp(a))) , right ? Not quite.

It depends on the relative values of the $a_{k}$ . Do the naïve calculation np.log(np.sum(np.exp(a))) and it can be horribly inaccurate. Why? Because of overflow and underflow errors.

If we have two values $a_{1}$ and $a_{2}$ and $a_{1}$ is much bigger than $a_{2}$ , when we add $\exp(a_{1})$ to $\exp(a_{2})$ we are using floating point arithmetic to try and add a very large number to a much smaller number. Most likely we will get an overflow error. If would be much better if we’d started with $\exp(a_{1})$ and try to add $\exp(a_{2})$ to it. In fact, we could pre-compute $a_{2} - a_{1}$ , which would be very negative and from this we could easily infer that adding $\exp(a_{2})$ to $\exp(a_{1})$ would make very little difference. In fact, we could just approximate $\exp(a_{1}) + \exp(a_{2})$ by $\exp(a_{1})$ .

But how negative does $a_{2} - a_{1}$ have to be before we ignore the addition of $\exp(a_{2})$ ? We can set some pre-specified threshold, chosen to avoid overflow or underflow errors. From this, we can see how to construct a little Python function that takes an array of values $a_{1}, a_{2},\ldots, a_{N}$ and computes an accurate approximation to $\sum_{k=1}^{N}\exp(a_{k})$ without encountering underflow or overflow errors.

In fact we can go further and approximate the whole sum by first of all identifying the maximum value in an array $a = [a_{1}, a_{2}, \ldots, a_{N}]$ . Let’s say, without loss of generality, the maximum value is $a_{1}$ . We could ensure this by first sorting the array, but it isn’t necessary actually do this to get the log-sum-exp trick to work. We can then subtract $a_{1}$ from all the other values of the array, and we get the result,

$\log \left ( \sum_{k=1}^{N} \exp \left ( a_{k} \right )\right )\;=\; a_{1} + \log \left ( 1\;+\;\sum_{k=2}^{N} \exp \left ( a_{k} - a_{1} \right )\right )$

The values $a_{k} - a_{1}$ are all negative for $k \ge 2$ , so we can easily approximate the logarithm on the right-hand side of the equation by a suitable expansion of $\log (1 + x)$ . This is the “log-sum-exp” trick.

The great news is that this “log-sum-exp” calculation is so common in different scientific fields that there are already Python functions written to do this for us. There is a very convenient “log-sum-exp” function in the SciPy package, which I’ll demonstrate in a moment.

The log-sum-exp function

The sharp-eyed amongst you may have noticed that the last formula above gives us a way of providing upper and lower bounds for the ${\rm max}$ function. We can simply re-arrange the last equation to get,

${\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; \log \left ( \sum_{k=1}^{N}\exp \left ( a_{k}\right )\right )$

The logarithm calculation on the right-hand side of the inequality above is what we call the log-sum-exp function (lse for short). So we have,

${\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; {\rm lse}\left ( a_{1}, a_{2}, \ldots, a_{N}\right )$

This gives us an upper bound for the ${\rm max}$ function. Since $a_{k} \le {\rm max}\left ( a_{1}, a_{2},\ldots,a_{N}\right )$ , it is also relatively easy to show that,

${\rm lse} \left ( a_{1}, a_{2}, \ldots, a_{N}\right )\;\le\; {\rm max}\left ( a_{1}, a_{2},\ldots, a_{N} \right )\;+\;\log N$

and so we have have a lower bound for the ${\rm max}$ function. So the log-sum-exp function allows us to compute lower and upper bounds for the maximum of an array of real values, and it can provide an approximation of the maximum function. The advantage is that the log-sum-exp function is smooth and differentiable. In contrast, the maximum function itself is not smooth nor differentiable everywhere, and so is less convenient to work with mathematically. For this reason the log-sum-exp is often called the “real-soft-max” function because it is a “soft” version of the maximum function. It is often used in machine learning settings to replace a maximum calculation.

Calculating log-sum-exp in Python

So how do we calculate the log-sum-exp function in Python. As I said, we can use the SciPy implementation which is in scipy.special. All we need to do is pass an array-like set of values $a_{k}$ . I’ve given a simple example below,

			
# import the packages and functions we need
import numpy as np
from scipy.special import logsumexp
# create the array of a_k values 
a = np.array([70.0, 68.9, 20.3, 72.9, 40.0])
# Calculate log-sum-exp using the scipy function 
lse = logsumexp(a)
# look at the result
print(lse)

		

This will give the result 72.9707742189605

The example above and several more can be found in the Jupyter notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes

The great thing about the SciPy implementation of log-sum-exp is that it allows us to include signed scale factors, i.e. we can compute,

$\log \left ( \sum_{k=1}^{N} b_{k}\exp\left ( a_{k}\right ) \right )$

where the values $b_{k}$ are allowed to be negative. This means, that when we are using the SciPy log-sum-exp function to perform the log-sum-exp trick, we can actually use it to calculate numerically stable estimates of sums of the form,

$\log \left ( \exp\left ( a_{1}\right ) \;-\; \exp\left ( a_{2}\right ) \;-\;\exp\left ( a_{3}\right )\;+\;\exp\left ( a_{4}\right )\;+\ldots\; + \exp\left ( a_{N}\right )\right )$ .

Here’a small code snippet illustrating the use of the scipy.special.logsumexp with signed contributions,

			
# Create the array of the a_k values
a = np.array([10.0, 9.99999, 1.2])
b = np.array([1.0, -1.0, 1.0])
# Use the scipy.special log-sum-exp function
lse = logsumexp(a=a, b=b)
# Look at the result
print(lse)

		

This will give the result 1.2642342014146895.

If you look at the output of the example above you’ll see that the final result is much closer to the value of the last array element $a_{3} = 1.2$ . This is because the first two contributions, $\exp(a_{1})$ and $\exp(a_{2})$ almost cancel each other out because the contribution $\exp(a_{2})$ is pre-fixed by a factor of -1. What we get left with is something close to $\log(\exp(a_{3}))\;=\; a_{3}$ .

There is also a small subtlety in using the SciPy logsumexp function with signed contributions. If the substraction of some terms had led to an overall negative result, scipy.special.logsumexp will rerturn NaN as the result. In order to get it to always return a result for us, we have to tell it to return the sign of the final summation as well, by setting the return_sign argument of the function to True. Again, you can find the code example above and others in the notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes.

When you are having to combine lots of different probabilities, that are on very different scales, and you need to subtract some of them and add others, the SciPy log-sum-exp function is very very useful.

The shape of mathematics to come

On December 21, 2025January 20, 2026 By dchoyleIn AI, Artificial Intelligence, Data Science, MathematicsLeave a comment

Summary

Using AI for mathematics research is a much less hyped field than GenAI in general.
The pre-print, “The Shape of Math To Come”, from Alex Kontorovich gives an excellent recent, readable, and accessible discussion of the potential for using AI to scale-up and automate formal mathematics research.

Introduction

Like the rest of the universe I’ve been drawn into experimenting with GenAI for general productivity tasks. However, one of my other interests is how GenAI gets used in scientific discovery. Even more interesting is how GenAI can be used in mathematical discovery. If you want to know where the combination of mathematics and AI is going or could go, then I would strongly recommend reading this arXiv pre-print from Alex Kontorovich, titled “The Shape of Math To Come”. The pre-print was uploaded to the archive in October of this year. With some free evenings last week, I finally got round to finish reading and digesting it.

GenAI for rigorous mathematics? Really?

It’s important to point out that I’m not talking about the mathematics behind GenAI, but about how GenAI is used or could be used as a tool to support the formal development of rigorous mathematical proofs, derivations and statements. Because of that rigour, there is actually less hype (compared to mainstream applications of GenAI) around the use of AI in this way. In the pre-print Kontorovich outlines a credible and believable road ahead for using GenAI in formal mathematics. For me the roadmap is credible because of i)Kontorovich’s credentials as Distinguished Professor of mathematics at Rutger’s University, ii) the significant progress that has been made in this area over the last couple of years, particularly the AlphaProof system from DeepMind, iii) and the existence of powerful automated theorem provers such as Lean.

The pre-print is very readable and even if you don’t have a formal mathematical background but have, say an undergraduate science or engineering degree, you will be able to follow the majority of the discussion – there is only one place where specialized knowledge of analysis is used and even then it is not essential for following Kontorovich’s argument.

The main potential sticking point that Kontorovich highlights in using AI to do maths is the tension between the stochastic nature of Large Language Models (LLMs) and the deterministic nature of mathematical proof. Kontorovich’s argument is that the use of automated theorem checkers such as Lean, in combination with a large and increasing library (MathLib) that formally encodes our current rigorous mathematical knowledge base, will provide a checking mechanism to mitigate the stochastic nature of LLM outputs. In this scenario, human mathematicians are elevated to the role of orchestrating and guiding the LLMs in their attempts to construct new proofs and derivations (using Lean and aided by the knowledge base encoded in MathLib). Whether human mathematicians choose to adopt this approach will depend on how fruitful and easy to use the new GenAI-based mathematics tools are. By analogy with the take-up of the LaTeX typesetting language in the early 1990s, Kontorovich says that this will be when “the ratio of time required to discover and formalize a mathematical result (from initial idea through verified formal proof) to the time required to discover and write up the same result in natural language mathematics” falls below one. When this happens there will also be accelerated ‘network’ effects – increased number of verified proofs encoded into MathLib will increase the ability of computer + GenAI assisted workflows to construct further verified proofs.

Less formal mathematics research

What interests me as a Data Scientist is, if such approaches are successful in advancing how mathematicians do formal mathematics, how much of those approaches can be adopted or modified to advance how non-mathematicians do non-formal mathematics. I’m not talking about the “closed-loop” GenAI-based “robot scientist” approaches that are being discussed in drug-discovery and materials development fields. I’m talking about the iterative, exploratory process of developing a new model training algorithm for a new type of data, or new application domain, or new type of problem. Data Science is typically about approximation. Yes, there is formal derivation in places, and rigorous proofs in others, but there is also a lot of heuristics. Those heuristics can be expressed mathematically, but they are still heuristics. This is not a criticism of using such heuristics – sometimes they are used out of ignorance that better approaches exist, but often they are used out of necessity arising from constraints imposed by the scale of the problem or runtime and storage constraints. What I’m driving at is that the Data Science development process is a messy workflow, with some steps being highly mathematical and imposing a high degree of rigour, some less so. Having access to proven tools that accelerate the exploration of the mathematical steps of that workflow can obviously accelerate the overall Data Science development process. At the moment, switching between mathematical and non-mathematical steps introduces context-switching and hence friction, so my tool use for the mathematical steps is limited. I might use Mathematica for an occasional tedious expansion and where tracking and collating expansion terms is error-prone. For my day-to-day work my mathematical steps will be limited to, say, 40 pages of A4, and are more typically between 5 – 15 pages of A4, and occasionally 1-2 pages. At such scales, I can just stick to pen-and-paper. If however, mathematical support tools developed from doing formal mathematics research became more seamlessly integrated into the Data Science development process, I would change my way of working. However, there is another problem. And that is the messy nature of the Data Science development process I outlined. Not all the steps in the process are rigorous. That means the Data Science development workflow doesn’t have access to the self-correcting process that formal mathematics does. In Alex Kontorovich’s paper he makes an optimistic case for how the use of formal theorem checkers, in combination with GenAI, can genuinely advance formal mathematics. I think of this as akin to how physics-based engines are used to provide rigorous environments for reinforcement-learning. The messy Data Science workflow typically has no such luxury. Yes, sometimes we have access to the end-point ground-truth, or we can simulate such truths under specified assumptions. More often than not, our check on the validity of a specific Data Science development workflow is an ad-hoc combination of data-driven checks, commercial vertical domain expertise, technical vertical expertise, and hard-won experience of ‘what works’. That needs to get better and faster. I would love for some of the tools and processes outlined by Kontorovich to be adapted to less formal but still mathematically-intensive fields such as the algorithm development side of Data Science. Will it happen? I don’t know. Can it be done? I don’t know, as yet.

A Bland-Altman plot of the peak expiratory flow rate data taken from the 1986 Lancet paper of Bland and Altman.

Data Science Notes: 1. Bland-Altman plots

On December 6, 2025January 31, 2026 By dchoyleIn Data Science, Data Science Notes, Data Visualization1 Comment

Introduction

Summary: If you are using a scatter plot to compare two datasets, rotate your data.

Three times in the last six months I’ve explained to different colleagues and former colleagues what a Bland-Altman (BA) plot is. Admittedly, the last of those explanations was because I remarked to a colleague that I’d been talking about BA-plots and they then wanted to know what they were.

BA-plots are a really simple idea. I like them because they highlight how a human’s ability to perceive patterns in data can be markedly affected by relatively small changes in how that data is presented; rotating the data in this case.

I also like them because they are from the statisticians Martin Bland and Doug Altman who produced a well-known series of short articles, “Statistics Notes”, in the BMJ in the 1990s. Each article focused on a simple, basic, but very important statistical concept. The series ran over nearly 70 articles and the idea was to explain to a medical audience about ‘statistical thinking’. You can find the articles at Martin Bland’s website here. Interestingly, BA-plots were not actually part of this series of BMJ articles as their work on BA-plots had been published in earlier articles. However, I’d still thoroughly recommend having a browse of the BMJ series.

Since I’ve had to explain BA-plots three times recently, I thought I’d give it another go in a blogpost. Also, inspired by the Bland-Altman series, I’m going to attempt a series of 10 or so short blogposts on simple, basic Data Science techniques and concepts that I find useful and/or interesting. The main criterion for inclusion in my series is whether I think I can explain it in a short post, not whether I think it is important.

What is a Bland-Altman plot?

BA-plots are used for comparing similar sets of data. The original use-case was to test how reproducible a process was. Take two samples of data that ideally you would want to be identical and compare them using a BA plot. This could be comparing clinical measurements made by two different clinicians across the same set of patients. What we want to know is how reproducible is a clinical measurement if made by two different clinicians.

Perhaps the first way of visually comparing two datasets on the same objects would be to just do a scatter plot – one dataset values on the x-axis, the other dataset values on the y-axis. I’ve got an example in the plot below. In fact, I’ve taken this data from Bland and Altman’s original 1986 Lancet paper. You can see the plotted points are pretty close to the 45-degree line (shown as a black dashed line), indicating the two datasets are measuring the same thing with some scatter, perhaps due to measurement error.

A scatter plot of peak expiratory flow rate data taken from the original Bland Altman paper in the Lancet from 1986. — Scatter plot of original Bland Altman PEFR data

Now, here’s the neat idea. I can do exactly the same plot, but I’m just going to rotate it clockwise by 45-degrees. A little bit of high-school/college linear algebra will convince you that I can do that by creating two new features,

$\frac{1}{\sqrt{2}} \left ( y + x \right )$
$\frac{1}{\sqrt{2}} \left ( y - x \right )$

Here $x$ and $y$ are our starting features or values from the two datsets we are comparing. Typically, the pre-factors of $\sqrt{2}$ are omitted and we simply define our new features as,

$A = \frac{1}{2} \left ( y + x \right )$
$M = \left ( y - x \right )$

Now we plot $M$ against $A$ . I’ve shown the new plot below.

Now a couple of things become clearer. Firstly, $A$ is the mean of $x$ and $y$ and so it gives us a better estimate of any common underlying value than just $x$ on its own or $y$ on its own. It gives us a good estimate of the size of the ‘thing’ we are interested in. Secondly, $M$ is the difference between $x$ and $y$ . $M$ tells us how different $x$ and $y$ are. Plotting $M$ against $A$ as I’ve done above shows me how reproducible the measurement is because I can easily see the scale of any discrepancies against the new vertical axis. I also get to see if there is any pattern in the level of discrepancy as the size of the ‘thing’ varies on the new horizontal axis. This was the original motivation for the Bland-Altman plot – to see the level of discrepancy between two sets of measurements as the true underlying value changes.

What the eye doesn’t see

What I really like about BA-plots though, is how much easier I find it to pick out if there is any systematic pattern to the differences between the two datasets. I haven’t looked into the psychological theory of visual perception, but it makes sense to me that humans would find it easier looking for differences as we move our eyes across one dimension – the horizontal axis – compared to moving our eyes across two dimensions – both the horizontal and vertical axes – when trying to scan the 45-degree line.

I first encountered BA-plots 25 years ago in the domain of microarray analysis. In that domain they were referred to as MA-plots (for obvious reasons). The choice of the symbols $M$ and $A$ also had a logic behind it. $M$ and $A$ are constructed as linear combinations of $x$ and $y$ , and in this case we “Add” them when constructing $A$ and “Minus” them when constructing $M$ . Hence the symbols $M$ and $A$ even tell you how to calculate the new features. You will also see BA-plots referred to Tukey mean-difference plots (again for obvious reasons).

In microarray analysis we were typically measuring the levels of mRNA gene expression in every gene in an organism across two different environmental conditions. We expected some genes to show differences in expression and so a few data points were expected to show deviations from zero on the vertical $M$ -axis. However, we didn’t expect broad systematic differences across all the genes, so we expected a horizontal data cloud on the MA-plot. Any broad systematic deviations from a horizontal data cloud were indicative of a systematic bias in the experimental set-up that needed to be corrected for. The MA plots gave an easy way to both visually detect any bias but also suggested an easy way to correct it. To correct it we just needed to fit a non-linear trendline through the data cloud, say using a non-parametric fit method like lowess. The vertical difference between a datapoint and the trendline was our estimate of the bias-corrected value of $M$ for that datapoint.

To illustrate this point I’ve constructed a synthetic example below. The left-hand plot shows the raw data in a standard scatterplot. The scatterplot suggests there is good agreement between the two samples – maybe a bit of disagreement but not much. However, when we look at the same data as a Bland-Altman plot (right-hand plot) we see a different picture. We can clearly see a systematic pattern to the discrepancy between the two samples. I’ve also estimated this systematic variation by fitting a non-linear trendline (in red) using the lowess function in the Python statsmodels package.

Two plots. The left hand plot shows a standard scatterplot, whilst the right-hand plot shows the corresponding Bland-Altman plot. — Scatterplot and Bland-Altman plot for the second example dataset.

Sometimes we may expect a global systematic shift between our paired data samples, i.e. a constant vertical shift on the $M$ axis. Or at least we can explain/interpret such a shift. Or there may be other patterns of shift we can comfortably interpret. This widens the applications domains we can use BA-plots for. In commercial Data Science I’ve seen BA-plots used to assess reproducibility of metrics on TV streaming advertising, and also calibration of transaction data across different supermarket stores. Next time you’re using a vanilla scatterplot to compare two data series, think about rotating and making a BA-plot.

All the code for the examples I’ve given in this post is in the Jupyter notebook DataScienceNotes1_BlandAltmanPlots.ipynb which can be found in the public GitHub repository https://github.com/dchoyle/datascience_notes. Feel free to clone the repository and play with the notebook. I’ll be adding to the repository as I add further “Data Science Notes” blogposts.

Testing your models before you build them

On October 7, 2025December 21, 2025 By dchoyleIn UncategorizedLeave a comment

Introduction

TL;DR: There are tests on models you can do even before you have done any training of the model. These are tests of the model form, and are more mathematical in nature. These tests stop you from putting a model with a flawed mathematical form into production.

My last blogpost was on using simulation data to test a model. I was asked if there are other tests I do for models, to which I replied, “other than the obvious, it depends on the model and the circumstances”. Then it occurred to me that “the obvious” tests might not be so obvious, so I should explain them here.

Personally, I broadly break down model tests into two categories:

Tests on a model before training/estimation of the model parameters.
Tests on a model after training/estimation of the model parameters.

The first category (pre-training) are typically tests on model form – does the model make sense, does the model include features in a sensible way. These are tests that get omitted most often and the majority of Data Scientists don’t have in their toolkit. However, these are tests that will spot the big costly problems before the model makes it into production.

The second category of tests (post-training) are typically tests on the numerical values of the model parameters and various goodness-of-fit measures. These are the tests that most Data Scientists will know about and will use regularly. Because of this I’m not going to go into details of any tests in this second category. What I want to focus on is tests in the first category, as this is where I think there is a gap in most Data Scientists’ toolkit.

The tests in the first category are largely mathematical, so I’m not going to give code examples. Instead, I ‘m just going to give a short description of each type of test and what it tries to achieve. Let’s start.

Pre-training tests of model form

Asymptotic behaviour tests:

One of the easiest ways to test a model form is to look at its output in circumstances which are easy to understand. In a model with many features and interacting parts this is best done by seeing what happens when you make one of the variables or parameters as large as possible (or as small as possible). In these circumstances the other variables will often become irrelevant and so the behaviour of the model is easier to spot. For example, in a demand model that predicts how much of a grocery product you’re going to sell, does putting up the price to infinity cause the predicted sales volume to drop to zero? If not, you’ve got a problem with your model.

Asymptotic behaviour tests are not limited to scenarios in which variables/parameters become very large or very small. In some cases the appropriate asymptotic scenario might be a parameter approaching a finite value at which a marked change in behaviour is expected. It should be clear that identifying asymptotic scenarios for which we can easily predict what should happen can require some domain knowledge. If you aren’t confident of your understanding of the application domain, then a good start is to make variables/parameters very large and/or very small one at a time and see if the resulting behaviour makes sense.

Typically, working out the behaviour of your model form in some asymptotic limit can be done simply by visual inspection of the mathematical form of your model, or with a few lines of pen-and-paper algebra. This gives us the leading order asymptotic behaviour. With a bit more pen-and-paper work we can also work out a formula for the next-to-leading order term in the asymptotic expansion of the model output. The next-to-leading order term tells us how quickly the model output approaches its asymptotic value – does it increase to the asymptotic value as we increase the variable, or does it decrease to the asymptotic value? We can also see which other variables and parameters affect the rate of this approach to the asymptotic value, again allowing us to identify potential flaws in the model form.

The asymptotic expansion approach to testing a model form can be continued to even higher orders, although I rarely do so. Constructing asymptotic expansions requires some experience with specific analysis techniques, e.g. saddle-point expansions. So I would recommend the following approach,

Always do the asymptotic limit (leading order term) test(s) as this is easy and usually requires minimal pen-and-paper work.
Only derive the next-to-leading order behaviour if you have experience with the right mathematical techniques. Don’t sweat if you don’t have the skills/experience to do this as you will still get a huge amount of insight from just doing 1.

Stress tests/Breakdown tests:

These are similar in spirit to the asymptotic analysis tests. Your looking to see if there are any scenarios in which the model breaks down. And by “break down”, I mean it gives a non-sensical answer such as predicting a negative value for a quantity that in real life can only be positive. How a model breaks down can be informative. For example, does the scenario in which the model breaks down clearly reflect an obvious limitation of the model assumptions, in which case breakdown is entirely expected and nothing to worry about. The breakdown is telling you what you already know, that in this scenario the assumptions don’t hold or are inappropriate and so we expect the model to be inaccurate or not work at all. If the breakdown scenario doesn’t reflect known weaknesses of the model assumptions you’ve either uncovered a flaw in the mathematical form of your model, which you can now fix, or you’ve uncovered an extra hidden assumption you didn’t know about. Either way, you’ve made progress.

Recover known behaviours:

Another test that has similarities to the asymptotic analysis and the stress tests. For example, your model may be a generalization of a more specialized model. It may contain extra parameters that capture non-linear effects. If we set those extra parameters to zero in the model or in any downstream mathematical analysis we have performed, then we would expect to get the same behaviour as the purely linear model. Is this what happens? If not, you’ve got a problem with your model or the downstream analysis. Again this is using known expected behaviour of a nested sub-case as a check on the general model.

Coefficients before fitting:

Your probably familiar with the idea of checking the parameters of a model after fitting, to check that those parameter values make sense. Here, I’m talking about models with small numbers of features and hence parameters, which also have some easy interpretation. Because we can interpret the parameters we can probably come up with what we think are reasonable ball-park values for them even before training the model. This gives us, i) a check on the final fitted parameter values, and ii) a check on what scale of output we think is reasonable from the model. We can then compare what we think should be the scale of the model output against what is needed to explain the response data. If there is an order of magnitude or more mis-match then we have a problem. Our model will either be incapable of explaining the training data in its current mathematical form, or one or more of the parameters is going to have an exceptional value. Either way, it is probably wise to look at the mathematical form of your model again.

Dimensional analysis:

In high school you may have encountered dimensional analysis in physics lessons. There you checked that the left-hand and right-hand sides of a formula were consistent when expressed in dimensions of Mass (M), Length (L), and Time (T). However, we can extend the idea to any sets of dimensions. If the right-hand side of a formula consists of clicks divided by spend, and so has units of $\rm{[currency]}^{-1}$ , then so must the left-hand side. Similarly, arguments to transcendental functions such as exp or sin and cos must be dimensionless. These checks are a quick and easy way to spot if a formula is inadvertently missing a dimensionful factor.

Conclusion:

These tests of the mathematical form of a model ensure that a model is robust and its output is sensible when used in scenarios outside of its training data. And let’s be realistic here; in commercial Data Science all models get used beyond the scope for which they are technically valid. Not having a robust and sensible mathematical form for your model means you run the risk of it outputting garbage.

The need for simulation

On August 4, 2025October 7, 2025 By dchoyleIn Algorithms, Data Science, Machine LearningLeave a comment

TL;DR: Poor mathematical-based design and testing of models can lead to significant problems in production. Finding suitable ground truth data for testing of models can be difficult. Yet, many Data Science models make it into production without appropriate testing. In these circumstances testing with simulated data can be hugely valuable. In this post I explain why and how. In fact, I argue that testing with Data Science models should be non-negotiable.

Introduction

Imagine a scenario. You’re the manager of a Premier League soccer team. You wouldn’t sign a new striker without testing if they could actually kick a ball. Wouldn’t you?

In the bad old days before VAR it was not uncommon for a big centre-back to openly punch a striker in the face if the referee and assistant referees weren’t looking. Even today, just look at any top-flight soccer match and you’ll see the blatant holding and shirt-pulling that goes on. Real-world soccer matches are dirty. A successful striker has to deal with all these realities of the game, whilst also being able to kick the ball in the net. At the very least when signing a new striker you’d want to test whether they could score under ideal benign conditions. Wouldn’t you? You’d put the ball on the penalty spot, with an open goal, and see if your new striker could score. Wouldn’t you? Passing this test, wouldn’t tell you that your striker will perform well in a real game, but if they fail this “ideal conditions” test it will tell you that they won’t perform well in real circumstances. I call this the “Harry Redknapp test” – some readers will understand the reference¹. If you don’t then read the footnote for an explanation.

How is this relevant to Data Science? One of the things I routinely do when implementing an algorithm is to test that implementation on simulated data. However, a common reaction I get from other Data Scientists is, “oh I don’t test on simulated data, it’s not real data. It’s not useful. It doesn’t tell you anything.” Oh yes it does! It tells you whether the algorithm you’ve implemented is accurate under the ideal conditions it was designed for. If your implementation performs badly on simulated data, you have a big problem! Your algorithm or your implementation of it has failed the “Harry Redknapp test”.

“Yeah, but I will have some ground-truth data I can test my implementation on instead, so I don’t need simulated data.” Not always. Are you 100% sure that that ground-truth data is correct? And what if you’re working on an unsupervised problem.

“Ok, but the chances of an algorithm implemented by experienced Data Scientists making it into production untested and with really bad performance characteristics is small”. Really!? I know of at least one implemented algorithm in production at a large organization that is actually an inconsistent estimator. An inconsistent estimator is one of the biggest sins an algorithm can commit. It means that even as we give the algorithm more and more ideal training data, it doesn’t produce the correct answer. It fails the “Harry Redknapp test”. I won’t name the organization in order to protect the guilty. I’ll explain more about inconsistent estimators later on.

So maybe I convinced you that simulated data can be useful. But what can it give you, what can’t it give you, and how do you go about it?”

What simulation will give you and what it won’t

To begin, we need to highlight some general but very important points about using simulated data:

Because we want to want to generate data, we need a model of the data generation process, i.e. we need a generative model².
Because we want to mimic the stochastic nature of real data, our generative model of the data will be a probabilistic one.
Because we are generating data from a model, what we can test are algorithms and processes that use that data, e.g. a parameter estimation process. We cannot test the model itself. Our conclusions are conditional on the model form being appropriate.

With those general points emphasized, let’s look in detail what we can get testing with simulated data.

What simulated data will give you

We can get a great deal from simulated data. As we said above, what we get is insight into the performance of algorithms that process the data, such as the parameter estimation process. Specifically, we can check whether our parameter estimation algorithm is, under ideal conditions,

Consistent
Biased
Efficient
Robust

I’ll explain each of these in detail below. We can also get insight into how fast our parameter estimation process runs or how much storage it requires. Running tests using simulated data can be extremely useful.

Consistency check

As a Data Scientist you’ll be familiar with the idea that if we have only a small amount of training data our parameter estimates for our trained model will not be accurate. However, if we have a lot of training data that matches the assumptions on which our parameter estimation algorithm is based, then we expect the trained parameter estimates to be close to their true values, i.e. close to the values which generated the data. As we increase the amount of training data, we expect our parameters estimates to get more and more accurate, converging ultimately to the true values in the limit of an infinite amount of training data. This is consistency.

In statistics, a formula or algorithm for estimating the parameters of a model is called an estimator. There can be multiple different estimators for the same model, some better than others. A consistent estimator is one whose expected value converges to the true value in the limit of an infinite amount of training data. An inconsistent estimator is one whose expected value doesn’t converge to the true value in the limit of an infinite amount of training data. Think about that for a moment,

An inconsistent estimator is an algorithm that doesn’t get better even when we give it a load more training data.

That is a bad algorithm! That is why I say constructing an inconsistent estimator is one of the worst sins a Data Scientist can commit. Very occasionally (rarely), an inconsistent estimator is constructed because it has other useful properties. But in general, it you encounter an inconsistent estimator you should take it as a sign of incompetence on the part of the Data Scientist who constructed it.

“Okay, okay, I get it. Inconsistent estimators are bad. But I don’t have an infinite amount of training data, so how can I actually check if my algorithm produces a consistent estimator? Surely, it can’t be done?” Yes, it can be done. What we’re looking for is convergence, i.e. parameter estimates getting closer and closer to the true values as we increase the training set size. I’ll give a demonstration of this in the next section when I show how to set up some simulation tests.

Bias check

Along with the concept of consistency comes the concept of bias. We said that a consistent estimator was one whose expectation value converges to the true value in the limit of an infinite amount of training data. However, that doesn’t mean a consistent estimator has an expectation value that is equal to the true value for a finite amount of training data. It is possible to have a consistent estimator that is biased. This means the estimator, on average, will differ from the true value when we use a finite amount of training data. For a consistent estimator, if it is biased that bias will disappear as we continually increase the amount of training data.

As you might have guessed, the best algorithms produce estimators that are consistent and unbiased. Knowing if your estimator is biased and by how much is extremely useful. Again, we can assess bias using simulated data, and I’ll show how to do this in the next section when I show how to set up some simulation tests.

Efficiency check

So far, we have spoken about the expectation or average properties of an algorithm/estimator. But what about its variance. It is all very well telling me that across lots of different instances of training datasets my algorithm would, on average get the right answer, or near the right answer, under ideal conditions, but in the real world I have only one training dataset. Am I going to be lucky and my particular training data will give parameter estimates close to the average behaviour of the algorithm? I’ll never know. But what I can know is how variable the parameter estimates from my algorithm are. I can do this by calculating the variance of the parameter estimates over lots of training datasets. A small variance will tell me that my one real-world dataset is likely to have performance close to the mean behaviour of the algorithm. I may still be unlucky with my particular training data and the parameter estimates are a long way from the average estimates, but it is unlikely. However, a large variance tells me that parameter estimates obtained from a single training dataset will often be a long way from the average estimates.

How can I calculate this variance of parameter estimates over training datasets? Simple, get lots of different training datasets produced under identical controlled conditions. How could I do that? Yep, you guessed it. Simulation. With a simulation process coded up, we can easily generate multiple instances of training datasets of the same size and generated under identical conditions. Again, I’ll demonstrate this in the next section.

Sensitivity check – robustness to contamination

Our message about simulated data is that it allows you to test your algorithm under conditions that match the assumptions made by the algorithm, i.e. under ideal conditions. But you can use simulation to test how well your algorithm performs in non-ideal conditions. We can also introduce contamination into the simulated data, for example drawing some response variable values from a non-Gaussian distribution if our algorithm has assumed the response variable is purely Gaussian distributed. We can produce multiple simulated datsets with different percentages of contamination and so test how sensitive or robust our estimation algorithm is to the level of contamination, i.e. how sensitive it is to non-ideal data.

In the first few pages of the first chapter of his classic textbook on Robust Statistics, Peter Huber describes analysis of an experiment originally due to John Tukey. The analysis reveals that even having just 2% of “bad” datapoints being drawn from a different Gaussian distribution (with a 3-fold larger standard deviation) is enough to markedly change the properties and efficiency of common statistical estimators. And yet, defining “bad” data as being drawn from a larger variance Gaussian is wonderfully simplistic. Real-world data is so much nastier.

What form should the data contamination take? There are multiple ways in which data can become contaminated. There can be changes in statistical properties, like the simple example we used above, or drift in statistical properties such as a non-stationary mean or a non-stationary variance. But you can get more complicated errors creeping into your data. These typically take two forms,

Human induced data contamination: These can be misspelling or mis-(en)coding errors that result from not using controlled and validated vocabularies for human data-entry tasks. You’ll recognize these sorts of errors when you see multiple different variants for the name of the same country, US county or UK city, say. You might think it is difficult to simulate such errors, but there are some excellent packages to do so – checkout the messy R package produced by Dr. Nicola Rennie that allows you to take a clean dataset and introduce these sorts of encoding errors into it. Spotting these errors can be as simple as plotting distributions of unique values in a table column, i.e. looking for unusual distributions. In R there are a number of packages to help you do this.
Machine induced errors: These are errors that arise from the processing or transferring of data. These can be as simple as incorrect datetime stamps on rows in a database table, or can be as complex as repeating blocks of rows in a table. These errors are less about contamination and more about alteration. The common element here is that there is a pattern to how the data has become altered or modified and so spotting the errors involves visual inspection of the individual rows of the table, combined with plotting lagged or offset data values. The machine induced errors arise because of bugs in processing code, and these can be either coding errors, e.g. a typo in the code, or unintended behaviour, e.g. datetime processing code that hasn’t been designed properly to correctly handle daylight saving switchovers.

What kind of data contamination should I simulate? This is a “how long is a piece of string” kind of question. It very much depends on what aspect of your algorithm or implementation you want to test for robustness, and only you can know that. You may have to write some bespoke code to simulate the sorts of errors that arise in the processes you use or are exposed to. Broadly speaking, robustness of an estimator will be tested by changes in the statistical properties of the input data and these can be simulated by changes in the distributions of data due to data drift or human-induced contamination, whilst machine-induced errors imply you have some sort of deployed pipeline and so simulating machine corrupted data is best when you want to stress-test your end-to-end pipeline.

Runtime scaling

There are also checks that simulated data allows you to perform that aren’t necessarily directly connected to the accuracy or efficiency of the parameter estimates. Because we can produce as much simulated data as we want, we can easily test how long our estimation algorithm takes for different sized datasets. Similarly, we can also use simulated data to test the memory and storage requirements of the algorithm.

We can continue this theme. Because we can tune and tweak the generation of the simulated data, this can also allow us to generate data to test very specific scenarios – corner cases – for which we don’t have real test data. The ability to generate simulated data increases the test coverage we can perform.

What simulated data won’t give you

Identify model mis-specification

Using simulated data will tell you how well your model training algorithm performs on data that matches precisely the form of the model you have used. It won’t tell you if your model form is correct or appropriate for the real data you will ultimately apply it to. It won’t tell you if you’ve omitted an important feature or if you’ve put non-linearity into your model in an incorrect way. Getting the model form right can only come from i) domain expertise, ii) testing on real ground-truth data. Again, what this highlights is that we use simulated data to test the training process, not the model.

This can trip up even experience researchers. I recently saw a talk from an academic researcher who tested two different model forms using simulated data generated from one of the models. When the model form used to generate the data fitted the simulation data better, they confidently claimed that this model was better and more correct. Well, of course it was for this simulated data!

Accuracy of your model on real data

For simulated data we have the ground-truth values of the response variable so we can assess the prediction accuarcy, either on training data or on holdout test data. However, unless our simulation process produced very realistic data, including the various contamination processes, the test set accuracy on simulated data cannot be used as a precise measure of the predictive accuracy of the trained model on real unseen data.

How to simulate

When producing simulated data for testing an algorithm related to a model there are two things we need to generate – the features and the response. There are two ways we can approach this,

Simulating the features and then simulating the response given the feature values we just produced.
Simulate just the response value given some pre-existing feature values.

Of these, 2 sounds easier, but I will discuss 1 first as it leads us naturally into discussing where we might get pre-existing feature values from.

Simulating features and response

As we said above, in this approach we simulate the features first, and this allows us to construct the distribution of the response variable conditional on the features. We can then sample a value from that conditional distribution. Our basic recipe is

Sample the feature values from a distribution.
Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
Sample the response variable from the conditional distribution constructed in 2.

How complex we want to make the feature distribution depends on how realistic we need our features to be and what aspect of the estimation/training algorithm we are wanting to test.

For real-world problems, it is unlikely that the features follow a Gaussian distribution. Take demand modelling, an area I have worked in a lot. The main feature we use is the price of the product whose demand we are trying to predict. Prices are definitely not Gaussian distributed. Retailers repeatedly switch between a regular and promotional price over a long period of time, so that we have a sample distribution of prices that is represented by two Dirac-delta functions. A more interesting price time series may introduce a few more price points, but it is still definitely not Gaussian. Similarly, real data has correlations between features.

When simulating a feature, we have to decide how important the real distribution is to the aspect of the estimation/training algorithm that we want to test. If we want to simulate with realistically distributed features. this can be problematic. We’ll return to this issue and real data later on, but for now we emphasize tha we can still test whether our estimator is consistent or assess its bias using features drawn from independent Gaussian distributions. So there are still useful tests of our estimation algorithm we can carry out. Let’s see how we can do that.

Linear model example

We’ll use a simple linear model that depends on three features, $x_{1}, x_{2}, x_{3}$ . The response variable $y$ is given by,

$y\; =\;\beta_{1} x_{1}\;+\; \beta_{2}x_{2}\;+\;\beta_{3}x_{3} \;+\;\epsilon\;\;\;\;,\;\; \epsilon\;\sim\; {\cal{N}}\left ( 0, \sigma^{2}_{\epsilon}\right )$

From which you can see both the linear dependence on the features and that $y$ contains Gaussian additive noise $\epsilon$ .

Simulating data is now easy once we have the structure of our probabilistic model. Given a user-specified mean $\mu_{1}$ and variance $\sigma^{2}_{1}$ we can easily sample a value for $x_{1}$ from ${\cal{N}}\left ( \mu_{1}, \sigma^{2}_{1}\right )$ . Similarly, given user-specified means $\mu_{2}, \mu_{3}$ and variances $\sigma^{2}_{2}, \sigma^{2}_{3}$ , we can generate values for $x_{2}$ and $x_{3}$ . If we have user-specified values of $\beta_{1}, \beta_{2}, \beta_{3}$ we can then easily generate a value for $y$ by sampling from ${\cal{N}}\left ( \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}, \sigma^{2}_{\epsilon} \right )$ , where $\sigma^{2}_{\epsilon}$ is the variance of the additive noise that we want to add to our response variable. To simulate $N$ datapoints we repeat that recipe $N$ times. Let’s apply that recipe to assess an estimator of the model parameters $\beta_{1}, \beta_{2}, \beta_{3}$ . We’ll assess the standard Ordinary Least Squares (OLS) estimator for a linear model.

Assessing the OLS Estimator for a linear model

Given a feature matrix $\underline{\underline{X}}$ (the i^th row of the matrix is the feature values for the i^th observation) and vector $\underline y = \left ( y_{1}, y_{2},\ldots,y_{N}\right )$ that represents the $N$ observations of the response variable, then the Ordinary Least Squares (OLS) estimator $\hat{\beta}$ of the true model parameters $\underline{\beta} = \left ( \beta_{1}, \beta_{2}, \beta_{3}\right )$ is given by the formula,

$\underline{\hat{\beta}}\;=\; \left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right ) ^{-1} \underline{\underline{X}}^{\top} \underline{y}\;\;\;\;\;\;\;{\rm Eq.1}$

Note that the OLS estimator is a linear combination of the observations $y_{1}, y_{2}, \ldots, y_{N}$ , with a weight matrix $\left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right )^{-1} \underline{\underline{X}}^{\top}$ . We’ll come back to this point in a moment.

What we want to know is how close is the estimate $\underline{\hat{\beta}}$ to $\underline{\beta}$ . Is the OLS estimator in the Eq.1 above a biased estimator of $\underline{\beta}$ , and is it a consistent estimator?

The plots below show the bias (mean error) for each of the model parameters, plotted against training dataset size $N$ . I constructed the plots by initializing a true model parameter vector $\underline{\beta}$ and then generating 1000 simulated training datasets for each of the different values of $N$ . For each simulated training dataset I computed the OLS parameter estimate $\hat{\underline{\beta}}$ and then computed the parameter estimate errors $\hat{\underline{\beta}} - \underline{\beta}$ . From the errors I then calculated their sample means and variances (over the simulations) for each value of $N$ .

You can see from the plots that whilst the mean error fluctuates it doesn’t systematically change with $N$ . Furthermore, it fluctuates around zero, suggesting that the OLS estimator is unbiased. And indeed it is. It is possible to mathematically show that the OLS estimator is unbiased at any finite value of $N$ . The reason we get a non-zero value in this case is because we have estimated $\mathbb{E}\left ( \hat{\beta}_{i}\right )$ using a sample average taken over 1000 simulated datasets. If we had used a larger number of simulated datasets we would have got even smaller sample average parameter errors.

Contrast this behaviour with how the variances of the parameter estimate errors change with $N$ in the plots below.

The decrease, with $N$ , in the variance of $\hat{\underline{\beta}} - \underline{\beta}$ is marked. In fact, in looks like a power-law decrease, so I have plotted the same data on a log-scale below,

We can see from those log-log plots that the variances of $\hat{\beta}_{i} - \beta_{i},\; i=1,2,3$ decrease as $N^{-1}$ . That implies that as we use larger and larger training sets any single instance of $\hat{\underline{\beta}}$ will get closer and closer to $\underline{\beta}$ . At large $N$ we have a low probability of being unlucky and our particular training set giving a poor estimate of $\underline{\beta}$ .

How efficient is the OLS estimator in Eq.1? Is the rate at which ${\rm Var}\left ( \hat{\beta}_{i} - \beta_{i}\right )$ decreases with $N$ good or bad? It turns out that the OLS estimator in Eq.1 is the Best Linear Unbiased Estimator (BLUE). For an unbiased estimator of $\underline{\beta}$ that is constructed as a linear combination of the observations $\underline{y}$ , you cannot do better than the OLS estimator in Eq. 1.

All the code for the linear model example is available in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the GitHub repository https://github.com/dchoyle/simulation_blogpost.

A linear model is relatively simple structure but the example was a good demonstration of the power of simulated data. Next, we’ll use a more complex model architecture and build a feed-forward neural network.

Neural network example

Our simulated neural network output has the form,

$y\;=\; f\left( \underline{x}| \underline{\theta} \right ) \;+\; \epsilon$

Again, we’ll use zero-mean Gaussian additive noise, $\epsilon \sim {\cal{N}}\left (0, \sigma^{2}_{\epsilon}\right )$ .

The function $f\left( \underline{x}| \underline{\theta} \right )$ represents our neural network function, with $\underline{x}$ being the vector of input features and $\underline{\theta}$ being a vector holding all the network parameters. For this demo, I’m going to use a 3 input-node, 2 hidden-layer feed-forward network, with 10 nodes in each of the hidden layers. The output layer consists of a single node, representing the variable $y$ . For the non-linear transfer (activation) functions I’m going to use $\tanh$ functions. So, schematically, my networks looks like the figure below,

I’m going to use a teacher network of the form above to generate simulated data, which I’ll then use to train a student network of the same form. What I want to test is, does my training process produce a trained student network whose predictions on a test set get more and more accurate as I increase the amount of training data? If not, I have a problem. If my training process doesn’t produce accurate trained networks on ideal data, the training process isn’t going to produce accurate networks when using real data. I’m less interested in comparing trained student network parameters to the teacher network parameters as, a) there are a lot of them to compare, b) since the output of a network is invariant to within-layer permutation of the hidden layer node labels and connections, defining a one-to-one comparison of network parameters is not straight forward here. Node 1 in the first hidden layer of the student isn’t necessarily equivalent to node 1 in the first hidden layer of the teacher network, and so on.

The details of how I’ve coded up the networks and set-up the evaluation are lengthy, so I’ll just show the final result here. All the details can be found in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the freely accesible github repository.

Below in left-hand plot I’ve plotted the average Mean Square Error (MSE) made by the trained student network on the test-sets. I’ve plotted the average MSE against the training dataset size. The average MSE is the average over the simulations of that training set size. For comparison, I have also calculated the average test-set MSE of the teacher network. Since the test-set data contains additive Gaussian noise, the teacher network won’t make perfect predictions on the test-set data even though the teacher network generated the systematic part of the test-set response values. The average test-set MSE of the teacher network provides a benchmark or baseline against which we can asses the trained student network. We have a ready intuition about the relative test-set MSE value. We expect the relative test-set MSE to be significantly above 1 at small values of $N$ , as the student network struggles to learn the teacher network output. As the amount of training data $N$ increases we expect the relative test-set MSE value to approach 1 from above. The average relative test-set error is plotted in the right-hand plot below.

We can see from both plots above that the prediction accuracy of a trained student network typically decreases with increasing amount of training data. My network training process has passed this basic test. The test was quick to set up and gives me confidence I can run my code over real data.

Sampling features from more realistic distributions

In our previous examples we have used independent features, sampled from simple but naive distributions, to test the convergence properties of an estimator. But what happens if you want to assess the quantitative performance of an estimator for more realistic feature patterns? Well, we use more realistic feature patterns. This is a variant of our previous basic recipe, but where we have access to a real dataset. The modified recipe is,

Sample an observation from the real dataset and keep the features.
Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
Sample the response variable from the conditional distribution constructed in 2.

This seems like a small modification of the recipe. However, it does have some big implications. We can’t generate simulated datasets of arbitrarily large size as we are limited by the size of the real dataset. We can obviously generate simulated datasets of smaller size than the real data, but this can make testing of the convergence properties of an estimator difficult.

That said, this is one of my faviourite approaches. Often, steps 2 and 3 are easy to implement. You’ll have a function for the conditional mean of the response variable already coded up for prediction purpose, so it is just a question of pushing some feature values through that code. I find the overhead of writing extra functions to simulate realistic looking feature values is significant, both in terms of time and thinking about what ‘realistic’ should look like. The recipe above gets round this easily. Simply pick a row from your existing real dataset at random and there you go, you have some realistic feature values. As before, the recipe allows me to then generate response values with known ground-truth parameters values. So overall I can compare parameter estimates to ground truth parameter values on realistic feature values, allowing me to check that my estimation algorithm is at least semi-accurate on realistic feature values. You can also choose in step 1 of the recipe, whether you want to sample a row of feature values with or without replacement.

Simulating the response only

You could argue that simulating response values with feature values sampled from an existing real dataset is an example of just simulating the response. After all, only the response value is computer generated. I still tend to think of it as simulating the features because, i) I am still sampling the features from a distribution function, the empirical distribution function in this case, and ii) I have broken some of the link between the features and the response in the real data because I have sampled the features values separately. However, sometimes we want to keep as much as of the links between features and response values in the real data as possible. We can do this by only making additions to the real data. By necessity this means only adding to the response value. This may sound very restrictive, but in fact there are many situations where this is precisely the kind of data we need to test an estimation algorithm. For example, changepoint detection or unconditional A/B testing. In these situations we take the real data, identify the split point where we want to increase the response value (the changepoint or the A/B grouping) and simply increase the response. Hey presto, we have realistic data with a guaranteed increase in the response variable at a know location. By changing the level of increase in the response variable we can use this approach to assess the statistical power of the changepoint or A/B test algorithm.

The plots below show an example of introducing a simple shift in level at timepoint 53 into a real dataset. We have only shown the process as a simple schematic, but coding it up yourself is only a matter of a line or two of code, so I haven’t given any code details.

In the above example I simply increased the response variable, by the same amount (8.285 in this case), at and after timepoint 53. If instead, you only want to increase the average value of the response variable, it is a simple modification of the process to include some additional zero-mean noise after the changepoint location.

Conclusions

Simulated data is extremely useful. It can give you lots of insight into the performance of your training/estimation algorithm (including bug detection). Its main advantages are it is,

Easy to produce in large volumes.
Can be produced in a user-controlled way.
Gives you ground-truth values.
Gives you a way to assess the performance of your training algorithm when you have no real ground-truth data.
Stops you releasing a poor untested training algorithm into production.

If you don’t want to sign an absolutely useless striker for your data science model team, test with simulated data at the very minimum.

Footnotes

Harry Redknapp is a former English Premier League football manager. Whilst Redknapp was manager of Tottenham Hotspur he had a reputation of being willing to sign players on the flimiest of evidence of footballing skills. At a time when there was a large influx of overseas players into the Premier league, due to their reputation for superior technical football skills, it was joked that he would sign a player simply because of how their name sounded and without any checks on the player at all.
The term generative model preceeds its useage in Generative AI. Broadly speaking, a generative model is a machine learning model that learns the underlying probability distribution of the data and can generate new, similar data instances. The useage of the term was popular around the early 2000’s, particularly when discussing different forms of classifiers, which were described as either being generative or discriminative.

Extreme Components Analysis

On June 18, 2025August 4, 2025 By dchoyleIn UncategorizedLeave a comment

TL;DR: Both Principal Components Analysis (PCA) and Minor Components Analysis (MCA) can be used for dimensionality reduction, identifying low-dimensional subspaces of interest as those which have the greatest variation in the original data (PCA), or those which have the least variation in the origina data MCA). As real data will contain both directions of unusually high variance and directions of unusually low variance, using just PCA or just MCA will lead to biased estimates of the low-dimensional subspace. The 2003 NeurIPs paper from Welling et al unifies PCA and MCA into a single probabilistic model XCA (Extreme Components Analysis). This post explains the XCA paper of Welling et al and demonstrates the XCA algorithm using simulated data. Code for the demonstration is available from https://github.com/dchoyle/xca_post

A deadline

This post arose because of a deadline I have to meet. I don’t know when the deadline is, I just know there is a deadline. Okay, it is a self-imposed deadline, but it will start to become embarrassing if I don’t hit it.

I was chatting with a connection, Will Faithfull, at a PyData Manchester Leaders meeting almost a year ago. I mentioned that one of my areas of expertise was Principal Components Analysis (PCA), or more specifically, the use of Random Matrix Theory to study the behaviour of PCA when applied to high-dimensional data.

A recap of PCA

In PCA we are trying to approximate a d-dimensional dataset by a reduced number of dimensions $k < d$ . Obviously we want to retain as much of the structure and variation of the original data, so we choose our k-dimensional subspace such that the variance of the original data in the subspace is as high as possible. Given a mean-centered data matrix $\underline{\underline{X}}$ consisting of $N$ observations, we can calculate the sample covariance matrix $\hat{\underline{\underline{C}}}$ as ¹,

$\hat{\underline{\underline{C}}} = \frac{1}{N-1} \underline{\underline{X}}^{\top} \underline{\underline{X}}$

Once we have the (symmetric) matrix $\hat{\underline{\underline{C}}}$ we can easily compute its eigenvectors $\underline{v}_{i}, i=1,\ldots, d$ , and their corresponding eigenvalues $\lambda_{i}$ .

The optimal $k$ -dimensional PCA subspace is then spanned by the $k$ eigenvectors of $\hat{\underline{\underline{C}}}$ that correspond to the $k$ largest eigenvalues of $\hat{\underline{\underline{C}}}$ . These eigenvectors are the directions of greatest variance in the original data. Alternatively, one can just do a Singular Value Decomposition (SVD) of the original data matrix $\underline{\underline{X}}$ , and work with the singular values of $\underline{\underline{X}}$ instead of the eigenvalues of $\hat{\underline{\underline{C}}}$ .

That is a heuristic derivation/justification of PCA (minus the detailed maths) that goes back to Harold Hotelling in 1933². There is a probabilistic model-based derivation due to Tipping and Bishop (1999), which we will return to later.

MCA

Will responded that as part of his PhD, he’d worked on a problem where he was more interested in the directions in the dataset along which the variation is least. The problem Will was working on was “unsupervised change detection in multivariate streaming data”. The solution Will developed was a modular one, chaining together several univariate change detection methods each of which monitored a single feature of the input space. This was combined with a MCA feature extraction and selection pre-processing step. The solution was tested against a problem of unsupervised endogenous eye blink detection.

The idea behind Will’s use of MCA was that for the streaming data he was interested in it was likely that the inter-class variances of various features were likely to be much smaller than intra-class variances, and so any principal components were likely to be dominated by what the classes had in common rather than what had changed, so the directions of greatest variance weren’t very useful for his change detection algorithm.

I’ve put a link here to Will’s PhD in case you are interested in the details of the problem and solution – yes, Will I have read your PhD.

Directions of least variance in a dataset can be found from the same eigen-decomposition of the sample covariance matrix and by selecting the components with the smallest non-zero eigenvalues. Unsurprisingly, focusing on directions of least variance in a dataset is called Minor Components Analysis (MCA)^3,4. Where we have the least variation in the data the data is effectively constrained so, MCA is good for identifying/modelling invariants or constraints within a dataset.

At this point in the conversation, I recalled the last time I’d thought about MCA. That was when an academic colleague and I had a paper accepted at the NeurIPs conference in 2003. Our paper was on kernel PCA applied to high-dimensional data, in particular the eigenvalue distributions that result. As I was moving job and house at the time I was unable to go the conference, so my co-author, Magnus Rattray (now Director of the Institute for Data Science and Artificial Intelligence at the University of Manchester), went instead. On returning, Magnus told me of an interesting conversation he’d had at the conference with Max Welling about our paper. Max also had a paper at the conference, on XCA – Extreme Components Analysis. Max and his collaborators had managed to unify PCA and MCA into a single framework.

I mentioned the XCA paper to Will at the PyData Manchester Leaders meeting and said I’d write something up explaining XCA. It would also give me an excuse to revisit something that I hadn’t looked at since 2003. That conversation with Will was nearly a year ago. Another PyData Manchester Leaders meeting came and went and another will be coming around sometime soon. To avoid having to give a lame apology I thought it was about time I wrote this post.

XCA

Welling et al rightly point out that if we are modelling a dataset as lying in some reduced dimensionality subspace then we consider the data as being a combination of variation and constraint. We have variation of the data within a subspace and a constraint that the data does not fall outside the subspace. So we can model the same dataset focusing either on the variation (PCA) or on the constraints (MCA).

Note that in my blog post I have used a different, more commonly used notation. for the number of features and the number of components, than that used in the Welling et al paper. The mapping between the two notations is given below,

Number of features: My notation = $d$ , Welling et al notation = $D$
Number of components: My notation = $k$ , Welling et al notation = $d$

Probabilistic PCA and MCA

PCA and MCA both have probabilistic formulations, PPCA and PMCA⁵ respectively. Welling et al state that, “probabilistic PCA can be interpreted as a low variance data cloud which has been stretched in certain directions. Probabilistic MCA on the other hand can be thought of as a large variance data cloud which has been pushed inward in certain directions.” In both probabilistic models a $d$ -dimensional datapoint $\underline{x}$ is considered as coming from a zero-mean multivariate Gaussian distribution. In PCA the covariance matrix of the Gaussian is modelled as,

$\underline{\underline{C}}_{PCA} = \sigma^{2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}$

The matrix $\underline{\underline{A}}$ is $k \times d$ and its columns are the principal components that span the low dimensional subspace we are trying to model.

In MCA the covariance matrix is modelled as,

$\underline{\underline{C}}_{MCA}^{-1} = \sigma^{-2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}$

The matrix $\underline{\underline{W}}$ is $d \times (d-k)$ and its rows are the minor components that define the $d-k$ subspace where we want as little variation as possible.

Since in real data we probably have both exceptional directions whose variance is greater than the bulk and exceptional directions whose variance is less than the bulk, both PCA and MCA would lead to biased estimates for these datasets. The problem is that if we use PCA we lump the low variation eigenvalues (minor components) in with our estimate of the isotropic noise, thereby underestimating the true noise variance and consequently biasing our estimate of the large variation PC subspace. Likewise, if we use MCA, we lump all the large variation eigenvalues (principal components) into our estimate of the noise and overestimate the true noise variance, thereby biasing our estimate of the low variation MC subspace.

Probabilistic XCA

In XCA we don’t have that problem. In XCA we include both large variation and small variation directions in our reduced dimensionality subspace. In fact we just have a set of orthogonal directions $\underline{a}_{i}\;,\;i=1,\ldots,k$ that span a low-dimensional subspace and again form the columns of a matrix $\underline{\underline{A}}$ . These are our directions of interest in the data. Some of them, say, $k_{PC}$ , have unusually large variance, some of them, say $k_{MC}$ , have unusually small variance. The overall number of extreme components (XC) is $k = k_{PC} + k_{MC}$ .

As with probabilistic PCA, we then add on top an isotropic noise component to the overall covariance matrix. However, the clever trick used by Welling et al was that they realized that adding noise always increases variances, and so adding noise to all features will make the minor components undetectable as the minor components have, by definition, variances below that of the bulk noise. To circumvent this, Welling et al only added noise to the subspace orthogonal to the subspace spanned by the vectors $\underline{a}_{i}$ . They do this by introducing a projection operator ${\cal{P}}_{A}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{A}} \left ( \underline{\underline{A}}^{\top}\,\underline{\underline{A}}\right )^{-1}\underline{\underline{A}}^{\top}$ . Again we model the data as coming from a zero-mean multivariate Gaussian, but for XCA the final covariance matrix is then of the form,

$\underline{\underline{C}}_{XCA} = \sigma^{2}_{0} {\cal{P}}_{A}^{\perp} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}$

and the XCA model is,

$\underline{x} \sim \underline{\underline{A}}\,\underline{y} + {\cal{P}}_{A}^{\perp} \underline{n}\;\;\;,\;\; \underline{y} \sim {\cal{N}}\left ( \underline{0}, \underline{\underline{I}}_{k}\right )\;\;\;,\;\;\underline{n} \sim {\cal {N}}\left ( \underline{0}, \sigma^{2}_{0}\underline{\underline{I}}_{d}\right )$

We can also start from the MCA side, by defining a projection operator ${\cal{P}}_{W}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{W}}^{\top} \left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top}\right )^{-1}\underline{\underline{W}}$ , where the rows of the $k\times d$ matrix $\underline{\underline{W}}$ span the $k$ dimensional XC subspace we wish to identify. From this MCA-based approach Welling et al derive the probabilistic model for XCA as zero-mean multivariate Gaussian distribution with inverse covariance,

$\underline{\underline{C}}_{XCA}^{-1} = \frac{1}{\sigma^{2}_{0}} {\cal{P}}_{W}^{\perp} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}$

The two probabilistic forms of XCA are equivalent and so one finds that the matrices $\underline{\underline{A}}$ and $\underline{\underline{W}}$ are related via $\underline{\underline{A}} = \underline{\underline{W}}^{\top}\left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top} \right )^{-1}$

If we also look at the two ways in which Welling et al derived a probabilistic model for XCA, we can see that they are very similar to the formulations of PPCA and PMCA respectively, just with the replacement of ${\cal{P}}_{A}^{\perp}$ for $\underline{\underline{I}}_{d}$ in the PPCA formulation, and the replacement of ${\cal{P}}_{W}^{\perp}$ for $\underline{\underline{I}}_{d}$ in the PMCA formulation. So with just a redefinition of how we add the noise in the probabilistic model, Welling et al derived a single probabilistic model that unifies PCA and MCA.

Note that we are now defining the minor components subspace as directions of unusually low variance, so we only need a few dimensions, i.e. $k_{MC} \ll d$ , whilst previously when we defined the minor components subspace as the subspace where we wanted to constrain the data away from, we needed $k_{MC} = d - k_{PC}$ directions. The probabilistic formulation of XCA is a very natural and efficient way to express MCA.

Maximum Likelihood solution for XCA

The model likelihood is easily written down and the maximum likelihood solution identified. As one might anticipate the maximum-likelihood estimates for the vectors $\underline{a}_{i}$ are just eigenvectors of $\underline{\underline{\hat{C}}}$ , but we need to work out which ones. We can use the likelihood value at the maximum likelihood solution to do that for us.

Let’s say we want to retain $k=6$ extreme components overall, and we’ll use ${\cal{C}}$ to denote the corresponding set of eigenvalues of $\hat{\underline{\underline{C}}}$ that are retained. The maximum likelihood value for $k$ extreme components (XC) is given by,

$\log L_{ML} = - \frac{Nd}{2}\log \left ( 2\pi e\right )\;-\;\frac{N}{2}\sum_{i\in {\cal{C}}}\lambda_{i}\;-\;\frac{N(d-k)}{2}\log \left ( \frac{1}{d-k}\left [ {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right ]\right )$

All we need to do is evaluate the above equation for all possible subsets ${\cal{C}}$ of size $k$ selected from the $d$ eigenvalues $\lambda_{i}\, , i=1,\ldots,d$ of $\hat{\underline{\underline{C}}}$ . Superificially, this looks like a nasty combinatorial optimization problem, of exponential complexity. But as Welling et al point out, we know from a result proved in the PPCA paper of Tipping and Bishop that in the maximum likelihood solution the non-extreme components have eigenvalues that form a contiguous group, and so the optimal choice of subset ${\cal{C}}$ reduces to determining where to split the ordered eigenvalue spectrum of $\hat{\underline{\underline{C}}}$ . Since we have $k = k_{PC} + k_{MC}$ that reduces to simply determining the optimal number of the largest $\lambda_{i}$ to retain. That makes the optimization problem linear in $k$ .

For example, in our hypothetical example we have said we want $k = 6$ , but that could be a 3 PCs + 3 MCs split, or a 2 PCs + 4MCs split, and so on. To determine which we simply compute the maxium likelihood value for all the possible values of $k_{PC}$ from $k_{PC} = 0$ to $k_{PC} = k$ , each time keeping the largest $k_{PC}$ values of $\lambda_{i}$ and the smallest $k - k_{PC}$ values of $\lambda_{i}$ in our set ${\cal{C}}$ .

Some of the terms in $\log L_{ML}$ don’t change as we vary $k_{PC}$ and can be dropped. Welling et al introduce a quantity ${\cal{K}}$ defined by,

${\cal{K}}\;=\sum_{i\in {\cal{C}}}\lambda_{i}\; + \;(d-k)\log \left ( {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right )$

${\cal{K}}$ is the negative of $\log L_{ML}$ , up to an irrelevant constant and scale. If we then compute ${\cal{K}}\left ( k_{PC}\right )$ for all values of $k_{PC} = 0$ to $k_{PC} = k$ and select the minimum, we can determine the optimal split of $k = k_{PC} + k_{MC}$ .

PCA and MCA as special cases

Potentially, we could find that $k_{PC} = k$ , in which case all the selected extreme components would correspond to principal components, and so the XCA algorithm becomes equivalent to PCA. Likewise, we could get $k_{PC} = 0$ , in which case all the selected extreme components would correspond to minor components and the XCA algorithm becomes equivalent to MCA. So XCA contains pure PCA and pure MCA as special cases. But when do these special cases arise? Obviously, it will depend upon the precise values of the sample covariance eigenvalues $\lambda_{i}$ , or rather the shape of the eigen-spectrum, but Welling et al also give some insight here, namely,

A log-convex sample covariance eigen-spectrum will give PCA
A log-concave sample covariance eigen-spectrum will give MCA
A sample covariance eigen-spectrum that is neither log-convex nor log-concave will yield both principal components and minor components

In layman’s terms, if the plot of the (sorted) eigenvalues on a log scale only bends upwards (has positive second derivative) then XCA will give just principal components, whilst if the plot of the (sorted) eigenvalues on a log scale only bends downwards (has negative second derivative) then we’ll get just minor components. If the plot of the log-eigenvalues has places where the second derivative is positive and places where it is negative, then XCA will yield a mixture of principal and minor components.

Experimental demonstration

To illustrate the XCA theory I produced a Jupyter notebook that generates simulated data containing a known number of principal components and a known number of minor components. The simulated data is drawn from a zero-mean multivariate Gaussian distribution with population covariance matrix $\underline{\underline{C}}$ whose eigenvalues $\Lambda_{i}$ have been set to the following values,

$\begin{array}{cclcl} \Lambda_{i} & = & 3\sigma^{2}\left ( 1 + \frac{i}{3k_{PC}}\right ) & , & i=1,\ldots, k_{PC}\\ \Lambda_{i} & = &\sigma^{2} & , & i=k_{PC}+1,\ldots, d - k_{MC}\\ \Lambda_{i} & = & 0.1\sigma^{2}\left ( 1 + \frac{i}{k_{MC}} \right ) & , & i=d - k_{MC} + 1,\ldots, d\end{array}$

The first $k_{PC}$ eigenvalues represent principal components, as their variance is considerable higher than the ‘noise’ eigenvalues, which are represented by eigenvalues $i=k_{PC}+1$ to $i=d - k_{MC}$ . The last $k_{MC}$ eigenvalues represent minor components, as their variance is considerably lower than the ‘noise’ eigenvalues. Note, I have scaled both the PC and MC population eigenvalues by the ‘noise’ variance $\sigma^{2}$ , so that $\sigma^{2}$ just sets an arbitrary (user-chosen) scale for all the variances. I have chosen a large value of $\sigma^{2}=20$ , so that when I plot the minor component eigenvalues of $\hat{\underline{\underline{C}}}$ I can easily distinguish them from zero (without having to plot on a logarithmic scale).

We would expect the eigenvalues of the sample covariance matrix to follow a similar pattern to the eigenvalues of the population covariance matrix that we used to generate the data, i.e. we expect a small group of noticeably low-valued eigenvalues, a small group of noticeably high-valued eigenvalues, and the bulk (majority) of the eigenvalues to form a continuous spectrum of values.

I generated a dataset consisting of $N=2000$ datapoints, each with $d=200$ features. For this dataset I chose $k_{PC} = k_{MC} = 10$ . From the simulated data I computed the sample covariance matrix $\hat{\underline{\underline{C}}}$ and then calculated the eigenvalues of $\hat{\underline{\underline{C}}}$ .

The left-hand plot below shows all the eigenvalues (sorted from lowest to highest), and I have also zoomed in on just the smallest values (middle plot) and just the largest values (right-hand plot). We can clearly see that the sample covariance eigenvalues follow the pattern we expected.

All the code (with explanations) for my calculations are in a Jupyter notebook and freely available from the github repository https://github.com/dchoyle/xca_post

From the left-hand plot we can see that there are places where the sample covariance eigenspectrum bends upwards and places where it bends downwards, indicating that we would expect the XCA algorithm to retain both principal and minor components. In fact, we can clearly see from the middle and right-hand plots the distinct group of minor component eigenvalues and the distinct group of principal component eigenvalues, and how these correspond to the distinct groups of extreme components in the population covariance eigenvalues. However, it would be interesting to see how the XCA algorithm performs in selecting a value for the number of principal components.

For the eigenvalues above I have calculated the minimum value of ${\cal{K}}$ for a total of $k=20$ extreme components. The minimum value of ${\cal{K}}$ occurs at $k_{PC} = 10$ , indicating that the method of Welling et al estimates that there are 10 principal components in this dataset, and by definition there are then $k - k_{PC} = 20 - 10 = 10$ minor components in the dataset. In this case, the XCA algorithm has identified the dimensionalities of the PC and MC subspaces exactly.

Final comments

That is the post done. I can look Will in the eye when we meet for a beer at the next PyData Manchester Leaders meeting. The post ended up being longer (and more fun) than I expected, and you may have got the impression from my post that the Welling et al paper has completely solved the problem of selecting interesting low-dimensional subspaces in Gaussian distributed data. Note quite true. There are still challenges with XCA, as there are with PCA. For example, we have not said how we choose the total number of extreme components $k$ . That is a whole other model selection problem and one that is particularly interesting for PCA when we have high-dimensional data. This is one of my research areas – see for example my JMLR paper Hoyle2008.

Another challenge that is particularly relevant for high-dimensional data is the question of whether we will see distinct groups of principal and minor component sample covariance eigenvalues at all, even when we have distinct groups of population covariance eigenvalues. I chose very carefully the settings used to generate the simulated data in the example above. I ensured that we had many more samples than features, i.e. $N \gg d$ , and that the extreme component population covariance eigenvalues were distinctly different from the ‘noise’ population eigenvalues. This ensured that the sample covariance eigenvalues separated into three clearly visible groups.

However, in PCA when we have $N < d$ and/or weak signal strengths for the extreme components of the population covariance, then the extreme component sample covariance eigenvalues may not be separated from the bulk of the other eigenvalues. As we increase the ratio $\alpha = N/d$ we observe a series of phase transitions at which each of the extreme components becomes detectable – again this is another of my areas of research expertise [HoyleRattray2003, HoyleRattray2004, HoyleRattray2007]

Footnotes

I have used the usual divide by $N-1$ Bessel correction in the definition of the sample covariance. This is because I have assumed any data matrix will have been explicitly mean-centered. In many of the analyses of PCA the starting assumption is that the data is drawn from a mean-zero distribution, so that the sample mean of any feature is zero only under expectation, not as a constraint. Consequently, most formal analysis of PCA will define the sample covariance matrix with a $1/N$ factor. Since I have to deal with real data, I will never presume the data been drawn from population distribution that has zero-mean and so to model the data with a zero-mean distribution I will explicitly mean-center the data. Therefore, I use the $1/(N-1)$ definition of the sample covariance. Strictly speaking, that means the various theories and analyses I discuss later in the post are not applicable to the data I’ll work with. It is possible to modify the various analyses to explicitly take into account the mean-centering step, but it is tedious to do so. In practice, (for large $N$ ) the difference is largely inconsequential, and formulae derived from analysis of zero-mean distributed data can be accurate for mean-centered data, so we’ll stick with using the $1/(N-1)$ definition for $\hat{\underline{\underline{C}}}$ .
Hotelling, H. “Analysis of a complex of statistical variables into principal components”. Journal of Educational Psychology, 24:417-441 and also 24:498–520, 1933.
https://dx.doi.org/10.1037/h0071325
Oja, E. “Principal components, minor components, and linear neural networks”. Neural Networks, 5(6):927-935, 1992. https://doi.org/10.1016/S0893-6080(05)80089-9
Luo, F.-L., Unbehauen, R. and Cichocki, A. “A Minor Component Analysis Algorithm”. Neural Networks, 10(2):291-297, 1997. https://doi.org/10.1016/S0893-6080(96)00063-9
See for example, Williams, C.K.I. and Agakov, F.V. “Products of gaussians and probabilistic minor components analysis”. Neural Computation, 14(5):1169-1182, 2002. https://doi.org/10.1162/089976602753633439

A book on language models and a paper on the mathematics of transformers

Quick introductions to the maths of transformers

On May 26, 2025June 21, 2025 By dchoyleIn AI, Algorithms, Artificial Intelligence, Data Science, Deep Learning, Machine LearningLeave a comment

TL;DR: Having a high-level understanding of the mathematics of transformers is important for any Data Scientist. The two sources I recommend below are excellent short introductions to the maths of transformers and modern language models.

A colleague asked me, about two months back, if I could recommend any articles on the mathematics of Large Language Models (LLMs). They then clarified that they meant transformers, as they were primarily interested in the algorithms on which LLM apps are based. Yes, they’d skim read the original “Attention Is All You Need” paper from Vaswani et al, but they done so just after the paper came out in 2017. They were looking to get back up to date with LLMs and even revisit the original Vaswani paper. Firstly, they wanted an accessible explanation which they could use to construct a high-level mental model of how transformers worked, the idea being that the high-level mental model would serve as a construct on which to hang and compartmentalize the many new concepts and advances that had happened since the Vaswani paper. Secondly, my colleague is very mathematically able, so they were looking for mathematical detail, but the right mathematical detail, and in a relatively short read.

I’ve listed below the recommendations I gave to my colleague because I think they are good recommendations (and I explain why below as well). I also believe it is important for all Data Scientists to have at least a high-level understanding of how transformers, and the LLMs which are built on them, work – again I explain why, below.

The recommendations

What I recommended was one paper and one book. The article is free to access, and the book has a “set your own price” option for access to the electronic version of the book.

The article is, “An Introduction to Transformers” by Richard Turner from the Dept. of Engineering at the University of Cambridge and Microsoft Research in Cambridge (UK). This arXiv paper can be found here. The paper focuses on how transformers work but not on training them. That way the reader focuses on the structure of the transformers without getting lost in the details of the arcane and dark art of training transformers. This is why I like this paper. It gives you an overview of what transformers are and how they work without getting into the necessary but separate nitty-gritty of how you get them to work. To read the paper does require some prior knowledge of mathematics but the level is not that high – see the last line of the abstract of the paper. The whole paper is only six pages long, making it a very succinct explanation of transformer maths that you can consume in one sitting.
The book is “The Hundred-Page Language Models Book” by Andriy Burkov. This is the latest in the series of books from Burkov, that include “The Hundred-Page Machine Learning Book” and “Machine Learning Engineering”. I have a copy of the hundred-page machine learning book and I think it is ok, but I prefer the LLMs book. I think part of the reason for this is that, like everybody else I have been only been using and playing with LLMs for the last three years or so, whilst I have been doing Data Science for a lot longer – I have been doing some form of mathematical or statistical modelling for over 30years – and so I didn’t really learn anything new from the machine learning book. In contrast, I learnt a lot from the book on LLMs. The whole book works through simple examples, both in code (Python) and in terms of the maths. I semi-skim read the book in two sittings. The code examples I skipped, not because they were simplistic but because I wanted to digest the theory and algorithm explanations end-to-end first and then return to trying the code examples at a later date. Overall, the book is packed with useful nuggets. It is a longer read than the Turner paper, but can still easily be consumed in a day if you skip bits. The book assumes less prior mathematical knowledge than the Turner paper and explains the new bits of maths it introduces, but given the whirlwind nature of a 100-page introduction to LLMs I would still recommend you have some basic familiarity with linear algebra, statistics & probability, and machine learning concepts.

Why learn the mathematics of transformers?

Having to think about which short articles I would recommend on the maths of transformers and LLMs made me think more broadly about whether there is any benefit from having a high-level understanding of transformer maths. My colleague was approaching it out of curiosity, and I knew that. They simply wanted to learn, not because they had to, nor because they thought that understanding the mathematical basis of transformers was the way to approach using LLMs as a tool.

However, given the exorbitant financial cost of building foundation models and the need to master a vast amount of engineering detail, most people won’t be building their own foundation models. Instead they will be using 3^rd party models simply as a tool and focusing on developing skills and familiarity in prompting them. So, are there any benefits then to understanding the maths behind LLMs? In other words, could I honestly recommend the two sources listed above to anybody else other than my colleague who was interested mainly out of curiousity?

The benefits of learning the maths of transformers and the risks of not doing so

The answer to the question above, in my opinion, is yes. But you probably could have guessed that from the fact I’ve written this post. So, what do I think are the benefits to a Data Scientist in having a high-level understanding of the mathematics of transformers? And equally important, what are the downsides and risks of not having that high-level understanding?

Having even a high-level understanding of the maths behind transformers de-mystifies LLMs since it forces you to focus on what is inside LLMs. Without this understanding you risk putting an unnecessary veneer of complexity or mysticism on top of LLMs, a veneer that prevents you using LLMs effectively.
You will understand why LLMs hallucinate. You will understand that LLMs build a model of the high-dimensional conditional probability distribution of the next token given the preceding context. And that distribution can have a large dispersion if the the training data is limited in the high-dimensional region that corresponds to the current context. That large dispersion results in the sampled next token having a high probability of being inappropriate. If you understand what LLMs are modelling and how they model it, hallucinations will not be a surprise to you (they may still be annoying) and you will understand strategies to mitigate them. If you don’t understand how LLMs are modelling the conditional probability of the next token, you will always be surprised, annoyed, and impacted by LLM hallucinations.
It helps you understand where LLMs excel and where they don’t because you have a grounded understanding of their strengths and weaknesses. This makes it easier to identify potential applications of LLMs. The downside? Not having a fundamental understanding of the strengths and weaknesses of the algorithms behind LLMs risks you building LLM-based applications that were doomed to failure from the start because they have been mis-matched to the capabilities of LLMs.
By having a high-level mental model of transformers on which to hang later advances in LLMs, you can more easily identify what is important and relevant (or not) in any new advance. The downside to not having this well-founded mental-model is that you get blown about by the winds of over-hyped LLM announcements from companies stating that their new tool or app is a “paradigm shift”, and consequently you waste time getting into the detail of what are trivial or inconsequential improvements.

What to do?

What should you do if you are a Data Scientist and I have managed to convince you that having a high-level understanding of the mathematics of transformers is important? Simple, access the two sources I’ve recommended above. Happy reading.

Summary

Introduction

Example 1:

Example 2:

How to use abstraction as a practical tool

Conclusion

Share this:

Summary

Introduction

First anecdote:

Second anecdote:

Conclusion

Share this:

Summary

Introduction

The problem which slapped me in the face

A colleague asks an awkward question

The mathematical detail

Conclusion

Share this:

Summary

Introduction

The log-sum-exp trick

The log-sum-exp function

Calculating log-sum-exp in Python

Share this:

Summary

Introduction

GenAI for rigorous mathematics? Really?

Less formal mathematics research

Share this:

Introduction

What is a Bland-Altman plot?

What the eye doesn’t see

Share this:

Share this:

Introduction

Pre-training tests of model form

Asymptotic behaviour tests:

Stress tests/Breakdown tests:

Recover known behaviours:

Coefficients before fitting:

Dimensional analysis:

Conclusion:

Share this:

Introduction

What simulation will give you and what it won’t

What simulated data will give you

Consistency check

Bias check

Efficiency check

Sensitivity check – robustness to contamination

Runtime scaling

What simulated data won’t give you

Identify model mis-specification

Accuracy of your model on real data

How to simulate

Simulating features and response

Linear model example

Assessing the OLS Estimator for a linear model

Neural network example

Sampling features from more realistic distributions

Simulating the response only

Conclusions

Footnotes

Share this:

A deadline

A recap of PCA

MCA

XCA

Probabilistic PCA and MCA

Probabilistic XCA

Maximum Likelihood solution for XCA

PCA and MCA as special cases

Experimental demonstration

Final comments

Footnotes

Share this:

The recommendations

Why learn the mathematics of transformers?