Testing your models before you build them

On October 7, 2025October 7, 2025 By dchoyleIn UncategorizedLeave a comment

Introduction

TL;DR: There are tests on models you can do even before you have done any training of the model. These are tests of the model form, and are more mathematical in nature. These tests stop you from putting a model with a flawed mathematical form into production.

My last blogpost was on using simulation data to test a model. I was asked if there are other tests I do for models, to which I replied, “other than the obvious, it depends on the model and the circumstances”. Then it occurred to me that “the obvious” tests might not be so obvious, so I should explain them here.

Personally, I broadly break down model tests into two categories:

Tests on a model before training/estimation of the model parameters.
Tests on a model after training/estimation of the model parameters.

The first category (pre-training) are typically tests on model form – does the model make sense, does the model include features in a sensible way. These are tests that get omitted most often and the majority of Data Scientists don’t have in their toolkit. However, these are tests that will spot the big costly problems before the model makes it into production.

The second category of tests (post-training) are typically tests on the numerical values of the model parameters and various goodness-of-fit measures. These are the tests that most Data Scientists will know about and will use regularly. Because of this I’m not going to go into details of any tests in this second category. What I want to focus on is tests in the first category, as this is where I think there is a gap in most Data Scientists’ toolkit.

The tests in the first category are largely mathematical, so I’m not going to give code examples. Instead, I ‘m just going to give a short description of each type of test and what it tries to achieve. Let’s start.

Pre-training tests of model form

Asymptotic behaviour tests:

One of the easiest ways to test a model form is to look at its output in circumstances which are easy to understand. In a model with many features and interacting parts this is best done by seeing what happens when you make one of the variables or parameters as large as possible (or as small as possible). In these circumstances the other variables will often become irrelevant and so the behaviour of the model is easier to spot. For example, in a demand model that predicts how much of a grocery product you’re going to sell, does putting up the price to infinity cause the predicted sales volume to drop to zero? If not, you’ve got a problem with your model.

Asymptotic behaviour tests are not limited to scenarios in which variables/parameters become very large or very small. In some cases the appropriate asymptotic scenario might be a parameter approaching a finite value at which a marked change in behaviour is expected. It should be clear that identifying asymptotic scenarios for which we can easily predict what should happen can require some domain knowledge. If you aren’t confident of your understanding of the application domain, then a good start is to make variables/parameters very large and/or very small one at a time and see if the resulting behaviour makes sense.

Typically, working out the behaviour of your model form in some asymptotic limit can be done simply by visual inspection of the mathematical form of your model, or with a few lines of pen-and-paper algebra. This gives us the leading order asymptotic behaviour. With a bit more pen-and-paper work we can also work out a formula for the next-to-leading order term in the asymptotic expansion of the model output. The next-to-leading order term tells us how quickly the model output approaches its asymptotic value – does it increase to the asymptotic value as we increase the variable, or does it decrease to the asymptotic value? We can also see which other variables and parameters affect the rate of this approach to the asymptotic value, again allowing us to identify potential flaws in the model form.

The asymptotic expansion approach to testing a model form can be continued to even higher orders, although I rarely do so. Constructing asymptotic expansions requires some experience with specific analysis techniques, e.g. saddle-point expansions. So I would recommend the following approach,

Always do the asymptotic limit (leading order term) test(s) as this is easy and usually requires minimal pen-and-paper work.
Only derive the next-to-leading order behaviour if you have experience with the right mathematical techniques. Don’t sweat if you don’t have the skills/experience to do this as you will still get a huge amount of insight from just doing 1.

Stress tests/Breakdown tests:

These are similar in spirit to the asymptotic analysis tests. Your looking to see if there are any scenarios in which the model breaks down. And by “break down”, I mean it gives a non-sensical answer such as predicting a negative value for a quantity that in real life can only be positive. How a model breaks down can be informative. For example, does the scenario in which the model breaks down clearly reflect an obvious limitation of the model assumptions, in which case breakdown is entirely expected and nothing to worry about. The breakdown is telling you what you already know, that in this scenario the assumptions don’t hold or are inappropriate and so we expect the model to be inaccurate or not work at all. If the breakdown scenario doesn’t reflect known weaknesses of the model assumptions you’ve either uncovered a flaw in the mathematical form of your model, which you can now fix, or you’ve uncovered an extra hidden assumption you didn’t know about. Either way, you’ve made progress.

Recover known behaviours:

Another test that has similarities to the asymptotic analysis and the stress tests. For example, your model may be a generalization of a more specialized model. It may contain extra parameters that capture non-linear effects. If we set those extra parameters to zero in the model or in any downstream mathematical analysis we have performed, then we would expect to get the same behaviour as the purely linear model. Is this what happens? If not, you’ve got a problem with your model or the downstream analysis. Again this is using known expected behaviour of a nested sub-case as a check on the general model.

Coefficients before fitting:

Your probably familiar with the idea of checking the parameters of a model after fitting, to check that those parameter values make sense. Here, I’m talking about models with small numbers of features and hence parameters, which also have some easy interpretation. Because we can interpret the parameters we can probably come up with what we think are reasonable ball-park values for them even before training the model. This gives us, i) a check on the final fitted parameter values, and ii) a check on what scale of output we think is reasonable from the model. We can then compare what we think should be the scale of the model output against what is needed to explain the response data. If there is an order of magnitude or more mis-match then we have a problem. Our model will either be incapable of explaining the training data in its current mathematical form, or one or more of the parameters is going to have an exceptional value. Either way, it is probably wise to look at the mathematical form of your model again.

Dimensional analysis:

In high school you may have encountered dimensional analysis in physics lessons. There you checked that the left-hand and right-hand sides of a formula were consistent when expressed in dimensions of Mass (M), Length (L), and Time (T). However, we can extend the idea to any sets of dimensions. If the right-hand side of a formula consists of clicks divided by spend, and so has units of $\rm{[currency]}^{-1}$ , then so must the left-hand side. Similarly, arguments to transcendental functions such as exp or sin and cos must be dimensionless. These checks are a quick and easy way to spot if a formula is inadvertently missing a dimensionful factor.

Conclusion:

These tests of the mathematical form of a model ensure that a model is robust and its output is sensible when used in scenarios outside of its training data. And let’s be realistic here; in commercial Data Science all models get used beyond the scope for which they are technically valid. Not having a robust and sensible mathematical form for your model means you run the risk of it outputting garbage.

A Bland-Altman plot of the peak expiratory flow rate data taken from the 1986 Lancet paper of Bland and Altman.

Data Science Notes: 1. Bland-Altman plots

On December 6, 2025December 6, 2025 By dchoyleIn UncategorizedLeave a comment

Introduction

Summary: If you are using a scatter plot to compare two datasets, rotate your data.

Three times in the last six months I’ve explained to different colleagues and former colleagues what a Bland-Altman (BA) plot is. Admittedly, the last of those explanations was because I remarked to a colleague that I’d been talking about BA-plots and they then wanted to know what they were.

BA-plots are a really simple idea. I like them because they highlight how a human’s ability to perceive patterns in data can be markedly affected by relatively small changes in how that data is presented; rotating the data in this case.

I also like them because they are from the statisticians Martin Bland and Doug Altman who produced a well-known series of short articles, “Statistics Notes”, in the BMJ in the 1990s. Each article focused on a simple, basic, but very important statistical concept. The series ran over nearly 70 articles and the idea was to explain to a medical audience about ‘statistical thinking’. You can find the articles at Martin Bland’s website here. Interestingly, BA-plots were not actually part of this series of BMJ articles as their work on BA-plots had been published in earlier articles. However, I’d still thoroughly recommend having a browse of the BMJ series.

Since I’ve had to explain BA-plots three times recently, I thought I’d give it another go in a blogpost. Also, inspired by the Bland-Altman series, I’m going to attempt a series of 10 or so short blogposts on simple, basic Data Science techniques and concepts that I find useful and/or interesting. The main criterion for inclusion in my series is whether I think I can explain it in a short post, not whether I think it is important.

What is a Bland-Altman plot?

BA-plots are used for comparing similar sets of data. The original use-case was to test how reproducible a process was. Take two samples of data that ideally you would want to be identical and compare them using a BA plot. This could be comparing clinical measurements made by two different clinicians across the same set of patients. What we want to know is how reproducible is a clinical measurement if made by two different clinicians.

Perhaps the first way of visually comparing two datasets on the same objects would be to just do a scatter plot – one dataset values on the x-axis, the other dataset values on the y-axis. I’ve got an example in the plot below. In fact, I’ve taken this data from Bland and Altman’s original 1986 Lancet paper. You can see the plotted points are pretty close to the 45-degree line (shown as a black dashed line), indicating the two datasets are measuring the same thing with some scatter, perhaps due to measurement error.

A scatter plot of peak expiratory flow rate data taken from the original Bland Altman paper in the Lancet from 1986. — Scatter plot of original Bland Altman PEFR data

Now, here’s the neat idea. I can do exactly the same plot, but I’m just going to rotate it clockwise by 45-degrees. A little bit of high-school/college linear algebra will convince you that I can do that by creating two new features,

$\frac{1}{\sqrt{2}} \left ( y + x \right )$
$\frac{1}{\sqrt{2}} \left ( y - x \right )$

Here $x$ and $y$ are our starting features or values from the two datsets we are comparing. Typically, the pre-factors of $\sqrt{2}$ are omitted and we simply define our new features as,

$A = \frac{1}{2} \left ( y + x \right )$
$M = \left ( y - x \right )$

Now we plot $M$ against $A$ . I’ve shown the new plot below.

Now a couple of things become clearer. Firstly, $A$ is the mean of $x$ and $y$ and so it gives us a better estimate of any common underlying value than just $x$ on its own or $y$ on its own. It gives us a good estimate of the size of the ‘thing’ we are interested in. Secondly, $M$ is the difference between $x$ and $y$ . $M$ tells us how different $x$ and $y$ are. Plotting $M$ against $A$ as I’ve done above shows me how reproducible the measurement is because I can easily see the scale of any discrepancies against the new vertical axis. I also get to see if there is any pattern in the level of discrepancy as the size of the ‘thing’ varies on the new horizontal axis. This was the original motivation for the Bland-Altman plot – to see the level of discrepancy between two sets of measurements as the true underlying value changes.

What the eye doesn’t see

What I really like about BA-plots though, is how much easier I find it to pick out if there is any systematic pattern to the differences between the two datasets. I haven’t looked into the psychological theory of visual perception, but it makes sense to me that humans would find it easier looking for differences as we move our eyes across one dimension – the horizontal axis – compared to moving our eyes across two dimensions – both the horizontal and vertical axes – when trying to scan the 45-degree line.

I first encountered BA-plots 25 years ago in the domain of microarray analysis. In that domain they were referred to as MA-plots (for obvious reasons). The choice of the symbols $M$ and $A$ also had a logic behind it. $M$ and $A$ are constructed as linear combinations of $x$ and $y$ , and in this case we “Add” them when constructing $A$ and “Minus” them when constructing $M$ . Hence the symbols $M$ and $A$ even tell you how to calculate the new features. You will also see BA-plots referred to Tukey mean-difference plots (again for obvious reasons).

In microarray analysis we were typically measuring the levels of mRNA gene expression in every gene in an organism across two different environmental conditions. We expected some genes to show differences in expression and so a few data points were expected to show deviations from zero on the vertical $M$ -axis. However, we didn’t expect broad systematic differences across all the genes, so we expected a horizontal data cloud on the MA-plot. Any broad systematic deviations from a horizontal data cloud were indicative of a systematic bias in the experimental set-up that needed to be corrected for. The MA plots gave an easy way to both visually detect any bias but also suggested an easy way to correct it. To correct it we just needed to fit a non-linear trendline through the data cloud, say using a non-parametric fit method like lowess. The vertical difference between a datapoint and the trendline was our estimate of the bias-corrected value of $M$ for that datapoint.

To illustrate this point I’ve constructed a synthetic example below. The left-hand plot shows the raw data in a standard scatterplot. The scatterplot suggests there is good agreement between the two samples – maybe a bit of disagreement but not much. However, when we look at the same data as a Bland-Altman plot (right-hand plot) we see a different picture. We can clearly see a systematic pattern to the discrepancy between the two samples. I’ve also estimated this systematic variation by fitting a non-linear trendline (in red) using the lowess function in the Python statsmodels package.

Two plots. The left hand plot shows a standard scatterplot, whilst the right-hand plot shows the corresponding Bland-Altman plot. — Scatterplot and Bland-Altman plot for the second example dataset.

Sometimes we may expect a global systematic shift between our paired data samples, i.e. a constant vertical shift on the $M$ axis. Or at least we can explain/interpret such a shift. Or there may be other patterns of shift we can comfortably interpret. This widens the applications domains we can use BA-plots for. In commercial Data Science I’ve seen BA-plots used to assess reproducibility of metrics on TV streaming advertising, and also calibration of transaction data across different supermarket stores. Next time you’re using a vanilla scatterplot to compare two data series, think about rotating and making a BA-plot.

All the code for the examples I’ve given in this post is in the Jupyter notebook DataScienceNotes1_BlandAltmanPlots.ipynb which can be found in the public GitHub repository https://github.com/dchoyle/datascience_notes. Feel free to clone the repository and play with the notebook. I’ll be adding to the repository as I add further “Data Science Notes” blogposts.

The need for simulation

On August 4, 2025October 7, 2025 By dchoyleIn Algorithms, Data Science, Machine LearningLeave a comment

TL;DR: Poor mathematical-based design and testing of models can lead to significant problems in production. Finding suitable ground truth data for testing of models can be difficult. Yet, many Data Science models make it into production without appropriate testing. In these circumstances testing with simulated data can be hugely valuable. In this post I explain why and how. In fact, I argue that testing with Data Science models should be non-negotiable.

Introduction

Imagine a scenario. You’re the manager of a Premier League soccer team. You wouldn’t sign a new striker without testing if they could actually kick a ball. Wouldn’t you?

In the bad old days before VAR it was not uncommon for a big centre-back to openly punch a striker in the face if the referee and assistant referees weren’t looking. Even today, just look at any top-flight soccer match and you’ll see the blatant holding and shirt-pulling that goes on. Real-world soccer matches are dirty. A successful striker has to deal with all these realities of the game, whilst also being able to kick the ball in the net. At the very least when signing a new striker you’d want to test whether they could score under ideal benign conditions. Wouldn’t you? You’d put the ball on the penalty spot, with an open goal, and see if your new striker could score. Wouldn’t you? Passing this test, wouldn’t tell you that your striker will perform well in a real game, but if they fail this “ideal conditions” test it will tell you that they won’t perform well in real circumstances. I call this the “Harry Redknapp test” – some readers will understand the reference¹. If you don’t then read the footnote for an explanation.

How is this relevant to Data Science? One of the things I routinely do when implementing an algorithm is to test that implementation on simulated data. However, a common reaction I get from other Data Scientists is, “oh I don’t test on simulated data, it’s not real data. It’s not useful. It doesn’t tell you anything.” Oh yes it does! It tells you whether the algorithm you’ve implemented is accurate under the ideal conditions it was designed for. If your implementation performs badly on simulated data, you have a big problem! Your algorithm or your implementation of it has failed the “Harry Redknapp test”.

“Yeah, but I will have some ground-truth data I can test my implementation on instead, so I don’t need simulated data.” Not always. Are you 100% sure that that ground-truth data is correct? And what if you’re working on an unsupervised problem.

“Ok, but the chances of an algorithm implemented by experienced Data Scientists making it into production untested and with really bad performance characteristics is small”. Really!? I know of at least one implemented algorithm in production at a large organization that is actually an inconsistent estimator. An inconsistent estimator is one of the biggest sins an algorithm can commit. It means that even as we give the algorithm more and more ideal training data, it doesn’t produce the correct answer. It fails the “Harry Redknapp test”. I won’t name the organization in order to protect the guilty. I’ll explain more about inconsistent estimators later on.

So maybe I convinced you that simulated data can be useful. But what can it give you, what can’t it give you, and how do you go about it?”

What simulation will give you and what it won’t

To begin, we need to highlight some general but very important points about using simulated data:

Because we want to want to generate data, we need a model of the data generation process, i.e. we need a generative model².
Because we want to mimic the stochastic nature of real data, our generative model of the data will be a probabilistic one.
Because we are generating data from a model, what we can test are algorithms and processes that use that data, e.g. a parameter estimation process. We cannot test the model itself. Our conclusions are conditional on the model form being appropriate.

With those general points emphasized, let’s look in detail what we can get testing with simulated data.

What simulated data will give you

We can get a great deal from simulated data. As we said above, what we get is insight into the performance of algorithms that process the data, such as the parameter estimation process. Specifically, we can check whether our parameter estimation algorithm is, under ideal conditions,

Consistent
Biased
Efficient
Robust

I’ll explain each of these in detail below. We can also get insight into how fast our parameter estimation process runs or how much storage it requires. Running tests using simulated data can be extremely useful.

Consistency check

As a Data Scientist you’ll be familiar with the idea that if we have only a small amount of training data our parameter estimates for our trained model will not be accurate. However, if we have a lot of training data that matches the assumptions on which our parameter estimation algorithm is based, then we expect the trained parameter estimates to be close to their true values, i.e. close to the values which generated the data. As we increase the amount of training data, we expect our parameters estimates to get more and more accurate, converging ultimately to the true values in the limit of an infinite amount of training data. This is consistency.

In statistics, a formula or algorithm for estimating the parameters of a model is called an estimator. There can be multiple different estimators for the same model, some better than others. A consistent estimator is one whose expected value converges to the true value in the limit of an infinite amount of training data. An inconsistent estimator is one whose expected value doesn’t converge to the true value in the limit of an infinite amount of training data. Think about that for a moment,

An inconsistent estimator is an algorithm that doesn’t get better even when we give it a load more training data.

That is a bad algorithm! That is why I say constructing an inconsistent estimator is one of the worst sins a Data Scientist can commit. Very occasionally (rarely), an inconsistent estimator is constructed because it has other useful properties. But in general, it you encounter an inconsistent estimator you should take it as a sign of incompetence on the part of the Data Scientist who constructed it.

“Okay, okay, I get it. Inconsistent estimators are bad. But I don’t have an infinite amount of training data, so how can I actually check if my algorithm produces a consistent estimator? Surely, it can’t be done?” Yes, it can be done. What we’re looking for is convergence, i.e. parameter estimates getting closer and closer to the true values as we increase the training set size. I’ll give a demonstration of this in the next section when I show how to set up some simulation tests.

Bias check

Along with the concept of consistency comes the concept of bias. We said that a consistent estimator was one whose expectation value converges to the true value in the limit of an infinite amount of training data. However, that doesn’t mean a consistent estimator has an expectation value that is equal to the true value for a finite amount of training data. It is possible to have a consistent estimator that is biased. This means the estimator, on average, will differ from the true value when we use a finite amount of training data. For a consistent estimator, if it is biased that bias will disappear as we continually increase the amount of training data.

As you might have guessed, the best algorithms produce estimators that are consistent and unbiased. Knowing if your estimator is biased and by how much is extremely useful. Again, we can assess bias using simulated data, and I’ll show how to do this in the next section when I show how to set up some simulation tests.

Efficiency check

So far, we have spoken about the expectation or average properties of an algorithm/estimator. But what about its variance. It is all very well telling me that across lots of different instances of training datasets my algorithm would, on average get the right answer, or near the right answer, under ideal conditions, but in the real world I have only one training dataset. Am I going to be lucky and my particular training data will give parameter estimates close to the average behaviour of the algorithm? I’ll never know. But what I can know is how variable the parameter estimates from my algorithm are. I can do this by calculating the variance of the parameter estimates over lots of training datasets. A small variance will tell me that my one real-world dataset is likely to have performance close to the mean behaviour of the algorithm. I may still be unlucky with my particular training data and the parameter estimates are a long way from the average estimates, but it is unlikely. However, a large variance tells me that parameter estimates obtained from a single training dataset will often be a long way from the average estimates.

How can I calculate this variance of parameter estimates over training datasets? Simple, get lots of different training datasets produced under identical controlled conditions. How could I do that? Yep, you guessed it. Simulation. With a simulation process coded up, we can easily generate multiple instances of training datasets of the same size and generated under identical conditions. Again, I’ll demonstrate this in the next section.

Sensitivity check – robustness to contamination

Our message about simulated data is that it allows you to test your algorithm under conditions that match the assumptions made by the algorithm, i.e. under ideal conditions. But you can use simulation to test how well your algorithm performs in non-ideal conditions. We can also introduce contamination into the simulated data, for example drawing some response variable values from a non-Gaussian distribution if our algorithm has assumed the response variable is purely Gaussian distributed. We can produce multiple simulated datsets with different percentages of contamination and so test how sensitive or robust our estimation algorithm is to the level of contamination, i.e. how sensitive it is to non-ideal data.

In the first few pages of the first chapter of his classic textbook on Robust Statistics, Peter Huber describes analysis of an experiment originally due to John Tukey. The analysis reveals that even having just 2% of “bad” datapoints being drawn from a different Gaussian distribution (with a 3-fold larger standard deviation) is enough to markedly change the properties and efficiency of common statistical estimators. And yet, defining “bad” data as being drawn from a larger variance Gaussian is wonderfully simplistic. Real-world data is so much nastier.

What form should the data contamination take? There are multiple ways in which data can become contaminated. There can be changes in statistical properties, like the simple example we used above, or drift in statistical properties such as a non-stationary mean or a non-stationary variance. But you can get more complicated errors creeping into your data. These typically take two forms,

Human induced data contamination: These can be misspelling or mis-(en)coding errors that result from not using controlled and validated vocabularies for human data-entry tasks. You’ll recognize these sorts of errors when you see multiple different variants for the name of the same country, US county or UK city, say. You might think it is difficult to simulate such errors, but there are some excellent packages to do so – checkout the messy R package produced by Dr. Nicola Rennie that allows you to take a clean dataset and introduce these sorts of encoding errors into it. Spotting these errors can be as simple as plotting distributions of unique values in a table column, i.e. looking for unusual distributions. In R there are a number of packages to help you do this.
Machine induced errors: These are errors that arise from the processing or transferring of data. These can be as simple as incorrect datetime stamps on rows in a database table, or can be as complex as repeating blocks of rows in a table. These errors are less about contamination and more about alteration. The common element here is that there is a pattern to how the data has become altered or modified and so spotting the errors involves visual inspection of the individual rows of the table, combined with plotting lagged or offset data values. The machine induced errors arise because of bugs in processing code, and these can be either coding errors, e.g. a typo in the code, or unintended behaviour, e.g. datetime processing code that hasn’t been designed properly to correctly handle daylight saving switchovers.

What kind of data contamination should I simulate? This is a “how long is a piece of string” kind of question. It very much depends on what aspect of your algorithm or implementation you want to test for robustness, and only you can know that. You may have to write some bespoke code to simulate the sorts of errors that arise in the processes you use or are exposed to. Broadly speaking, robustness of an estimator will be tested by changes in the statistical properties of the input data and these can be simulated by changes in the distributions of data due to data drift or human-induced contamination, whilst machine-induced errors imply you have some sort of deployed pipeline and so simulating machine corrupted data is best when you want to stress-test your end-to-end pipeline.

Runtime scaling

There are also checks that simulated data allows you to perform that aren’t necessarily directly connected to the accuracy or efficiency of the parameter estimates. Because we can produce as much simulated data as we want, we can easily test how long our estimation algorithm takes for different sized datasets. Similarly, we can also use simulated data to test the memory and storage requirements of the algorithm.

We can continue this theme. Because we can tune and tweak the generation of the simulated data, this can also allow us to generate data to test very specific scenarios – corner cases – for which we don’t have real test data. The ability to generate simulated data increases the test coverage we can perform.

What simulated data won’t give you

Identify model mis-specification

Using simulated data will tell you how well your model training algorithm performs on data that matches precisely the form of the model you have used. It won’t tell you if your model form is correct or appropriate for the real data you will ultimately apply it to. It won’t tell you if you’ve omitted an important feature or if you’ve put non-linearity into your model in an incorrect way. Getting the model form right can only come from i) domain expertise, ii) testing on real ground-truth data. Again, what this highlights is that we use simulated data to test the training process, not the model.

This can trip up even experience researchers. I recently saw a talk from an academic researcher who tested two different model forms using simulated data generated from one of the models. When the model form used to generate the data fitted the simulation data better, they confidently claimed that this model was better and more correct. Well, of course it was for this simulated data!

Accuracy of your model on real data

For simulated data we have the ground-truth values of the response variable so we can assess the prediction accuarcy, either on training data or on holdout test data. However, unless our simulation process produced very realistic data, including the various contamination processes, the test set accuracy on simulated data cannot be used as a precise measure of the predictive accuracy of the trained model on real unseen data.

How to simulate

When producing simulated data for testing an algorithm related to a model there are two things we need to generate – the features and the response. There are two ways we can approach this,

Simulating the features and then simulating the response given the feature values we just produced.
Simulate just the response value given some pre-existing feature values.

Of these, 2 sounds easier, but I will discuss 1 first as it leads us naturally into discussing where we might get pre-existing feature values from.

Simulating features and response

As we said above, in this approach we simulate the features first, and this allows us to construct the distribution of the response variable conditional on the features. We can then sample a value from that conditional distribution. Our basic recipe is

Sample the feature values from a distribution.
Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
Sample the response variable from the conditional distribution constructed in 2.

How complex we want to make the feature distribution depends on how realistic we need our features to be and what aspect of the estimation/training algorithm we are wanting to test.

For real-world problems, it is unlikely that the features follow a Gaussian distribution. Take demand modelling, an area I have worked in a lot. The main feature we use is the price of the product whose demand we are trying to predict. Prices are definitely not Gaussian distributed. Retailers repeatedly switch between a regular and promotional price over a long period of time, so that we have a sample distribution of prices that is represented by two Dirac-delta functions. A more interesting price time series may introduce a few more price points, but it is still definitely not Gaussian. Similarly, real data has correlations between features.

When simulating a feature, we have to decide how important the real distribution is to the aspect of the estimation/training algorithm that we want to test. If we want to simulate with realistically distributed features. this can be problematic. We’ll return to this issue and real data later on, but for now we emphasize tha we can still test whether our estimator is consistent or assess its bias using features drawn from independent Gaussian distributions. So there are still useful tests of our estimation algorithm we can carry out. Let’s see how we can do that.

Linear model example

We’ll use a simple linear model that depends on three features, $x_{1}, x_{2}, x_{3}$ . The response variable $y$ is given by,

$y\; =\;\beta_{1} x_{1}\;+\; \beta_{2}x_{2}\;+\;\beta_{3}x_{3} \;+\;\epsilon\;\;\;\;,\;\; \epsilon\;\sim\; {\cal{N}}\left ( 0, \sigma^{2}_{\epsilon}\right )$

From which you can see both the linear dependence on the features and that $y$ contains Gaussian additive noise $\epsilon$ .

Simulating data is now easy once we have the structure of our probabilistic model. Given a user-specified mean $\mu_{1}$ and variance $\sigma^{2}_{1}$ we can easily sample a value for $x_{1}$ from ${\cal{N}}\left ( \mu_{1}, \sigma^{2}_{1}\right )$ . Similarly, given user-specified means $\mu_{2}, \mu_{3}$ and variances $\sigma^{2}_{2}, \sigma^{2}_{3}$ , we can generate values for $x_{2}$ and $x_{3}$ . If we have user-specified values of $\beta_{1}, \beta_{2}, \beta_{3}$ we can then easily generate a value for $y$ by sampling from ${\cal{N}}\left ( \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}, \sigma^{2}_{\epsilon} \right )$ , where $\sigma^{2}_{\epsilon}$ is the variance of the additive noise that we want to add to our response variable. To simulate $N$ datapoints we repeat that recipe $N$ times. Let’s apply that recipe to assess an estimator of the model parameters $\beta_{1}, \beta_{2}, \beta_{3}$ . We’ll assess the standard Ordinary Least Squares (OLS) estimator for a linear model.

Assessing the OLS Estimator for a linear model

Given a feature matrix $\underline{\underline{X}}$ (the i^th row of the matrix is the feature values for the i^th observation) and vector $\underline y = \left ( y_{1}, y_{2},\ldots,y_{N}\right )$ that represents the $N$ observations of the response variable, then the Ordinary Least Squares (OLS) estimator $\hat{\beta}$ of the true model parameters $\underline{\beta} = \left ( \beta_{1}, \beta_{2}, \beta_{3}\right )$ is given by the formula,

$\underline{\hat{\beta}}\;=\; \left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right ) ^{-1} \underline{\underline{X}}^{\top} \underline{y}\;\;\;\;\;\;\;{\rm Eq.1}$

Note that the OLS estimator is a linear combination of the observations $y_{1}, y_{2}, \ldots, y_{N}$ , with a weight matrix $\left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right )^{-1} \underline{\underline{X}}^{\top}$ . We’ll come back to this point in a moment.

What we want to know is how close is the estimate $\underline{\hat{\beta}}$ to $\underline{\beta}$ . Is the OLS estimator in the Eq.1 above a biased estimator of $\underline{\beta}$ , and is it a consistent estimator?

The plots below show the bias (mean error) for each of the model parameters, plotted against training dataset size $N$ . I constructed the plots by initializing a true model parameter vector $\underline{\beta}$ and then generating 1000 simulated training datasets for each of the different values of $N$ . For each simulated training dataset I computed the OLS parameter estimate $\hat{\underline{\beta}}$ and then computed the parameter estimate errors $\hat{\underline{\beta}} - \underline{\beta}$ . From the errors I then calculated their sample means and variances (over the simulations) for each value of $N$ .

You can see from the plots that whilst the mean error fluctuates it doesn’t systematically change with $N$ . Furthermore, it fluctuates around zero, suggesting that the OLS estimator is unbiased. And indeed it is. It is possible to mathematically show that the OLS estimator is unbiased at any finite value of $N$ . The reason we get a non-zero value in this case is because we have estimated $\mathbb{E}\left ( \hat{\beta}_{i}\right )$ using a sample average taken over 1000 simulated datasets. If we had used a larger number of simulated datasets we would have got even smaller sample average parameter errors.

Contrast this behaviour with how the variances of the parameter estimate errors change with $N$ in the plots below.

The decrease, with $N$ , in the variance of $\hat{\underline{\beta}} - \underline{\beta}$ is marked. In fact, in looks like a power-law decrease, so I have plotted the same data on a log-scale below,

We can see from those log-log plots that the variances of $\hat{\beta}_{i} - \beta_{i},\; i=1,2,3$ decrease as $N^{-1}$ . That implies that as we use larger and larger training sets any single instance of $\hat{\underline{\beta}}$ will get closer and closer to $\underline{\beta}$ . At large $N$ we have a low probability of being unlucky and our particular training set giving a poor estimate of $\underline{\beta}$ .

How efficient is the OLS estimator in Eq.1? Is the rate at which ${\rm Var}\left ( \hat{\beta}_{i} - \beta_{i}\right )$ decreases with $N$ good or bad? It turns out that the OLS estimator in Eq.1 is the Best Linear Unbiased Estimator (BLUE). For an unbiased estimator of $\underline{\beta}$ that is constructed as a linear combination of the observations $\underline{y}$ , you cannot do better than the OLS estimator in Eq. 1.

All the code for the linear model example is available in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the GitHub repository https://github.com/dchoyle/simulation_blogpost.

A linear model is relatively simple structure but the example was a good demonstration of the power of simulated data. Next, we’ll use a more complex model architecture and build a feed-forward neural network.

Neural network example

Our simulated neural network output has the form,

$y\;=\; f\left( \underline{x}| \underline{\theta} \right ) \;+\; \epsilon$

Again, we’ll use zero-mean Gaussian additive noise, $\epsilon \sim {\cal{N}}\left (0, \sigma^{2}_{\epsilon}\right )$ .

The function $f\left( \underline{x}| \underline{\theta} \right )$ represents our neural network function, with $\underline{x}$ being the vector of input features and $\underline{\theta}$ being a vector holding all the network parameters. For this demo, I’m going to use a 3 input-node, 2 hidden-layer feed-forward network, with 10 nodes in each of the hidden layers. The output layer consists of a single node, representing the variable $y$ . For the non-linear transfer (activation) functions I’m going to use $\tanh$ functions. So, schematically, my networks looks like the figure below,

I’m going to use a teacher network of the form above to generate simulated data, which I’ll then use to train a student network of the same form. What I want to test is, does my training process produce a trained student network whose predictions on a test set get more and more accurate as I increase the amount of training data? If not, I have a problem. If my training process doesn’t produce accurate trained networks on ideal data, the training process isn’t going to produce accurate networks when using real data. I’m less interested in comparing trained student network parameters to the teacher network parameters as, a) there are a lot of them to compare, b) since the output of a network is invariant to within-layer permutation of the hidden layer node labels and connections, defining a one-to-one comparison of network parameters is not straight forward here. Node 1 in the first hidden layer of the student isn’t necessarily equivalent to node 1 in the first hidden layer of the teacher network, and so on.

The details of how I’ve coded up the networks and set-up the evaluation are lengthy, so I’ll just show the final result here. All the details can be found in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the freely accesible github repository.

Below in left-hand plot I’ve plotted the average Mean Square Error (MSE) made by the trained student network on the test-sets. I’ve plotted the average MSE against the training dataset size. The average MSE is the average over the simulations of that training set size. For comparison, I have also calculated the average test-set MSE of the teacher network. Since the test-set data contains additive Gaussian noise, the teacher network won’t make perfect predictions on the test-set data even though the teacher network generated the systematic part of the test-set response values. The average test-set MSE of the teacher network provides a benchmark or baseline against which we can asses the trained student network. We have a ready intuition about the relative test-set MSE value. We expect the relative test-set MSE to be significantly above 1 at small values of $N$ , as the student network struggles to learn the teacher network output. As the amount of training data $N$ increases we expect the relative test-set MSE value to approach 1 from above. The average relative test-set error is plotted in the right-hand plot below.

We can see from both plots above that the prediction accuracy of a trained student network typically decreases with increasing amount of training data. My network training process has passed this basic test. The test was quick to set up and gives me confidence I can run my code over real data.

Sampling features from more realistic distributions

In our previous examples we have used independent features, sampled from simple but naive distributions, to test the convergence properties of an estimator. But what happens if you want to assess the quantitative performance of an estimator for more realistic feature patterns? Well, we use more realistic feature patterns. This is a variant of our previous basic recipe, but where we have access to a real dataset. The modified recipe is,

Sample an observation from the real dataset and keep the features.
Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
Sample the response variable from the conditional distribution constructed in 2.

This seems like a small modification of the recipe. However, it does have some big implications. We can’t generate simulated datasets of arbitrarily large size as we are limited by the size of the real dataset. We can obviously generate simulated datasets of smaller size than the real data, but this can make testing of the convergence properties of an estimator difficult.

That said, this is one of my faviourite approaches. Often, steps 2 and 3 are easy to implement. You’ll have a function for the conditional mean of the response variable already coded up for prediction purpose, so it is just a question of pushing some feature values through that code. I find the overhead of writing extra functions to simulate realistic looking feature values is significant, both in terms of time and thinking about what ‘realistic’ should look like. The recipe above gets round this easily. Simply pick a row from your existing real dataset at random and there you go, you have some realistic feature values. As before, the recipe allows me to then generate response values with known ground-truth parameters values. So overall I can compare parameter estimates to ground truth parameter values on realistic feature values, allowing me to check that my estimation algorithm is at least semi-accurate on realistic feature values. You can also choose in step 1 of the recipe, whether you want to sample a row of feature values with or without replacement.

Simulating the response only

You could argue that simulating response values with feature values sampled from an existing real dataset is an example of just simulating the response. After all, only the response value is computer generated. I still tend to think of it as simulating the features because, i) I am still sampling the features from a distribution function, the empirical distribution function in this case, and ii) I have broken some of the link between the features and the response in the real data because I have sampled the features values separately. However, sometimes we want to keep as much as of the links between features and response values in the real data as possible. We can do this by only making additions to the real data. By necessity this means only adding to the response value. This may sound very restrictive, but in fact there are many situations where this is precisely the kind of data we need to test an estimation algorithm. For example, changepoint detection or unconditional A/B testing. In these situations we take the real data, identify the split point where we want to increase the response value (the changepoint or the A/B grouping) and simply increase the response. Hey presto, we have realistic data with a guaranteed increase in the response variable at a know location. By changing the level of increase in the response variable we can use this approach to assess the statistical power of the changepoint or A/B test algorithm.

The plots below show an example of introducing a simple shift in level at timepoint 53 into a real dataset. We have only shown the process as a simple schematic, but coding it up yourself is only a matter of a line or two of code, so I haven’t given any code details.

In the above example I simply increased the response variable, by the same amount (8.285 in this case), at and after timepoint 53. If instead, you only want to increase the average value of the response variable, it is a simple modification of the process to include some additional zero-mean noise after the changepoint location.

Conclusions

Simulated data is extremely useful. It can give you lots of insight into the performance of your training/estimation algorithm (including bug detection). Its main advantages are it is,

Easy to produce in large volumes.
Can be produced in a user-controlled way.
Gives you ground-truth values.
Gives you a way to assess the performance of your training algorithm when you have no real ground-truth data.
Stops you releasing a poor untested training algorithm into production.

If you don’t want to sign an absolutely useless striker for your data science model team, test with simulated data at the very minimum.

Footnotes

Harry Redknapp is a former English Premier League football manager. Whilst Redknapp was manager of Tottenham Hotspur he had a reputation of being willing to sign players on the flimiest of evidence of footballing skills. At a time when there was a large influx of overseas players into the Premier league, due to their reputation for superior technical football skills, it was joked that he would sign a player simply because of how their name sounded and without any checks on the player at all.
The term generative model preceeds its useage in Generative AI. Broadly speaking, a generative model is a machine learning model that learns the underlying probability distribution of the data and can generate new, similar data instances. The useage of the term was popular around the early 2000’s, particularly when discussing different forms of classifiers, which were described as either being generative or discriminative.

Extreme Components Analysis

On June 18, 2025August 4, 2025 By dchoyleIn UncategorizedLeave a comment

TL;DR: Both Principal Components Analysis (PCA) and Minor Components Analysis (MCA) can be used for dimensionality reduction, identifying low-dimensional subspaces of interest as those which have the greatest variation in the original data (PCA), or those which have the least variation in the origina data MCA). As real data will contain both directions of unusually high variance and directions of unusually low variance, using just PCA or just MCA will lead to biased estimates of the low-dimensional subspace. The 2003 NeurIPs paper from Welling et al unifies PCA and MCA into a single probabilistic model XCA (Extreme Components Analysis). This post explains the XCA paper of Welling et al and demonstrates the XCA algorithm using simulated data. Code for the demonstration is available from https://github.com/dchoyle/xca_post

A deadline

This post arose because of a deadline I have to meet. I don’t know when the deadline is, I just know there is a deadline. Okay, it is a self-imposed deadline, but it will start to become embarrassing if I don’t hit it.

I was chatting with a connection, Will Faithfull, at a PyData Manchester Leaders meeting almost a year ago. I mentioned that one of my areas of expertise was Principal Components Analysis (PCA), or more specifically, the use of Random Matrix Theory to study the behaviour of PCA when applied to high-dimensional data.

A recap of PCA

In PCA we are trying to approximate a d-dimensional dataset by a reduced number of dimensions $k < d$ . Obviously we want to retain as much of the structure and variation of the original data, so we choose our k-dimensional subspace such that the variance of the original data in the subspace is as high as possible. Given a mean-centered data matrix $\underline{\underline{X}}$ consisting of $N$ observations, we can calculate the sample covariance matrix $\hat{\underline{\underline{C}}}$ as ¹,

$\hat{\underline{\underline{C}}} = \frac{1}{N-1} \underline{\underline{X}}^{\top} \underline{\underline{X}}$

Once we have the (symmetric) matrix $\hat{\underline{\underline{C}}}$ we can easily compute its eigenvectors $\underline{v}_{i}, i=1,\ldots, d$ , and their corresponding eigenvalues $\lambda_{i}$ .

The optimal $k$ -dimensional PCA subspace is then spanned by the $k$ eigenvectors of $\hat{\underline{\underline{C}}}$ that correspond to the $k$ largest eigenvalues of $\hat{\underline{\underline{C}}}$ . These eigenvectors are the directions of greatest variance in the original data. Alternatively, one can just do a Singular Value Decomposition (SVD) of the original data matrix $\underline{\underline{X}}$ , and work with the singular values of $\underline{\underline{X}}$ instead of the eigenvalues of $\hat{\underline{\underline{C}}}$ .

That is a heuristic derivation/justification of PCA (minus the detailed maths) that goes back to Harold Hotelling in 1933². There is a probabilistic model-based derivation due to Tipping and Bishop (1999), which we will return to later.

MCA

Will responded that as part of his PhD, he’d worked on a problem where he was more interested in the directions in the dataset along which the variation is least. The problem Will was working on was “unsupervised change detection in multivariate streaming data”. The solution Will developed was a modular one, chaining together several univariate change detection methods each of which monitored a single feature of the input space. This was combined with a MCA feature extraction and selection pre-processing step. The solution was tested against a problem of unsupervised endogenous eye blink detection.

The idea behind Will’s use of MCA was that for the streaming data he was interested in it was likely that the inter-class variances of various features were likely to be much smaller than intra-class variances, and so any principal components were likely to be dominated by what the classes had in common rather than what had changed, so the directions of greatest variance weren’t very useful for his change detection algorithm.

I’ve put a link here to Will’s PhD in case you are interested in the details of the problem and solution – yes, Will I have read your PhD.

Directions of least variance in a dataset can be found from the same eigen-decomposition of the sample covariance matrix and by selecting the components with the smallest non-zero eigenvalues. Unsurprisingly, focusing on directions of least variance in a dataset is called Minor Components Analysis (MCA)^3,4. Where we have the least variation in the data the data is effectively constrained so, MCA is good for identifying/modelling invariants or constraints within a dataset.

At this point in the conversation, I recalled the last time I’d thought about MCA. That was when an academic colleague and I had a paper accepted at the NeurIPs conference in 2003. Our paper was on kernel PCA applied to high-dimensional data, in particular the eigenvalue distributions that result. As I was moving job and house at the time I was unable to go the conference, so my co-author, Magnus Rattray (now Director of the Institute for Data Science and Artificial Intelligence at the University of Manchester), went instead. On returning, Magnus told me of an interesting conversation he’d had at the conference with Max Welling about our paper. Max also had a paper at the conference, on XCA – Extreme Components Analysis. Max and his collaborators had managed to unify PCA and MCA into a single framework.

I mentioned the XCA paper to Will at the PyData Manchester Leaders meeting and said I’d write something up explaining XCA. It would also give me an excuse to revisit something that I hadn’t looked at since 2003. That conversation with Will was nearly a year ago. Another PyData Manchester Leaders meeting came and went and another will be coming around sometime soon. To avoid having to give a lame apology I thought it was about time I wrote this post.

XCA

Welling et al rightly point out that if we are modelling a dataset as lying in some reduced dimensionality subspace then we consider the data as being a combination of variation and constraint. We have variation of the data within a subspace and a constraint that the data does not fall outside the subspace. So we can model the same dataset focusing either on the variation (PCA) or on the constraints (MCA).

Note that in my blog post I have used a different, more commonly used notation. for the number of features and the number of components, than that used in the Welling et al paper. The mapping between the two notations is given below,

Number of features: My notation = $d$ , Welling et al notation = $D$
Number of components: My notation = $k$ , Welling et al notation = $d$

Probabilistic PCA and MCA

PCA and MCA both have probabilistic formulations, PPCA and PMCA⁵ respectively. Welling et al state that, “probabilistic PCA can be interpreted as a low variance data cloud which has been stretched in certain directions. Probabilistic MCA on the other hand can be thought of as a large variance data cloud which has been pushed inward in certain directions.” In both probabilistic models a $d$ -dimensional datapoint $\underline{x}$ is considered as coming from a zero-mean multivariate Gaussian distribution. In PCA the covariance matrix of the Gaussian is modelled as,

$\underline{\underline{C}}_{PCA} = \sigma^{2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}$

The matrix $\underline{\underline{A}}$ is $k \times d$ and its columns are the principal components that span the low dimensional subspace we are trying to model.

In MCA the covariance matrix is modelled as,

$\underline{\underline{C}}_{MCA}^{-1} = \sigma^{-2}_{0}\underline{\underline{I}}_{d} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}$

The matrix $\underline{\underline{W}}$ is $d \times (d-k)$ and its rows are the minor components that define the $d-k$ subspace where we want as little variation as possible.

Since in real data we probably have both exceptional directions whose variance is greater than the bulk and exceptional directions whose variance is less than the bulk, both PCA and MCA would lead to biased estimates for these datasets. The problem is that if we use PCA we lump the low variation eigenvalues (minor components) in with our estimate of the isotropic noise, thereby underestimating the true noise variance and consequently biasing our estimate of the large variation PC subspace. Likewise, if we use MCA, we lump all the large variation eigenvalues (principal components) into our estimate of the noise and overestimate the true noise variance, thereby biasing our estimate of the low variation MC subspace.

Probabilistic XCA

In XCA we don’t have that problem. In XCA we include both large variation and small variation directions in our reduced dimensionality subspace. In fact we just have a set of orthogonal directions $\underline{a}_{i}\;,\;i=1,\ldots,k$ that span a low-dimensional subspace and again form the columns of a matrix $\underline{\underline{A}}$ . These are our directions of interest in the data. Some of them, say, $k_{PC}$ , have unusually large variance, some of them, say $k_{MC}$ , have unusually small variance. The overall number of extreme components (XC) is $k = k_{PC} + k_{MC}$ .

As with probabilistic PCA, we then add on top an isotropic noise component to the overall covariance matrix. However, the clever trick used by Welling et al was that they realized that adding noise always increases variances, and so adding noise to all features will make the minor components undetectable as the minor components have, by definition, variances below that of the bulk noise. To circumvent this, Welling et al only added noise to the subspace orthogonal to the subspace spanned by the vectors $\underline{a}_{i}$ . They do this by introducing a projection operator ${\cal{P}}_{A}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{A}} \left ( \underline{\underline{A}}^{\top}\,\underline{\underline{A}}\right )^{-1}\underline{\underline{A}}^{\top}$ . Again we model the data as coming from a zero-mean multivariate Gaussian, but for XCA the final covariance matrix is then of the form,

$\underline{\underline{C}}_{XCA} = \sigma^{2}_{0} {\cal{P}}_{A}^{\perp} + \underline{\underline{A}}\,\underline{\underline{A}}^{\top}$

and the XCA model is,

$\underline{x} \sim \underline{\underline{A}}\,\underline{y} + {\cal{P}}_{A}^{\perp} \underline{n}\;\;\;,\;\; \underline{y} \sim {\cal{N}}\left ( \underline{0}, \underline{\underline{I}}_{k}\right )\;\;\;,\;\;\underline{n} \sim {\cal {N}}\left ( \underline{0}, \sigma^{2}_{0}\underline{\underline{I}}_{d}\right )$

We can also start from the MCA side, by defining a projection operator ${\cal{P}}_{W}^{\perp} = \underline{\underline{I}}_{d} - \underline{\underline{W}}^{\top} \left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top}\right )^{-1}\underline{\underline{W}}$ , where the rows of the $k\times d$ matrix $\underline{\underline{W}}$ span the $k$ dimensional XC subspace we wish to identify. From this MCA-based approach Welling et al derive the probabilistic model for XCA as zero-mean multivariate Gaussian distribution with inverse covariance,

$\underline{\underline{C}}_{XCA}^{-1} = \frac{1}{\sigma^{2}_{0}} {\cal{P}}_{W}^{\perp} + \underline{\underline{W}}^{\top}\,\underline{\underline{W}}$

The two probabilistic forms of XCA are equivalent and so one finds that the matrices $\underline{\underline{A}}$ and $\underline{\underline{W}}$ are related via $\underline{\underline{A}} = \underline{\underline{W}}^{\top}\left ( \underline{\underline{W}}\,\underline{\underline{W}}^{\top} \right )^{-1}$

If we also look at the two ways in which Welling et al derived a probabilistic model for XCA, we can see that they are very similar to the formulations of PPCA and PMCA respectively, just with the replacement of ${\cal{P}}_{A}^{\perp}$ for $\underline{\underline{I}}_{d}$ in the PPCA formulation, and the replacement of ${\cal{P}}_{W}^{\perp}$ for $\underline{\underline{I}}_{d}$ in the PMCA formulation. So with just a redefinition of how we add the noise in the probabilistic model, Welling et al derived a single probabilistic model that unifies PCA and MCA.

Note that we are now defining the minor components subspace as directions of unusually low variance, so we only need a few dimensions, i.e. $k_{MC} \ll d$ , whilst previously when we defined the minor components subspace as the subspace where we wanted to constrain the data away from, we needed $k_{MC} = d - k_{PC}$ directions. The probabilistic formulation of XCA is a very natural and efficient way to express MCA.

Maximum Likelihood solution for XCA

The model likelihood is easily written down and the maximum likelihood solution identified. As one might anticipate the maximum-likelihood estimates for the vectors $\underline{a}_{i}$ are just eigenvectors of $\underline{\underline{\hat{C}}}$ , but we need to work out which ones. We can use the likelihood value at the maximum likelihood solution to do that for us.

Let’s say we want to retain $k=6$ extreme components overall, and we’ll use ${\cal{C}}$ to denote the corresponding set of eigenvalues of $\hat{\underline{\underline{C}}}$ that are retained. The maximum likelihood value for $k$ extreme components (XC) is given by,

$\log L_{ML} = - \frac{Nd}{2}\log \left ( 2\pi e\right )\;-\;\frac{N}{2}\sum_{i\in {\cal{C}}}\lambda_{i}\;-\;\frac{N(d-k)}{2}\log \left ( \frac{1}{d-k}\left [ {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right ]\right )$

All we need to do is evaluate the above equation for all possible subsets ${\cal{C}}$ of size $k$ selected from the $d$ eigenvalues $\lambda_{i}\, , i=1,\ldots,d$ of $\hat{\underline{\underline{C}}}$ . Superificially, this looks like a nasty combinatorial optimization problem, of exponential complexity. But as Welling et al point out, we know from a result proved in the PPCA paper of Tipping and Bishop that in the maximum likelihood solution the non-extreme components have eigenvalues that form a contiguous group, and so the optimal choice of subset ${\cal{C}}$ reduces to determining where to split the ordered eigenvalue spectrum of $\hat{\underline{\underline{C}}}$ . Since we have $k = k_{PC} + k_{MC}$ that reduces to simply determining the optimal number of the largest $\lambda_{i}$ to retain. That makes the optimization problem linear in $k$ .

For example, in our hypothetical example we have said we want $k = 6$ , but that could be a 3 PCs + 3 MCs split, or a 2 PCs + 4MCs split, and so on. To determine which we simply compute the maxium likelihood value for all the possible values of $k_{PC}$ from $k_{PC} = 0$ to $k_{PC} = k$ , each time keeping the largest $k_{PC}$ values of $\lambda_{i}$ and the smallest $k - k_{PC}$ values of $\lambda_{i}$ in our set ${\cal{C}}$ .

Some of the terms in $\log L_{ML}$ don’t change as we vary $k_{PC}$ and can be dropped. Welling et al introduce a quantity ${\cal{K}}$ defined by,

${\cal{K}}\;=\sum_{i\in {\cal{C}}}\lambda_{i}\; + \;(d-k)\log \left ( {\rm tr}\hat{\underline{\underline{C}}} - \sum_{i\in {\cal{C}}} \lambda_{i}\right )$

${\cal{K}}$ is the negative of $\log L_{ML}$ , up to an irrelevant constant and scale. If we then compute ${\cal{K}}\left ( k_{PC}\right )$ for all values of $k_{PC} = 0$ to $k_{PC} = k$ and select the minimum, we can determine the optimal split of $k = k_{PC} + k_{MC}$ .

PCA and MCA as special cases

Potentially, we could find that $k_{PC} = k$ , in which case all the selected extreme components would correspond to principal components, and so the XCA algorithm becomes equivalent to PCA. Likewise, we could get $k_{PC} = 0$ , in which case all the selected extreme components would correspond to minor components and the XCA algorithm becomes equivalent to MCA. So XCA contains pure PCA and pure MCA as special cases. But when do these special cases arise? Obviously, it will depend upon the precise values of the sample covariance eigenvalues $\lambda_{i}$ , or rather the shape of the eigen-spectrum, but Welling et al also give some insight here, namely,

A log-convex sample covariance eigen-spectrum will give PCA
A log-concave sample covariance eigen-spectrum will give MCA
A sample covariance eigen-spectrum that is neither log-convex nor log-concave will yield both principal components and minor components

In layman’s terms, if the plot of the (sorted) eigenvalues on a log scale only bends upwards (has positive second derivative) then XCA will give just principal components, whilst if the plot of the (sorted) eigenvalues on a log scale only bends downwards (has negative second derivative) then we’ll get just minor components. If the plot of the log-eigenvalues has places where the second derivative is positive and places where it is negative, then XCA will yield a mixture of principal and minor components.

Experimental demonstration

To illustrate the XCA theory I produced a Jupyter notebook that generates simulated data containing a known number of principal components and a known number of minor components. The simulated data is drawn from a zero-mean multivariate Gaussian distribution with population covariance matrix $\underline{\underline{C}}$ whose eigenvalues $\Lambda_{i}$ have been set to the following values,

$\begin{array}{cclcl} \Lambda_{i} & = & 3\sigma^{2}\left ( 1 + \frac{i}{3k_{PC}}\right ) & , & i=1,\ldots, k_{PC}\\ \Lambda_{i} & = &\sigma^{2} & , & i=k_{PC}+1,\ldots, d - k_{MC}\\ \Lambda_{i} & = & 0.1\sigma^{2}\left ( 1 + \frac{i}{k_{MC}} \right ) & , & i=d - k_{MC} + 1,\ldots, d\end{array}$

The first $k_{PC}$ eigenvalues represent principal components, as their variance is considerable higher than the ‘noise’ eigenvalues, which are represented by eigenvalues $i=k_{PC}+1$ to $i=d - k_{MC}$ . The last $k_{MC}$ eigenvalues represent minor components, as their variance is considerably lower than the ‘noise’ eigenvalues. Note, I have scaled both the PC and MC population eigenvalues by the ‘noise’ variance $\sigma^{2}$ , so that $\sigma^{2}$ just sets an arbitrary (user-chosen) scale for all the variances. I have chosen a large value of $\sigma^{2}=20$ , so that when I plot the minor component eigenvalues of $\hat{\underline{\underline{C}}}$ I can easily distinguish them from zero (without having to plot on a logarithmic scale).

We would expect the eigenvalues of the sample covariance matrix to follow a similar pattern to the eigenvalues of the population covariance matrix that we used to generate the data, i.e. we expect a small group of noticeably low-valued eigenvalues, a small group of noticeably high-valued eigenvalues, and the bulk (majority) of the eigenvalues to form a continuous spectrum of values.

I generated a dataset consisting of $N=2000$ datapoints, each with $d=200$ features. For this dataset I chose $k_{PC} = k_{MC} = 10$ . From the simulated data I computed the sample covariance matrix $\hat{\underline{\underline{C}}}$ and then calculated the eigenvalues of $\hat{\underline{\underline{C}}}$ .

The left-hand plot below shows all the eigenvalues (sorted from lowest to highest), and I have also zoomed in on just the smallest values (middle plot) and just the largest values (right-hand plot). We can clearly see that the sample covariance eigenvalues follow the pattern we expected.

All the code (with explanations) for my calculations are in a Jupyter notebook and freely available from the github repository https://github.com/dchoyle/xca_post

From the left-hand plot we can see that there are places where the sample covariance eigenspectrum bends upwards and places where it bends downwards, indicating that we would expect the XCA algorithm to retain both principal and minor components. In fact, we can clearly see from the middle and right-hand plots the distinct group of minor component eigenvalues and the distinct group of principal component eigenvalues, and how these correspond to the distinct groups of extreme components in the population covariance eigenvalues. However, it would be interesting to see how the XCA algorithm performs in selecting a value for the number of principal components.

For the eigenvalues above I have calculated the minimum value of ${\cal{K}}$ for a total of $k=20$ extreme components. The minimum value of ${\cal{K}}$ occurs at $k_{PC} = 10$ , indicating that the method of Welling et al estimates that there are 10 principal components in this dataset, and by definition there are then $k - k_{PC} = 20 - 10 = 10$ minor components in the dataset. In this case, the XCA algorithm has identified the dimensionalities of the PC and MC subspaces exactly.

Final comments

That is the post done. I can look Will in the eye when we meet for a beer at the next PyData Manchester Leaders meeting. The post ended up being longer (and more fun) than I expected, and you may have got the impression from my post that the Welling et al paper has completely solved the problem of selecting interesting low-dimensional subspaces in Gaussian distributed data. Note quite true. There are still challenges with XCA, as there are with PCA. For example, we have not said how we choose the total number of extreme components $k$ . That is a whole other model selection problem and one that is particularly interesting for PCA when we have high-dimensional data. This is one of my research areas – see for example my JMLR paper Hoyle2008.

Another challenge that is particularly relevant for high-dimensional data is the question of whether we will see distinct groups of principal and minor component sample covariance eigenvalues at all, even when we have distinct groups of population covariance eigenvalues. I chose very carefully the settings used to generate the simulated data in the example above. I ensured that we had many more samples than features, i.e. $N \gg d$ , and that the extreme component population covariance eigenvalues were distinctly different from the ‘noise’ population eigenvalues. This ensured that the sample covariance eigenvalues separated into three clearly visible groups.

However, in PCA when we have $N < d$ and/or weak signal strengths for the extreme components of the population covariance, then the extreme component sample covariance eigenvalues may not be separated from the bulk of the other eigenvalues. As we increase the ratio $\alpha = N/d$ we observe a series of phase transitions at which each of the extreme components becomes detectable – again this is another of my areas of research expertise [HoyleRattray2003, HoyleRattray2004, HoyleRattray2007]

Footnotes

I have used the usual divide by $N-1$ Bessel correction in the definition of the sample covariance. This is because I have assumed any data matrix will have been explicitly mean-centered. In many of the analyses of PCA the starting assumption is that the data is drawn from a mean-zero distribution, so that the sample mean of any feature is zero only under expectation, not as a constraint. Consequently, most formal analysis of PCA will define the sample covariance matrix with a $1/N$ factor. Since I have to deal with real data, I will never presume the data been drawn from population distribution that has zero-mean and so to model the data with a zero-mean distribution I will explicitly mean-center the data. Therefore, I use the $1/(N-1)$ definition of the sample covariance. Strictly speaking, that means the various theories and analyses I discuss later in the post are not applicable to the data I’ll work with. It is possible to modify the various analyses to explicitly take into account the mean-centering step, but it is tedious to do so. In practice, (for large $N$ ) the difference is largely inconsequential, and formulae derived from analysis of zero-mean distributed data can be accurate for mean-centered data, so we’ll stick with using the $1/(N-1)$ definition for $\hat{\underline{\underline{C}}}$ .
Hotelling, H. “Analysis of a complex of statistical variables into principal components”. Journal of Educational Psychology, 24:417-441 and also 24:498–520, 1933.
https://dx.doi.org/10.1037/h0071325
Oja, E. “Principal components, minor components, and linear neural networks”. Neural Networks, 5(6):927-935, 1992. https://doi.org/10.1016/S0893-6080(05)80089-9
Luo, F.-L., Unbehauen, R. and Cichocki, A. “A Minor Component Analysis Algorithm”. Neural Networks, 10(2):291-297, 1997. https://doi.org/10.1016/S0893-6080(96)00063-9
See for example, Williams, C.K.I. and Agakov, F.V. “Products of gaussians and probabilistic minor components analysis”. Neural Computation, 14(5):1169-1182, 2002. https://doi.org/10.1162/089976602753633439

A book on language models and a paper on the mathematics of transformers

Quick introductions to the maths of transformers

On May 26, 2025June 21, 2025 By dchoyleIn AI, Algorithms, Artificial Intelligence, Data Science, Deep Learning, Machine LearningLeave a comment

TL;DR: Having a high-level understanding of the mathematics of transformers is important for any Data Scientist. The two sources I recommend below are excellent short introductions to the maths of transformers and modern language models.

A colleague asked me, about two months back, if I could recommend any articles on the mathematics of Large Language Models (LLMs). They then clarified that they meant transformers, as they were primarily interested in the algorithms on which LLM apps are based. Yes, they’d skim read the original “Attention Is All You Need” paper from Vaswani et al, but they done so just after the paper came out in 2017. They were looking to get back up to date with LLMs and even revisit the original Vaswani paper. Firstly, they wanted an accessible explanation which they could use to construct a high-level mental model of how transformers worked, the idea being that the high-level mental model would serve as a construct on which to hang and compartmentalize the many new concepts and advances that had happened since the Vaswani paper. Secondly, my colleague is very mathematically able, so they were looking for mathematical detail, but the right mathematical detail, and in a relatively short read.

I’ve listed below the recommendations I gave to my colleague because I think they are good recommendations (and I explain why below as well). I also believe it is important for all Data Scientists to have at least a high-level understanding of how transformers, and the LLMs which are built on them, work – again I explain why, below.

The recommendations

What I recommended was one paper and one book. The article is free to access, and the book has a “set your own price” option for access to the electronic version of the book.

The article is, “An Introduction to Transformers” by Richard Turner from the Dept. of Engineering at the University of Cambridge and Microsoft Research in Cambridge (UK). This arXiv paper can be found here. The paper focuses on how transformers work but not on training them. That way the reader focuses on the structure of the transformers without getting lost in the details of the arcane and dark art of training transformers. This is why I like this paper. It gives you an overview of what transformers are and how they work without getting into the necessary but separate nitty-gritty of how you get them to work. To read the paper does require some prior knowledge of mathematics but the level is not that high – see the last line of the abstract of the paper. The whole paper is only six pages long, making it a very succinct explanation of transformer maths that you can consume in one sitting.
The book is “The Hundred-Page Language Models Book” by Andriy Burkov. This is the latest in the series of books from Burkov, that include “The Hundred-Page Machine Learning Book” and “Machine Learning Engineering”. I have a copy of the hundred-page machine learning book and I think it is ok, but I prefer the LLMs book. I think part of the reason for this is that, like everybody else I have been only been using and playing with LLMs for the last three years or so, whilst I have been doing Data Science for a lot longer – I have been doing some form of mathematical or statistical modelling for over 30years – and so I didn’t really learn anything new from the machine learning book. In contrast, I learnt a lot from the book on LLMs. The whole book works through simple examples, both in code (Python) and in terms of the maths. I semi-skim read the book in two sittings. The code examples I skipped, not because they were simplistic but because I wanted to digest the theory and algorithm explanations end-to-end first and then return to trying the code examples at a later date. Overall, the book is packed with useful nuggets. It is a longer read than the Turner paper, but can still easily be consumed in a day if you skip bits. The book assumes less prior mathematical knowledge than the Turner paper and explains the new bits of maths it introduces, but given the whirlwind nature of a 100-page introduction to LLMs I would still recommend you have some basic familiarity with linear algebra, statistics & probability, and machine learning concepts.

Why learn the mathematics of transformers?

Having to think about which short articles I would recommend on the maths of transformers and LLMs made me think more broadly about whether there is any benefit from having a high-level understanding of transformer maths. My colleague was approaching it out of curiosity, and I knew that. They simply wanted to learn, not because they had to, nor because they thought that understanding the mathematical basis of transformers was the way to approach using LLMs as a tool.

However, given the exorbitant financial cost of building foundation models and the need to master a vast amount of engineering detail, most people won’t be building their own foundation models. Instead they will be using 3^rd party models simply as a tool and focusing on developing skills and familiarity in prompting them. So, are there any benefits then to understanding the maths behind LLMs? In other words, could I honestly recommend the two sources listed above to anybody else other than my colleague who was interested mainly out of curiousity?

The benefits of learning the maths of transformers and the risks of not doing so

The answer to the question above, in my opinion, is yes. But you probably could have guessed that from the fact I’ve written this post. So, what do I think are the benefits to a Data Scientist in having a high-level understanding of the mathematics of transformers? And equally important, what are the downsides and risks of not having that high-level understanding?

Having even a high-level understanding of the maths behind transformers de-mystifies LLMs since it forces you to focus on what is inside LLMs. Without this understanding you risk putting an unnecessary veneer of complexity or mysticism on top of LLMs, a veneer that prevents you using LLMs effectively.
You will understand why LLMs hallucinate. You will understand that LLMs build a model of the high-dimensional conditional probability distribution of the next token given the preceding context. And that distribution can have a large dispersion if the the training data is limited in the high-dimensional region that corresponds to the current context. That large dispersion results in the sampled next token having a high probability of being inappropriate. If you understand what LLMs are modelling and how they model it, hallucinations will not be a surprise to you (they may still be annoying) and you will understand strategies to mitigate them. If you don’t understand how LLMs are modelling the conditional probability of the next token, you will always be surprised, annoyed, and impacted by LLM hallucinations.
It helps you understand where LLMs excel and where they don’t because you have a grounded understanding of their strengths and weaknesses. This makes it easier to identify potential applications of LLMs. The downside? Not having a fundamental understanding of the strengths and weaknesses of the algorithms behind LLMs risks you building LLM-based applications that were doomed to failure from the start because they have been mis-matched to the capabilities of LLMs.
By having a high-level mental model of transformers on which to hang later advances in LLMs, you can more easily identify what is important and relevant (or not) in any new advance. The downside to not having this well-founded mental-model is that you get blown about by the winds of over-hyped LLM announcements from companies stating that their new tool or app is a “paradigm shift”, and consequently you waste time getting into the detail of what are trivial or inconsequential improvements.

What to do?

What should you do if you are a Data Scientist and I have managed to convince you that having a high-level understanding of the mathematics of transformers is important? Simple, access the two sources I’ve recommended above. Happy reading.

Comparison of Benford's Law and the proportion of first digits from file sizes of files on my laptop hard drive.

A Christmas Cracker Puzzle – Part 2

On January 19, 2025May 26, 2025 By dchoyleIn Data Science, Fun Mathematics, Miscellaneous, Scientific computation1 Comment

Before Christmas I set a little puzzle. The challenge was to calculate the proportion of file sizes on your hard drive that start with the digit 1. I predicted that the proportion you got was around 30%. I’ll now explain why.

Benford’s Law

The reason why around 30% of all the file sizes on your hard disk start with 1 is because of Benford’s Law. Computer file sizes approximately follow Benford’s Law.

What is Benford’s Law?

Benford’s Law says that for many datasets the first digits of the numbers in the dataset follow a particular distribution. Under Benford’s Law, the probability of the first digit being equal to d is,

$\log_{10} ( 1 + \frac{1}{d} )\;\;\;.\;\;\;\;\;\;$ Eq.1

So, in a dataset that follows Benford’s Law, the probability that a number starts with a 1 is around 30%. Hence, the percentage of file sizes that start with 1 is around 30%.

The figure below shows a comparison of the distribution in Eq.1 and the distribution of first digits of file sizes for files in the “Documents” folder of my hard drive. The code I used to calculate the empirical distribution in the figure is given at the end of this post. You can see the distribution derived from my files is in very close agreement with the distribution predicted by Eq.1. The agreement between the two distributions in the figure is close but not perfect – more on that later.

Benford’s Law is named after Frank Benford, whose discovered the law in 1938 – see the later section for some of the long history of Benford’s Law. Because Benford’s Law is concerned with the distribution of first digits in a dataset, it is also commonly referred to as, ‘the first digit law’ and also the ‘significant digit law’.

Benford’s Law is more than just a mathematical curiosity or Christmas cracker puzzle. It has some genuine applications – see later. It has also fascinated mathematicians and statisticians because it applies to so many diverse datasets. Benford’s Law has been shown to apply to datasets as different as the size of rivers and election results.

What’s behind Benford’s Law

The intuition behind Benford’s Law is that if we think there are no a priori constraints on what value a number can take then we can make the following statements,

We expect the numbers to be uniformly distributed in some sense – more on that in a moment.
There is no a priori scale associated with the distribution of numbers. So, I should be able to re-scale all my numbers and have a distribution with the same properties.

Overall, this means we expect the numbers to be uniformly distributed on a log scale.

This intuition helps us identify when we should expect a dataset to follow Benford’s Law, and it also gives us a hand-waving way of deriving the form of Benford’s Law in Eq.1.

Deriving Benford’s Law

First, we restrict our numbers to be positive and derive Benford’s Law. The fact that Benford’s Law would also apply to negative numbers once we ignore their sign should be clear.

We’ll also have to restrict our positive numbers to lying in some range $[\frac{1}{x_{max}}, x_{max}]$ so that the probability distribution of $x$ is properly normalized. We’ll then take the limit $x_{max}\rightarrow\infty$ at the end. For convenience, well take $x_{max}$ to be of the form $x_{max} = 10^{k_{max}}$ , and so the limit $x_{max}\rightarrow\infty$ corresponds to the limit $k_{max}\rightarrow\infty$ .

Now, if our number $x$ lies in $[\frac{1}{x_{max}}, x_{max}]$ and is uniformly distributed on a log-scale, then the probability density for $x$ is,

$p(x) = \frac{1}{2\ln x_{max}} \frac{1}{x}\;\;.$

The probability of getting a number between, say 1 and 2 is then,

${\rm{Prob}} \left ( 1 \le x < 2 \right ) = \frac{1}{2\ln x_{max}} \int_{1}^{2} \frac{1}{x} dx \;\;.$

Now numbers which start with a digit $d$ are of the form $a10^{k}$ with $a \in [d, d+1)$ and $k = -k_{max},\ldots,-2,-1,0,1,2,\ldots,k_{max}$ . So, the total probability of getting such a number is,

${\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \sum_{k=-k_{max}}^{k_{max}}\frac{1}{2\ln x_{max}}\int_{d10^{k}}^{(d+1)10^{k}} \frac{1}{x} dx\;\; ,$

and so after performing the integration and summation we have,

${\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \frac{2k_{max}}{2\ln x_{max}} \left [ \ln ( d+1 ) - \ln d\right ]\;\;.$

Recalling that we have chosen $x_{max} =10^{k_{max}}$ , we get,

${\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \frac{1}{\ln 10} \left [ \ln ( d+1 ) - \ln d\right ]\;=\;\log_{10} \left ( 1 + \frac{1}{d}\right )\;\;.$

Finally, taking the limit $k_{max}\rightarrow\infty$ gives us Benford’s Law for positive valued numbers which are uniformly distributed on a log scale.

Ted Hill’s proof of Benford’s Law

The derivation above is a very loose (lacking formal rigour) derivation of Benford’s Law. Many mathematicians have attempted to construct a broadly applicable and rigorous proof of Benford’s Law, but it was not until 1995 that a widely accepted proof was derived by Ted Hill from Georgia Institute of Technology. You can find Hill’s proof here. Ted Hill’s proof also seemed to reinvigorate interest in Benford’s Law. In the following years there were various popular science articules on Benford’s Law, such as this one from 1999 by Robert Matthews in The New Scientist. This article was where I first learned about Benford’s Law.

Benford’s Law is base invariant

The approximate justification we gave above for why Benford’s Law works made no explicit reference to the fact that we were working with numbers expressed in base 10. Consequently, that justification would be equally valid if we were working with numbers expressed in base $b$ . This means that Benford’s Law is base invariant, and a similar derivation can be made for base $b$ .

The distribution of first digits given in Eq.1 is for numbers expressed in base 10. If we express those same numbers in base $b$ and write a number $x$ in base $b$ as,

$x = x_{1}x_{2}x_{3}\ldots x_{K}\;\;,$

then the first digit $x_{1}$ is in the set $\{1,2,3,\ldots,b-1\}$ and the probability that the first digit has value $d$ is given by,

${\rm{Prob}} \left ( x_{1}\;=\;d \right ) = \log_{b}\left ( 1 + \frac{1}{d}\right )\;\;\;,\; d\in\{1,2,3,\ldots,b-1\}\;\;\;.\;\;\;\;\;$ Eq.2

When should a dataset follow Benford’s Law?

The broad intuition behind Benford’s Law that we outlined above also gives us an intuition about when we should expect Benford’s Law to apply. If we believe the process generating our data is not restricted in the scale of the values that can be produced and there are no particular values that are preferred, then a large dataset drawn from that process will be well approximated by Benford’s Law. These considerations apply to many different types of data generating processes, and so with hindsight it should not come as a surprise to us that many different datasets appear to follow Benford’s Law closely.

This requires that no scale emerges naturally from the data generating process. This means the data values shouldn’t be clustered around a particular value, or particular values. So, data drawn from a Gaussian distribution would not conform to Benford’s law. From the perspective of the underlying processes that produce the data, this means there shouldn’t be any equilibrium process at work, as that would drive the measured data values to the value corresponding to the equilibrium state of the system. Likewise, there should be no constraints on the system generating the data, as the constraints will drive the data towards a specific range of values.

However, no real system is truly scale-free. The finite system size always imposes a scale. However, if we have data that can vary over many orders of magnitude then from a practical point of view, we can regard that data as effectively scale free. I have found that data that varies over 5 orders of magnitude is usually well approximated by Benford’s law.

Because a real-world system cannot really ever truly satisfy the conditions for Benford’s Law, we should expect that most real-world datasets will only show an approximate agreement with Benford’s Law. The agreement can be very close, but we should still expect to see deviations from Benford’s Law – just as we saw in the figure at the top of this post. Since real-world datasets won’t follow Benford’s Law precisely, we also shouldn’t expect to see a real-world dataset follow the base-invariant form of Benford’s Law in Eq.2 for every choice of base. In practice, this means that there is usually a particular base $b$ for which we will see closer agreement with the distribution in Eq.2, compared to other choices of base.

Why computer file sizes follow Benford’s Law

Why does Benford’s Law apply to the sizes of the files on your computer? Those file sizes can span over several orders of magnitude – from a few hundred bytes to several hundred megabytes. There is also no reason why my files should cluster around a particular file size – I have photos, scientific and technical papers, videos, small memos, slides, and so on. It would be very unusual if all those different types of files, from different use cases, end up having very similar sizes. So, I expect Benford’s Law to be a reasonable description of the distribution of first digits of the file sizes of files in my “Documents” folder.

However, if the folder I was looking at just contained, say, daily server logs, from a server that ran very unexciting applications, I would expect the server log file size to be very similar from one day to the next. I would not expect Benford’s Law to be a good fit to those server log file sizes.

In fact a significant deviation from Benford’s Law in the file size distribution would indicate that we have a file size generation process that is very different from a normal human user going about their normal business. That may be entirely innocent, or it could be indicative of some fraudulent activity. In fact, fraud detection is one of the practical applications of Benford’s Law.

The history of Benford’s Law

Simon Newcomb

One of the reasons why the fascination with Benford’s Law endures is the story of how it was discovered. With such an intriguing mathematical pattern, it is perhaps no surprise to learn that Frank Benford was not the first scientist to spot the first-digit pattern. The astronomer Simon Newcomb had also published a paper on, “the frequency of use of the different digits in natural numbers” in 1881. Before the advent of modern computers, scientists and mathematicians computed logarithms by looking them up in mathematical tables – literally books of logarithm values. I still have such a book from when I was in high-school. The story goes that Newcomb noticed that in a book of logarithms the pages were grubbier, i.e. used more, for numbers whose first significant digit was 1. From this Newcomb inferred that numbers whose first significant digit was 1 must be more common, and he supposedly even inferred the approximate frequency of such numbers from the relative grubbiness of the pages in the book of logarithms.

In more recent years the first-digit law is also referred to as the Newcomb-Benford Law, although Benford’s Law is still more commonly used because of Frank Benford’s work in popularizing it.

Benford’s discovery

Frank Benford rediscovered the law in 1938, but also showed that data from many diverse datasets – from the surface area of rivers to population sizes of US counties – appeared to follow the distribution in Eq.1. Benford then published his now famous paper, “The law of anomalous numbers”.

Applications of Benford’s Law

There are several books on Benford’s Law. One of the most recent, and perhaps the most comprehensive is Benford’s Law: Theory and Applications. It is divided into 2 sections on General Theory (a total of 6 chapters) and 4 sections on Applications (a total of 13 chapters). Those applications cover the following:

Detection of accounting fraud
Detection of voter fraud
Measurement of the quality of economic statistics
Uses of Benford’s Law in the natural sciences, clinical sciences, and psychology.
Uses of Benford’s Law in image analysis.

I like the book because it has extensive chapters on applications written by practitioners and experts on the uses of Benford’s Law, but the application chapters make links back to the theory.

Many of the applications, such as fraud detection, are based on the idea of detecting deviations from the Benford Law distribution in Eq.1. If we have data that we expect to span several orders of magnitude and we have no reasons to suspect the data values should naturally cluster around a particular value, then we might expect it to follow Benford’s Law closely. This could be sales receipts from a business that has clients of very different sizes and sells goods or services that vary over a wide range of values. Any large deviation from Benford’s Law in the sales recipts would then indicate the presence of a process that produces very specific receipt values. That process could be data fabrication, i.e. fraud. Note, this doesn’t prove the data has been fraudulently produced, it just means that the data has been produced by a process we wouldn’t necessarily expect.

My involvement with Benford’s Law

In 2001 I published a paper demonstrating that data from high-throughput gene expression experiments tended to follow Benford’s Law. The reasons why mRNA levels should follow Benford’s Law is ultimately those we have already outlined – mRNA levels can range over many orders of magnitude and there are no a priori molecular biology reasons why, across the whole genome, mRNA levels should be centered around a particular value.

In December 2007 a conference on Benford’s Law was organized by Prof. Steven Miller from Brown University and others. The conference was held in a hotel in Sante Fe and was sponsored/funded by Brown University, the University of New Mexico, Universidade de Vigo, and the IEEE. Because of my 2001 paper, I received an invitation to talk at the workshop.

For me, the workshop was very memorable for many reasons,

I had a very bad cold at the time.
Due to snow in both the UK and US, I was snowbound overnight in Chicago airport (sleeping in the airport), and only just made a connecting flight in Dallas to Albuquerque. Unfortunately, my luggage didn’t make the flight connection and ended up getting lost in all the flight re-arrangements and didn’t show up in Santa Fe until 2 days later.
It was the first time I’d seen snow in the desert – this really surprised me. I don’t know why, it just did.
Because of my cold, I hadn’t finished writing my presentation. So I stayed up late the night before my talk to finish writing my slides. To sustain my energy levels through the night whilst I was finishing writing my slides, I bought a Hershey bar thinking it would be similar to chocolate bars in the UK. I took a big bite from the Hershey bar. Never again.
But this was all made up for by the fact I got to sit next to Ted Hill during the workshop dinner. Ted was one of the most genuine and humble scientists I have had the pleasure of talking to. Secondly, the red wine at the workshop dinner was superb.

From that workshop Steven Miller organized and edited the book on Benford’s Law I referenced above. Hence why I think that book is one of the best on Benford’s Law, although I am biased as I contributed one of the chapters – Chapter 16 on “Benford’s Law in the Natural Sciences”

My Python code solution

To end this post, I have given below the code I used to calculate the distribution of first digits of the file sizes on my laptop hard drive.

import numpy as np

# We'll use the pathlib library to recurse over
# directories
from pathlib import Path

# Specify the top level directory
start_dir = "C:\\Users\\David\\Documents"

# Use a list comprehension to recursively loop over all sub-directories 
# and get the file sizes
filesizes = [path.lstat().st_size for path in list(Path(start_dir).glob('**/*' )) if path.is_file()]

# Now count how many file sizes start with a 1
proportion1 = np.sum([int(str(i)[0])==1 for i in filesizes])/len(filesizes)

# Print the result
print("Proportion of filesizes starting with 1 = " + str(proportion1))
print("Number of files = " + str(len(filesizes)))

# Calculate the first-digit proportions for all digits 1 to 9
proportions = np.zeros(9)
for i in range(len(filesizes)):
    proportions[int(str(filesizes[i])[0])-1] += 1.0 
    
proportions /= len(filesizes)

A Christmas Cracker Puzzle – Part 1

On December 17, 2024January 19, 2025 By dchoyleIn Fun Mathematics, Mathematical Analysis, Miscellaneous1 Comment

Here’s an interesting little coding challenge for you, with a curious data twist at the end. Less of an Advent of Code challenge and more of a Christmas Cracker puzzle.

Without knowing anything about you, I will make a prediction about the size of the files you have on your computer. Christmas magic? Maybe. Read on to find out.

There are two parts to this puzzle

Part 1 – Obtain a list of file sizes (in bytes) of all the files on your computer that are in a high-level directory and its sub-directories. We’re talking here about files with a non-zero file size. We also talking about actual files, so the folders/directories themselves should not be included, just their file contents. If this is your own laptop the high-level directory could be the Documents directory (if on Windows). Obviously, you’ll need to have permission to recurse over all the sub-directories. You should choose a starting directory that has a reasonably large number of files in total across all its sub-directories.

Part 2 – Calculate the proportion of files whose file size starts with a 1. So, if you had 10 files of sizes, 87442, 78922, 3444, 9653, 197643, 26768, 8794787, 22445, 7654, 56573, then the proportion of files whose file size starts with a 1 is 0.1 or 10%.

The overall goal

The goal is to write code that performs the two parts of the challenge in as efficient a way as possible. You can use whatever programming languages you want to perform the two parts – you might use the command line for the first part and a high-level programming language for the second, or you might use a high-level programming language for both parts.

I used Python for both parts and used the Documents directory on my laptop as my starting directory. I had 96351 files in my Documents folder and its sub-directories. The proportion of my files that have a size starting with 1 is approximately 0.31, or 31%.

The twist

Now for the curious part. You should get a similar proportion, that is, around 30%. I will explain why in a separate post (part 2) in the new year, and why if you don’t it can tell us something interesting about the nature of the files on your computer.

You’re going to need a bigger algorithm – Amdahl’s Law and your responsibilities as a Data Scientist

On February 2, 2024December 17, 2024 By dchoyleIn Algorithms, Data Science, Mathematical Analysis, Scientific computationLeave a comment

You have some prototype Data Science code based on an algorithm you have designed. The code needs to be productionized, and so sped up to meet the specified production run-times. If you stick to your existing technology stack, unless the runtimes of your prototype code are within a factor of 1000 of your target production runtimes, you’ll need a bigger, better algorithm. There is a limit to what speed up your technology stack can achieve. Why is this? Read on and I’ll explain. And I’ll explain what you can do if you need more than a 1000-fold speed up of your prototype.

Speeding up your code with your current tech stack

There are two ways in which you can speed up your prototype code,

Improve the efficiency of the language constructs used, e.g. in Python replacing for loops with list comprehensions or maps, refactoring subsections of the code etc.
Horizontal scaling of your current hardware, e.g. adding more nodes to a compute cluster, adding more executors to the pool in a Spark cluster.

Point 2 assumes that your calculation is compute bound and not memory bound, but we’ll stick with that assumption for this article. We also exclude the possibility that the productionization team can invent or buy a new technology that is sufficiently different or better than your current tech stack – it would be an unfair ask of the ML engineers to have to invent a whole new technology just to compensate for your poor prototype. They may be able to to, but we are talking solely about using your current tech stack and we assume that it does have some capacity to be horizontally scaled.

So what speed ups can we expect from points 1 and 2 above? Point 1 is always possible. There are always opportunities for improving code efficiency that you or another person will spot when looking at the code for a second time. A more experienced programmer reviewing the code can definitely help. But let’s assume that you’re a reasonably experienced Data Scientist yourself. It is unlikely that your code is so bad that a review by someone else would speed it up by more than a factor of 10 or so.

So if the most we expect from code efficiency improvements is a factor 10 speed up, what speed up can we additionally get from horizontal scaling of your existing tech stack? A factor of 100 at most. Where does this limit of 100 come from? Amdahl’s law.

Amdahl’s law

Amdahl’s law is a great little law. Its origins are in High Performance Computing (HPC), but it has a very intuitive basis and so is widely applicable. Because of that it is worth explaining in detail.

Imagine we have a task that currently takes time T to run. Part of that task can be divided up and performed by separate workers or resources such as compute nodes. Let’s use P to denote the fraction of the task that can be divided up. We choose the symbol P because this part of the overall task can be parallelized. The fraction that can’t be divided up we denote by S, because it is the non-parallelizable or serial part of the task. The serial part of the task represents things like unavoidable overhead and operations in manipulating input and output data-structures and so on.

Obviously, since we’re talking about fractions of the overall runtime T, the fractions P and S must sum to 1, i.e.

The parallelizable part of the task takes time TP to run, whilst the serial part takes time TS to run.

What happens if we do parallelize that parallelizable component P? We’ll parallelize it using N workers or executors. When N=1, the parallelizable part took time TP to run, so with N workers it should (in an ideal world) take time TP/N to run. Now our overall run time, as a function of N is,

This is Amdahl’s law¹. It looks simple but let’s unpack it in more detail. We can write the speed up factor in going from T(N=1) to T(N) as,

The figure below shows plots of the speed-up factor against N, for different values of S.

From the plot in the figure, you can see that the speed up factor initially looks close to linear in N and then saturates. The speed up at saturation depends on the size of the serial component S. There is clearly a limit to the amount of speed up we can achieve. When N is large, we can approximate the speed up factor in Eq.3 as,

From Eq.4 (or from Eq.3) we can see the limiting speed up factor is 1/S. The mathematical approximation in Eq.4 hides the intuition behind the result. The intuition is this; if the total runtime is,

then at some point we will have made N big enough that P/N is smaller than S. This means we have reduced the runtime of the parallelizable part to below that of the serial part. The largest contribution to the overall runtime is now the serial part, not the parallelizable part. Increasing N further won’t change this. We have hit a point of rapidly diminishing returns. And by definition we can’t reduce S by any horizontal scaling. This means that when P/N becomes comparable to S, there is little point in increasing N further and we have effectively reached the saturation speed up.

How small is S?

This is the million-dollar question, as the size of S determines the limiting speed up factor we can achieve through horizontal scaling. A larger value of S means a smaller speed up factor limit. And here’s the depressing part – you’ll be very lucky to get S close to 1%, which would give you a speed up factor limit of 100.

A real-world example

To explain why S = 0.01 is around the lowest serial fraction you’ll observe in a real calculation, I’ll give you a real example. I first came across Amdahl’s law in 2007/2008, whilst working on a genomics project, processing very high-dimensional data sets². The calculations I was doing were statistical hypothesis tests run multiple times.

This is an example of an “embarrassingly parallel” calculation since it just involves splitting up a dataframe into subsets of rows and sending the subsets to the worker nodes of the cluster. There is no sophistication to how the calculation is parallelized, it is almost embarrassing to do – hence the term “embarrassingly parallel”.

The dataframe I had was already sorted in the appropriate order, so parallelization consisted of taking a small number of rows off the top of dataframe and sending to a worker node and repeating. Mathematically, on paper, we had S=0. Timings of actual calculations with different numbers of compute nodes and fitting an Amdahl’s law curve to those timings revealed we had something between S=0.01 and S=0.05.

A value of S=0.01 gaves us a maximum speed up factor of 100 from horizontal scaling. And this was for a problem that on paper had S=0. In reality, there is always some code overhead in manipulating the data. A more realistic limit on S for an average complexity piece of Data Science code would be S=0.05 or S=0.1, meaning we should expect limits on the speed up factor of between 10 and 20.

What to do?

Disappointing isn’t it!? Horizontal scaling will speed up our calculation by at most a factor of 100, and more likely only a factor of 10-20. What does it mean for productionizing our prototype code? If we also include the improvements in the code efficiency, the most we’re likely to be able to speed up our prototype code by is a factor of 1000 overall. It means that as a Data Scientist you have a responsibility to ensure the runtime of your initial prototype is within a factor of 1000 of the production runtime requirements.

If a speed up of 1000 isn’t enough to hit the production run-time requirements, what can we do? Don’t despair. You have several options. Firstly, you can always change the technology underpinning your tech stack. Despite what I said at the beginning of this post, if you are repeatedly finding that horizontal scaling of your current tech stack does not give you the speed-up you require, then there may be a case for either vertical scaling the runtime performance of each worker node or using a superior tech stack if one exists.

If improvement by vertical scaling of individual compute nodes is not possible, then there are still things you can do to mitigate the situation. Put the coffee on, sharpen your pencil, and start work on designing a faster algorithm. There are two approaches you can use here,

Reduce the performance requirements: This could be lowering the accuracy through approximations that are simpler and quicker to calculate. For example, if your code involves significant matrix inversion operations you may be able to approximate a matrix by its diagonal and explicitly hard code the calculation of its inverse rather than performing expensive numerical inversion of the full matrix.
Construct a better algorithm: There are no easy recipes here. You can get some hints on where to focus your effort and attention by identifying the runtime bottlenecks in your initial prototype. This can be done using code profiling tools. Once a bottleneck has been identified, you can then progress by simplifying the problem and constructing a toy problem that has the same mathematical characteristics as the original bottleneck. By speeding up the toy problem you will learn a lot. You can then apply those learnings, even if only approximately, to the original bottleneck problem.

When I first stumbled across Amdahl’s law, I mentioned it to a colleague working on the same project as I was. They were a full-stack software developer and immediately, said “oh, you mean Amdahl’s law about limits on the speed you can write to disk?”. It turns out there is another Amdahl’s Law, often called “Amdahl’s Second Law”, or “Amdahl’s Other Law”, or “Amdahl’s Lesser Law”, or “Amdahl’s Rule-Of-Thumb”. See this blog post, for example, for more details on Amdahl’s Second Law.
Hoyle et. al, “Shared Genomics: High Performance Computing for distributed insights in genomic medical research”, Studies in Health Technology & Informatics 147:232-241, 2009.

The Past, The Future and The Infrequent: Four books on forecasting

On January 6, 2024February 2, 2024 By dchoyleIn Demand Forecasting, ForecastingLeave a comment

Unsurprisingly, given my day job, I ended up reading several books about forecasting in 2023. On reflection, what was more surprising to me was the variety. The four books I have read (admittedly, not all of them cover-to-cover) span the range from a history of general technology forecasting, the current state of the art and near future of forecasting methods in business, right through to the specific topic of intermittent demand modelling. Hence the title of this blogpost, “The past, the future, and the infrequent”, which is a play on the spaghetti western “The Good, The Bad, and The Ugly”. It also gives me an opportunity to play around with my Stability AI prompting skills to create the headline image above.

The Books

Since I really enjoyed all four of the books, I thought I’d post a short summary and review of each of them. The four books are,

What the books are about

(Top left) – “A History of the Future: Prophets of Progress from H.G. Wells to Isaac Asimov”, by Peter J. Bowler. Published by Cambridge University Press, 2017. ISBN 978-1-316-60262-1.
- This is not really a book about forecasting in the way that a Data Scientist would use the word “forecasting”. It is a book about the history of “futurology” – the practice of making predictions about what the world and society will be like in the future, how it will be shaped by technological innovations, and what new technological innovations might emerge. The book reviews the successes and failures of futurologists from the 20^th century and what themes were present in those predictions and forecasts. What is interesting is how the forecasts were often shaped by the background and training of the forecaster – forecasts from people with a scientific training or background tended to be more optimistic than those from people with more arts or literary backgrounds. I did read this book from end-to-end.
(Top right) – “Histories of the Future: Milestones in the last 100 years of business forecasting”, by Jonathon P. Karelse. Published by Forbes Books, 2022. ISBN= 978-1-955884-26-6.
- This is another book about the history of forecasting. As one of the reviewers, Professor Spyros Makridakis, says on the inside cover of the book, this is not a “how to” guide. However, each chapter of the book does focus on a prominent forecasting method that is used widely in business settings – Chapter 3 covers exponential smoothing, Chapter 5 covers Holt-Winters, Chapter 7 covers Delphi methods – but each method is introduced and discussed from the historical perspective of how the method arose and was used in genuine operational business settings. Consequently, the methods discussed do tend to be the simpler but more robust methods that have stood the test of time of being used in genuine real-world business settings, although the final chapter does discuss AI and ML forecasting methods. This is another book I did read end-to-end.
(Bottom left) – “Demand Forecasting for Executives and Professionals”, by Stephan Kolassa, Bahman Rostami-Tabar, and Enno Siemsen. Published by CRC Press, 2023. ISBN=978-1-032-50772-9.
- This is a technical book. However, it has relatively few equations and those equations that it does contain are relatively simple and understandable by anyone with high-school maths, or who has taken a maths module in the first year of a Bachelor’s degree. That is deliberate. As the book says in the preface it “is a high-level introduction to demand forecasting. It will, by itself, not turn you into a forecaster.” The book is aimed at executives and IT professionals whose responsibilities include managing forecasting systems. It is designed to give an overview of the forecasting process as a whole. My only criticism is that, even given the focus on delivering a high-level overview of forecasting and how it should be used and implemented as a process, the topics covered are still ambitious. My experience is that senior managers, even technical ones, won’t have the time to read about ARIMA modelling even at the level it is covered in this book. That said, the breadth of the book (in under 250 pages) and its focus on forecasting as a process is what I like about it. It emphasizes the human element of forecasting via the interaction and involvement that a forecaster or consumer of a forecast has with the forecasting process. These are things you won’t get from a technical book on statistical forecasting methods and that you usually only learn the hard way in practice. If I had an executive or senior IT manager who did want to learn more about forecasting and I could recommend only one book to them, this would be it. As a Data Scientist this is still an interesting book. There is still material I have read and learnt from, but as a Data Scientist it has been a book I have only dipped in and out of.
(Bottom right) – “Intermittent Demand Forecasting: Context, Methods and Application”, by John E. Boylan and Aris A. Syntetos. Published by Wiley 2021. ISBN=978-119-97608-0.
- Professor John Boylan passed away in July 2023. I was fortunate enough to attend a webinar in February 2023 that John Boylan gave about intermittent demand forecasting. I learnt a lot from the webinar. It also meant that I was already familiar with a lot of the context on reading the book, making reading the book more enjoyable. In fact, the seminar was where I first came across the book. The book is technical. It is the most technical and focused of the four books reviewed here. It is a book on the best statistical models and methodologies for forecasting intermittent demand, particularly for inventory-management applications. It is an in-depth “how-to” book. As far as I am aware this book is the most up-to-date, comprehensive, and authoritative book on intermittent demand forecasting there is. Since it is a technical book, it is a book I have dipped in and out of, rather than read end-to-end.

I can genuinely recommend all four books. The first two books I enjoyed the most, because I find that personally, reading about the history of how scientific methods and algorithms arise gives extra insight into the nuances of the methods and when and where they work best. The second two books are more “how-to” books – you can find similar material on the internet, in various blog articles and academic papers etc. However, it is always great to have methods explained by practitioners who are also experts in those methods.

The content of the last three books would be more recognizable by your typical working Data Scientist. The first book is more of a book for historians, but I enjoyed it because the subject matter it addressed was in an area/domain relevant to me, that of long-range forecasting.

Multiplication palindromes

On November 27, 2023January 6, 2024 By dchoyleIn Fun MathematicsLeave a comment

A bit of fun mathematics here. On a cross-country train journey recently, I saw this post from @abakcus on X (formerly Twitter),

Be careful if you do the above calculation yourself. The result has 17 digits and so we’re hitting the limit of what a double precision 64-bit floating point number can represent. This means you may get some error in the calculation in the least significant digits depending on which language you use. Python uses Bignum arithmetic and so we don’t have to worry about precision issues when multiplying integers in Python. To do the calculation in R I used the Ryacas package, which allows me to do infinite precision calculations. The R code snippet below shows how,

library(Ryacas)
yac_str("111111111^2")

which gives us the output below.

"12345678987654321"

The result of the multiplication, with its palindromic structure, is intriguing. But then you remember that we also have,

11 x 11 = 121

This made me wonder if there was a general pattern here. So I tried a few multiplications and got,

1 x 1 = 1
11 x 11 = 121
111 x 111 = 12321
1111 x 1111 = 1234321
11111 x 11111 = 123454321
111111 x 111111 = 12345654321
1111111 x 1111111 = 1234567654321
11111111 x 11111111 = 123456787654321
111111111 x 111111111 = 12345678987654321

Pretty interesting and I wondered why I had never spotted or been shown this pattern before.

On the return train journey a few days later I decided to occupy a bit of the time by looking for an explanation. With hindsight, the explanation was obvious (as is anything in hindsight), and I’m sure many other people have worked this out before. So my apologies if what I’m about to explain seems trivial, but for me this was something I hadn’t spotted before nor encountered before.

The first question that sprang to mind is, does it generalize to other bases? That is, in base n do we always get a pattern,

1_n x 1_n = 1_n
11_n x 11_n = 121_n
111_n x 111_n = 12321_n

and so on. Obviously, there is a limitation here. In base n, the highest digit we can see in any position is n-1. So a more precise statement of the question is: for any base n and for any integer k < n, do we always have,

Let’s try it out. Remember for each base n we will have n-1 examples. The base 2 case (see below) is trivial, but it is still worth explicitly stating to highlight the pattern.

The red numbers denote the place-value corresponding to each digit position. The base 3 case gives two examples (see below),

Again, red numbers indicate place-values. We have also converted to a base 10 calculation in the middle to help work out what the final result on the far right-hand side should be.

The base 4 case gives three examples (see below). Now the annotation with the red numbers is getting a bit crowded, so we have rotated some of them to fit them all in. We have also connected the red numbers with lines to their corresponding digit to make the connection clearer.

The base 5 case gives four examples (see below),

Ok, so there does appear to be a pattern emerging. Can we prove the general case? Let’s do the equivalent to what we have done above, i.e. convert the calculations from a place-value representation to a numeral. The base n number consisting of k 1s is,

Meaning that when we square the base n number that consists of k 1s, we get,

Similarly, we can write the base n number 1234…k….4321 as,

The final line in Eq.3 is the same as Eq.2, so yes, we have proved the pattern holds for any base n and for any k < n.

However, the above calculation feels a bit disappointing. By converting from the original place-value representation we hide what is really going on. Is there a way we can prove the result just using the place-value representations?

Remember we can break down any multiplication a x b as the sum of separate multiplications, each of which involves just one of the digits of b multiplying a. So we can write 11…1 x 11…1 as,

We can make the idea more concrete by looking at a specific example. Take the five digit number 11111. We can write the multiplication of 11111 by itself in the following way,

Note we haven’t said what base we’re in. That is because the decomposition above holds for any base. Now remember that when we multiply by a number of the form 00010, we shift the digits of whatever number we’re multiplying one place to the left. So, 11111 x 00010 = 111110. Likewise, when we multiply with 00100, we shift all digits two places to the left. If we multiply with 01000, we shift three digits to the left.

Now, with all that shifting of digits to the left we’re going to need a bigger register to align all of our multiplications. We’ll add zeros at the front of each multiplication. For example, 11111 x 00100 = 1111100 now becomes 11111 x 00100 = 001111100. With the extra zeros in place, we can finally write 11111 x 11111 as,

You can now see how all the 1s on the right-hand side line up in columns, allowing us to count how many we’ve got. Now we must impose the restriction that we are in base n > 5 so that when we add up all the 1s in a column we don’t get any carry over. If we put the above result in a table format, the column totals become clearer – see below.

Now it becomes clearer why we get the result 11111 x 11111 = 123454321 for any base n > 5. It is due to the shifting and aligning of multiple copies of the original starting number 11111.

The generalization to any number of 1s in the number we are squaring is obvious. This means we can also extend the palindromic pattern beyond base 10, for example to hexadecimal and multiply the hexadecimal number 111111111111111₁₆ by itself to get,

111111111111111₁₆ x 111111111111111₁₆ = 123456789ABCDEFEDCBA987654321

Which means we get a pleasing palindrome that includes the alphabetic hexadecimal digits A,B,C,E,F. Try it in Python using the codeline below,

hex(0x111111111111111 * 0x111111111111111)

You should get the output ‘0x123456789abcdefedcba987654321’

Introduction

Pre-training tests of model form

Asymptotic behaviour tests:

Stress tests/Breakdown tests:

Recover known behaviours:

Coefficients before fitting:

Dimensional analysis:

Conclusion:

Share this:

Introduction

What is a Bland-Altman plot?

What the eye doesn’t see

Share this:

Share this:

Introduction

What simulation will give you and what it won’t

What simulated data will give you

Consistency check

Bias check

Efficiency check

Sensitivity check – robustness to contamination

Runtime scaling

What simulated data won’t give you

Identify model mis-specification

Accuracy of your model on real data

How to simulate

Simulating features and response

Linear model example

Assessing the OLS Estimator for a linear model

Neural network example

Sampling features from more realistic distributions

Simulating the response only

Conclusions

Footnotes

Share this:

A deadline

A recap of PCA

MCA

XCA

Probabilistic PCA and MCA

Probabilistic XCA

Maximum Likelihood solution for XCA

PCA and MCA as special cases

Experimental demonstration

Final comments

Footnotes

Share this:

The recommendations

Why learn the mathematics of transformers?

The benefits of learning the maths of transformers and the risks of not doing so

What to do?

Share this:

Benford’s Law

What is Benford’s Law?

What’s behind Benford’s Law

Deriving Benford’s Law

Ted Hill’s proof of Benford’s Law

Benford’s Law is base invariant

When should a dataset follow Benford’s Law?

Why computer file sizes follow Benford’s Law

The history of Benford’s Law

Simon Newcomb

Benford’s discovery

Applications of Benford’s Law

My involvement with Benford’s Law

My Python code solution

Share this:

There are two parts to this puzzle

The overall goal

The twist

Share this:

Speeding up your code with your current tech stack

Amdahl’s law

How small is S?

A real-world example

What to do?

Share this:

The Books

What the books are about

Share this: