An equation showing the mathematical form of the log sum exp calculation.

Data Science Notes: 2. Log-Sum-Exp

Summary

  • Summing up many probabilities that are on very different scales often involves calculation of quantities of the form \log\left ( \sum_{k} \exp\left( a_{k}\right )\right ). This calculation is called log-sum-exp.
  • Calcuating log-sum-exp the naive way can lead to numerical instabilities. The solution to this numerical problem is the “log-sum-exp” trick.
  • The scipy.special.logsumexp function provides a very useful implementation of the log-sum-exp trick.
  • The log-sum-exp function also has uses in machine learning, as it is a smooth, differentiable approximation to the {\rm max} function.

Introduction

This is the second in my series of Data Science Notes series. The first on Bland-Altman plots can be found here. This post is on a very simple numerical trick that ensures accuracy when adding lots of probability contributions together. The trick is so simple that implementations of it exist in standard Python packages, so you only need to call the appropriate function. However, you still need to understand why you can’t just naively code-up the calculation yourself, and why you need to use the numerical trick. As with the Bland-Altman plots, this is something I’ve had to explain to another Data Scientist in the last year.

The log-sum-exp trick

Sometimes you’ll need to calculate a sum of the form, \sum_{k} \exp\left ( a_{k}\right ), where you have values for the a_{k}. Really? Will you? Yes, it will probably be calculating a log-likelihood, or a contribution to a log-likelihood, so the actual calculation you want to do is of the form,

\log\left ( \sum_{k} \exp \left ( a_{k}\right )\right )

These sorts of calculations arise where you have log-likelihood or log-probability values a_{k} for individual parts of an overall likelihood calculation. If you come from a physics calculation you’ll also recognise the expression above as the calculation of a log-partition function.

So what we need to do is exponentiate the a_{k} values, sum them, and then take the log at the end. Hence the expression “log-sum-exp”. But why a blogpost on “log-sum-exp”? Surely, it’s an easy calculation. It’s just np.log(np.sum(np.exp(a))) , right ? Not quite.

It depends on the relative values of the a_{k}. Do the naïve calculation np.log(np.sum(np.exp(a))) and it can be horribly inaccurate. Why? Because of overflow and underflow errors.

If we have two values a_{1} and a_{2} and a_{1} is much bigger than a_{2}, when we add \exp(a_{1}) to \exp(a_{2}) we are using floating point arithmetic to try and add a very large number to a much smaller number. Most likely we will get an overflow error. If would be much better if we’d started with \exp(a_{1}) and try to add \exp(a_{2}) to it. In fact, we could pre-compute a_{2} - a_{1}, which would be very negative and from this we could easily infer that adding \exp(a_{2}) to \exp(a_{1}) would make very little difference. In fact, we could just approximate \exp(a_{1}) + \exp(a_{2}) by \exp(a_{1}).

But how negative does a_{2} - a_{1} have to be before we ignore the addition of \exp(a_{2})? We can set some pre-specified threshold, chosen to avoid overflow or underflow errors. From this, we can see how to construct a little Python function that takes an array of values a_{1}, a_{2},\ldots, a_{N} and computes an accurate approximation to  \sum_{k=1}^{N}\exp(a_{k}) without encountering underflow or overflow errors.

In fact we can go further and approximate the whole sum by first of all identifying the maximum value in an array a = [a_{1}, a_{2}, \ldots, a_{N}]. Let’s say, without loss of generality, the maximum value is a_{1}. We could ensure this by first sorting the array, but it isn’t necessary actually do this to get the log-sum-exp trick to work. We can then subtract a_{1} from all the other values of the array, and we get the result, 

\log \left ( \sum_{k=1}^{N} \exp \left ( a_{k} \right )\right )\;=\; a_{1} + \log \left ( 1\;+\;\sum_{k=2}^{N} \exp \left ( a_{k} - a_{1} \right )\right )

The values a_{k} - a_{1} are all negative for k \ge 2, so we can easily approximate the logarithm on the right-hand side of the equation by a suitable expansion of \log (1 + x). This is the “log-sum-exp” trick.

The great news is that this “log-sum-exp” calculation is so common in different scientific fields that there are already Python functions written to do this for us. There is a very convenient “log-sum-exp” function in the SciPy package, which I’ll demonstrate in a moment.

The log-sum-exp function

The sharp-eyed amongst you may have noticed that the last formula above gives us a way of providing upper and lower bounds for the {\rm max} function. We can simply re-arrange the last equation to get,

{\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; \log \left ( \sum_{k=1}^{N}\exp \left ( a_{k}\right )\right )

The logarithm calculation on the right-hand side of the inequality above is what we call the log-sum-exp function (lse for short). So we have, 

{\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; {\rm lse}\left ( a_{1}, a_{2}, \ldots, a_{N}\right )

This gives us an upper bound for the ${\rm max}$ function. Since a_{k} \le {\rm max}\left ( a_{1}, a_{2},\ldots,a_{N}\right ), it is also relatively easy to show that,

{\rm lse} \left ( a_{1}, a_{2}, \ldots, a_{N}\right )\;\le\; {\rm max}\left ( a_{1}, a_{2},\ldots, a_{N} \right )\;+\;\log N

and so we have have a lower bound for the {\rm max} function. So the log-sum-exp function allows us to compute lower and upper bounds for the maximum of an array of real values, and it can provide an approximation of the maximum function. The advantage is that the log-sum-exp function is smooth and differentiable. In contrast, the maximum function itself is not smooth nor differentiable everywhere, and so is less convenient to work with mathematically. For this reason the log-sum-exp is often called the “real-soft-max” function because it is a “soft” version of the maximum function. It is often used in machine learning settings to replace a maximum calculation.

Calculating log-sum-exp in Python

So how do we calculate the log-sum-exp function in Python. As I said, we can use the SciPy implementation which is in scipy.special. All we need to do is pass an array-like set of values a_{k}. I’ve given a simple example below, 

# import the packages and functions we need
import numpy as np
from scipy.special import logsumexp

# create the array of a_k values 
a = np.array([70.0, 68.9, 20.3, 72.9, 40.0])

# Calculate log-sum-exp using the scipy function 
lse = logsumexp(a)

# look at the result
print(lse)

This will give the result 72.9707742189605

The example above and several more can be found in the Jupyter notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes

The great thing about the SciPy implementation of log-sum-exp is that it allows us to include signed scale factors, i.e. we can compute,

\log \left ( \sum_{k=1}^{N} b_{k}\exp\left ( a_{k}\right ) \right )

where the values b_{k} are allowed to be negative. This means, that when we are using the SciPy log-sum-exp function to perform the log-sum-exp trick, we can actually use it to calculate numerically stable estimates of sums of the form,

\log \left ( \exp\left ( a_{1}\right ) \;-\; \exp\left ( a_{2}\right ) \;-\;\exp\left ( a_{3}\right )\;+\;\exp\left ( a_{4}\right )\;+\ldots\; + \exp\left ( a_{N}\right )\right ).

Here’a small code snippet illustrating the use of the scipy.special.logsumexp with signed contributions,

# Create the array of the a_k values
a = np.array([10.0, 9.99999, 1.2])
b = np.array([1.0, -1.0, 1.0])

# Use the scipy.special log-sum-exp function
lse = logsumexp(a=a, b=b)

# Look at the result
print(lse)

This will give the result 1.2642342014146895.

If you look at the output of the example above you’ll see that the final result is much closer to the value of the last array element a_{3} = 1.2. This is because the first two contributions, \exp(a_{1}) and \exp(a_{2}) almost cancel each other out because the contribution \exp(a_{2}) is pre-fixed by a factor of -1. What we get left with is something close to \log(\exp(a_{3}))\;=\; a_{3}.

There is also a small subtlety in using the SciPy logsumexp function with signed contributions. If the substraction of some terms had led to an overall negative result, scipy.special.logsumexp will rerturn NaN as the result. In order to get it to always return a result for us, we have to tell it to return the sign of the final summation as well, by setting the return_sign argument of the function to True. Again, you can find the code example above and others in the notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes.

When you are having to combine lots of different probabilities, that are on very different scales, and you need to subtract some of them and add others, the SciPy log-sum-exp function is very very useful.

© 2026 David Hoyle. All Rights Reserved

The need for simulation

TL;DR: Poor mathematical-based design and testing of models can lead to significant problems in production. Finding suitable ground truth data for testing of models can be difficult. Yet, many Data Science models make it into production without appropriate testing. In these circumstances testing with simulated data can be hugely valuable. In this post I explain why and how. In fact, I argue that testing with Data Science models should be non-negotiable.

Introduction

Imagine a scenario. You’re the manager of a Premier League soccer team. You wouldn’t sign a new striker without testing if they could actually kick a ball. Wouldn’t you?

In the bad old days before VAR it was not uncommon for a big centre-back to openly punch a striker in the face if the referee and assistant referees weren’t looking. Even today, just look at any top-flight soccer match and you’ll see the blatant holding and shirt-pulling that goes on. Real-world soccer matches are dirty. A successful striker has to deal with all these realities of the game, whilst also being able to kick the ball in the net. At the very least when signing a new striker you’d want to test whether they could score under ideal benign conditions. Wouldn’t you? You’d put the ball on the penalty spot, with an open goal, and see if your new striker could score. Wouldn’t you? Passing this test, wouldn’t tell you that your striker will perform well in a real game, but if they fail this “ideal conditions” test it will tell you that they won’t perform well in real circumstances. I call this the “Harry Redknapp test” – some readers will understand the reference1. If you don’t then read the footnote for an explanation.

How is this relevant to Data Science? One of the things I routinely do when implementing an algorithm is to test that implementation on simulated data. However, a common reaction I get from other Data Scientists is, “oh I don’t test on simulated data, it’s not real data. It’s not useful. It doesn’t tell you anything.” Oh yes it does! It tells you whether the algorithm you’ve implemented is accurate under the ideal conditions it was designed for. If your implementation performs badly on simulated data, you have a big problem! Your algorithm or your implementation of it has failed the “Harry Redknapp test”.

“Yeah, but I will have some ground-truth data I can test my implementation on instead, so I don’t need simulated data.” Not always. Are you 100% sure that that ground-truth data is correct? And what if you’re working on an unsupervised problem.

“Ok, but the chances of an algorithm implemented by experienced Data Scientists making it into production untested and with really bad performance characteristics is small”. Really!? I know of at least one implemented algorithm in production at a large organization that is actually an inconsistent estimator. An inconsistent estimator is one of the biggest sins an algorithm can commit. It means that even as we give the algorithm more and more ideal training data, it doesn’t produce the correct answer. It fails the “Harry Redknapp test”. I won’t name the organization in order to protect the guilty. I’ll explain more about inconsistent estimators later on.

So maybe I convinced you that simulated data can be useful. But what can it give you, what can’t it give you, and how do you go about it?”

What simulation will give you and what it won’t

To begin, we need to highlight some general but very important points about using simulated data:

  1. Because we want to want to generate data, we need a model of the data generation process, i.e. we need a generative model2.
  2. Because we want to mimic the stochastic nature of real data, our generative model of the data will be a probabilistic one.
  3. Because we are generating data from a model, what we can test are algorithms and processes that use that data, e.g. a parameter estimation process. We cannot test the model itself. Our conclusions are conditional on the model form being appropriate.

With those general points emphasized, let’s look in detail what we can get testing with simulated data.

What simulated data will give you

We can get a great deal from simulated data. As we said above, what we get is insight into the performance of algorithms that process the data, such as the parameter estimation process. Specifically, we can check whether our parameter estimation algorithm is, under ideal conditions,

  • Consistent
  • Biased
  • Efficient
  • Robust

I’ll explain each of these in detail below. We can also get insight into how fast our parameter estimation process runs or how much storage it requires. Running tests using simulated data can be extremely useful.

Consistency check

As a Data Scientist you’ll be familiar with the idea that if we have only a small amount of training data our parameter estimates for our trained model will not be accurate. However, if we have a lot of training data that matches the assumptions on which our parameter estimation algorithm is based, then we expect the trained parameter estimates to be close to their true values, i.e. close to the values which generated the data. As we increase the amount of training data, we expect our parameters estimates to get more and more accurate, converging ultimately to the true values in the limit of an infinite amount of training data. This is consistency.

In statistics, a formula or algorithm for estimating the parameters of a model is called an estimator. There can be multiple different estimators for the same model, some better than others. A consistent estimator is one whose expected value converges to the true value in the limit of an infinite amount of training data. An inconsistent estimator is one whose expected value doesn’t converge to the true value in the limit of an infinite amount of training data. Think about that for a moment,

An inconsistent estimator is an algorithm that doesn’t get better even when we give it a load more training data.

That is a bad algorithm! That is why I say constructing an inconsistent estimator is one of the worst sins a Data Scientist can commit. Very occasionally (rarely), an inconsistent estimator is constructed because it has other useful properties. But in general, it you encounter an inconsistent estimator you should take it as a sign of incompetence on the part of the Data Scientist who constructed it.

“Okay, okay, I get it. Inconsistent estimators are bad. But I don’t have an infinite amount of training data, so how can I actually check if my algorithm produces a consistent estimator? Surely, it can’t be done?” Yes, it can be done. What we’re looking for is convergence, i.e. parameter estimates getting closer and closer to the true values as we increase the training set size. I’ll give a demonstration of this in the next section when I show how to set up some simulation tests.

Bias check

Along with the concept of consistency comes the concept of bias. We said that a consistent estimator was one whose expectation value converges to the true value in the limit of an infinite amount of training data. However, that doesn’t mean a consistent estimator has an expectation value that is equal to the true value for a finite amount of training data. It is possible to have a consistent estimator that is biased. This means the estimator, on average, will differ from the true value when we use a finite amount of training data. For a consistent estimator, if it is biased that bias will disappear as we continually increase the amount of training data.

As you might have guessed, the best algorithms produce estimators that are consistent and unbiased. Knowing if your estimator is biased and by how much is extremely useful. Again, we can assess bias using simulated data, and I’ll show how to do this in the next section when I show how to set up some simulation tests.

Efficiency check

So far, we have spoken about the expectation or average properties of an algorithm/estimator. But what about its variance. It is all very well telling me that across lots of different instances of training datasets my algorithm would, on average get the right answer, or near the right answer, under ideal conditions, but in the real world I have only one training dataset. Am I going to be lucky and my particular training data will give parameter estimates close to the average behaviour of the algorithm? I’ll never know. But what I can know is how variable the parameter estimates from my algorithm are. I can do this by calculating the variance of the parameter estimates over lots of training datasets. A small variance will tell me that my one real-world dataset is likely to have performance close to the mean behaviour of the algorithm. I may still be unlucky with my particular training data and the parameter estimates are a long way from the average estimates, but it is unlikely. However, a large variance tells me that parameter estimates obtained from a single training dataset will often be a long way from the average estimates.

How can I calculate this variance of parameter estimates over training datasets? Simple, get lots of different training datasets produced under identical controlled conditions. How could I do that? Yep, you guessed it. Simulation. With a simulation process coded up, we can easily generate multiple instances of training datasets of the same size and generated under identical conditions. Again, I’ll demonstrate this in the next section.

Sensitivity check – robustness to contamination

Our message about simulated data is that it allows you to test your algorithm under conditions that match the assumptions made by the algorithm, i.e. under ideal conditions. But you can use simulation to test how well your algorithm performs in non-ideal conditions. We can also introduce contamination into the simulated data, for example drawing some response variable values from a non-Gaussian distribution if our algorithm has assumed the response variable is purely Gaussian distributed. We can produce multiple simulated datsets with different percentages of contamination and so test how sensitive or robust our estimation algorithm is to the level of contamination, i.e. how sensitive it is to non-ideal data.

In the first few pages of the first chapter of his classic textbook on Robust Statistics, Peter Huber describes analysis of an experiment originally due to John Tukey. The analysis reveals that even having just 2% of “bad” datapoints being drawn from a different Gaussian distribution (with a 3-fold larger standard deviation) is enough to markedly change the properties and efficiency of common statistical estimators. And yet, defining “bad” data as being drawn from a larger variance Gaussian is wonderfully simplistic. Real-world data is so much nastier.

What form should the data contamination take? There are multiple ways in which data can become contaminated. There can be changes in statistical properties, like the simple example we used above, or drift in statistical properties such as a non-stationary mean or a non-stationary variance. But you can get more complicated errors creeping into your data. These typically take two forms,

  • Human induced data contamination: These can be misspelling or mis-(en)coding errors that result from not using controlled and validated vocabularies for human data-entry tasks. You’ll recognize these sorts of errors when you see multiple different variants for the name of the same country, US county or UK city, say. You might think it is difficult to simulate such errors, but there are some excellent packages to do so – checkout the messy R package produced by Dr. Nicola Rennie that allows you to take a clean dataset and introduce these sorts of encoding errors into it. Spotting these errors can be as simple as plotting distributions of unique values in a table column, i.e. looking for unusual distributions. In R there are a number of packages to help you do this.
  • Machine induced errors: These are errors that arise from the processing or transferring of data. These can be as simple as incorrect datetime stamps on rows in a database table, or can be as complex as repeating blocks of rows in a table. These errors are less about contamination and more about alteration. The common element here is that there is a pattern to how the data has become altered or modified and so spotting the errors involves visual inspection of the individual rows of the table, combined with plotting lagged or offset data values. The machine induced errors arise because of bugs in processing code, and these can be either coding errors, e.g. a typo in the code, or unintended behaviour, e.g. datetime processing code that hasn’t been designed properly to correctly handle daylight saving switchovers.

What kind of data contamination should I simulate? This is a “how long is a piece of string” kind of question. It very much depends on what aspect of your algorithm or implementation you want to test for robustness, and only you can know that. You may have to write some bespoke code to simulate the sorts of errors that arise in the processes you use or are exposed to. Broadly speaking, robustness of an estimator will be tested by changes in the statistical properties of the input data and these can be simulated by changes in the distributions of data due to data drift or human-induced contamination, whilst machine-induced errors imply you have some sort of deployed pipeline and so simulating machine corrupted data is best when you want to stress-test your end-to-end pipeline.

Runtime scaling

There are also checks that simulated data allows you to perform that aren’t necessarily directly connected to the accuracy or efficiency of the parameter estimates. Because we can produce as much simulated data as we want, we can easily test how long our estimation algorithm takes for different sized datasets. Similarly, we can also use simulated data to test the memory and storage requirements of the algorithm.

We can continue this theme. Because we can tune and tweak the generation of the simulated data, this can also allow us to generate data to test very specific scenarios – corner cases – for which we don’t have real test data. The ability to generate simulated data increases the test coverage we can perform.

What simulated data won’t give you

Identify model mis-specification

Using simulated data will tell you how well your model training algorithm performs on data that matches precisely the form of the model you have used. It won’t tell you if your model form is correct or appropriate for the real data you will ultimately apply it to. It won’t tell you if you’ve omitted an important feature or if you’ve put non-linearity into your model in an incorrect way. Getting the model form right can only come from i) domain expertise, ii) testing on real ground-truth data. Again, what this highlights is that we use simulated data to test the training process, not the model.

This can trip up even experience researchers. I recently saw a talk from an academic researcher who tested two different model forms using simulated data generated from one of the models. When the model form used to generate the data fitted the simulation data better, they confidently claimed that this model was better and more correct. Well, of course it was for this simulated data!

Accuracy of your model on real data

For simulated data we have the ground-truth values of the response variable so we can assess the prediction accuarcy, either on training data or on holdout test data. However, unless our simulation process produced very realistic data, including the various contamination processes, the test set accuracy on simulated data cannot be used as a precise measure of the predictive accuracy of the trained model on real unseen data.

How to simulate

When producing simulated data for testing an algorithm related to a model there are two things we need to generate – the features and the response. There are two ways we can approach this,

  1. Simulating the features and then simulating the response given the feature values we just produced.
  2. Simulate just the response value given some pre-existing feature values.

Of these, 2 sounds easier, but I will discuss 1 first as it leads us naturally into discussing where we might get pre-existing feature values from.

Simulating features and response

As we said above, in this approach we simulate the features first, and this allows us to construct the distribution of the response variable conditional on the features. We can then sample a value from that conditional distribution. Our basic recipe is

  1. Sample the feature values from a distribution.
  2. Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
  3. Sample the response variable from the conditional distribution constructed in 2.

How complex we want to make the feature distribution depends on how realistic we need our features to be and what aspect of the estimation/training algorithm we are wanting to test.

For real-world problems, it is unlikely that the features follow a Gaussian distribution. Take demand modelling, an area I have worked in a lot. The main feature we use is the price of the product whose demand we are trying to predict. Prices are definitely not Gaussian distributed. Retailers repeatedly switch between a regular and promotional price over a long period of time, so that we have a sample distribution of prices that is represented by two Dirac-delta functions. A more interesting price time series may introduce a few more price points, but it is still definitely not Gaussian. Similarly, real data has correlations between features.

When simulating a feature, we have to decide how important the real distribution is to the aspect of the estimation/training algorithm that we want to test. If we want to simulate with realistically distributed features. this can be problematic. We’ll return to this issue and real data later on, but for now we emphasize tha we can still test whether our estimator is consistent or assess its bias using features drawn from independent Gaussian distributions. So there are still useful tests of our estimation algorithm we can carry out. Let’s see how we can do that.

Linear model example

We’ll use a simple linear model that depends on three features, x_{1}, x_{2}, x_{3}. The response variable y is given by,

y\; =\;\beta_{1} x_{1}\;+\; \beta_{2}x_{2}\;+\;\beta_{3}x_{3} \;+\;\epsilon\;\;\;\;,\;\; \epsilon\;\sim\; {\cal{N}}\left ( 0, \sigma^{2}_{\epsilon}\right )

From which you can see both the linear dependence on the features and that y contains Gaussian additive noise \epsilon.

Simulating data is now easy once we have the structure of our probabilistic model. Given a user-specified mean \mu_{1} and variance \sigma^{2}_{1} we can easily sample a value for x_{1} from {\cal{N}}\left ( \mu_{1}, \sigma^{2}_{1}\right ). Similarly, given user-specified means \mu_{2}, \mu_{3} and variances \sigma^{2}_{2}, \sigma^{2}_{3}, we can generate values for x_{2} and x_{3}.  If we have user-specified values of \beta_{1}, \beta_{2}, \beta_{3} we can then easily generate a value for y by sampling from {\cal{N}}\left ( \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}, \sigma^{2}_{\epsilon} \right ), where \sigma^{2}_{\epsilon} is the variance of the additive noise that we want to add to our response variable. To simulate N datapoints we repeat that recipe N times. Let’s apply that recipe to assess an estimator of the model parameters \beta_{1}, \beta_{2}, \beta_{3}. We’ll assess the standard Ordinary Least Squares (OLS) estimator for a linear model.

Assessing the OLS Estimator for a linear model

Given a feature matrix \underline{\underline{X}} (the ith row of the matrix is the feature values for the ith observation) and vector \underline y = \left ( y_{1}, y_{2},\ldots,y_{N}\right ) that represents the N observations of the response variable, then the Ordinary Least Squares (OLS) estimator \hat{\beta} of the true model parameters \underline{\beta} = \left ( \beta_{1}, \beta_{2}, \beta_{3}\right ) is given by the formula,

\underline{\hat{\beta}}\;=\; \left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right ) ^{-1} \underline{\underline{X}}^{\top} \underline{y}\;\;\;\;\;\;\;{\rm Eq.1}

Note that the OLS estimator is a linear combination of the observations y_{1}, y_{2}, \ldots, y_{N}, with a weight matrix \left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right )^{-1} \underline{\underline{X}}^{\top}. We’ll come back to this point in a moment.

What we want to know is how close is the estimate \underline{\hat{\beta}} to \underline{\beta}. Is the OLS estimator in the Eq.1 above a biased estimator of \underline{\beta}, and is it a consistent estimator?

The plots below show the bias (mean error)  for each of the model parameters, plotted against training dataset size N. I constructed the plots by initializing a true model parameter vector \underline{\beta} and then generating 1000 simulated training datasets for each of the different values of N. For each simulated training dataset I computed the OLS parameter estimate \hat{\underline{\beta}} and then computed the parameter estimate errors \hat{\underline{\beta}} - \underline{\beta}. From the errors I then calculated their sample means and variances (over the simulations) for each value of N.

You can see from the plots that whilst the mean error fluctuates it doesn’t systematically change with N. Furthermore, it fluctuates around zero, suggesting that the OLS estimator is unbiased. And indeed it is. It is possible to mathematically show that the OLS estimator is unbiased at any finite value of N. The reason we get a non-zero value in this case is because we have estimated \mathbb{E}\left ( \hat{\beta}_{i}\right ) using a sample average taken over 1000 simulated datasets. If we had used a larger number of simulated datasets we would have got even smaller sample average parameter errors.

Contrast this behaviour with how the variances of the parameter estimate errors change with N in the plots below.

The decrease, with N, in the variance of \hat{\underline{\beta}} - \underline{\beta} is marked. In fact, in looks like a power-law decrease, so I have plotted the same data on a log-scale below,

We can see from those log-log plots that the variances of \hat{\beta}_{i} - \beta_{i},\; i=1,2,3 decrease as N^{-1}. That implies that as we use larger and larger training sets any single instance of \hat{\underline{\beta}} will get closer and closer to \underline{\beta}. At large N we have a low probability of being unlucky and our particular training set giving a poor estimate of \underline{\beta}.

How efficient is the OLS estimator in Eq.1? Is the rate at which {\rm Var}\left ( \hat{\beta}_{i} - \beta_{i}\right ) decreases with N good or bad? It turns out that the OLS estimator in Eq.1 is the Best Linear Unbiased Estimator (BLUE). For an unbiased estimator of \underline{\beta} that is constructed as a linear combination of the observations \underline{y}, you cannot do better than the OLS estimator in Eq. 1.

All the code for the linear model example is available in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the GitHub repository https://github.com/dchoyle/simulation_blogpost.

A linear model is relatively simple structure but the example was a good demonstration of the power of simulated data. Next, we’ll use a more complex model architecture and build a feed-forward neural network.

Neural network example

Our simulated neural network output has the form,

y\;=\; f\left( \underline{x}| \underline{\theta} \right ) \;+\; \epsilon

Again, we’ll use zero-mean Gaussian additive noise, \epsilon \sim {\cal{N}}\left (0, \sigma^{2}_{\epsilon}\right ).

The function f\left( \underline{x}| \underline{\theta} \right ) represents our neural network function, with \underline{x} being the vector of input features and \underline{\theta} being a vector holding all the network parameters. For this demo, I’m going to use a 3 input-node, 2 hidden-layer feed-forward network, with 10 nodes in each of the hidden layers. The output layer consists of a single node, representing the variable y. For the non-linear transfer (activation) functions I’m going to use \tanh functions. So, schematically, my networks looks like the figure below,

I’m going to use a teacher network of the form above to generate simulated data, which I’ll then use to train a student network of the same form. What I want to test is, does my training process produce a trained student network whose predictions on a test set get more and more accurate as I increase the amount of training data? If not, I have a problem. If my training process doesn’t produce accurate trained networks on ideal data, the training process isn’t going to produce accurate networks when using real data. I’m less interested in comparing trained student network parameters to the teacher network parameters as, a) there are a lot of them to compare, b) since the output of a network is invariant to within-layer permutation of the hidden layer node labels and connections, defining a one-to-one comparison of network parameters is not straight forward here. Node 1 in the first hidden layer of the student isn’t necessarily equivalent to node 1 in the first hidden layer of the teacher network, and so on.

The details of how I’ve coded up the networks and set-up the evaluation are lengthy, so I’ll just show the final result here. All the details can be found in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the freely accesible github repository.

Below in left-hand plot I’ve plotted the average Mean Square Error (MSE) made by the trained student network on the test-sets. I’ve plotted the average MSE against the training dataset size. The average MSE is the average over the simulations of that training set size. For comparison, I have also calculated the average test-set MSE of the teacher network. Since the test-set data contains additive Gaussian noise, the teacher network won’t make perfect predictions on the test-set data even though the teacher network generated the systematic part of the test-set response values. The average test-set MSE of the teacher network provides a benchmark or baseline against which we can asses the trained student network. We have a ready intuition about the relative test-set MSE value. We expect the relative test-set MSE to be significantly above 1 at small values of N, as the student network struggles to learn the teacher network output. As the amount of training data N increases we expect the relative test-set MSE value to approach 1 from above. The average relative test-set error is plotted in the right-hand plot below.

We can see from both plots above that the prediction accuracy of a trained student network typically decreases with increasing amount of training data. My network training process has passed this basic test. The test was quick to set up and gives me confidence I can run my code over real data.

Sampling features from more realistic distributions

In our previous examples we have used independent features, sampled from simple but naive distributions, to test the convergence properties of an estimator. But what happens if you want to assess the quantitative performance of an estimator for more realistic feature patterns? Well, we use more realistic feature patterns. This is a variant of our previous basic recipe, but where we have access to a real dataset. The modified recipe is,

  1. Sample an observation from the real dataset and keep the features.
  2. Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
  3. Sample the response variable from the conditional distribution constructed in 2.

This seems like a small modification of the recipe. However, it does have some big implications. We can’t generate simulated datasets of arbitrarily large size as we are limited by the size of the real dataset. We can obviously generate simulated datasets of smaller size than the real data, but this can make testing of the convergence properties of an estimator difficult.

That said, this is one of my faviourite approaches. Often, steps 2 and 3 are easy to implement. You’ll have a function for the conditional mean of the response variable  already coded up for prediction purpose, so it is just a question of pushing some feature values through that code. I find the overhead of writing extra functions to simulate realistic looking feature values is significant, both in terms of time and thinking about what ‘realistic’ should look like. The recipe above gets round this easily. Simply pick a row from your existing real dataset at random and there you go, you have some realistic feature values. As before, the recipe allows me to then generate response values with known ground-truth parameters values. So overall I can compare parameter estimates to ground truth parameter values on realistic feature values, allowing me to check that my estimation algorithm is at least semi-accurate on realistic feature values. You can also choose in step 1 of the recipe, whether you want to sample a row of feature values with or without replacement.

Simulating the response only

You could argue that simulating response values with feature values sampled from an existing real dataset is an example of just simulating the response. After all, only the response value is computer generated. I still tend to think of it as simulating the features because, i) I am still sampling the features from a distribution function, the empirical distribution function in this case, and ii) I have broken some of the link between the features and the response in the real data because I have sampled the features values separately. However, sometimes we want to keep as much as of the links between features and response values in the real data as possible. We can do this by only making additions to the real data. By necessity this means only adding to the response value. This may sound very restrictive, but in fact there are many situations where this is precisely the kind of data we need to test an estimation algorithm. For example, changepoint detection or unconditional A/B testing. In these situations we take the real data, identify the split point where we want to increase the response value (the changepoint or the A/B grouping) and simply increase the response. Hey presto, we have realistic data with a guaranteed increase in the response variable at a know location. By changing the level of increase in the response variable we can use this approach to assess the statistical power of the changepoint or A/B test algorithm.

The plots below show an example of introducing a simple shift in level at timepoint 53 into a real dataset. We have only shown the process as a simple schematic, but coding it up yourself is only a matter of a line or two of code, so I haven’t given any code details.

In the above example I simply increased the response variable, by the same amount (8.285 in this case), at and after timepoint 53. If instead, you only want to increase the average value of the response variable, it is a simple modification of the process to include some additional zero-mean noise after the changepoint location.

Conclusions

Simulated data is extremely useful. It can give you lots of insight into the performance of your training/estimation algorithm (including bug detection). Its main advantages are it is,

  • Easy to produce in large volumes.
  • Can be produced in a user-controlled way.
  • Gives you ground-truth values.
  • Gives you a way to assess the performance of your training algorithm when you have no real ground-truth data.
  • Stops you releasing a poor untested training algorithm into production.

If you don’t want to sign an absolutely useless striker for your data science model team, test with simulated data at the very minimum.

Footnotes
  1. Harry Redknapp is a former English Premier League football manager. Whilst Redknapp was manager of Tottenham Hotspur he had a reputation of being willing to sign players on the flimiest of evidence of footballing skills. At a time when there was a large influx of overseas players into the Premier league, due to their reputation for superior technical football skills, it was joked that he would sign a player simply because of how their name sounded and without any checks on the player at all.
  2. The term generative model preceeds its useage in Generative AI. Broadly speaking, a generative model is a machine learning model that learns the underlying probability distribution of the data and can generate new, similar data instances. The useage of the term was popular around the early 2000’s, particularly when discussing different forms of classifiers, which were described as either being generative or discriminative.

© 2025 David Hoyle. All Rights Reserved

A book on language models and a paper on the mathematics of transformers

Quick introductions to the maths of transformers

TL;DR: Having a high-level understanding of the mathematics of transformers is important for any Data Scientist. The two sources I recommend below are excellent short introductions to the maths of transformers and modern language models.

A colleague asked me, about two months back, if I could recommend any articles on the mathematics of Large Language Models (LLMs). They then clarified that they meant transformers, as they were primarily interested in the algorithms on which LLM apps are based. Yes, they’d skim read the original “Attention Is All You Need” paper from Vaswani et al, but they done so just after the paper came out in 2017. They were looking to get back up to date with LLMs and even revisit the original Vaswani paper. Firstly, they wanted an accessible explanation which they could use to construct a high-level mental model of how transformers worked, the idea being that the high-level mental model would serve as a construct on which to hang and compartmentalize the many new concepts and advances that had happened since the Vaswani paper. Secondly, my colleague is very mathematically able, so they were looking for mathematical detail, but the right mathematical detail, and in a relatively short read.

I’ve listed below the recommendations I gave to my colleague because I think they are good recommendations (and I explain why below as well). I also believe it is important for all Data Scientists to have at least a high-level understanding of how transformers, and the LLMs which are built on them, work – again I explain why, below.

The recommendations

What I recommended was one paper and one book. The article is free to access, and the book has a “set your own price” option for access to the electronic version of the book.

  • The article is, “An Introduction to Transformers” by Richard Turner from the Dept. of Engineering at the University of Cambridge and Microsoft Research in Cambridge (UK). This arXiv paper can be found here.  The paper focuses on how transformers work but not on training them. That way the reader focuses on the structure of the transformers without getting lost in the details of the arcane and dark art of training transformers. This is why I like this paper. It gives you an overview of what transformers are and how they work without getting into the necessary but separate nitty-gritty of how you get them to work. To read the paper does require some prior knowledge of mathematics but the level is not that high – see the last line of the abstract of the paper. The whole paper is only six pages long, making it a very succinct explanation of transformer maths that you can consume in one sitting.
  • The book is “The Hundred-Page Language Models Book” by Andriy Burkov. This is the latest in the series of books from Burkov, that include “The Hundred-Page Machine Learning Book” and “Machine Learning Engineering”. I have a copy of the hundred-page machine learning book and I think it is ok, but I prefer the LLMs book. I think part of the reason for this is that, like everybody else I have been only been using and playing with LLMs for the last three years or so, whilst I have been doing Data Science for a lot longer – I have been doing some form of mathematical or statistical modelling for over 30years – and so I didn’t really learn anything new from the machine learning book. In contrast, I learnt a lot from the book on LLMs. The whole book works through simple examples, both in code (Python) and in terms of the maths. I semi-skim read the book in two sittings. The code examples I skipped, not because they were simplistic but because I wanted to digest the theory and algorithm explanations end-to-end first and then return to trying the code examples at a later date. Overall, the book is packed with useful nuggets. It is a longer read than the Turner paper, but can still easily be consumed in a day if you skip bits. The book assumes less prior mathematical knowledge than the Turner paper and explains the new bits of maths it introduces, but given the whirlwind nature of a 100-page introduction to LLMs I would still recommend you have some basic familiarity with linear algebra, statistics & probability, and machine learning concepts.

Why learn the mathematics of transformers?

Having to think about which short articles I would recommend on the maths of transformers and LLMs made me think more broadly about whether there is any benefit from having a high-level understanding of transformer maths. My colleague was approaching it out of curiosity, and I knew that. They simply wanted to learn, not because they had to, nor because they thought that understanding the mathematical basis of transformers was the way to approach using LLMs as a tool.

However, given the exorbitant financial cost of building foundation models and the need to master a vast amount of engineering detail, most people won’t be building their own foundation models. Instead they will be using 3rd party models simply as a tool and focusing on developing skills and familiarity in prompting them. So, are there any benefits then to understanding the maths behind LLMs? In other words, could I honestly recommend the two sources listed above to anybody else other than my colleague who was interested mainly out of curiousity?

The benefits of learning the maths of transformers and the risks of not doing so

The answer to the question above, in my opinion, is yes. But you probably could have guessed that from the fact I’ve written this post. So, what do I think are the benefits to a Data Scientist in having a high-level understanding of the mathematics of transformers? And equally important, what are the downsides and risks of not having that high-level understanding?

  1. Having even a high-level understanding of the maths behind transformers de-mystifies LLMs since it forces you to focus on what is inside LLMs. Without this understanding you risk putting an unnecessary veneer of complexity or mysticism on top of LLMs, a veneer that prevents you using LLMs effectively.
  2. You will understand why LLMs hallucinate. You will understand that LLMs build a model of the high-dimensional conditional probability distribution of the next token given the preceding context. And that distribution can have a large dispersion if the the training data is limited in the high-dimensional region that corresponds to the current context. That large dispersion results in the sampled next token having a high probability of being inappropriate. If you understand what LLMs are modelling and how they model it, hallucinations will not be a surprise to you (they may still be annoying) and you will understand strategies to mitigate them. If you don’t understand how LLMs are modelling the conditional probability of the next token, you will always be surprised, annoyed, and impacted by LLM hallucinations.
  3. It helps you understand where LLMs excel and where they don’t because you have a grounded understanding of their strengths and weaknesses. This makes it easier to identify potential applications of LLMs. The downside? Not having a fundamental understanding of the strengths and weaknesses of the algorithms behind LLMs risks you building LLM-based applications that were doomed to failure from the start because they have been mis-matched to the capabilities of LLMs.
  4. By having a high-level mental model of transformers on which to hang later advances in LLMs, you can more easily identify what is important and relevant (or not) in any new advance. The downside to not having this well-founded mental-model is that you get blown about by the winds of over-hyped LLM announcements from companies stating that their new tool or app is a “paradigm shift”, and consequently you waste time getting into the detail of what are trivial or inconsequential improvements.

What to do?

What should you do if you are a Data Scientist and I have managed to convince you that having a high-level understanding of the mathematics of transformers is important? Simple, access the two sources I’ve recommended above. Happy reading.

© 2025 David Hoyle. All Rights Reserved

You’re going to need a bigger algorithm – Amdahl’s Law and your responsibilities as a Data Scientist

You have some prototype Data Science code based on an algorithm you have designed. The code needs to be productionized, and so sped up to meet the specified production run-times. If you stick to your existing technology stack, unless the runtimes of your prototype code are within a factor of 1000 of your target production runtimes, you’ll need a bigger, better algorithm. There is a limit to what speed up your technology stack can achieve. Why is this? Read on and I’ll explain. And I’ll explain what you can do if you need more than a 1000-fold speed up of your prototype.

Speeding up your code with your current tech stack

There are two ways in which you can speed up your prototype code,

  1. Improve the efficiency of the language constructs used, e.g. in Python replacing for loops with list comprehensions or maps, refactoring subsections of the code etc.
  2. Horizontal scaling of your current hardware, e.g. adding more nodes to a compute cluster, adding more executors to the pool in a Spark cluster.

Point 2 assumes that your calculation is compute bound and not memory bound, but we’ll stick with that assumption for this article. We also exclude the possibility that the productionization team can invent or buy a new technology that is sufficiently different or better than your current tech stack – it would be an unfair ask of the ML engineers to have to invent a whole new technology just to compensate for your poor prototype. They may be able to to, but we are talking solely about using your current tech stack and we assume that it does have some capacity to be horizontally scaled.

So what speed ups can we expect from points 1 and 2 above? Point 1 is always possible. There are always opportunities for improving code efficiency that you or another person will spot when looking at the code for a second time. A more experienced programmer reviewing the code can definitely help. But let’s assume that you’re a reasonably experienced Data Scientist yourself. It is unlikely that your code is so bad that a review by someone else would speed it up by more than a factor of 10 or so.

So if the most we expect from code efficiency improvements is a factor 10 speed up, what speed up can we additionally get from horizontal scaling of your existing tech stack? A factor of 100 at most. Where does this limit of 100 come from? Amdahl’s law.

Amdahl’s law

Amdahl’s law is a great little law. Its origins are in High Performance Computing (HPC), but it has a very intuitive basis and so is widely applicable. Because of that it is worth explaining in detail.

Imagine we have a task that currently takes time T to run. Part of that task can be divided up and performed by separate workers or resources such as compute nodes. Let’s use P to denote the fraction of the task that can be divided up. We choose the symbol P because this part of the overall task can be parallelized. The fraction that can’t be divided up we denote by S, because it is the non-parallelizable or serial part of the task. The serial part of the task represents things like unavoidable overhead and operations in manipulating input and output data-structures and so on.

Obviously, since we’re talking about fractions of the overall runtime T, the fractions P and S must sum to 1, i.e.

Eq1

The parallelizable part of the task takes time TP to run, whilst the serial part takes time TS to run.

What happens if we do parallelize that parallelizable component P? We’ll parallelize it using N workers or executors. When N=1, the parallelizable part took time TP to run, so with N workers it should (in an ideal world) take time TP/N to run. Now our overall run time, as a function of N is,

Eq2

This is Amdahl’s law1. It looks simple but let’s unpack it in more detail. We can write the speed up factor in going from T(N=1) to T(N) as,

Eq3

The figure below shows plots of the speed-up factor against N, for different values of S.

AmdahlLawPlots

From the plot in the figure, you can see that the speed up factor initially looks close to linear in N and then saturates. The speed up at saturation depends on the size of the serial component S. There is clearly a limit to the amount of speed up we can achieve. When N is large, we can approximate the speed up factor in Eq.3 as,

Eq4

From Eq.4 (or from Eq.3) we can see the limiting speed up factor is 1/S. The mathematical approximation in Eq.4 hides the intuition behind the result. The intuition is this; if the total runtime is,

Eq5

then at some point we will have made N big enough that P/N is smaller than S. This means we have reduced the runtime of the parallelizable part to below that of the serial part. The largest contribution to the overall runtime is now the serial part, not the parallelizable part. Increasing N further won’t change this. We have hit a point of rapidly diminishing returns. And by definition we can’t reduce S by any horizontal scaling. This means that when P/N becomes comparable to S, there is little point in increasing N further and we have effectively reached the saturation speed up.

How small is S?

This is the million-dollar question, as the size of S determines the limiting speed up factor we can achieve through horizontal scaling. A larger value of S means a smaller speed up factor limit. And here’s the depressing part – you’ll be very lucky to get S close to 1%, which would give you a speed up factor limit of 100.

A real-world example

To explain why S = 0.01 is around the lowest serial fraction you’ll observe in a real calculation, I’ll give you a real example. I first came across Amdahl’s law in 2007/2008, whilst working on a genomics project, processing very high-dimensional data sets2. The calculations I was doing were statistical hypothesis tests run multiple times.

This is an example of an “embarrassingly parallel” calculation since it just involves splitting up a dataframe into subsets of rows and sending the subsets to the worker nodes of the cluster. There is no sophistication to how the calculation is parallelized, it is almost embarrassing to do – hence the term “embarrassingly parallel”.

The dataframe I had was already sorted in the appropriate order, so parallelization consisted of taking a small number of rows off the top of dataframe and sending to a worker node and repeating. Mathematically, on paper, we had S=0. Timings of actual calculations with different numbers of compute nodes and fitting an Amdahl’s law curve to those timings revealed we had something between S=0.01 and S=0.05.

A value of S=0.01 gaves us a maximum speed up factor of 100 from horizontal scaling. And this was for a problem that on paper had S=0. In reality, there is always some code overhead in manipulating the data. A more realistic limit on S for an average complexity piece of Data Science code would be S=0.05 or S=0.1, meaning we should expect limits on the speed up factor of between 10 and 20.

What to do?

Disappointing isn’t it!? Horizontal scaling will speed up our calculation by at most a factor of 100, and more likely only a factor of 10-20. What does it mean for productionizing our prototype code? If we also include the improvements in the code efficiency, the most we’re likely to be able to speed up our prototype code by is a factor of 1000 overall. It means that as a Data Scientist you have a responsibility to ensure the runtime of your initial prototype is within a factor of 1000 of the production runtime requirements.

If a speed up of 1000 isn’t enough to hit the production run-time requirements, what can we do? Don’t despair. You have several options. Firstly, you can always change the technology underpinning your tech stack. Despite what I said at the beginning of this post, if you are repeatedly finding that horizontal scaling of your current tech stack does not give you the speed-up you require, then there may be a case for either vertical scaling the runtime performance of each worker node or using a superior tech stack if one exists.

If improvement by vertical scaling of individual compute nodes is not possible, then there are still things you can do to mitigate the situation. Put the coffee on, sharpen your pencil, and start work on designing a faster algorithm. There are two approaches you can use here,

  • Reduce the performance requirements: This could be lowering the accuracy through approximations that are simpler and quicker to calculate. For example, if your code involves significant matrix inversion operations you may be able to approximate a matrix by its diagonal and explicitly hard code the calculation of its inverse rather than performing expensive numerical inversion of the full matrix.
  • Construct a better algorithm: There are no easy recipes here. You can get some hints on where to focus your effort and attention by identifying the runtime bottlenecks in your initial prototype. This can be done using code profiling tools. Once a bottleneck has been identified, you can then progress by simplifying the problem and constructing a toy problem that has the same mathematical characteristics as the original bottleneck. By speeding up the toy problem you will learn a lot. You can then apply those learnings, even if only approximately, to the original bottleneck problem.

  1. When I first stumbled across Amdahl’s law, I mentioned it to a colleague working on the same project as I was. They were a full-stack software developer and immediately, said “oh, you mean Amdahl’s law about limits on the speed you can write to disk?”. It turns out there is another Amdahl’s Law, often called “Amdahl’s Second Law”, or “Amdahl’s Other Law”, or “Amdahl’s Lesser Law”, or “Amdahl’s Rule-Of-Thumb”. See this blog post, for example, for more details on Amdahl’s Second Law.
  2. Hoyle et. al, “Shared Genomics: High Performance Computing for distributed insights in genomic medical research”, Studies in Health Technology & Informatics 147:232-241, 2009.

© 2024 David Hoyle. All Rights Reserved

How many iterations are needed for the bisection algorithm?

<TL;DR>

  • The bisection algorithm is a very simple algorithm for finding the root of a 1-D function.
  • Working out the number of iterations of the algorithm required to determine the root location within a specified tolerance can be determined from a very simple little hack, which I explain here.
  • Things get more interesting when we consider variants of the bisection algorithm, where we cut an interval into unequal portions.

</TL;DR>

A little while ago a colleague mentioned that they were repeatedly using an off-the-shelf bisection algorithm to find the root of a function. The algorithm required the user to specify the number of iterations to run the bisection for. Since my colleague was running the algorithm repeatedly they wanted to set the number of iterations efficiently and also to achieve a guaranteed level of accuracy, but they didn’t know how to do this.

I mentioned that it was very simple to do this and it was a couple of lines of arithmetic in a little hack that I’d used many times. Then I realised that the hack was obvious and known to me because I was old – I’d been doing this sort of thing for years. My colleague hadn’t. So I thought the hack would be a good subject for a short blog post.

The idea behind a bisection algorithm is simple and illustrated in Figure 1 below.

How the bisection algorithm works
Figure 1: Schematic of how the bisection algorithm works

At each iteration we determine whether the root is to the right of the current mid-point, in the right-hand interval, or to the left of the current mid-point, in the left-hand interval. In either case, the range within which we locate the root halves. We have gone from knowing it was in the interval [x_{lower}, x_{upper}], which has width x_{upper}-x_{lower}, to knowing it is in an interval of width \frac{1}{2}(x_{upper}-x_{lower}). So with every iteration we reduce our uncertainty of where the root is located by half. After N iterations we have reduced our initial uncertainty by (1/2)^{N}. Given our initial uncertainty is determined by the initial bracketing of the root, i.e.  an interval of width (x_{upper}^{(initial)}-x_{lower}^{(initial)}), we can now work out that after N iterations we have narrowed down the root to an interval of width {\rm initial\;width} \times \left ( \frac{1}{2}\right ) ^{N}. Now if we want to locate the root to within a tolerance {\rm tol}, we just have to keep iterating until the uncertainty reaches {\rm tol}. That is, we run for N iterations where N satisfies,

\displaystyle N\;=\; -\frac{\ln({\rm initial\;width/tol})}{\ln\left (\frac{1}{2} \right )}

Strictly speaking we need to run for \lceil N \rceil iterations. Usually I will add on a few extra iterations, e.g. 3 to 5, as an engineering safety factor.

As a means of easily and quickly determining the number of iterations to run a bisection algorithm the calculation above is simple, easy to understand and a great little hack to remember.

Is bisection optimal?

The bisection algorithm works by dividing into two our current estimate of the interval in which the root lies. Dividing the interval in two is efficient. It is like we are playing the childhood game “Guess Who”, where we ask questions about the characters’ features in order to eliminate them.

Asking about a feature that approximately half the remaining characters possess is the most efficient – it has a reasonable probability of applying to the target character and eliminates half of the remaining characters. If we have single question, with a binary outcome and a probability p of one of those outcomes, then the question that has p = \frac{1}{2} maximizes the expected information (the entropy), p\ln (p)\;+\; (1-p)\ln(1-p).

Dividing the interval unequally

When we first played “Guess Who” as kids we learnt that asking questions with a much lower probability p of being correct didn’t win the game. Is the same true for our root finding algorithm? If instead we divide each interval into unequal portions is the root finding less efficient than when we bisect the interval?

Let’s repeat the derivation but with a different cut-point e.g. 25% along the current interval bracketing the root. In general we can test whether the root is to the left of right of a point that is a proportion \phi along the current interval, meaning the cut-point is x_{lower} + \phi (x_{upper}-x_{lower}). At each iteration we don’t know in advance which side of the cut-point the root lies until we test for it, so in trying to determine in advance the number of iterations we need to run, we have to assume the worst case scenario and assume that the root is still in the larger of the two intervals. The reduction in uncertainty is then, {\rm max}\{\phi, 1-\phi\}. Repeating the derivation we find that we have to run at least,

\displaystyle N_{Worst\;Case}\;=\;\ -\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm max}\{\phi, 1 - \phi \right \})}

iterations to be guaranteed that we have located the root to within tol.

Now to determine the cut-point \phi that minimizes the upper bound on number of iterations required, we simply differentiate the expression above with respect to \phi. Doing so we find,

\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; -\frac{\ln({\rm initial\;width/tol})}{ (1-\phi) \left ( \ln (1 - \phi) \right )^{2}} \;\;,\;\; \phi < \frac{1}{2}

and

\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; \frac{\ln({\rm initial\;width/tol})}{\phi \left ( \ln (\phi) \right)^{2}} \;\;,\;\; \phi > \frac{1}{2}

The minimum of N_{Worst\;Case} is at \phi =\frac{1}{2}, although \phi=\frac{1}{2} is not a stationary point of the upper bound N_{Worst\;Case}, as N_{Worst\;Case} has a discontinuous gradient there.

That is the behaviour of the worst-case scenario. A similar analysis can be applied to the best-case scenario – we simply replace max with min in all the above formula. That is, in the best-case scenario the number of iterations required is given by,

\displaystyle N_{Best\;Case}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm min}\{\phi, 1 - \phi \right \})}

Here, the maximum of the best-case number of iterations occurs when \phi = \frac{1}{2}.

That’s the worst-case and best-case scenarios, but how many iterations do we expect to use on average? Let’s look at the expected reduction in uncertainty in the root location after N iterations. In a single iteration a root that is randomly located within our interval will lie, with probability \phi, in segement to the left of our cut-point and leads to a reduction in the uncertainty by a factor of \phi. Similarly, we get a reduction in uncertainty of 1-\phi with probability 1-\phi if our randomly located root is to the right of the cut-point. So after N iterations the expected reduction in uncertainty is,

\displaystyle {\rm Expected\;reduction}\;=\;\left ( \phi^{2}\;+\;(1-\phi)^{2}\right )^{N}

Using this as an approximation to determine the typical number of iterations, we get,

\displaystyle N_{Expected\;Reduction}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ( \phi^{2} + (1-\phi)^{2} \right )}

This still isn’t the expected number of iterations, but to see how it compares Figure 2 belows shows simulation estimates of \mathbb{E}\left ( N \right ) plotted against \phi when the root is random and uniformly distributed within the original interval.

The number of iterations needed for the bisection algorithm
Number of iterations required for the different root finding methods.

For Figure 2 we have set w = ({\rm initial\;width/tol}) = 0.01. Also plotted in Figure 2 are our three theoretical estimates, \lceil N_{Worst\;Case}\rceil, \lceil N_{Best\;Case}\rceil, \lceil N_{Expected\;Reduction}\rceil. The stepped structure in these 3 integer quantities is clearly apparent, as is how many more iterations are required under the worst case method when \phi \neq \frac{1}{2}.

The expected number of iterations required, \mathbb{E}( N ), actually shows a rich structure that isn’t clear unless you zoom in. Some aspects of that structure were unexpected, but requires some more involved mathematics to understand. I may save that for a follow-up post at a later date.

© 2022 David Hoyle. All Rights Reserved