Data Science Notes: 5. Trimmed variances and getting MAD

On May 29, 2026May 29, 2026 By dchoyleIn Data Science, Data Science Notes, StatisticsLeave a comment

Summary

Estimating the variance of a distribution when we have outliers can create a circular problem, particularly if we want to use the estimated variance in an algorithm to detect the outliers in the first place.
If we are prepared to assume, that at most, a fraction $\alpha$ of our data are outliers, then we can use a trimmed-variance to estimate the population variance without having to explicitly identify the outliers.
We incorporate a correction factor to make our variance estimate consistent with the parametric distribution we have assumed our data has been drawn from.
The median absolute deviation from the median (MAD) also uses a similar idea and tends to have better performance (efficiency) for outlier-contaminated Normal data.
Calculating the Normal-consistent MAD-based estimate of the standard deviation is ridiculously simple.

Introduction

This is one of those little hacks that I really like. So simple, so easy, and yet it solves a real problem. It solves a circular problem. How to estimate the standard deviation of a distribution from a sample of data, without being impacted by outliers when we haven’t yet identified which datapoints are outliers. We may even want to use that estimate of the standard deviation to help detect those outliers – in an appropriate algorithm that takes the sample size into account of course (see my previous Data Science Notes post What is an outlier?). It looks like we have a circular problem; the outliers affect the estimation of the standard deviation, but we can’t identify and deal with those outliers until we have calculated the standard deviation.

We can drop the data points from the tails of the sample to remove outliers. The trouble is the standard deviation relates to the spread of values in a distribution, i.e. it relates to the tails of the distribution. It looks like we are going to lose signal about the standard deviation if we do this.

If only there was a way to estimate a standard deviation from just the data values around the mean. That is, can we estimate the standard deviation from the data points that have come from the bulk of the distribution and not its tails? There is, if we are prepared to make a parametric assumption about the shape of the distribution. Once we make a parametric assumption about the shape, any datapoint gives us signal about the parameters of the distribution. And once we have signal about the parameters of the distribution we have signal about its standard deviation.

Theory

Let’s use the Normal distribution as an example. We know that the probability density is given by,

$\phi (x)\;=\; \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp \left ( -\frac{1}{2\sigma^{2}}\left ( x - \mu\right )^{2}\right )$ Eq.1

In Eq.1 $\mu$ is the mean of the probability distribution and $\sigma$ is its standard deviation. The part of the probability distribution that corresponds to the central 90% of the probability distribution is given by the interval $\left [ -\sigma x_{5} + \mu , \sigma x_{5} + \mu \right ]$ , where $x_{5}$ corresponds to the 5^th centile of the standard Normal distribution and so is given by,

$x_{5} \;=\; -\sqrt{2} {\rm erf}^{-1}\left ( 0.95 \right )$ Eq.2

If we calculate the 2^nd moment of $x$ about the mean $\mu$ and over that central 90% portion of the probability distribution we can write that,

$\frac{1}{\sqrt{2\pi\sigma^{2}}}\int_{-x_{5} }^{x_{5}} x^{2}\,\exp \left ( -\frac{x^{2}}{2\sigma^{2}}\right )\, dx\;=\; 0.9\sigma^{2}\left [ 1 + \frac{2}{0.9}\Phi^{-1}\left ( 0.05\right ) \phi \left ( \Phi^{-1}\left ( 0.05 \right )\right ) \right ]$ Eq. 3

In Eq.3 $\Phi(z)$ is the CDF of the Standard Normal distribution and so $\Phi(z) = \frac{1}{2}{\rm erfc}\left ( -z/\sqrt{2} \right )$ . The right-hand side of Eq.3 is 0.9 times what we would get if we had an infinite sample of data from a Normal distribution with standard deviation of $\sigma$ and removed the upper and lower 5% of the datapoints and then calculated the sample standard deviation $s^{2}$ .

Let’s denote our sample of data by the vector $\underline{x}$ . Eq.3 tell us that once we have removed the upper and lower 5% of datapoints from our sample $\underline{x}$ and then calculated $s^{2}$ , we have,

$s^{2} \simeq \sigma^{2}\left [ 1 + \frac{2}{0.9}\Phi^{-1}\left ( 0.05\right ) \phi \left ( \Phi^{-1}\left ( 0.05 \right )\right ) \right ]$ Eq.4

A simple re-arrangement of Eq.4 gives us a means of estimating $\sigma^{2}$ from our trimmed data sample. We just use,

$\hat{\sigma}^{2}\simeq \frac{s^{2}}{\left [ 1 + \frac{2}{0.9}\Phi^{-1}\left ( 0.05\right ) \phi \left ( \Phi^{-1}\left ( 0.05 \right )\right ) \right ] }$ Eq.5

There you have it. A simple way to estimate the standard deviation of a Normal distribution whilst excluding the smallest and largest 5% of values. If we want to exclude $\alpha \times 100\%$ of values, we can easily generalize the formula in Eq.5. I’ll leave it to you as an exercise to do the derivation. I’m just going to quote the final result below,

$\hat{\sigma}^{2}\;=\; \frac{s^{2}}{\left [ 1 + \frac{2}{1 - \alpha}\Phi^{-1}\left ( \frac{1}{2}\alpha\right ) \phi \left ( \Phi^{-1}\left ( \frac{1}{2}\alpha \right )\right ) \right ] }$ Eq.6

Practice

That’s the theory. How do we use this simple idea in practice? Like all simple but great ideas someone has already coded it up for you. In Python the trimmed variance of an array a of numbers can be calculated using the trimmed_var function in scipy.stats.mstats, as follows,

			
from scipy.stats.mstats import trimmed_var
trimmed_var = trimmed_var(a,
                          limits=(0.05, 0.05),
                          inclusive=(1, 1),
                          relative=True,
                          axis=None,
                          ddof=1)

		

In the notebook DataScienceNotes5_TrimmedVariances.ipynb in the public GitHub repository github.com/dchoyle/datascience_notes, I have used the trimmed_var function to run a number of simulations. The plot below (Figure 1) shows the average estimated variance, averaged over 1000 simulation datasets, using the formula in Eq.6 and where I have used the scipy trimmed_var function to calculate $s^{2}$ for each simulation sample. The plot shows the average (over simulation datasets) estimated population variance at a number of different sample sizes $N$ and for three different values of $\alpha$ .

Figure 1: Plot of the simulation means for the trimmed-variance estimates against log N.

On the y-axis I’ve actually plotted the ratio of the average estimated variance to the true population variance (with which the data was generated) minus 1, so that it is easier to see how accurate the estimated variance is. A value of zero on the y-axis indicates a perfect estimate. You can see that as the starting sample size $N$ increases, each of the different $\alpha$ curves converges to zero, i.e. a perfect estimate, on average, of the true variance. However, the rate of convergence appears to be diffferent for the different values of $\alpha$ . In part, this is illusory, and is because we haven’t adjusted the x-axis for the effective sample size. Conisder if our starting sample size was $N=10000$ . At $\alpha=0.1$ we are actually estimating the variance from 900 data points. The effective sample size is 900. Whilst for $\alpha=0.4$ the variance is estimated from a trimmed sample consisting of 600 data points. In order to compare like effective sample sizes with like effectve sample sizes I simply need to adjust the x-axis values in Fig.1 by $\log(1-\alpha)$ . This I have done in Figure 2 below.

Figure 2: Plot of the simulation means of the trimmed-variance estimates plotted against the adjusted sample size.

You can see that in Fig.2 all three different $\alpha$ curves are nearly identical, particularly at the larger values of $N$ . In Fig.2 I have also included error bars on the $\alpha=0.1$ curve. These correspond to $\pm 2$ times the standard errors of the means in the $\alpha=0.1$ curve. I haven’t included the standard errors for the other two curves simply because they will be of similar scale and will make the plot too crowded. From Fig.2, with the standard errors visible we see that there is a bias in the variance estimates at smaller values of $N$ , i.e. the average estimate is systematically different from the true population level. Deriving the mathematical form of the bias is a lengthy and challenging mathematical calculation, so I’m not going to go into it here.

Getting MAD

So far we have been calculating trimmed variances. However, the idea of using the central portion of a data sample to estimate a population quantity is more widely applicable. One of my favourite and most well known applications of this idea is the median absolute deviation, or MAD for short.

MAD is the median absolute deviation of $x$ from some measure of central tendency of $x$ , such as the median or mean. Usually, the default is to use the median as the measure of central tendency, so that the MAD is calculated as the median absolute deviation from the median. That is, we take our sample of data, calculate its median, calculate the absolute difference between each data point and that median, and calculate the median of all those absolute values.

The great thing about MAD (when using the median central tendency measure) is that for a large sample of data drawn from a Normal distribution, there is a very simple relationship between the expectation value of MAD and the population standard deviation $\sigma$ . That relationship is,

$\mathbb{E}\left ( {\rm MAD} \right ) \;\rightarrow\; \sigma \Phi \left ( 3/4 \right )\;=\; 0.6745\sigma\;\;\;,\;\; {\rm as} \;N\rightarrow\infty$ Eq.7

Let’s unpack what Eq.7 is telling us. It says that, on average, the MAD calculated from a sample of (Normally distributed) data is a simple constant times the true standard deviation. This means we can invert Eq. 7 to get a extremely simple way of estimating the population standard deviation that is robust to the presence of outliers (provided they make up less than 50% of the sample). That simple way is,

$\hat{\sigma}\;=\; 1.4826 \times {\rm MAD} \left ( \underline{x} \right )$ Eq. 8

Calculating the MAD is such a common task that once again you will find that most programming languages used for numerical analysis will contain a MAD function either as part of the base distribution, or in a commonly used package. In R we can use the mad function which is part of the base R distribution. In Python we can use scipy.stats.median_abs_deviation. In both cases the use of MAD to estimate the standard deviation is such a common use of MAD that both the R and Python functions have the correction factor of 1.4826 built in. In R, it is included by default, meaning that if you call mad, it will return the Normal adjusted estimate of $\sigma$ (also called the Normal consistent estimate). Be aware of this in case what you actually wanted was just the MAD value for a sample of data, rather than the Normal consistent estimate of $\sigma$ calculated from MAD. In Python, for the scipy.stats.median_abs_deviation function we have to explicitly tell it that we want the correction factor applied. We do that by setting the scale argument of the function equal to ‘normal’. The code-snippet below illustrates the scipy.stats.median_abs_deviation function in action.

			
from scipy.stats import median_abs_deviation
mad_sd_estimate = median_abs_deviation(a, scale='normal')

For the simulated data used to produce Fig.1 and Fig.2, I also calculated the MAD based estimate of $\sigma$ . The plot below (Figure 3) shows the simulation average of the Normal consistent MAD-based estimates of the true standard deviation $\sigma$ . The simulation averages are plotted against $\log N$ . Again, I have actually plottted the ratio of the simulation average of $\hat{\sigma}$ to $\sigma$ then minus 1. The dashed line is at a value of zero on the y-axis and is used to indicate when we have a bias in our estimate. The points are the means of the simulation results, i.e. the mean of the estimates of $\sigma$ over all the simulation datasets for the particular value of $N$ . The error bars correspond to $\pm$ 2 standard errors of the simulation means, with the standard error simply estimated from the variance over the simulation results.

Figure 3: Plot of the simulation means of the MAD-based estimates of the standard deviation, plotted against log N.

From Fig.3 we can see that the accuarcy of the MAD-based estimate of $\sigma$ is very good. This would appear (from the plot) to be a consistent estimator. Again we can see evidence of the bias in the estimator at the smaller values of $N$ . However, we can’t yet do a comparison with the trimmed-variance estimator, as we are estimating a different quantity. The trimmed-variance estimates the underlying population variance, whilst the MAD-based estimator estimates the underlying population standard deviation. Fortunately, when I computed the MAD-based estimates, I also stored the effective estimates of the variance, so we can do a like-for-like comparison with the trimmed-variance estimator. The MAD-based estimates of $\sigma^{2}$ are shown in the plot in Figure 4 below.

Figure 4: Plot of Normal-consistent MAD-based estimates of the population variance plotted against log N.

The accuarcy of the MAD-based estimator appears to be, at least for these simulation results, superior to the trimmed-variance estimator, with the errors being around $\pm$ 1%. The number of simulation runs is not sufficient in this set of results to properly detect the bias even for the smallest values of $N$ . However, given the plot in Fig.3, we would expect that if we increased the number of simulation runs appropriately, we would be able to resolve this, i.e. it is simply that the standard errors of the mean of $\hat{\sigma}^{2}/\sigma^{2}$ are larger than the standard errors of the mean of $\hat{\sigma}/\sigma$ (for this MAD-based estimator).

Other distributions

So far we have been discussing outlier-contaminated Normal data. But what happens if we are certain our data wasn’t drawn from a Normal distribution with an additional outlier process on top? What happens if we think our data is Gamma-distributed with outliers? We can still apply the same ideas, but it can get trickier and also more computationally intensive. Specifically, we need to distinguish two cases:

The distribution we assume is still symmetric about its mean – in this case calculating the trimmed variance with an appropriate correction factor, or calculating MAD with an appropriate correction factor is still relatively straight forward. The correction factors may not be expressable in terms of common special functions, or in the worst case you may have to evaluate them numerically, once you have reduced the calculation down to some canonical form.
The assumed distribution is not symmetric about its mean – in these circumstances it can become a lot trickier. Obviously, now things such as the skewness of the distribution affect the value of MAD or the trimmed-variance value, and so calculation of the correction factors is now affected by the shape of the distribution. This means we typically need to estimate a ‘shape’ parameter of the parametric distribution as well as estimating the ‘scale’ parameter of the distribution, which is the thing we are predominantly interested in. This will mean solving for shape and scale parameters simultaneously, and we are probably going to have to do that root finding numerically, as it is unlikely we can do it analytically. Calculating the appropriate adjustment factors is usually possible in these circumstances, but is typically more computationally intensive and/or we have to introduce additional approximations.

Conclusions

In many situations we want to estimate the standard deviation of a distribution from a sample of data, but we know there are outliers present in the sample.
We can’t use our usual estimate of the standard deviation to detect the outliers, because that estimate is affected by the outliers.
By making an assumption about the parametric form of the distribution whose standard deviation we are trying to estimate, we can estimate the standard deviation using data from any part of the distribution. This means we can use just the middle part of the sample data.
If we throw away a fraction $\alpha$ of the data, we can easily compute the correction factor needed, based upon our parametric assumption of the distribution shape.
In R and Python there are convenient functions for calculating the trimmed-variance of a sample and also the Normal-consistent MAD-based estimate of the population standard deviation.
Calculation of trimmed-variance and MAD-based estimates of variance is possible for distributions other than the Normal but will be more computationally intensive, particularly for asymmetric distributions.

Data Science Notes: 4. What is an outlier?

On March 9, 2026March 9, 2026 By dchoyleIn Data Science, Data Science Notes, Statistics1 Comment

Summary

What defines an outlier is something that causes a great deal of confusion amongst Data Scientists. Most Data Scientists get this wrong.
An outlier is not just simply a datapoint that has a very large or very small value.
My personal definition of an outlier is, “a datapoint that comes from a different statistical process than the one I’m currently modelling”.
Since real world data is typically made from multiple statistical processes, it is better to think not in terms of a dataset being comprised of “ok data + outliers”, but instead to think of data as being comprised from data from process 1, plus data from process 2, and so on.
Throwing away outliers is not a good idea. They made up part of your real dataset so you are throwing away insight into how your data gets generated. It is ok however to exclude “outliers” from estimation of the parameters of a specific statistical process if you have confidently identified those “outliers” as coming from a different mechanism.

Introduction

The image above is a well known meme, right. It’s quite funny, right? Yes, except I have seen this precise situation happen. 25 years ago I went to a talk where a scientist was presenting some experimental gene expression work they’d been doing. As they were using a new assay technology called ‘microarrays’, which they were concerned could provide noisy measurements, they’d decided under advice from another scientist to throw away any datapoints that were more than +/- 2 standard deviations from the mean of their data. My colleague and I, who were working on the theory of statistical analysis of microarray data, asked how much data they’d discarded. “It’s quite weird. We always seem to end up removing about 5% of the total data, no matter what experiment we do” was the reply. I’ve remembered that reply word for word since then.

What do you think an outlier is?

Take a look at the plot below. It is a dataset of 150 datapoints, with the values plotted on the y-axis and the index of each datapoint on the x-axis. Which datapoints do you think might be outliers? Possibly the ones I’ve circled in red, as they are clearly much larger than the bulk of the data points. Perhaps we should discard them?

Plot of example data points with two data points circled in red as possible outliers — An example dataset with possible outliers highlighted in red.

Okay, what about this second plot below? Again, a dataset of 150 datapoints. Probably no outliers right? All the data points are reasonably clustered together on the y-axis. Yes, some are more towards the edges but you would think you could easily model the distribution of this data using a single Gaussian distribution.

An example dataset with no outliers highlighted — A second example dataset

The two plots are of the same data.

How is this so? The data is drawn (simulated) from a log-normal distribution. The second plot shows the data values on a log scale, and as we’ve already commented we would be happy modelling the distribution of the data using a Gaussian distribution, using all the data points to estimate the mean and variance of that Gaussian distribution. The first plot shows the data on the original scale – sometimes we refer to this as the ‘normal scale’, but not to be confused with ‘normal’ as in ‘normal distribution’, we just mean we haven’t taken the logarithm. And yet, now we know that the appropriate distribution to use to model this data is a log-normal distribution, we are happy to use all the datapoints in the first plot to estimate the parameters of that log-normal distribution. There is no need to discard or throw away any datapoints.

What have we learnt from this little experiment? Two things:

Whether a datapoint is an outlier or not has nothing to do with how big the data value is.
If a datapoint is consistent with the statistical distribution or process we are trying to model, then we can use it. It is not an outlier, no matter how big or small it is.

The second point is my favourite way of defining an outlier. An outlier is a datapoint that comes from a statistical process other than that process which we are currently trying to model. Okay, that definition sounds a bit dry and technical, but I like it because it emphasizes that an outlier is not ‘wrong’, ‘incorrect’ or ‘deficient’ in any sense. It has not ‘corrupted’ or ‘contaminated’ the dataset in any true sense, even though those are common terms used in statistics when discussing robust parameter estimation methods. Instead, saying that a datapoint is an outlier just means we are saying it comes from a different mechanism than the one we are currently modelling. All the datapoints in the two plots above came from the same statistical process. I know because that is how I generated them.

The definition of an outlier I have just given also tells us why we are justified in omitting (not throwing away) outliers when estimating the parameters associated with the statistical mechanism we want to model. If a datapoint hasn’t come from the statistical mechanism we want to model, including it in any estimation or training algorithm will lead to incorrect parameter estimates. So once we are confident we have correctly identified any outliers, omitting them from the parameter estimation process is justified.

An illustration

Let’s we want to estimate the typical energy usage per square metre of rooms in a residential house. I can get data on a load of houses; the total energy use, the number of rooms, and the total floor aera of those rooms. It’s an easy calculation to then work out the average energy consumption per square metre. But if I notice that several of the ‘houses’ in my dataset have over 150 rooms each, I‘m probably going to think that they are not typical residential houses. Most likely they are hotels, or mansions of some tech billionaires. Either way, the pattern of energy usage for the rooms in those hotels/mansions is likely to be very different from that of a room in a typical residential house. The energy usage for those hotel rooms comes from a different statistical process to that for the residential house rooms. The datapoints from those hotels are outliers to what I’m trying to model. I should therefore exclude them from my estimation algorithm. That’s easy to do now that I’m am confident I have identified the outliers appropriately. Job done!

Life (and data) is always made up of multiple processes

The other reason I like the outlier definition I have given is that it emphasizes that real data is almost always made up of data coming from multiple mechanisms or processes. The outlier definition I’ve used highlights that an outlier isn’t wrong, it’s just different. Sometimes we only want to model in depth just one of the statistical mechanisms contributing to our dataset, so we omit the outliers and run our training algorithm. Other times we need to model all of the data in our dataset and so we break down our dataset into the different generative processes/mechanisms that we think are at play, select the appropriate subset of data for each process, and run an appropriate training algorithm on that subset. Instead of thinking of the data as being made up of ‘good’ data plus ‘outliers’, I now think of it as, ‘data from mechanism 1’ plus ‘data from mechanism 2’ plus ‘data from mechanism 3’, and so on.

Let’s revisit our energy consumption example to illustrate this. Say, now I want to model energy use per square metre for different types of buildings – residential houses and hotels. Now it is just a question of filtering the data to the datapoints matching the building type. As we said before, the energy use patterns will probably be different for houses and for hotels, i.e. we have different statistical processes generating the different subtypes of data. Consequently, we choose to use different distribution types to model the different building types. So what we have is not outliers, just a different training process for each of the different building types.

How to detect outliers?

So far I have only discussed how to think about what an outlier is. I’ve said that you are justified in excluding genuine outliers from parameter estimation once you have correctly identified them. What I have not discussed is how you actually do the outlier detection. That is deliberate, as it is a much bigger topic. It is also one I will touch upon, in part, in later posts in the Data Science Notes series.

What you will hopefully have grasped is that if we are to use something like standard deviation (SD) to determine whether a datapoint is an outlier, we need to replace the naive “|x – mean | > 2SD” filter shown in the meme image at the start and replace it with a question that asks whether the datapoint is consistent with a statistical process that has the given mean and SD. So we need to replace the meme with a question more like, “what is the probability of the highest value in a dataset of size N being greater than x?” We can see that outlier detection now takes into account the sample size and brings us into the realm of extreme value theory (EVT) – also a topic for a later Data Science Note.

Conclusions

An outlier is a datapoint that has come from a statistical process different to the one we are currently modelling.
Excluding outliers from parameter estimation for the statistical process we are currently modelling is valid because they don’t, by definition, conform to that statistical process and so would bias the parameter estimation/training algorithm.
Most datasets are made up of data from multiple different statistical processes, and so we should get used to thinking not in terms of “ok data + outliers”, but data from process 1, data from process 2, and so on.
If we want to model a complete dataset we may need to appropriately subset the data into its different statistical process and run parameter estimation algorithms for each subset/process.
By breaking down a dataset into its component processes, we are handling in an appropriate way what we previously thought of as the “outliers”. We haven’t had to throw away any datapoints, merely just temporarily exclude them when necessary.

Abstraction is your friend

On February 13, 2026March 9, 2026 By dchoyleIn Algorithms, Coding, Data ScienceLeave a comment

Summary

When constructing a coding solution to an implememtation problem, or an algorithm/theory for a mathematical problem, we can get bogged down in detail and lose our way.
Abstraction helps us to move up a level and leave the detail behind.
Once we temporarily leave the detail behind, what we need to do becomes clearer.
Now we know what we have to do or implement, we can successfully put the detail back in because we have a clear ‘north star’ of what we need to implement.

Introduction

The computer scientist Edsger Dijkstra once said,

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

This is one of my favourite computing-related quotes, not least because it comes from Dijkstra. Edsger Dijkstra was a prodigious and extremely impactful computer scientist. In 1972 he was awarded the ACM A.M. Turing prize, widely regarded as the “Nobel prize for computer science”. Dijkstra was also “the computer scientist’s computer scientist”, providing rigorous solutions to tough practical problems, such as finding the shortest path between any two nodes on a network.

I also like the Dijkstra quote because I can relate to it. Whenever I want clarity on a problem, I abstract. Abstraction allows me to identify and focus on the “what” by taking the details of the “how” out of the discussion (for the time being). Once I know “what” it is I need to do I can put the details of the “how” back in and it becomes a straightforward technical implementation task (with debugging of course).

Yes, I hear you say, but what do you actually mean by all this high-level philosophical talk? I’ll give you two example. Yes, they are concocted examples, but they help illustrate what I’m driving at. Once I’ve discussed the two examples, we can then distil a more general approach of, i) when to recognize that we need more abstraction, and ii) how to do it.

Example 1:

Imagine you have the following equation,

$G + \Lambda g = \kappa T$ Eq.1

In Eq.1 $\Lambda$ and $\kappa$ are constants. With just your high-school mathematics, you can see that Eq. 1 is just a linear equation. With just high-school maths you can confidently understand that as $T$ increases then so does $G$ .

I will now tell you that Eq.1 represents the field equations for Einstein’s theory of general relativity. With just high-schools maths you’ve been able to understand general relativity. $G$ is the curvature tensor – it describes how spacetime is curved, $g$ is called the metric tensor, and $T$ is the stress-energy tensor. The simple high-level form of Eq.1 allows us to understand that energy, or equivalently mass (remember $E = mc^{2}$ ) affects how curved spacetime is. The more mass we have or the higher the energy density, then the more spacetime will be curved. And the curvature of spacetime affects the dynamics of anything moving in that spacetime. This simple high-level picture is so easy to grasp that the famous cosmologist John Archibald Wheeler could summarize Einstein’s theory in one short sentence, “Spacetime tells matter how to move; matter tells spacetime how to curve”.

All of the physics concepts of Einstein’s theory are in that simple equation of Eq.1 and likewise in the simple sentence from Wheeler. At this abstract level, the “what”, i.e. the physics, is crystal clear. However, when we start getting into the technical detail we have to go deeper. To “implement” a calculation we have to start to get to grips with the “how” and learn about Christoffel symbols, Ricci tensors, covariant and contravariant tensors, and much more. It requires mathematical training.

The details are also a lot messier. In the way I have presented the example, we started with the abstraction and then introduced some of the detail. When we are actually doing research, it is usually the reverse. We start with some existing details and want to construct a simplifying, unifying theory, but we get lost in the details because of the deep technical nature of those details.

A similar thing can happen when we are coding a Data Science solution. When we are deep in the technical details we often fail to spot the high-level patterns and so we fail to spot the simple description. We end-up producing lots of corner-cases and edge-cases. Our code starts to become one big series of “if-else if” statements or a big “switch” statement. Our second example illustrates that in a Data Science context.

Example 2:

In our first example we established that we are comfortable with linear models. In statistics a linear model is of the form,

$\mathbb{E}\left ( y \right ) \;=\; \underline{\beta}^{\top}\underline{x}$ Eq.2

The left-hand side of Eq.2 is the mean of our target variable $y$ and the right-hand side is what we call the “linear predictor” because it is a linear combination of the various predictive features and it predicts the average value of $y$ given the feature vector $\underline{x}$ .

What about non-linear models? Imagine that we have a non-linear relationship between our target variable and a single explanatory feature $x$ . One approach I’ve seen some Data Scientists take is to partition the values of $x$ into a number of sub-ranges and build a linear model for each sub-range. Sticking with linear models seems comfortable. But we end up with whole zoo of different linear models (yes, I really have seen someone do this).

Let’s move away from the details about how we do the model fitting and focus on what we want. Let’s abstract. We want to model the mean of the target variable. When we state the problem in these simple terms we realise we can just write the mean of the target variable as a non-linear transformation of the linear predictor in Eq.2. We write this as,

$\mathbb{E}\left( y \right )\;=\;g^{-1}\left ( \underline{\beta}^{\top}\underline{x} \right )$ . Eq.3

This is what statisticians call a Generalized Linear Model (GLM). Again, the left-hand side of Eq.3 is how we represent the mean of the target variable $y$ and the non-linear transformation function is traditionally represented as $g^{-1}(\cdot)$ . The function $g$ is called the “link function” and is monotonic. So we can read Eq.3 as “mean of y = simple monotonic non-linear transformation of $\underline{\beta}^{\top}\underline{x}$ ”. You can see where the name Generalized Linear Model comes from. It is just a non-linear transformation of the same thing we focus on when build a linear model.

The way we introduced Eq.3 seemed very natural and intuitive. And yet so many Data Scientists seem scared of GLMs. The jargon of link functions and the notation of expectation values $\mathbb{E}( y)$ can be off-putting when you first encounter them. Consequently, many Data Scientists don’t persevere with GLMs. When you understand what Eq.3 is saying at high-level they are very easy to understand. When we step away from the messy detail of “how” to code the non-linearity and instead focus on the “what” of what we want to achieve – introduce a non-linear relationship between the mean of our target variable and a linear combination of predictive features – then Eq.3 becomes the intuitive and obvious way to do it. Coding it up then becomes easy.

What can we learn more generally from those two examples?

How to use abstraction as a practical tool

What the two examples above will hopefully have convinced you about is:

Focusing at a high-level on the ‘things’ in our problem and what we want them to do helps us to identify the main actors in problem, the relationships between them and what we need to do to them. We discover the “what”.
Once we have the what, the “how”, i.e. the implementation is typically easy.

But how to use this in practice? Whenever I find myself trying to solve a coding problem and I’m getting bogged down in the details of structuring the problem, flipping between different choices of data-structures to use, then I know I’m too close to the problem. I know I need to step away from the keyboard and get the pen and paper out.

I start by representing the main objects in my problem by some sort of symbol, e.g. a circle or a square. I then sketch out what interactions happen between those objects. By this point I’m beginning to describe the interactions in terms of high-level language such as, “I have matrix of type X. It is processed by a transform of type Y that is parameterized by mathematical object of type A”. You’ll spot that I’m already beginning to describe the interactions more in terms of interfaces and method signatures. That is, I’m focusing on the abstraction of the problem and not on the implementation details that would occur behind those interfaces and method signatures. This is because I have no other choice. I have only pen and paper. I can’t code, so I can’t bogged down again in implementation details and discussions on what data structures to use.

Once I have finished a couple of iterations of pen-and-paper sketching I have the architecture of my solution nailed. And the best thing is the structure typically mimics the structure of the actual mathematical calculation I’m trying to code. At this point I have identified what interfaces, classes, and methods I need. I now return to the keyboard and the implementation is easy.

Conclusion

If you’re getting bogged down in working out implementation details, it is a good sign you don’t actually know “what” it is you need to implement.

Step back, move up a level of abstraction and start identifying what are the things/actors in your problem and what are the interactions between them. Doing this using pen-and-paper will aid you a lot, because it forces you away from the keyboard and the temptation to continue implementing.

Once you have sketched out the necessary objects and their interactions you will pretty much have the classes, interfaces, and methods you need. You have the “what”. Now you can implement. It will now be a lot faster.

Log Normal Bias and the infinite capacity of mathematics to surpise

On January 17, 2026January 31, 2026 By dchoyleIn Data Science, Mathematical Analysis, MathematicsLeave a comment

Summary

Always confirm your mathematical intuition with an explicit calculation, no matter how simple the calculation.
Always confirm your mathematical intuition with an explicit calculation, no matter how well you know the particular area of mathematics.
Don’t leave it 25 years to check your mathematical intuition with a calculation, no matter how confident you are in that intuition.

Introduction

One of the things about mathematics is that it’s not afraid to slap you in the face and stand there laughing at you. Metaphorically speaking, I mean. Mathematics has an infinite capacity to surprise even in areas where you think you can’t be surprised. And you end up looking or feeling like an idiot, with mathematics laughing at you.

This is a salutary lesson about how I thought I understood an area of statistics, but didn’t. The area in question is the log-normal distribution. This is a distribution I first encountered 25 years ago and have worked with almost continuously since.

The problem which slapped me in the face

The particular aspect of the log-normal distribution in question, was what happens when you go from models or estimates on the log-scale, to models or estimates on the ‘normal’ or exponentiated scale. For a log-normal random variable $Y$ with $\sigma^{2}$ being the variance of $\log Y$ , we have the well-known result,

$\mathbb{E}\left ( Y \right ) \;=\; \exp\left( \mathbb{E}\left (\log Y \right ) \right ) \times \exp\left ( \frac{1}{2}\sigma^{2}\right )$

This is an example of Jensen’s inequality at work. If you apply a non-linear function $f$ to a random variable $Y$ , then the average of the function is not equal to the function of the average. That is, $\mathbb{E}\left ( f(Y)\right )\;\neq\; f\left ( \mathbb{E}\left ( Y\right )\right )$ .

In the case of the log-normal distribution, the first equation above tells us that we can relate the $\mathbb{E}\left ( \log Y\right )$ to $\mathbb{E}\left (Y)\right)$ via a simple bias-correction factor $\exp ( \frac{1}{2}\sigma^{2} )$ . So, if we had a sample of $N$ observations of $\log Y$ we could estimate the expectation of $Y$ from an estimate of the expectation of $\log Y$ by multiplying by a bias correction factor of $\exp\left ( \frac{1}{2} \sigma^{2}\right )$ . We can estimate $\sigma^{2}$ from the residuals of our estimate of the expectation of $\log Y$ . In this case, I’m going to assume our model is just an estimation of the mean, $\mathbb{E}\left ( \log Y \right )$ , which I’ll denote by $\mu$ , and I’m not going to try to decompose $\mu$ any further. If our estimate of $\mu$ is just the sample mean of $\log Y$ , then our estimate of $\sigma^{2}$ would simply be the sample variance, $s^{2}$ . If we denote our observations by $y_{i}\;,\; i=1,\ldots,N$ then $s^{2}$ is given by the formula,

$s^{2}\;=\; \frac{1}{N-1}\sum_{i} \log^{2} y_{i} \;-\; \frac{1}{N(N-1)}\left( \sum_{i}\log y_{i}\right )^{2}$

Note the usual Bessel correction that I’ve applied in the calculation of $s^{2}$ .

We can see from this discussion that the bias correction in going from the log-scale to the ‘normal’ scale is multiplicative in this case. The bias correction factor $B_{1}$ is given by,

$B_{1} \;=\;\exp\left ( \frac{1}{2} s^{2}\right )$

If we had suspected that any bias correction needs to be multiplicative, which seems reasonable given we are going from a log-scale to a normal-scale, then we could easily have estimated a bias correction factor as simply the ratio of the mean of the observations on the normal scale to the exponential of the mean of the observations on the log scale. That is, we can construct a second estimate, $B_{2}$ , of the bias correction factor needed, with $B_{2}$ given by,

$B_{2} \; =\; \frac{\frac{1}{N}\sum_{i}y_{i}}{\exp\left ( \frac{1}{N}\sum_{i}\log y_{i}\right )}$

I tend to refer to the $B_{2}$ method as a non-parametric method because it is how you would estimate $\mathbb{E}\left ( Y \right )$ from an estimate of $\mathbb{E}\left ( \log Y \right )$ and you only knew the bias correction needed was multiplicative. You wouldn’t explicitly need to make any assumptions about the distribution of the data. In contrast, $B_{1}$ is specific to the data coming from a log-normal distribution, so I call it a parametric estimate. For these reasons, I tend to prefer using the $B_{2}$ way of estimating any necessary bias correction factor, as obviously any real data won’t be perfectly log-normal and so the $B_{2}$ estimator will be more robust whilst also being very simple to calculate.

A colleague asks an awkward question

So far everything I’ve described is well known and pretty standard. However, a few months ago a colleague asked me if there would be any differences between the estimates $B_{1}$ and $B_{2}$ ? “Of course”, I said, if the data is not log-normal. “No”, my colleague said. They meant, even if the data was log-normal would there be any differences between the estimates $B_{1}$ and $B_{2}$ . Understanding the size of any differences when the parametric assumptions are met would help understand how big the differences could be when the parametric assumptions are violated. “Yeah, there will obviously still be differences between $B_{1}$ and $B_{2}$ even if we do have log-normal data, because of sampling variation affecting the two formulae differently. But, I would expect those differences to be small even for moderate sample sizes”, I confidently replied.

I’ve been using those formulae for $B_{1}$ and $B_{2}$ for over 25 years and never considered the question my colleague asked. I’d never confirmed my intuition. So 25 years later, I thought I’d do a quick calculation to confirm my intuition. My intuition was very very wrong! I thought it would be interesting to explain the calculations I did and what, to me, was the surprising behaviour of $B_{1}$ . Note, with hindsight the behaviour I uncovered shouldn’t have been surprising. It was obvious once I’d begun to think about it. It was just that I’d left thinking about it till 25 years too late.

The mathematical detail

The question my colleague was asking was this. If we have $N$ i.i.d. observations drawn from a log-normal distribution, i.e.,

$x_{i}\;=\;\log y_{i} \sim {\cal{N}}\left (\mu, \sigma^{2}\right )\;\;\; , i = 1,\ldots,N\;\;\;,$

what are the expectation values of the bias correction factors $B_{1}$ and $B_{2}$ at any finite value of $N$ . Do $\mathbb{E}\left ( B_{1} \right )$ and $\mathbb{E}\left ( B_{2}\right )$ converge to $\exp(\frac{1}{2}\sigma^{2})$ as $N\rightarrow\infty$ , and if so how fast?

Since the $\log y_{i}$ are Gaussian random variables we can express these expectation values in terms of simple integrals and we get,

$\mathbb{E}\left ( B_{1}\right )\;=\; \left( 2\pi\sigma^{2}\right)^{-\frac{N}{2}}\int_{{\mathbb{R}}^{N}} d{\underline{x}}\exp\left [-\frac{1}{2\sigma^{2}}\left ( \underline{x} - \mu\underline{1}_{N}\right )^{\top}\left ( \underline{x} - \mu\underline{1}_{N}\right )\right ] \exp\left[ \frac{1}{2} \underline{x}^{\top}\underline{\underline{M}}\underline{x}\right ]$ Eq.1

where the matrix $\underline{\underline{M}}$ is given by,

$\underline{\underline{M}}\;=\;\frac{1}{N-1}\underline{\underline{I}}_{N}\;-\;\frac{1}{N(N-1)}\underline{1}_{N}\underline{1}_{N}^{\top}$ .

Here ${\underline{1}}_{N}$ is an N-dimensional vector consisting of all ones, and $\underline{\underline{I}}_{N}$ is the $N\times N$ identity matrix.

We can similarly express the expectation of $B_{2}$ and we get,

$\mathbb{E}\left ( B_{2}\right )\;=\;\frac{1}{N}\sum_{i=1}^{N} \left( 2\pi\sigma^{2}\right)^{-\frac{N}{2}}\int_{{\mathbb{R}}^{N}} d{\underline{x}}\exp\left [-\frac{1}{2\sigma^{2}}\left ( \underline{x} - \mu\underline{1}_{N}\right )^{\top}\left ( \underline{x} - \mu\underline{1}_{N}\right )\right ] \exp \left [ \sum_{j=1}^{N}\left ( \delta_{ij} - \frac{1}{N}\right )x_{j}\right ]$ Eq.2

The matrices in the exponents of the integrands of Eq.1 and Eq.2 are easily diagonalized analytically and so the integrals are relatively easy to evaluate analytically. From this we find,

$\mathbb{E}\left ( B_{1}\right )\;=\; \left ( 1\;\;-\;\frac{\sigma^{2}}{N-1}\right )^{-\frac{(N-1)}{2}}$ Eq.3

$\mathbb{E}\left ( B_{2}\right )\;=\; \exp \left [ \frac{\sigma^{2}}{2}\left ( 1\;-\;\frac{1}{N}\right )\right ]$ Eq.4

We can see that as $N\rightarrow\infty$ both expressions for the bias correction tend to the correct value of $\exp(\sigma^2/2)$ . However, what is noticeable is at finite $N$ the value of $\mathbb{E}\left( B_{1}\right))$ has a divergence. Namely, as $\sigma^{2} \rightarrow N-1$ , we have $\mathbb{E}\left( B_{1}\right ) \rightarrow \infty$ and for $\sigma^{2} \ge N-1$ , we have that $\mathbb{E}\left( B_{1}\right )$ doesn’t exist. This means that for a finite sample size there are situations where the parametric form of the bias correction will be very very wrong.

At first sight I was shocked and surprised. The divergence shouldn’t have been there. But it was. My intuition was wrong. I was expecting there to be differences between $\mathbb{E}\left ( B_{1}\right)$ and $\mathbb{E}\left ( B_{2}\right )$ , but very small ones. I’d largely always thought of any large differences that arise in practice between the two methods to be due to the distributional assumption made – because a real noise process won’t follow a log-normal distribution we’ll always get some differences. But the derivations above are analytic. They show that there can be a large difference between the two bias-correction estimation methods even when all assumptions about the data have been met.

With hindsight, the reason for the large difference between the two expectations is obvious. The second, $\mathbb{E}\left ( B_{2}\right )$ , involves the expectation of the exponential of a sum of Gaussian random variables, whilst the first involves the expectation of the exponential of a sum of the squares of those same Gaussian random variables. The expectation in Eq.1 is a Gaussian integral and if $\sigma^{2}$ is large enough the coefficient of the quadratic term in the exponent becomes positive, leading to the expectation value becoming undefined.

Even though in hindsight the result I’d just derived was obvious, I still decided to simulate the process to check it. As part of the simulation I also wanted to check the behaviour of the variances of $B_{1}$ and $B_{2}$ , so I needed to derive analytical results for ${\rm Var}\left ( B_{1}\right )$ and ${\rm Var}\left ( B_{2}\right )$ . These calculations are similar to the evaluation of $\mathbb{E}\left ( B_{1}\right )$ and $\mathbb{E}\left ( B_{2}\right )$ , so I’ll just state the results below,

${\rm Var}\left ( B_{1}\right)\;=\; \left ( 1 - \frac{2\sigma^{2}}{N-1}\right )^{-\frac{(N-1)}{2}}\;-\;\left ( 1 - \frac{\sigma^{2}}{N-1} \right )^{-(N-1)}$ Eq.5

${\rm Var}\left ( B_{2}\right)\;=\; \left ( 1 - \frac{1}{N}\right )\exp \left [ \sigma^{2}\left ( 1 - \frac{2}{N}\right )\right ]\;+\;\frac{1}{N}\exp \left [ 2\sigma^{2}\left ( 1 - \frac{1}{N}\right )\right ]\;-\; \exp \left [ \sigma^{2}\left ( 1 - \frac{1}{N}\right )\right ]$ Eq.6

Note that the variance of $B_{1}$ also diverges, but when $\frac{\sigma^{2}}{N-1} = \frac{1}{2}$ , so much before $\mathbb{E}\left ( B_{1}\right )$ diverges.

To test Eq.3 – Eq.6 I ran some simulations. First, I introduced a scaled variance $\phi = \sigma^{2}/(N-1)$ . The divergence in $\mathbb{E}\left ( B_{1}\right )$ occurs at $\phi = 1$ , so $\phi$ gives us a way of measuring how problematic the value of $\sigma^{2}$ will be for the given value of $N$ . The divergence in ${\rm Var}\left ( B_{1} \right )$ occurs at $\phi = 0.5$ . I chose to set $N=20$ in my simulations, so quite a small sample but there are good reasons that I’ll explain later. For each value of $\phi$ I generated $N_{sim} = 100000$ sample datasets, each of $N$ log-normal datapoints, and calculated $B_{1}$ and $B_{2}$ . From the simulation sample of $B_{1}$ and $B_{2}$ values I then calculated sample estimates of $\mathbb{E}\left ( B_{1}\right )$ , $\mathbb{E}\left ( B_{2}\right )$ , ${\rm Var}\left ( B_{1}\right )$ , and ${\rm Var}\left ( B_{2}\right )$ . The simulation code is in the Jupyter notebook LogNormalBiasCorrection.ipynb in the GitHub repository here.

Below in Figure 1 I’ve shown a plot of the expectation of $B_{1}$ , relative to the true bias correction factor $\exp\left( \frac{1}{2}\sigma^{2}\right )$ , as a function of $\phi$ . The red line is based on the theoretical calculation in Eq.3, whilst the blue points are simulation estimates,

A plot showing how the average relative error made by the parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 1: Plot of theory and simulation estimates of the average relative error made by the parametric bias correction factor, as we increase the variance of the log-normal data.

The agreement between simulation and theory is very good. But even for these small values of $\phi$ we can see the error that $B_{1}$ makes in estimating the correct bias correction factor is significant – nearly 40% relative error at $\phi = 0.25$ . For $N=20$ , the value of $\sigma$ is 2.24 at $\phi=0.25$ . Whilst this is large for the additive log-scale noise, it is not ridiculously large. It shows that the presence of the divergence at $\phi = 1$ is strongly felt even a long way away from that divergence point.

The error bars on the simulation estimates correspond to $\pm 2$ standard-errors of the mean. In calculating the standard-errors of the mean I’ve used the theoretical calculation of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ (derived from Eq.5), rather than the simulation sample estimate of it. The reason being that once one goes above about $N=20$ and above $\phi = 0.25$ , the variance (over different sets of simulation runs) of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ becomes large itself, and so obtaining a stable accurate simulation estimate of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ requires such a prohibitively large number of simulations (way greater than 100000) that estimating ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ from a simulation sample becomes impractical. So I used the theoretical result instead. Hence also why the plot below is restricted to $\phi \le 0.25$ .

For the range $\phi\le 0.25$ , where the simulation sample size is sufficiently large enough that we can trust our simulation sample based estimates of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ , we can compare them with the theoretical values. This comparison is plotted in Figure 2 below.

A plot showing how variance of the relative error made by the parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 2: Plot of theory and simulation estimates of the variance of the relative error made by the parametric bias correction factor, as we increase the variance of the log-normal data.

We can see that the simulation estimates of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ confirm the accuracy of the theoretical result in Eq.5, and consequently confirm the validity of the standard error estimates in Figure 1.

Okay, so that’s confirmed the ‘surprising’ behaviour of the bias-correction estimate $B_{1}$ , but what about $B_{2}$ ? How do simulation estimates of $\mathbb{E}\left ( B_{2}\right )$ and ${\rm Var}\left ( B_{2}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ compare to the theoretical results in Eq.4 and Eq.6 ? I’ve plotted these in Figures 3 and 4 below.

A plot showing how the average relative error made by the non-parametric bias correction factor changes as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 3: Plot of theory and simulation estimates of the average relative error made by the non-parametric bias correction factor, as we increase the variance of the log-normal data.

The accuracy of $B_{2}$ is much better than that of $B_{1}$ . By comparison $B_{2}$ only makes a 10% relative error (under-estimate) on average at $\phi=0.25$ , compared to the 40% over-estimate of $B_{1}$ (on average). We can also see that the magnitude of the relative error increases slowly as $\phi$ increases, again in constrast with the rapid increase in the relative error made by $B_{1}$ (in Figure 1).

A plot showing how variance of the relative error made by the non-parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 4: Plot of theory and simulation estimates of the variance of the relative error made by the non-parametric bias correction factor, as we increase the variance of the log-normal data.

Conclusion

So, the theoretical derivations and analysis I did of $B_{1}$ and $B_{2}$ turned out to be correct. What did I learn from this? Three things. Firstly, even areas of mathematics that you understand well, have worked with for a long time, and are relatively simple in mathematical terms (e.g. just a univariate distribution) have the capacity to surprise you. Secondly, given this capacity to surprise it is always a good idea to check your mathematical hunches, even with a simple calculation, no matter how confident you are in those hunches. Thirdly, don’t leave it 25 years to check your mathematical hunches. The evaluation of those integrals defining $\mathbb{E}\left ( B_{1}\right )$ and $\mathbb{E}\left ( B_{2}\right )$ only took me a couple of minutes. If I had calculated those integrals 25 years ago, I wouldn’t be feeling so stupid now.

An equation showing the mathematical form of the log sum exp calculation.

Data Science Notes: 2. Log-Sum-Exp

On January 1, 2026January 31, 2026 By dchoyleIn Algorithms, Coding, Data Science, Data Science Notes, Numerical Analysis, pythonLeave a comment

Summary

Summing up many probabilities that are on very different scales often involves calculation of quantities of the form $\log\left ( \sum_{k} \exp\left( a_{k}\right )\right )$ . This calculation is called log-sum-exp.
Calcuating log-sum-exp the naive way can lead to numerical instabilities. The solution to this numerical problem is the “log-sum-exp” trick.
The scipy.special.logsumexp function provides a very useful implementation of the log-sum-exp trick.
The log-sum-exp function also has uses in machine learning, as it is a smooth, differentiable approximation to the ${\rm max}$ function.

Introduction

This is the second in my series of Data Science Notes series. The first on Bland-Altman plots can be found here. This post is on a very simple numerical trick that ensures accuracy when adding lots of probability contributions together. The trick is so simple that implementations of it exist in standard Python packages, so you only need to call the appropriate function. However, you still need to understand why you can’t just naively code-up the calculation yourself, and why you need to use the numerical trick. As with the Bland-Altman plots, this is something I’ve had to explain to another Data Scientist in the last year.

The log-sum-exp trick

Sometimes you’ll need to calculate a sum of the form, $\sum_{k} \exp\left ( a_{k}\right )$ , where you have values for the $a_{k}$ . Really? Will you? Yes, it will probably be calculating a log-likelihood, or a contribution to a log-likelihood, so the actual calculation you want to do is of the form,

$\log\left ( \sum_{k} \exp \left ( a_{k}\right )\right )$

These sorts of calculations arise where you have log-likelihood or log-probability values $a_{k}$ for individual parts of an overall likelihood calculation. If you come from a physics calculation you’ll also recognise the expression above as the calculation of a log-partition function.

So what we need to do is exponentiate the $a_{k}$ values, sum them, and then take the log at the end. Hence the expression “log-sum-exp”. But why a blogpost on “log-sum-exp”? Surely, it’s an easy calculation. It’s just np.log(np.sum(np.exp(a))) , right ? Not quite.

It depends on the relative values of the $a_{k}$ . Do the naïve calculation np.log(np.sum(np.exp(a))) and it can be horribly inaccurate. Why? Because of overflow and underflow errors.

If we have two values $a_{1}$ and $a_{2}$ and $a_{1}$ is much bigger than $a_{2}$ , when we add $\exp(a_{1})$ to $\exp(a_{2})$ we are using floating point arithmetic to try and add a very large number to a much smaller number. Most likely we will get an overflow error. If would be much better if we’d started with $\exp(a_{1})$ and try to add $\exp(a_{2})$ to it. In fact, we could pre-compute $a_{2} - a_{1}$ , which would be very negative and from this we could easily infer that adding $\exp(a_{2})$ to $\exp(a_{1})$ would make very little difference. In fact, we could just approximate $\exp(a_{1}) + \exp(a_{2})$ by $\exp(a_{1})$ .

But how negative does $a_{2} - a_{1}$ have to be before we ignore the addition of $\exp(a_{2})$ ? We can set some pre-specified threshold, chosen to avoid overflow or underflow errors. From this, we can see how to construct a little Python function that takes an array of values $a_{1}, a_{2},\ldots, a_{N}$ and computes an accurate approximation to $\sum_{k=1}^{N}\exp(a_{k})$ without encountering underflow or overflow errors.

In fact we can go further and approximate the whole sum by first of all identifying the maximum value in an array $a = [a_{1}, a_{2}, \ldots, a_{N}]$ . Let’s say, without loss of generality, the maximum value is $a_{1}$ . We could ensure this by first sorting the array, but it isn’t necessary actually do this to get the log-sum-exp trick to work. We can then subtract $a_{1}$ from all the other values of the array, and we get the result,

$\log \left ( \sum_{k=1}^{N} \exp \left ( a_{k} \right )\right )\;=\; a_{1} + \log \left ( 1\;+\;\sum_{k=2}^{N} \exp \left ( a_{k} - a_{1} \right )\right )$

The values $a_{k} - a_{1}$ are all negative for $k \ge 2$ , so we can easily approximate the logarithm on the right-hand side of the equation by a suitable expansion of $\log (1 + x)$ . This is the “log-sum-exp” trick.

The great news is that this “log-sum-exp” calculation is so common in different scientific fields that there are already Python functions written to do this for us. There is a very convenient “log-sum-exp” function in the SciPy package, which I’ll demonstrate in a moment.

The log-sum-exp function

The sharp-eyed amongst you may have noticed that the last formula above gives us a way of providing upper and lower bounds for the ${\rm max}$ function. We can simply re-arrange the last equation to get,

${\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; \log \left ( \sum_{k=1}^{N}\exp \left ( a_{k}\right )\right )$

The logarithm calculation on the right-hand side of the inequality above is what we call the log-sum-exp function (lse for short). So we have,

${\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; {\rm lse}\left ( a_{1}, a_{2}, \ldots, a_{N}\right )$

This gives us an upper bound for the ${\rm max}$ function. Since $a_{k} \le {\rm max}\left ( a_{1}, a_{2},\ldots,a_{N}\right )$ , it is also relatively easy to show that,

${\rm lse} \left ( a_{1}, a_{2}, \ldots, a_{N}\right )\;\le\; {\rm max}\left ( a_{1}, a_{2},\ldots, a_{N} \right )\;+\;\log N$

and so we have have a lower bound for the ${\rm max}$ function. So the log-sum-exp function allows us to compute lower and upper bounds for the maximum of an array of real values, and it can provide an approximation of the maximum function. The advantage is that the log-sum-exp function is smooth and differentiable. In contrast, the maximum function itself is not smooth nor differentiable everywhere, and so is less convenient to work with mathematically. For this reason the log-sum-exp is often called the “real-soft-max” function because it is a “soft” version of the maximum function. It is often used in machine learning settings to replace a maximum calculation.

Calculating log-sum-exp in Python

So how do we calculate the log-sum-exp function in Python. As I said, we can use the SciPy implementation which is in scipy.special. All we need to do is pass an array-like set of values $a_{k}$ . I’ve given a simple example below,

			
# import the packages and functions we need
import numpy as np
from scipy.special import logsumexp
# create the array of a_k values 
a = np.array([70.0, 68.9, 20.3, 72.9, 40.0])
# Calculate log-sum-exp using the scipy function 
lse = logsumexp(a)
# look at the result
print(lse)

		

This will give the result 72.9707742189605

The example above and several more can be found in the Jupyter notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes

The great thing about the SciPy implementation of log-sum-exp is that it allows us to include signed scale factors, i.e. we can compute,

$\log \left ( \sum_{k=1}^{N} b_{k}\exp\left ( a_{k}\right ) \right )$

where the values $b_{k}$ are allowed to be negative. This means, that when we are using the SciPy log-sum-exp function to perform the log-sum-exp trick, we can actually use it to calculate numerically stable estimates of sums of the form,

$\log \left ( \exp\left ( a_{1}\right ) \;-\; \exp\left ( a_{2}\right ) \;-\;\exp\left ( a_{3}\right )\;+\;\exp\left ( a_{4}\right )\;+\ldots\; + \exp\left ( a_{N}\right )\right )$ .

Here’a small code snippet illustrating the use of the scipy.special.logsumexp with signed contributions,

			
# Create the array of the a_k values
a = np.array([10.0, 9.99999, 1.2])
b = np.array([1.0, -1.0, 1.0])
# Use the scipy.special log-sum-exp function
lse = logsumexp(a=a, b=b)
# Look at the result
print(lse)

		

This will give the result 1.2642342014146895.

If you look at the output of the example above you’ll see that the final result is much closer to the value of the last array element $a_{3} = 1.2$ . This is because the first two contributions, $\exp(a_{1})$ and $\exp(a_{2})$ almost cancel each other out because the contribution $\exp(a_{2})$ is pre-fixed by a factor of -1. What we get left with is something close to $\log(\exp(a_{3}))\;=\; a_{3}$ .

There is also a small subtlety in using the SciPy logsumexp function with signed contributions. If the substraction of some terms had led to an overall negative result, scipy.special.logsumexp will rerturn NaN as the result. In order to get it to always return a result for us, we have to tell it to return the sign of the final summation as well, by setting the return_sign argument of the function to True. Again, you can find the code example above and others in the notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes.

When you are having to combine lots of different probabilities, that are on very different scales, and you need to subtract some of them and add others, the SciPy log-sum-exp function is very very useful.

The shape of mathematics to come

On December 21, 2025January 20, 2026 By dchoyleIn AI, Artificial Intelligence, Data Science, MathematicsLeave a comment

Summary

Using AI for mathematics research is a much less hyped field than GenAI in general.
The pre-print, “The Shape of Math To Come”, from Alex Kontorovich gives an excellent recent, readable, and accessible discussion of the potential for using AI to scale-up and automate formal mathematics research.

Introduction

Like the rest of the universe I’ve been drawn into experimenting with GenAI for general productivity tasks. However, one of my other interests is how GenAI gets used in scientific discovery. Even more interesting is how GenAI can be used in mathematical discovery. If you want to know where the combination of mathematics and AI is going or could go, then I would strongly recommend reading this arXiv pre-print from Alex Kontorovich, titled “The Shape of Math To Come”. The pre-print was uploaded to the archive in October of this year. With some free evenings last week, I finally got round to finish reading and digesting it.

GenAI for rigorous mathematics? Really?

It’s important to point out that I’m not talking about the mathematics behind GenAI, but about how GenAI is used or could be used as a tool to support the formal development of rigorous mathematical proofs, derivations and statements. Because of that rigour, there is actually less hype (compared to mainstream applications of GenAI) around the use of AI in this way. In the pre-print Kontorovich outlines a credible and believable road ahead for using GenAI in formal mathematics. For me the roadmap is credible because of i)Kontorovich’s credentials as Distinguished Professor of mathematics at Rutger’s University, ii) the significant progress that has been made in this area over the last couple of years, particularly the AlphaProof system from DeepMind, iii) and the existence of powerful automated theorem provers such as Lean.

The pre-print is very readable and even if you don’t have a formal mathematical background but have, say an undergraduate science or engineering degree, you will be able to follow the majority of the discussion – there is only one place where specialized knowledge of analysis is used and even then it is not essential for following Kontorovich’s argument.

The main potential sticking point that Kontorovich highlights in using AI to do maths is the tension between the stochastic nature of Large Language Models (LLMs) and the deterministic nature of mathematical proof. Kontorovich’s argument is that the use of automated theorem checkers such as Lean, in combination with a large and increasing library (MathLib) that formally encodes our current rigorous mathematical knowledge base, will provide a checking mechanism to mitigate the stochastic nature of LLM outputs. In this scenario, human mathematicians are elevated to the role of orchestrating and guiding the LLMs in their attempts to construct new proofs and derivations (using Lean and aided by the knowledge base encoded in MathLib). Whether human mathematicians choose to adopt this approach will depend on how fruitful and easy to use the new GenAI-based mathematics tools are. By analogy with the take-up of the LaTeX typesetting language in the early 1990s, Kontorovich says that this will be when “the ratio of time required to discover and formalize a mathematical result (from initial idea through verified formal proof) to the time required to discover and write up the same result in natural language mathematics” falls below one. When this happens there will also be accelerated ‘network’ effects – increased number of verified proofs encoded into MathLib will increase the ability of computer + GenAI assisted workflows to construct further verified proofs.

Less formal mathematics research

What interests me as a Data Scientist is, if such approaches are successful in advancing how mathematicians do formal mathematics, how much of those approaches can be adopted or modified to advance how non-mathematicians do non-formal mathematics. I’m not talking about the “closed-loop” GenAI-based “robot scientist” approaches that are being discussed in drug-discovery and materials development fields. I’m talking about the iterative, exploratory process of developing a new model training algorithm for a new type of data, or new application domain, or new type of problem. Data Science is typically about approximation. Yes, there is formal derivation in places, and rigorous proofs in others, but there is also a lot of heuristics. Those heuristics can be expressed mathematically, but they are still heuristics. This is not a criticism of using such heuristics – sometimes they are used out of ignorance that better approaches exist, but often they are used out of necessity arising from constraints imposed by the scale of the problem or runtime and storage constraints. What I’m driving at is that the Data Science development process is a messy workflow, with some steps being highly mathematical and imposing a high degree of rigour, some less so. Having access to proven tools that accelerate the exploration of the mathematical steps of that workflow can obviously accelerate the overall Data Science development process. At the moment, switching between mathematical and non-mathematical steps introduces context-switching and hence friction, so my tool use for the mathematical steps is limited. I might use Mathematica for an occasional tedious expansion and where tracking and collating expansion terms is error-prone. For my day-to-day work my mathematical steps will be limited to, say, 40 pages of A4, and are more typically between 5 – 15 pages of A4, and occasionally 1-2 pages. At such scales, I can just stick to pen-and-paper. If however, mathematical support tools developed from doing formal mathematics research became more seamlessly integrated into the Data Science development process, I would change my way of working. However, there is another problem. And that is the messy nature of the Data Science development process I outlined. Not all the steps in the process are rigorous. That means the Data Science development workflow doesn’t have access to the self-correcting process that formal mathematics does. In Alex Kontorovich’s paper he makes an optimistic case for how the use of formal theorem checkers, in combination with GenAI, can genuinely advance formal mathematics. I think of this as akin to how physics-based engines are used to provide rigorous environments for reinforcement-learning. The messy Data Science workflow typically has no such luxury. Yes, sometimes we have access to the end-point ground-truth, or we can simulate such truths under specified assumptions. More often than not, our check on the validity of a specific Data Science development workflow is an ad-hoc combination of data-driven checks, commercial vertical domain expertise, technical vertical expertise, and hard-won experience of ‘what works’. That needs to get better and faster. I would love for some of the tools and processes outlined by Kontorovich to be adapted to less formal but still mathematically-intensive fields such as the algorithm development side of Data Science. Will it happen? I don’t know. Can it be done? I don’t know, as yet.

A Bland-Altman plot of the peak expiratory flow rate data taken from the 1986 Lancet paper of Bland and Altman.

Data Science Notes: 1. Bland-Altman plots

On December 6, 2025January 31, 2026 By dchoyleIn Data Science, Data Science Notes, Data Visualization1 Comment

Introduction

Summary: If you are using a scatter plot to compare two datasets, rotate your data.

Three times in the last six months I’ve explained to different colleagues and former colleagues what a Bland-Altman (BA) plot is. Admittedly, the last of those explanations was because I remarked to a colleague that I’d been talking about BA-plots and they then wanted to know what they were.

BA-plots are a really simple idea. I like them because they highlight how a human’s ability to perceive patterns in data can be markedly affected by relatively small changes in how that data is presented; rotating the data in this case.

I also like them because they are from the statisticians Martin Bland and Doug Altman who produced a well-known series of short articles, “Statistics Notes”, in the BMJ in the 1990s. Each article focused on a simple, basic, but very important statistical concept. The series ran over nearly 70 articles and the idea was to explain to a medical audience about ‘statistical thinking’. You can find the articles at Martin Bland’s website here. Interestingly, BA-plots were not actually part of this series of BMJ articles as their work on BA-plots had been published in earlier articles. However, I’d still thoroughly recommend having a browse of the BMJ series.

Since I’ve had to explain BA-plots three times recently, I thought I’d give it another go in a blogpost. Also, inspired by the Bland-Altman series, I’m going to attempt a series of 10 or so short blogposts on simple, basic Data Science techniques and concepts that I find useful and/or interesting. The main criterion for inclusion in my series is whether I think I can explain it in a short post, not whether I think it is important.

What is a Bland-Altman plot?

BA-plots are used for comparing similar sets of data. The original use-case was to test how reproducible a process was. Take two samples of data that ideally you would want to be identical and compare them using a BA plot. This could be comparing clinical measurements made by two different clinicians across the same set of patients. What we want to know is how reproducible is a clinical measurement if made by two different clinicians.

Perhaps the first way of visually comparing two datasets on the same objects would be to just do a scatter plot – one dataset values on the x-axis, the other dataset values on the y-axis. I’ve got an example in the plot below. In fact, I’ve taken this data from Bland and Altman’s original 1986 Lancet paper. You can see the plotted points are pretty close to the 45-degree line (shown as a black dashed line), indicating the two datasets are measuring the same thing with some scatter, perhaps due to measurement error.

A scatter plot of peak expiratory flow rate data taken from the original Bland Altman paper in the Lancet from 1986. — Scatter plot of original Bland Altman PEFR data

Now, here’s the neat idea. I can do exactly the same plot, but I’m just going to rotate it clockwise by 45-degrees. A little bit of high-school/college linear algebra will convince you that I can do that by creating two new features,

$\frac{1}{\sqrt{2}} \left ( y + x \right )$
$\frac{1}{\sqrt{2}} \left ( y - x \right )$

Here $x$ and $y$ are our starting features or values from the two datsets we are comparing. Typically, the pre-factors of $\sqrt{2}$ are omitted and we simply define our new features as,

$A = \frac{1}{2} \left ( y + x \right )$
$M = \left ( y - x \right )$

Now we plot $M$ against $A$ . I’ve shown the new plot below.

Now a couple of things become clearer. Firstly, $A$ is the mean of $x$ and $y$ and so it gives us a better estimate of any common underlying value than just $x$ on its own or $y$ on its own. It gives us a good estimate of the size of the ‘thing’ we are interested in. Secondly, $M$ is the difference between $x$ and $y$ . $M$ tells us how different $x$ and $y$ are. Plotting $M$ against $A$ as I’ve done above shows me how reproducible the measurement is because I can easily see the scale of any discrepancies against the new vertical axis. I also get to see if there is any pattern in the level of discrepancy as the size of the ‘thing’ varies on the new horizontal axis. This was the original motivation for the Bland-Altman plot – to see the level of discrepancy between two sets of measurements as the true underlying value changes.

What the eye doesn’t see

What I really like about BA-plots though, is how much easier I find it to pick out if there is any systematic pattern to the differences between the two datasets. I haven’t looked into the psychological theory of visual perception, but it makes sense to me that humans would find it easier looking for differences as we move our eyes across one dimension – the horizontal axis – compared to moving our eyes across two dimensions – both the horizontal and vertical axes – when trying to scan the 45-degree line.

I first encountered BA-plots 25 years ago in the domain of microarray analysis. In that domain they were referred to as MA-plots (for obvious reasons). The choice of the symbols $M$ and $A$ also had a logic behind it. $M$ and $A$ are constructed as linear combinations of $x$ and $y$ , and in this case we “Add” them when constructing $A$ and “Minus” them when constructing $M$ . Hence the symbols $M$ and $A$ even tell you how to calculate the new features. You will also see BA-plots referred to Tukey mean-difference plots (again for obvious reasons).

In microarray analysis we were typically measuring the levels of mRNA gene expression in every gene in an organism across two different environmental conditions. We expected some genes to show differences in expression and so a few data points were expected to show deviations from zero on the vertical $M$ -axis. However, we didn’t expect broad systematic differences across all the genes, so we expected a horizontal data cloud on the MA-plot. Any broad systematic deviations from a horizontal data cloud were indicative of a systematic bias in the experimental set-up that needed to be corrected for. The MA plots gave an easy way to both visually detect any bias but also suggested an easy way to correct it. To correct it we just needed to fit a non-linear trendline through the data cloud, say using a non-parametric fit method like lowess. The vertical difference between a datapoint and the trendline was our estimate of the bias-corrected value of $M$ for that datapoint.

To illustrate this point I’ve constructed a synthetic example below. The left-hand plot shows the raw data in a standard scatterplot. The scatterplot suggests there is good agreement between the two samples – maybe a bit of disagreement but not much. However, when we look at the same data as a Bland-Altman plot (right-hand plot) we see a different picture. We can clearly see a systematic pattern to the discrepancy between the two samples. I’ve also estimated this systematic variation by fitting a non-linear trendline (in red) using the lowess function in the Python statsmodels package.

Two plots. The left hand plot shows a standard scatterplot, whilst the right-hand plot shows the corresponding Bland-Altman plot. — Scatterplot and Bland-Altman plot for the second example dataset.

Sometimes we may expect a global systematic shift between our paired data samples, i.e. a constant vertical shift on the $M$ axis. Or at least we can explain/interpret such a shift. Or there may be other patterns of shift we can comfortably interpret. This widens the applications domains we can use BA-plots for. In commercial Data Science I’ve seen BA-plots used to assess reproducibility of metrics on TV streaming advertising, and also calibration of transaction data across different supermarket stores. Next time you’re using a vanilla scatterplot to compare two data series, think about rotating and making a BA-plot.

All the code for the examples I’ve given in this post is in the Jupyter notebook DataScienceNotes1_BlandAltmanPlots.ipynb which can be found in the public GitHub repository https://github.com/dchoyle/datascience_notes. Feel free to clone the repository and play with the notebook. I’ll be adding to the repository as I add further “Data Science Notes” blogposts.

The need for simulation

On August 4, 2025October 7, 2025 By dchoyleIn Algorithms, Data Science, Machine LearningLeave a comment

TL;DR: Poor mathematical-based design and testing of models can lead to significant problems in production. Finding suitable ground truth data for testing of models can be difficult. Yet, many Data Science models make it into production without appropriate testing. In these circumstances testing with simulated data can be hugely valuable. In this post I explain why and how. In fact, I argue that testing with Data Science models should be non-negotiable.

Introduction

Imagine a scenario. You’re the manager of a Premier League soccer team. You wouldn’t sign a new striker without testing if they could actually kick a ball. Wouldn’t you?

In the bad old days before VAR it was not uncommon for a big centre-back to openly punch a striker in the face if the referee and assistant referees weren’t looking. Even today, just look at any top-flight soccer match and you’ll see the blatant holding and shirt-pulling that goes on. Real-world soccer matches are dirty. A successful striker has to deal with all these realities of the game, whilst also being able to kick the ball in the net. At the very least when signing a new striker you’d want to test whether they could score under ideal benign conditions. Wouldn’t you? You’d put the ball on the penalty spot, with an open goal, and see if your new striker could score. Wouldn’t you? Passing this test, wouldn’t tell you that your striker will perform well in a real game, but if they fail this “ideal conditions” test it will tell you that they won’t perform well in real circumstances. I call this the “Harry Redknapp test” – some readers will understand the reference¹. If you don’t then read the footnote for an explanation.

How is this relevant to Data Science? One of the things I routinely do when implementing an algorithm is to test that implementation on simulated data. However, a common reaction I get from other Data Scientists is, “oh I don’t test on simulated data, it’s not real data. It’s not useful. It doesn’t tell you anything.” Oh yes it does! It tells you whether the algorithm you’ve implemented is accurate under the ideal conditions it was designed for. If your implementation performs badly on simulated data, you have a big problem! Your algorithm or your implementation of it has failed the “Harry Redknapp test”.

“Yeah, but I will have some ground-truth data I can test my implementation on instead, so I don’t need simulated data.” Not always. Are you 100% sure that that ground-truth data is correct? And what if you’re working on an unsupervised problem.

“Ok, but the chances of an algorithm implemented by experienced Data Scientists making it into production untested and with really bad performance characteristics is small”. Really!? I know of at least one implemented algorithm in production at a large organization that is actually an inconsistent estimator. An inconsistent estimator is one of the biggest sins an algorithm can commit. It means that even as we give the algorithm more and more ideal training data, it doesn’t produce the correct answer. It fails the “Harry Redknapp test”. I won’t name the organization in order to protect the guilty. I’ll explain more about inconsistent estimators later on.

So maybe I convinced you that simulated data can be useful. But what can it give you, what can’t it give you, and how do you go about it?”

What simulation will give you and what it won’t

To begin, we need to highlight some general but very important points about using simulated data:

Because we want to want to generate data, we need a model of the data generation process, i.e. we need a generative model².
Because we want to mimic the stochastic nature of real data, our generative model of the data will be a probabilistic one.
Because we are generating data from a model, what we can test are algorithms and processes that use that data, e.g. a parameter estimation process. We cannot test the model itself. Our conclusions are conditional on the model form being appropriate.

With those general points emphasized, let’s look in detail what we can get testing with simulated data.

What simulated data will give you

We can get a great deal from simulated data. As we said above, what we get is insight into the performance of algorithms that process the data, such as the parameter estimation process. Specifically, we can check whether our parameter estimation algorithm is, under ideal conditions,

Consistent
Biased
Efficient
Robust

I’ll explain each of these in detail below. We can also get insight into how fast our parameter estimation process runs or how much storage it requires. Running tests using simulated data can be extremely useful.

Consistency check

As a Data Scientist you’ll be familiar with the idea that if we have only a small amount of training data our parameter estimates for our trained model will not be accurate. However, if we have a lot of training data that matches the assumptions on which our parameter estimation algorithm is based, then we expect the trained parameter estimates to be close to their true values, i.e. close to the values which generated the data. As we increase the amount of training data, we expect our parameters estimates to get more and more accurate, converging ultimately to the true values in the limit of an infinite amount of training data. This is consistency.

In statistics, a formula or algorithm for estimating the parameters of a model is called an estimator. There can be multiple different estimators for the same model, some better than others. A consistent estimator is one whose expected value converges to the true value in the limit of an infinite amount of training data. An inconsistent estimator is one whose expected value doesn’t converge to the true value in the limit of an infinite amount of training data. Think about that for a moment,

An inconsistent estimator is an algorithm that doesn’t get better even when we give it a load more training data.

That is a bad algorithm! That is why I say constructing an inconsistent estimator is one of the worst sins a Data Scientist can commit. Very occasionally (rarely), an inconsistent estimator is constructed because it has other useful properties. But in general, it you encounter an inconsistent estimator you should take it as a sign of incompetence on the part of the Data Scientist who constructed it.

“Okay, okay, I get it. Inconsistent estimators are bad. But I don’t have an infinite amount of training data, so how can I actually check if my algorithm produces a consistent estimator? Surely, it can’t be done?” Yes, it can be done. What we’re looking for is convergence, i.e. parameter estimates getting closer and closer to the true values as we increase the training set size. I’ll give a demonstration of this in the next section when I show how to set up some simulation tests.

Bias check

Along with the concept of consistency comes the concept of bias. We said that a consistent estimator was one whose expectation value converges to the true value in the limit of an infinite amount of training data. However, that doesn’t mean a consistent estimator has an expectation value that is equal to the true value for a finite amount of training data. It is possible to have a consistent estimator that is biased. This means the estimator, on average, will differ from the true value when we use a finite amount of training data. For a consistent estimator, if it is biased that bias will disappear as we continually increase the amount of training data.

As you might have guessed, the best algorithms produce estimators that are consistent and unbiased. Knowing if your estimator is biased and by how much is extremely useful. Again, we can assess bias using simulated data, and I’ll show how to do this in the next section when I show how to set up some simulation tests.

Efficiency check

So far, we have spoken about the expectation or average properties of an algorithm/estimator. But what about its variance. It is all very well telling me that across lots of different instances of training datasets my algorithm would, on average get the right answer, or near the right answer, under ideal conditions, but in the real world I have only one training dataset. Am I going to be lucky and my particular training data will give parameter estimates close to the average behaviour of the algorithm? I’ll never know. But what I can know is how variable the parameter estimates from my algorithm are. I can do this by calculating the variance of the parameter estimates over lots of training datasets. A small variance will tell me that my one real-world dataset is likely to have performance close to the mean behaviour of the algorithm. I may still be unlucky with my particular training data and the parameter estimates are a long way from the average estimates, but it is unlikely. However, a large variance tells me that parameter estimates obtained from a single training dataset will often be a long way from the average estimates.

How can I calculate this variance of parameter estimates over training datasets? Simple, get lots of different training datasets produced under identical controlled conditions. How could I do that? Yep, you guessed it. Simulation. With a simulation process coded up, we can easily generate multiple instances of training datasets of the same size and generated under identical conditions. Again, I’ll demonstrate this in the next section.

Sensitivity check – robustness to contamination

Our message about simulated data is that it allows you to test your algorithm under conditions that match the assumptions made by the algorithm, i.e. under ideal conditions. But you can use simulation to test how well your algorithm performs in non-ideal conditions. We can also introduce contamination into the simulated data, for example drawing some response variable values from a non-Gaussian distribution if our algorithm has assumed the response variable is purely Gaussian distributed. We can produce multiple simulated datsets with different percentages of contamination and so test how sensitive or robust our estimation algorithm is to the level of contamination, i.e. how sensitive it is to non-ideal data.

In the first few pages of the first chapter of his classic textbook on Robust Statistics, Peter Huber describes analysis of an experiment originally due to John Tukey. The analysis reveals that even having just 2% of “bad” datapoints being drawn from a different Gaussian distribution (with a 3-fold larger standard deviation) is enough to markedly change the properties and efficiency of common statistical estimators. And yet, defining “bad” data as being drawn from a larger variance Gaussian is wonderfully simplistic. Real-world data is so much nastier.

What form should the data contamination take? There are multiple ways in which data can become contaminated. There can be changes in statistical properties, like the simple example we used above, or drift in statistical properties such as a non-stationary mean or a non-stationary variance. But you can get more complicated errors creeping into your data. These typically take two forms,

Human induced data contamination: These can be misspelling or mis-(en)coding errors that result from not using controlled and validated vocabularies for human data-entry tasks. You’ll recognize these sorts of errors when you see multiple different variants for the name of the same country, US county or UK city, say. You might think it is difficult to simulate such errors, but there are some excellent packages to do so – checkout the messy R package produced by Dr. Nicola Rennie that allows you to take a clean dataset and introduce these sorts of encoding errors into it. Spotting these errors can be as simple as plotting distributions of unique values in a table column, i.e. looking for unusual distributions. In R there are a number of packages to help you do this.
Machine induced errors: These are errors that arise from the processing or transferring of data. These can be as simple as incorrect datetime stamps on rows in a database table, or can be as complex as repeating blocks of rows in a table. These errors are less about contamination and more about alteration. The common element here is that there is a pattern to how the data has become altered or modified and so spotting the errors involves visual inspection of the individual rows of the table, combined with plotting lagged or offset data values. The machine induced errors arise because of bugs in processing code, and these can be either coding errors, e.g. a typo in the code, or unintended behaviour, e.g. datetime processing code that hasn’t been designed properly to correctly handle daylight saving switchovers.

What kind of data contamination should I simulate? This is a “how long is a piece of string” kind of question. It very much depends on what aspect of your algorithm or implementation you want to test for robustness, and only you can know that. You may have to write some bespoke code to simulate the sorts of errors that arise in the processes you use or are exposed to. Broadly speaking, robustness of an estimator will be tested by changes in the statistical properties of the input data and these can be simulated by changes in the distributions of data due to data drift or human-induced contamination, whilst machine-induced errors imply you have some sort of deployed pipeline and so simulating machine corrupted data is best when you want to stress-test your end-to-end pipeline.

Runtime scaling

There are also checks that simulated data allows you to perform that aren’t necessarily directly connected to the accuracy or efficiency of the parameter estimates. Because we can produce as much simulated data as we want, we can easily test how long our estimation algorithm takes for different sized datasets. Similarly, we can also use simulated data to test the memory and storage requirements of the algorithm.

We can continue this theme. Because we can tune and tweak the generation of the simulated data, this can also allow us to generate data to test very specific scenarios – corner cases – for which we don’t have real test data. The ability to generate simulated data increases the test coverage we can perform.

What simulated data won’t give you

Identify model mis-specification

Using simulated data will tell you how well your model training algorithm performs on data that matches precisely the form of the model you have used. It won’t tell you if your model form is correct or appropriate for the real data you will ultimately apply it to. It won’t tell you if you’ve omitted an important feature or if you’ve put non-linearity into your model in an incorrect way. Getting the model form right can only come from i) domain expertise, ii) testing on real ground-truth data. Again, what this highlights is that we use simulated data to test the training process, not the model.

This can trip up even experience researchers. I recently saw a talk from an academic researcher who tested two different model forms using simulated data generated from one of the models. When the model form used to generate the data fitted the simulation data better, they confidently claimed that this model was better and more correct. Well, of course it was for this simulated data!

Accuracy of your model on real data

For simulated data we have the ground-truth values of the response variable so we can assess the prediction accuarcy, either on training data or on holdout test data. However, unless our simulation process produced very realistic data, including the various contamination processes, the test set accuracy on simulated data cannot be used as a precise measure of the predictive accuracy of the trained model on real unseen data.

How to simulate

When producing simulated data for testing an algorithm related to a model there are two things we need to generate – the features and the response. There are two ways we can approach this,

Simulating the features and then simulating the response given the feature values we just produced.
Simulate just the response value given some pre-existing feature values.

Of these, 2 sounds easier, but I will discuss 1 first as it leads us naturally into discussing where we might get pre-existing feature values from.

Simulating features and response

As we said above, in this approach we simulate the features first, and this allows us to construct the distribution of the response variable conditional on the features. We can then sample a value from that conditional distribution. Our basic recipe is

Sample the feature values from a distribution.
Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
Sample the response variable from the conditional distribution constructed in 2.

How complex we want to make the feature distribution depends on how realistic we need our features to be and what aspect of the estimation/training algorithm we are wanting to test.

For real-world problems, it is unlikely that the features follow a Gaussian distribution. Take demand modelling, an area I have worked in a lot. The main feature we use is the price of the product whose demand we are trying to predict. Prices are definitely not Gaussian distributed. Retailers repeatedly switch between a regular and promotional price over a long period of time, so that we have a sample distribution of prices that is represented by two Dirac-delta functions. A more interesting price time series may introduce a few more price points, but it is still definitely not Gaussian. Similarly, real data has correlations between features.

When simulating a feature, we have to decide how important the real distribution is to the aspect of the estimation/training algorithm that we want to test. If we want to simulate with realistically distributed features. this can be problematic. We’ll return to this issue and real data later on, but for now we emphasize tha we can still test whether our estimator is consistent or assess its bias using features drawn from independent Gaussian distributions. So there are still useful tests of our estimation algorithm we can carry out. Let’s see how we can do that.

Linear model example

We’ll use a simple linear model that depends on three features, $x_{1}, x_{2}, x_{3}$ . The response variable $y$ is given by,

$y\; =\;\beta_{1} x_{1}\;+\; \beta_{2}x_{2}\;+\;\beta_{3}x_{3} \;+\;\epsilon\;\;\;\;,\;\; \epsilon\;\sim\; {\cal{N}}\left ( 0, \sigma^{2}_{\epsilon}\right )$

From which you can see both the linear dependence on the features and that $y$ contains Gaussian additive noise $\epsilon$ .

Simulating data is now easy once we have the structure of our probabilistic model. Given a user-specified mean $\mu_{1}$ and variance $\sigma^{2}_{1}$ we can easily sample a value for $x_{1}$ from ${\cal{N}}\left ( \mu_{1}, \sigma^{2}_{1}\right )$ . Similarly, given user-specified means $\mu_{2}, \mu_{3}$ and variances $\sigma^{2}_{2}, \sigma^{2}_{3}$ , we can generate values for $x_{2}$ and $x_{3}$ . If we have user-specified values of $\beta_{1}, \beta_{2}, \beta_{3}$ we can then easily generate a value for $y$ by sampling from ${\cal{N}}\left ( \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}, \sigma^{2}_{\epsilon} \right )$ , where $\sigma^{2}_{\epsilon}$ is the variance of the additive noise that we want to add to our response variable. To simulate $N$ datapoints we repeat that recipe $N$ times. Let’s apply that recipe to assess an estimator of the model parameters $\beta_{1}, \beta_{2}, \beta_{3}$ . We’ll assess the standard Ordinary Least Squares (OLS) estimator for a linear model.

Assessing the OLS Estimator for a linear model

Given a feature matrix $\underline{\underline{X}}$ (the i^th row of the matrix is the feature values for the i^th observation) and vector $\underline y = \left ( y_{1}, y_{2},\ldots,y_{N}\right )$ that represents the $N$ observations of the response variable, then the Ordinary Least Squares (OLS) estimator $\hat{\beta}$ of the true model parameters $\underline{\beta} = \left ( \beta_{1}, \beta_{2}, \beta_{3}\right )$ is given by the formula,

$\underline{\hat{\beta}}\;=\; \left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right ) ^{-1} \underline{\underline{X}}^{\top} \underline{y}\;\;\;\;\;\;\;{\rm Eq.1}$

Note that the OLS estimator is a linear combination of the observations $y_{1}, y_{2}, \ldots, y_{N}$ , with a weight matrix $\left ( \underline{\underline{X}}^{\top}\, \underline{\underline{X}}\right )^{-1} \underline{\underline{X}}^{\top}$ . We’ll come back to this point in a moment.

What we want to know is how close is the estimate $\underline{\hat{\beta}}$ to $\underline{\beta}$ . Is the OLS estimator in the Eq.1 above a biased estimator of $\underline{\beta}$ , and is it a consistent estimator?

The plots below show the bias (mean error) for each of the model parameters, plotted against training dataset size $N$ . I constructed the plots by initializing a true model parameter vector $\underline{\beta}$ and then generating 1000 simulated training datasets for each of the different values of $N$ . For each simulated training dataset I computed the OLS parameter estimate $\hat{\underline{\beta}}$ and then computed the parameter estimate errors $\hat{\underline{\beta}} - \underline{\beta}$ . From the errors I then calculated their sample means and variances (over the simulations) for each value of $N$ .

You can see from the plots that whilst the mean error fluctuates it doesn’t systematically change with $N$ . Furthermore, it fluctuates around zero, suggesting that the OLS estimator is unbiased. And indeed it is. It is possible to mathematically show that the OLS estimator is unbiased at any finite value of $N$ . The reason we get a non-zero value in this case is because we have estimated $\mathbb{E}\left ( \hat{\beta}_{i}\right )$ using a sample average taken over 1000 simulated datasets. If we had used a larger number of simulated datasets we would have got even smaller sample average parameter errors.

Contrast this behaviour with how the variances of the parameter estimate errors change with $N$ in the plots below.

The decrease, with $N$ , in the variance of $\hat{\underline{\beta}} - \underline{\beta}$ is marked. In fact, in looks like a power-law decrease, so I have plotted the same data on a log-scale below,

We can see from those log-log plots that the variances of $\hat{\beta}_{i} - \beta_{i},\; i=1,2,3$ decrease as $N^{-1}$ . That implies that as we use larger and larger training sets any single instance of $\hat{\underline{\beta}}$ will get closer and closer to $\underline{\beta}$ . At large $N$ we have a low probability of being unlucky and our particular training set giving a poor estimate of $\underline{\beta}$ .

How efficient is the OLS estimator in Eq.1? Is the rate at which ${\rm Var}\left ( \hat{\beta}_{i} - \beta_{i}\right )$ decreases with $N$ good or bad? It turns out that the OLS estimator in Eq.1 is the Best Linear Unbiased Estimator (BLUE). For an unbiased estimator of $\underline{\beta}$ that is constructed as a linear combination of the observations $\underline{y}$ , you cannot do better than the OLS estimator in Eq. 1.

All the code for the linear model example is available in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the GitHub repository https://github.com/dchoyle/simulation_blogpost.

A linear model is relatively simple structure but the example was a good demonstration of the power of simulated data. Next, we’ll use a more complex model architecture and build a feed-forward neural network.

Neural network example

Our simulated neural network output has the form,

$y\;=\; f\left( \underline{x}| \underline{\theta} \right ) \;+\; \epsilon$

Again, we’ll use zero-mean Gaussian additive noise, $\epsilon \sim {\cal{N}}\left (0, \sigma^{2}_{\epsilon}\right )$ .

The function $f\left( \underline{x}| \underline{\theta} \right )$ represents our neural network function, with $\underline{x}$ being the vector of input features and $\underline{\theta}$ being a vector holding all the network parameters. For this demo, I’m going to use a 3 input-node, 2 hidden-layer feed-forward network, with 10 nodes in each of the hidden layers. The output layer consists of a single node, representing the variable $y$ . For the non-linear transfer (activation) functions I’m going to use $\tanh$ functions. So, schematically, my networks looks like the figure below,

I’m going to use a teacher network of the form above to generate simulated data, which I’ll then use to train a student network of the same form. What I want to test is, does my training process produce a trained student network whose predictions on a test set get more and more accurate as I increase the amount of training data? If not, I have a problem. If my training process doesn’t produce accurate trained networks on ideal data, the training process isn’t going to produce accurate networks when using real data. I’m less interested in comparing trained student network parameters to the teacher network parameters as, a) there are a lot of them to compare, b) since the output of a network is invariant to within-layer permutation of the hidden layer node labels and connections, defining a one-to-one comparison of network parameters is not straight forward here. Node 1 in the first hidden layer of the student isn’t necessarily equivalent to node 1 in the first hidden layer of the teacher network, and so on.

The details of how I’ve coded up the networks and set-up the evaluation are lengthy, so I’ll just show the final result here. All the details can be found in the Jupyter notebook NeedForSimulation_Blogpost.ipynb in the freely accesible github repository.

Below in left-hand plot I’ve plotted the average Mean Square Error (MSE) made by the trained student network on the test-sets. I’ve plotted the average MSE against the training dataset size. The average MSE is the average over the simulations of that training set size. For comparison, I have also calculated the average test-set MSE of the teacher network. Since the test-set data contains additive Gaussian noise, the teacher network won’t make perfect predictions on the test-set data even though the teacher network generated the systematic part of the test-set response values. The average test-set MSE of the teacher network provides a benchmark or baseline against which we can asses the trained student network. We have a ready intuition about the relative test-set MSE value. We expect the relative test-set MSE to be significantly above 1 at small values of $N$ , as the student network struggles to learn the teacher network output. As the amount of training data $N$ increases we expect the relative test-set MSE value to approach 1 from above. The average relative test-set error is plotted in the right-hand plot below.

We can see from both plots above that the prediction accuracy of a trained student network typically decreases with increasing amount of training data. My network training process has passed this basic test. The test was quick to set up and gives me confidence I can run my code over real data.

Sampling features from more realistic distributions

In our previous examples we have used independent features, sampled from simple but naive distributions, to test the convergence properties of an estimator. But what happens if you want to assess the quantitative performance of an estimator for more realistic feature patterns? Well, we use more realistic feature patterns. This is a variant of our previous basic recipe, but where we have access to a real dataset. The modified recipe is,

Sample an observation from the real dataset and keep the features.
Use the sampled feature values and the model form to construct the distribution of the response variable conditional on the features.
Sample the response variable from the conditional distribution constructed in 2.

This seems like a small modification of the recipe. However, it does have some big implications. We can’t generate simulated datasets of arbitrarily large size as we are limited by the size of the real dataset. We can obviously generate simulated datasets of smaller size than the real data, but this can make testing of the convergence properties of an estimator difficult.

That said, this is one of my faviourite approaches. Often, steps 2 and 3 are easy to implement. You’ll have a function for the conditional mean of the response variable already coded up for prediction purpose, so it is just a question of pushing some feature values through that code. I find the overhead of writing extra functions to simulate realistic looking feature values is significant, both in terms of time and thinking about what ‘realistic’ should look like. The recipe above gets round this easily. Simply pick a row from your existing real dataset at random and there you go, you have some realistic feature values. As before, the recipe allows me to then generate response values with known ground-truth parameters values. So overall I can compare parameter estimates to ground truth parameter values on realistic feature values, allowing me to check that my estimation algorithm is at least semi-accurate on realistic feature values. You can also choose in step 1 of the recipe, whether you want to sample a row of feature values with or without replacement.

Simulating the response only

You could argue that simulating response values with feature values sampled from an existing real dataset is an example of just simulating the response. After all, only the response value is computer generated. I still tend to think of it as simulating the features because, i) I am still sampling the features from a distribution function, the empirical distribution function in this case, and ii) I have broken some of the link between the features and the response in the real data because I have sampled the features values separately. However, sometimes we want to keep as much as of the links between features and response values in the real data as possible. We can do this by only making additions to the real data. By necessity this means only adding to the response value. This may sound very restrictive, but in fact there are many situations where this is precisely the kind of data we need to test an estimation algorithm. For example, changepoint detection or unconditional A/B testing. In these situations we take the real data, identify the split point where we want to increase the response value (the changepoint or the A/B grouping) and simply increase the response. Hey presto, we have realistic data with a guaranteed increase in the response variable at a know location. By changing the level of increase in the response variable we can use this approach to assess the statistical power of the changepoint or A/B test algorithm.

The plots below show an example of introducing a simple shift in level at timepoint 53 into a real dataset. We have only shown the process as a simple schematic, but coding it up yourself is only a matter of a line or two of code, so I haven’t given any code details.

In the above example I simply increased the response variable, by the same amount (8.285 in this case), at and after timepoint 53. If instead, you only want to increase the average value of the response variable, it is a simple modification of the process to include some additional zero-mean noise after the changepoint location.

Conclusions

Simulated data is extremely useful. It can give you lots of insight into the performance of your training/estimation algorithm (including bug detection). Its main advantages are it is,

Easy to produce in large volumes.
Can be produced in a user-controlled way.
Gives you ground-truth values.
Gives you a way to assess the performance of your training algorithm when you have no real ground-truth data.
Stops you releasing a poor untested training algorithm into production.

If you don’t want to sign an absolutely useless striker for your data science model team, test with simulated data at the very minimum.

Footnotes

Harry Redknapp is a former English Premier League football manager. Whilst Redknapp was manager of Tottenham Hotspur he had a reputation of being willing to sign players on the flimiest of evidence of footballing skills. At a time when there was a large influx of overseas players into the Premier league, due to their reputation for superior technical football skills, it was joked that he would sign a player simply because of how their name sounded and without any checks on the player at all.
The term generative model preceeds its useage in Generative AI. Broadly speaking, a generative model is a machine learning model that learns the underlying probability distribution of the data and can generate new, similar data instances. The useage of the term was popular around the early 2000’s, particularly when discussing different forms of classifiers, which were described as either being generative or discriminative.

A book on language models and a paper on the mathematics of transformers

Quick introductions to the maths of transformers

On May 26, 2025June 21, 2025 By dchoyleIn AI, Algorithms, Artificial Intelligence, Data Science, Deep Learning, Machine LearningLeave a comment

TL;DR: Having a high-level understanding of the mathematics of transformers is important for any Data Scientist. The two sources I recommend below are excellent short introductions to the maths of transformers and modern language models.

A colleague asked me, about two months back, if I could recommend any articles on the mathematics of Large Language Models (LLMs). They then clarified that they meant transformers, as they were primarily interested in the algorithms on which LLM apps are based. Yes, they’d skim read the original “Attention Is All You Need” paper from Vaswani et al, but they done so just after the paper came out in 2017. They were looking to get back up to date with LLMs and even revisit the original Vaswani paper. Firstly, they wanted an accessible explanation which they could use to construct a high-level mental model of how transformers worked, the idea being that the high-level mental model would serve as a construct on which to hang and compartmentalize the many new concepts and advances that had happened since the Vaswani paper. Secondly, my colleague is very mathematically able, so they were looking for mathematical detail, but the right mathematical detail, and in a relatively short read.

I’ve listed below the recommendations I gave to my colleague because I think they are good recommendations (and I explain why below as well). I also believe it is important for all Data Scientists to have at least a high-level understanding of how transformers, and the LLMs which are built on them, work – again I explain why, below.

The recommendations

What I recommended was one paper and one book. The article is free to access, and the book has a “set your own price” option for access to the electronic version of the book.

The article is, “An Introduction to Transformers” by Richard Turner from the Dept. of Engineering at the University of Cambridge and Microsoft Research in Cambridge (UK). This arXiv paper can be found here. The paper focuses on how transformers work but not on training them. That way the reader focuses on the structure of the transformers without getting lost in the details of the arcane and dark art of training transformers. This is why I like this paper. It gives you an overview of what transformers are and how they work without getting into the necessary but separate nitty-gritty of how you get them to work. To read the paper does require some prior knowledge of mathematics but the level is not that high – see the last line of the abstract of the paper. The whole paper is only six pages long, making it a very succinct explanation of transformer maths that you can consume in one sitting.
The book is “The Hundred-Page Language Models Book” by Andriy Burkov. This is the latest in the series of books from Burkov, that include “The Hundred-Page Machine Learning Book” and “Machine Learning Engineering”. I have a copy of the hundred-page machine learning book and I think it is ok, but I prefer the LLMs book. I think part of the reason for this is that, like everybody else I have been only been using and playing with LLMs for the last three years or so, whilst I have been doing Data Science for a lot longer – I have been doing some form of mathematical or statistical modelling for over 30years – and so I didn’t really learn anything new from the machine learning book. In contrast, I learnt a lot from the book on LLMs. The whole book works through simple examples, both in code (Python) and in terms of the maths. I semi-skim read the book in two sittings. The code examples I skipped, not because they were simplistic but because I wanted to digest the theory and algorithm explanations end-to-end first and then return to trying the code examples at a later date. Overall, the book is packed with useful nuggets. It is a longer read than the Turner paper, but can still easily be consumed in a day if you skip bits. The book assumes less prior mathematical knowledge than the Turner paper and explains the new bits of maths it introduces, but given the whirlwind nature of a 100-page introduction to LLMs I would still recommend you have some basic familiarity with linear algebra, statistics & probability, and machine learning concepts.

Why learn the mathematics of transformers?

Having to think about which short articles I would recommend on the maths of transformers and LLMs made me think more broadly about whether there is any benefit from having a high-level understanding of transformer maths. My colleague was approaching it out of curiosity, and I knew that. They simply wanted to learn, not because they had to, nor because they thought that understanding the mathematical basis of transformers was the way to approach using LLMs as a tool.

However, given the exorbitant financial cost of building foundation models and the need to master a vast amount of engineering detail, most people won’t be building their own foundation models. Instead they will be using 3^rd party models simply as a tool and focusing on developing skills and familiarity in prompting them. So, are there any benefits then to understanding the maths behind LLMs? In other words, could I honestly recommend the two sources listed above to anybody else other than my colleague who was interested mainly out of curiousity?

The benefits of learning the maths of transformers and the risks of not doing so

The answer to the question above, in my opinion, is yes. But you probably could have guessed that from the fact I’ve written this post. So, what do I think are the benefits to a Data Scientist in having a high-level understanding of the mathematics of transformers? And equally important, what are the downsides and risks of not having that high-level understanding?

Having even a high-level understanding of the maths behind transformers de-mystifies LLMs since it forces you to focus on what is inside LLMs. Without this understanding you risk putting an unnecessary veneer of complexity or mysticism on top of LLMs, a veneer that prevents you using LLMs effectively.
You will understand why LLMs hallucinate. You will understand that LLMs build a model of the high-dimensional conditional probability distribution of the next token given the preceding context. And that distribution can have a large dispersion if the the training data is limited in the high-dimensional region that corresponds to the current context. That large dispersion results in the sampled next token having a high probability of being inappropriate. If you understand what LLMs are modelling and how they model it, hallucinations will not be a surprise to you (they may still be annoying) and you will understand strategies to mitigate them. If you don’t understand how LLMs are modelling the conditional probability of the next token, you will always be surprised, annoyed, and impacted by LLM hallucinations.
It helps you understand where LLMs excel and where they don’t because you have a grounded understanding of their strengths and weaknesses. This makes it easier to identify potential applications of LLMs. The downside? Not having a fundamental understanding of the strengths and weaknesses of the algorithms behind LLMs risks you building LLM-based applications that were doomed to failure from the start because they have been mis-matched to the capabilities of LLMs.
By having a high-level mental model of transformers on which to hang later advances in LLMs, you can more easily identify what is important and relevant (or not) in any new advance. The downside to not having this well-founded mental-model is that you get blown about by the winds of over-hyped LLM announcements from companies stating that their new tool or app is a “paradigm shift”, and consequently you waste time getting into the detail of what are trivial or inconsequential improvements.

What to do?

What should you do if you are a Data Scientist and I have managed to convince you that having a high-level understanding of the mathematics of transformers is important? Simple, access the two sources I’ve recommended above. Happy reading.

Comparison of Benford's Law and the proportion of first digits from file sizes of files on my laptop hard drive.

A Christmas Cracker Puzzle – Part 2

On January 19, 2025May 26, 2025 By dchoyleIn Data Science, Fun Mathematics, Miscellaneous, Scientific computation1 Comment

Before Christmas I set a little puzzle. The challenge was to calculate the proportion of file sizes on your hard drive that start with the digit 1. I predicted that the proportion you got was around 30%. I’ll now explain why.

Benford’s Law

The reason why around 30% of all the file sizes on your hard disk start with 1 is because of Benford’s Law. Computer file sizes approximately follow Benford’s Law.

What is Benford’s Law?

Benford’s Law says that for many datasets the first digits of the numbers in the dataset follow a particular distribution. Under Benford’s Law, the probability of the first digit being equal to d is,

$\log_{10} ( 1 + \frac{1}{d} )\;\;\;.\;\;\;\;\;\;$ Eq.1

So, in a dataset that follows Benford’s Law, the probability that a number starts with a 1 is around 30%. Hence, the percentage of file sizes that start with 1 is around 30%.

The figure below shows a comparison of the distribution in Eq.1 and the distribution of first digits of file sizes for files in the “Documents” folder of my hard drive. The code I used to calculate the empirical distribution in the figure is given at the end of this post. You can see the distribution derived from my files is in very close agreement with the distribution predicted by Eq.1. The agreement between the two distributions in the figure is close but not perfect – more on that later.

Benford’s Law is named after Frank Benford, whose discovered the law in 1938 – see the later section for some of the long history of Benford’s Law. Because Benford’s Law is concerned with the distribution of first digits in a dataset, it is also commonly referred to as, ‘the first digit law’ and also the ‘significant digit law’.

Benford’s Law is more than just a mathematical curiosity or Christmas cracker puzzle. It has some genuine applications – see later. It has also fascinated mathematicians and statisticians because it applies to so many diverse datasets. Benford’s Law has been shown to apply to datasets as different as the size of rivers and election results.

What’s behind Benford’s Law

The intuition behind Benford’s Law is that if we think there are no a priori constraints on what value a number can take then we can make the following statements,

We expect the numbers to be uniformly distributed in some sense – more on that in a moment.
There is no a priori scale associated with the distribution of numbers. So, I should be able to re-scale all my numbers and have a distribution with the same properties.

Overall, this means we expect the numbers to be uniformly distributed on a log scale.

This intuition helps us identify when we should expect a dataset to follow Benford’s Law, and it also gives us a hand-waving way of deriving the form of Benford’s Law in Eq.1.

Deriving Benford’s Law

First, we restrict our numbers to be positive and derive Benford’s Law. The fact that Benford’s Law would also apply to negative numbers once we ignore their sign should be clear.

We’ll also have to restrict our positive numbers to lying in some range $[\frac{1}{x_{max}}, x_{max}]$ so that the probability distribution of $x$ is properly normalized. We’ll then take the limit $x_{max}\rightarrow\infty$ at the end. For convenience, well take $x_{max}$ to be of the form $x_{max} = 10^{k_{max}}$ , and so the limit $x_{max}\rightarrow\infty$ corresponds to the limit $k_{max}\rightarrow\infty$ .

Now, if our number $x$ lies in $[\frac{1}{x_{max}}, x_{max}]$ and is uniformly distributed on a log-scale, then the probability density for $x$ is,

$p(x) = \frac{1}{2\ln x_{max}} \frac{1}{x}\;\;.$

The probability of getting a number between, say 1 and 2 is then,

${\rm{Prob}} \left ( 1 \le x < 2 \right ) = \frac{1}{2\ln x_{max}} \int_{1}^{2} \frac{1}{x} dx \;\;.$

Now numbers which start with a digit $d$ are of the form $a10^{k}$ with $a \in [d, d+1)$ and $k = -k_{max},\ldots,-2,-1,0,1,2,\ldots,k_{max}$ . So, the total probability of getting such a number is,

${\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \sum_{k=-k_{max}}^{k_{max}}\frac{1}{2\ln x_{max}}\int_{d10^{k}}^{(d+1)10^{k}} \frac{1}{x} dx\;\; ,$

and so after performing the integration and summation we have,

${\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \frac{2k_{max}}{2\ln x_{max}} \left [ \ln ( d+1 ) - \ln d\right ]\;\;.$

Recalling that we have chosen $x_{max} =10^{k_{max}}$ , we get,

${\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \frac{1}{\ln 10} \left [ \ln ( d+1 ) - \ln d\right ]\;=\;\log_{10} \left ( 1 + \frac{1}{d}\right )\;\;.$

Finally, taking the limit $k_{max}\rightarrow\infty$ gives us Benford’s Law for positive valued numbers which are uniformly distributed on a log scale.

Ted Hill’s proof of Benford’s Law

The derivation above is a very loose (lacking formal rigour) derivation of Benford’s Law. Many mathematicians have attempted to construct a broadly applicable and rigorous proof of Benford’s Law, but it was not until 1995 that a widely accepted proof was derived by Ted Hill from Georgia Institute of Technology. You can find Hill’s proof here. Ted Hill’s proof also seemed to reinvigorate interest in Benford’s Law. In the following years there were various popular science articules on Benford’s Law, such as this one from 1999 by Robert Matthews in The New Scientist. This article was where I first learned about Benford’s Law.

Benford’s Law is base invariant

The approximate justification we gave above for why Benford’s Law works made no explicit reference to the fact that we were working with numbers expressed in base 10. Consequently, that justification would be equally valid if we were working with numbers expressed in base $b$ . This means that Benford’s Law is base invariant, and a similar derivation can be made for base $b$ .

The distribution of first digits given in Eq.1 is for numbers expressed in base 10. If we express those same numbers in base $b$ and write a number $x$ in base $b$ as,

$x = x_{1}x_{2}x_{3}\ldots x_{K}\;\;,$

then the first digit $x_{1}$ is in the set $\{1,2,3,\ldots,b-1\}$ and the probability that the first digit has value $d$ is given by,

${\rm{Prob}} \left ( x_{1}\;=\;d \right ) = \log_{b}\left ( 1 + \frac{1}{d}\right )\;\;\;,\; d\in\{1,2,3,\ldots,b-1\}\;\;\;.\;\;\;\;\;$ Eq.2

When should a dataset follow Benford’s Law?

The broad intuition behind Benford’s Law that we outlined above also gives us an intuition about when we should expect Benford’s Law to apply. If we believe the process generating our data is not restricted in the scale of the values that can be produced and there are no particular values that are preferred, then a large dataset drawn from that process will be well approximated by Benford’s Law. These considerations apply to many different types of data generating processes, and so with hindsight it should not come as a surprise to us that many different datasets appear to follow Benford’s Law closely.

This requires that no scale emerges naturally from the data generating process. This means the data values shouldn’t be clustered around a particular value, or particular values. So, data drawn from a Gaussian distribution would not conform to Benford’s law. From the perspective of the underlying processes that produce the data, this means there shouldn’t be any equilibrium process at work, as that would drive the measured data values to the value corresponding to the equilibrium state of the system. Likewise, there should be no constraints on the system generating the data, as the constraints will drive the data towards a specific range of values.

However, no real system is truly scale-free. The finite system size always imposes a scale. However, if we have data that can vary over many orders of magnitude then from a practical point of view, we can regard that data as effectively scale free. I have found that data that varies over 5 orders of magnitude is usually well approximated by Benford’s law.

Because a real-world system cannot really ever truly satisfy the conditions for Benford’s Law, we should expect that most real-world datasets will only show an approximate agreement with Benford’s Law. The agreement can be very close, but we should still expect to see deviations from Benford’s Law – just as we saw in the figure at the top of this post. Since real-world datasets won’t follow Benford’s Law precisely, we also shouldn’t expect to see a real-world dataset follow the base-invariant form of Benford’s Law in Eq.2 for every choice of base. In practice, this means that there is usually a particular base $b$ for which we will see closer agreement with the distribution in Eq.2, compared to other choices of base.

Why computer file sizes follow Benford’s Law

Why does Benford’s Law apply to the sizes of the files on your computer? Those file sizes can span over several orders of magnitude – from a few hundred bytes to several hundred megabytes. There is also no reason why my files should cluster around a particular file size – I have photos, scientific and technical papers, videos, small memos, slides, and so on. It would be very unusual if all those different types of files, from different use cases, end up having very similar sizes. So, I expect Benford’s Law to be a reasonable description of the distribution of first digits of the file sizes of files in my “Documents” folder.

However, if the folder I was looking at just contained, say, daily server logs, from a server that ran very unexciting applications, I would expect the server log file size to be very similar from one day to the next. I would not expect Benford’s Law to be a good fit to those server log file sizes.

In fact a significant deviation from Benford’s Law in the file size distribution would indicate that we have a file size generation process that is very different from a normal human user going about their normal business. That may be entirely innocent, or it could be indicative of some fraudulent activity. In fact, fraud detection is one of the practical applications of Benford’s Law.

The history of Benford’s Law

Simon Newcomb

One of the reasons why the fascination with Benford’s Law endures is the story of how it was discovered. With such an intriguing mathematical pattern, it is perhaps no surprise to learn that Frank Benford was not the first scientist to spot the first-digit pattern. The astronomer Simon Newcomb had also published a paper on, “the frequency of use of the different digits in natural numbers” in 1881. Before the advent of modern computers, scientists and mathematicians computed logarithms by looking them up in mathematical tables – literally books of logarithm values. I still have such a book from when I was in high-school. The story goes that Newcomb noticed that in a book of logarithms the pages were grubbier, i.e. used more, for numbers whose first significant digit was 1. From this Newcomb inferred that numbers whose first significant digit was 1 must be more common, and he supposedly even inferred the approximate frequency of such numbers from the relative grubbiness of the pages in the book of logarithms.

In more recent years the first-digit law is also referred to as the Newcomb-Benford Law, although Benford’s Law is still more commonly used because of Frank Benford’s work in popularizing it.

Benford’s discovery

Frank Benford rediscovered the law in 1938, but also showed that data from many diverse datasets – from the surface area of rivers to population sizes of US counties – appeared to follow the distribution in Eq.1. Benford then published his now famous paper, “The law of anomalous numbers”.

Applications of Benford’s Law

There are several books on Benford’s Law. One of the most recent, and perhaps the most comprehensive is Benford’s Law: Theory and Applications. It is divided into 2 sections on General Theory (a total of 6 chapters) and 4 sections on Applications (a total of 13 chapters). Those applications cover the following:

Detection of accounting fraud
Detection of voter fraud
Measurement of the quality of economic statistics
Uses of Benford’s Law in the natural sciences, clinical sciences, and psychology.
Uses of Benford’s Law in image analysis.

I like the book because it has extensive chapters on applications written by practitioners and experts on the uses of Benford’s Law, but the application chapters make links back to the theory.

Many of the applications, such as fraud detection, are based on the idea of detecting deviations from the Benford Law distribution in Eq.1. If we have data that we expect to span several orders of magnitude and we have no reasons to suspect the data values should naturally cluster around a particular value, then we might expect it to follow Benford’s Law closely. This could be sales receipts from a business that has clients of very different sizes and sells goods or services that vary over a wide range of values. Any large deviation from Benford’s Law in the sales recipts would then indicate the presence of a process that produces very specific receipt values. That process could be data fabrication, i.e. fraud. Note, this doesn’t prove the data has been fraudulently produced, it just means that the data has been produced by a process we wouldn’t necessarily expect.

My involvement with Benford’s Law

In 2001 I published a paper demonstrating that data from high-throughput gene expression experiments tended to follow Benford’s Law. The reasons why mRNA levels should follow Benford’s Law is ultimately those we have already outlined – mRNA levels can range over many orders of magnitude and there are no a priori molecular biology reasons why, across the whole genome, mRNA levels should be centered around a particular value.

In December 2007 a conference on Benford’s Law was organized by Prof. Steven Miller from Brown University and others. The conference was held in a hotel in Sante Fe and was sponsored/funded by Brown University, the University of New Mexico, Universidade de Vigo, and the IEEE. Because of my 2001 paper, I received an invitation to talk at the workshop.

For me, the workshop was very memorable for many reasons,

I had a very bad cold at the time.
Due to snow in both the UK and US, I was snowbound overnight in Chicago airport (sleeping in the airport), and only just made a connecting flight in Dallas to Albuquerque. Unfortunately, my luggage didn’t make the flight connection and ended up getting lost in all the flight re-arrangements and didn’t show up in Santa Fe until 2 days later.
It was the first time I’d seen snow in the desert – this really surprised me. I don’t know why, it just did.
Because of my cold, I hadn’t finished writing my presentation. So I stayed up late the night before my talk to finish writing my slides. To sustain my energy levels through the night whilst I was finishing writing my slides, I bought a Hershey bar thinking it would be similar to chocolate bars in the UK. I took a big bite from the Hershey bar. Never again.
But this was all made up for by the fact I got to sit next to Ted Hill during the workshop dinner. Ted was one of the most genuine and humble scientists I have had the pleasure of talking to. Secondly, the red wine at the workshop dinner was superb.

From that workshop Steven Miller organized and edited the book on Benford’s Law I referenced above. Hence why I think that book is one of the best on Benford’s Law, although I am biased as I contributed one of the chapters – Chapter 16 on “Benford’s Law in the Natural Sciences”

My Python code solution

To end this post, I have given below the code I used to calculate the distribution of first digits of the file sizes on my laptop hard drive.

import numpy as np

# We'll use the pathlib library to recurse over
# directories
from pathlib import Path

# Specify the top level directory
start_dir = "C:\\Users\\David\\Documents"

# Use a list comprehension to recursively loop over all sub-directories 
# and get the file sizes
filesizes = [path.lstat().st_size for path in list(Path(start_dir).glob('**/*' )) if path.is_file()]

# Now count how many file sizes start with a 1
proportion1 = np.sum([int(str(i)[0])==1 for i in filesizes])/len(filesizes)

# Print the result
print("Proportion of filesizes starting with 1 = " + str(proportion1))
print("Number of files = " + str(len(filesizes)))

# Calculate the first-digit proportions for all digits 1 to 9
proportions = np.zeros(9)
for i in range(len(filesizes)):
    proportions[int(str(filesizes[i])[0])-1] += 1.0 
    
proportions /= len(filesizes)

You’re going to need a bigger algorithm – Amdahl’s Law and your responsibilities as a Data Scientist

On February 2, 2024December 17, 2024 By dchoyleIn Algorithms, Data Science, Mathematical Analysis, Scientific computationLeave a comment

You have some prototype Data Science code based on an algorithm you have designed. The code needs to be productionized, and so sped up to meet the specified production run-times. If you stick to your existing technology stack, unless the runtimes of your prototype code are within a factor of 1000 of your target production runtimes, you’ll need a bigger, better algorithm. There is a limit to what speed up your technology stack can achieve. Why is this? Read on and I’ll explain. And I’ll explain what you can do if you need more than a 1000-fold speed up of your prototype.

Speeding up your code with your current tech stack

There are two ways in which you can speed up your prototype code,

Improve the efficiency of the language constructs used, e.g. in Python replacing for loops with list comprehensions or maps, refactoring subsections of the code etc.
Horizontal scaling of your current hardware, e.g. adding more nodes to a compute cluster, adding more executors to the pool in a Spark cluster.

Point 2 assumes that your calculation is compute bound and not memory bound, but we’ll stick with that assumption for this article. We also exclude the possibility that the productionization team can invent or buy a new technology that is sufficiently different or better than your current tech stack – it would be an unfair ask of the ML engineers to have to invent a whole new technology just to compensate for your poor prototype. They may be able to to, but we are talking solely about using your current tech stack and we assume that it does have some capacity to be horizontally scaled.

So what speed ups can we expect from points 1 and 2 above? Point 1 is always possible. There are always opportunities for improving code efficiency that you or another person will spot when looking at the code for a second time. A more experienced programmer reviewing the code can definitely help. But let’s assume that you’re a reasonably experienced Data Scientist yourself. It is unlikely that your code is so bad that a review by someone else would speed it up by more than a factor of 10 or so.

So if the most we expect from code efficiency improvements is a factor 10 speed up, what speed up can we additionally get from horizontal scaling of your existing tech stack? A factor of 100 at most. Where does this limit of 100 come from? Amdahl’s law.

Amdahl’s law

Amdahl’s law is a great little law. Its origins are in High Performance Computing (HPC), but it has a very intuitive basis and so is widely applicable. Because of that it is worth explaining in detail.

Imagine we have a task that currently takes time T to run. Part of that task can be divided up and performed by separate workers or resources such as compute nodes. Let’s use P to denote the fraction of the task that can be divided up. We choose the symbol P because this part of the overall task can be parallelized. The fraction that can’t be divided up we denote by S, because it is the non-parallelizable or serial part of the task. The serial part of the task represents things like unavoidable overhead and operations in manipulating input and output data-structures and so on.

Obviously, since we’re talking about fractions of the overall runtime T, the fractions P and S must sum to 1, i.e.

The parallelizable part of the task takes time TP to run, whilst the serial part takes time TS to run.

What happens if we do parallelize that parallelizable component P? We’ll parallelize it using N workers or executors. When N=1, the parallelizable part took time TP to run, so with N workers it should (in an ideal world) take time TP/N to run. Now our overall run time, as a function of N is,

This is Amdahl’s law¹. It looks simple but let’s unpack it in more detail. We can write the speed up factor in going from T(N=1) to T(N) as,

The figure below shows plots of the speed-up factor against N, for different values of S.

From the plot in the figure, you can see that the speed up factor initially looks close to linear in N and then saturates. The speed up at saturation depends on the size of the serial component S. There is clearly a limit to the amount of speed up we can achieve. When N is large, we can approximate the speed up factor in Eq.3 as,

From Eq.4 (or from Eq.3) we can see the limiting speed up factor is 1/S. The mathematical approximation in Eq.4 hides the intuition behind the result. The intuition is this; if the total runtime is,

then at some point we will have made N big enough that P/N is smaller than S. This means we have reduced the runtime of the parallelizable part to below that of the serial part. The largest contribution to the overall runtime is now the serial part, not the parallelizable part. Increasing N further won’t change this. We have hit a point of rapidly diminishing returns. And by definition we can’t reduce S by any horizontal scaling. This means that when P/N becomes comparable to S, there is little point in increasing N further and we have effectively reached the saturation speed up.

How small is S?

This is the million-dollar question, as the size of S determines the limiting speed up factor we can achieve through horizontal scaling. A larger value of S means a smaller speed up factor limit. And here’s the depressing part – you’ll be very lucky to get S close to 1%, which would give you a speed up factor limit of 100.

A real-world example

To explain why S = 0.01 is around the lowest serial fraction you’ll observe in a real calculation, I’ll give you a real example. I first came across Amdahl’s law in 2007/2008, whilst working on a genomics project, processing very high-dimensional data sets². The calculations I was doing were statistical hypothesis tests run multiple times.

This is an example of an “embarrassingly parallel” calculation since it just involves splitting up a dataframe into subsets of rows and sending the subsets to the worker nodes of the cluster. There is no sophistication to how the calculation is parallelized, it is almost embarrassing to do – hence the term “embarrassingly parallel”.

The dataframe I had was already sorted in the appropriate order, so parallelization consisted of taking a small number of rows off the top of dataframe and sending to a worker node and repeating. Mathematically, on paper, we had S=0. Timings of actual calculations with different numbers of compute nodes and fitting an Amdahl’s law curve to those timings revealed we had something between S=0.01 and S=0.05.

A value of S=0.01 gaves us a maximum speed up factor of 100 from horizontal scaling. And this was for a problem that on paper had S=0. In reality, there is always some code overhead in manipulating the data. A more realistic limit on S for an average complexity piece of Data Science code would be S=0.05 or S=0.1, meaning we should expect limits on the speed up factor of between 10 and 20.

What to do?

Disappointing isn’t it!? Horizontal scaling will speed up our calculation by at most a factor of 100, and more likely only a factor of 10-20. What does it mean for productionizing our prototype code? If we also include the improvements in the code efficiency, the most we’re likely to be able to speed up our prototype code by is a factor of 1000 overall. It means that as a Data Scientist you have a responsibility to ensure the runtime of your initial prototype is within a factor of 1000 of the production runtime requirements.

If a speed up of 1000 isn’t enough to hit the production run-time requirements, what can we do? Don’t despair. You have several options. Firstly, you can always change the technology underpinning your tech stack. Despite what I said at the beginning of this post, if you are repeatedly finding that horizontal scaling of your current tech stack does not give you the speed-up you require, then there may be a case for either vertical scaling the runtime performance of each worker node or using a superior tech stack if one exists.

If improvement by vertical scaling of individual compute nodes is not possible, then there are still things you can do to mitigate the situation. Put the coffee on, sharpen your pencil, and start work on designing a faster algorithm. There are two approaches you can use here,

Reduce the performance requirements: This could be lowering the accuracy through approximations that are simpler and quicker to calculate. For example, if your code involves significant matrix inversion operations you may be able to approximate a matrix by its diagonal and explicitly hard code the calculation of its inverse rather than performing expensive numerical inversion of the full matrix.
Construct a better algorithm: There are no easy recipes here. You can get some hints on where to focus your effort and attention by identifying the runtime bottlenecks in your initial prototype. This can be done using code profiling tools. Once a bottleneck has been identified, you can then progress by simplifying the problem and constructing a toy problem that has the same mathematical characteristics as the original bottleneck. By speeding up the toy problem you will learn a lot. You can then apply those learnings, even if only approximately, to the original bottleneck problem.

When I first stumbled across Amdahl’s law, I mentioned it to a colleague working on the same project as I was. They were a full-stack software developer and immediately, said “oh, you mean Amdahl’s law about limits on the speed you can write to disk?”. It turns out there is another Amdahl’s Law, often called “Amdahl’s Second Law”, or “Amdahl’s Other Law”, or “Amdahl’s Lesser Law”, or “Amdahl’s Rule-Of-Thumb”. See this blog post, for example, for more details on Amdahl’s Second Law.
Hoyle et. al, “Shared Genomics: High Performance Computing for distributed insights in genomic medical research”, Studies in Health Technology & Informatics 147:232-241, 2009.

The Royal Statistical Society Conference and Data Science

On September 24, 2022December 9, 2023 By dchoyleIn Conferences, Data Science, MeetUpsLeave a comment

This year the UK’s Royal Statistical Society (RSS) held its annual international conference in Aberdeen between the 12^th and 15^th September 2022.

You may think that the society’s main conference doesn’t hold that much relevance for you as a Data Scientist. Yes, you have an interest in Data Science with a statistical flavour, but surely the main conference is all clinical trials analysis and the like, isn’t it? My job over the next 980 words is to persuade you otherwise.

Statistics is about the whole data life cycle

Go to the RSS website or look at an official email from the RSS and you’ll see that the RSS strapline is “Data | Evidence | Decisions”. This accurately reflects the breadth of topics covered at the conference – in the session talks, the posters, and the plenary lectures. Statistics is about data, and modern statistics now concerns itself with all aspects related to data – how it is collected, how it is analysed, how models are built from that data, how inferences are made from those models, and how decisions are made off the back of those inferences. A modern general statistics conference now has to reflect the full end-to-end lifecycle of data and also the computational and engineering workflows that go with it. This year’s RSS conference did just that.

A Strong Data Science focus

Over the three main days of the conference there were 7 specific sessions dedicated to Data Science, totalling 8hrs and 20mins of talks. You can see from the full list below the breadth covered in the Data Science sessions.

Novel applications and Data Sets
Introduction to MLOps
The secret sauce of Open Source
Data Science for Health Equity
The UK’s future data research infrastructure
Epidemiological applications of Data Science
Algorithmic bias and ethical considerations in Data Science

On top of this there were Data Science topics in the 8 rapid fire talk sessions and in the 110 accepted posters. Example Data Science related topics included MLOps, Decentralized finance, Genetic algorithms, Kernels for optimal compression of distributions, Changepoint detection, Quantifying the Shannon entropy of a histogram, Digital Twins, Joint node degree estimation in Erdos-Renyi networks, Car club usage prediction, and Deep hierarchical classification of crop types from satellite images.

A growing Data Science presence

I’ve been involved with the conference board this year and last (Manchester 2021) and my perception is the size of the conference in increasing, in terms of number of submissions and attendees, the range of topics, and the amount of Data Science represented. However, I only have two datapoints here. One of those was just as the UK was coming out of its first Covid-19 lockdown, so will probably not provide a representative baseline. So I’m not going to stick my neck out too much here, but I do expect further increases in the amount of Data Science presence at next year’s conference.

Other relevant sessions

If like me you work primarily as a Data Scientist in a commercial environment, then there were also many talks from other Sections of the RSS that were highly relevant. The Business, Industry and Finance section had talks on Explainable AI, Novel Applications of Statistics in Business, and Democratisation of Statistics in GlaxoSmithKline, whilst the Professional Development section had talks on Linked Open Data, programming in R and Python, and the new Quarto scientific publishing system.

The Future of the Data Science Profession

Of particular relevance to Data Scientists was the Professional Development section’s talk on the new Alliance for Data Science Professionals accreditations of which the RSS is part. The session walked through the various paths to accreditation and the collaborative nature of the application process. This was backed up by a Data Science ‘Beer and Pizza’ event hosted by Brian Tarran (former Significance magazine editor and now RSS Head of Data Science Platform) and Ricky McGowan (RSS Head of Standards and Corporate Relations) who both explained some of the RSS long-term plans for Data Science.

Diversity of topics across the whole conference

Diversity of topics was a noticeable theme emerging from the conference as a whole, not just in the Data Science and commercial statistics streams. For me, this reflects the broader desire of the RSS to embrace Data Scientists and any practitioners who are involved with analysing and handling data. It reflects a healthy antidote to the ‘Two cultures of statistical modelling‘ divide identified and discussed by Leo Breiman many years ago.

For example, the range of plenary talks was equally impressive as the diversity of topics in the various sessions. Like many Data Scientists my original background was a PhD in Theoretical Physics. So, a talk from Ewain Gwynne on Random Surfaces and Liouville Quantum Gravity – see picture below – took me back 30 years and also gave me an enjoyable update on what has happened in the field in those intervening years.

Ewain Gwynne talking about Random Surfaces and Liouville Quantum Gravity.

Other plenary highlights for me were Ruth King’s Barnett lecture on statistical ecology and Adrian Raftery’s talk on the challenges of forecasting world populations out to the year 2100 and as far as 2300 – see below.

Adrian Raftery talking about Bayesian Demography.

A friendly conference

The conference is not a mega-conference. We not talking NeurIPS or ICML. It was around 600 attendees – big enough not to be too insular and focused only on one or two topics, but still small enough to be welcoming, friendly and very sociable. There were social events on every evening of the conference. And to top it all, it was even sunny in Aberdeen for the whole week.

I also got to play pool against the person who led the UK’s COVID-19 dashboard work, reporting the UK government’s official daily COVID-19 stats to the general public. I lost 2-1. I now hold a grudge.

Next year – Harrogate 2023

Next year’s conference is in Harrogate, 4^th – 7^th September 2023. I will be going. Between now and then I will be practicing my pool for a revenge match. I will also be involved with the conference board again, helping to shape the Data Science content. I can promise a wide range of Data Science contributions and talks on other statistical topics Data Scientists will find interesting. I can’t promise sunshine, but that’s Yorkshire for you.

How many iterations are needed for the bisection algorithm?

On August 9, 2022September 24, 2022 By dchoyleIn Algorithms, Data Science, Mathematical Analysis, Scientific computationLeave a comment

<TL;DR>

The bisection algorithm is a very simple algorithm for finding the root of a 1-D function.
Working out the number of iterations of the algorithm required to determine the root location within a specified tolerance can be determined from a very simple little hack, which I explain here.
Things get more interesting when we consider variants of the bisection algorithm, where we cut an interval into unequal portions.

</TL;DR>

A little while ago a colleague mentioned that they were repeatedly using an off-the-shelf bisection algorithm to find the root of a function. The algorithm required the user to specify the number of iterations to run the bisection for. Since my colleague was running the algorithm repeatedly they wanted to set the number of iterations efficiently and also to achieve a guaranteed level of accuracy, but they didn’t know how to do this.

I mentioned that it was very simple to do this and it was a couple of lines of arithmetic in a little hack that I’d used many times. Then I realised that the hack was obvious and known to me because I was old – I’d been doing this sort of thing for years. My colleague hadn’t. So I thought the hack would be a good subject for a short blog post.

The idea behind a bisection algorithm is simple and illustrated in Figure 1 below.

How the bisection algorithm works — Figure 1: Schematic of how the bisection algorithm works

At each iteration we determine whether the root is to the right of the current mid-point, in the right-hand interval, or to the left of the current mid-point, in the left-hand interval. In either case, the range within which we locate the root halves. We have gone from knowing it was in the interval $[x_{lower}, x_{upper}]$ , which has width $x_{upper}-x_{lower}$ , to knowing it is in an interval of width $\frac{1}{2}(x_{upper}-x_{lower})$ . So with every iteration we reduce our uncertainty of where the root is located by half. After $N$ iterations we have reduced our initial uncertainty by $(1/2)^{N}$ . Given our initial uncertainty is determined by the initial bracketing of the root, i.e. an interval of width $(x_{upper}^{(initial)}-x_{lower}^{(initial)})$ , we can now work out that after $N$ iterations we have narrowed down the root to an interval of width ${\rm initial\;width} \times \left ( \frac{1}{2}\right ) ^{N}$ . Now if we want to locate the root to within a tolerance ${\rm tol}$ , we just have to keep iterating until the uncertainty reaches ${\rm tol}$ . That is, we run for $N$ iterations where $N$ satisfies,

$\displaystyle N\;=\; -\frac{\ln({\rm initial\;width/tol})}{\ln\left (\frac{1}{2} \right )}$

Strictly speaking we need to run for $\lceil N \rceil$ iterations. Usually I will add on a few extra iterations, e.g. 3 to 5, as an engineering safety factor.

As a means of easily and quickly determining the number of iterations to run a bisection algorithm the calculation above is simple, easy to understand and a great little hack to remember.

Is bisection optimal?

The bisection algorithm works by dividing into two our current estimate of the interval in which the root lies. Dividing the interval in two is efficient. It is like we are playing the childhood game “Guess Who”, where we ask questions about the characters’ features in order to eliminate them.

Asking about a feature that approximately half the remaining characters possess is the most efficient – it has a reasonable probability of applying to the target character and eliminates half of the remaining characters. If we have single question, with a binary outcome and a probability $p$ of one of those outcomes, then the question that has $p = \frac{1}{2}$ maximizes the expected information (the entropy), $p\ln (p)\;+\; (1-p)\ln(1-p)$ .

Dividing the interval unequally

When we first played “Guess Who” as kids we learnt that asking questions with a much lower probability $p$ of being correct didn’t win the game. Is the same true for our root finding algorithm? If instead we divide each interval into unequal portions is the root finding less efficient than when we bisect the interval?

Let’s repeat the derivation but with a different cut-point e.g. 25% along the current interval bracketing the root. In general we can test whether the root is to the left of right of a point that is a proportion $\phi$ along the current interval, meaning the cut-point is $x_{lower} + \phi (x_{upper}-x_{lower})$ . At each iteration we don’t know in advance which side of the cut-point the root lies until we test for it, so in trying to determine in advance the number of iterations we need to run, we have to assume the worst case scenario and assume that the root is still in the larger of the two intervals. The reduction in uncertainty is then, ${\rm max}\{\phi, 1-\phi\}$ . Repeating the derivation we find that we have to run at least,

$\displaystyle N_{Worst\;Case}\;=\;\ -\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm max}\{\phi, 1 - \phi \right \})}$

iterations to be guaranteed that we have located the root to within $tol$ .

Now to determine the cut-point $\phi$ that minimizes the upper bound on number of iterations required, we simply differentiate the expression above with respect to $\phi$ . Doing so we find,

$\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; -\frac{\ln({\rm initial\;width/tol})}{ (1-\phi) \left ( \ln (1 - \phi) \right )^{2}} \;\;,\;\; \phi < \frac{1}{2}$

and

$\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; \frac{\ln({\rm initial\;width/tol})}{\phi \left ( \ln (\phi) \right)^{2}} \;\;,\;\; \phi > \frac{1}{2}$

The minimum of $N_{Worst\;Case}$ is at $\phi =\frac{1}{2}$ , although $\phi=\frac{1}{2}$ is not a stationary point of the upper bound $N_{Worst\;Case}$ , as $N_{Worst\;Case}$ has a discontinuous gradient there.

That is the behaviour of the worst-case scenario. A similar analysis can be applied to the best-case scenario – we simply replace $max$ with $min$ in all the above formula. That is, in the best-case scenario the number of iterations required is given by,

$\displaystyle N_{Best\;Case}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm min}\{\phi, 1 - \phi \right \})}$

Here, the maximum of the best-case number of iterations occurs when $\phi = \frac{1}{2}$ .

That’s the worst-case and best-case scenarios, but how many iterations do we expect to use on average? Let’s look at the expected reduction in uncertainty in the root location after $N$ iterations. In a single iteration a root that is randomly located within our interval will lie, with probability $\phi$ , in segement to the left of our cut-point and leads to a reduction in the uncertainty by a factor of $\phi$ . Similarly, we get a reduction in uncertainty of $1-\phi$ with probability $1-\phi$ if our randomly located root is to the right of the cut-point. So after $N$ iterations the expected reduction in uncertainty is,

$\displaystyle {\rm Expected\;reduction}\;=\;\left ( \phi^{2}\;+\;(1-\phi)^{2}\right )^{N}$

Using this as an approximation to determine the typical number of iterations, we get,

$\displaystyle N_{Expected\;Reduction}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ( \phi^{2} + (1-\phi)^{2} \right )}$

This still isn’t the expected number of iterations, but to see how it compares Figure 2 belows shows simulation estimates of $\mathbb{E}\left ( N \right )$ plotted against $\phi$ when the root is random and uniformly distributed within the original interval.

The number of iterations needed for the bisection algorithm — Number of iterations required for the different root finding methods.

For Figure 2 we have set $w = ({\rm initial\;width/tol}) = 0.01$ . Also plotted in Figure 2 are our three theoretical estimates, $\lceil N_{Worst\;Case}\rceil, \lceil N_{Best\;Case}\rceil, \lceil N_{Expected\;Reduction}\rceil$ . The stepped structure in these 3 integer quantities is clearly apparent, as is how many more iterations are required under the worst case method when $\phi \neq \frac{1}{2}$ .

The expected number of iterations required, $\mathbb{E}( N )$ , actually shows a rich structure that isn’t clear unless you zoom in. Some aspects of that structure were unexpected, but requires some more involved mathematics to understand. I may save that for a follow-up post at a later date.

Part 3 – What does the future hold?: Using forecasting in a commercial environment

On February 15, 2022December 3, 2023 By dchoyleIn Data Science, Forecasting, TimeSeriesLeave a comment

This is the last in my series of posts on forecasting. The posts have focused on the ‘why’ of forecasting and also some of the practicalities of forecasting. This last post is going to be shorter. It is simply some links to forecasting resources that I’ve found useful or think could be useful – books, articles, online tutorials and software, as well as my opinions on which bits you should focus on learning.

Books/Articles

Forecasting: Methods and Applications by Spyros Makridakis, Steven Wheelwright and Rob Hyndman. This was the book that was recommended to me by a colleague when I started in commercial forecasting. Rob Hyndman (one of the authors) says it is out of date, and recommends his later textbook (see next), but I still find it useful.
Forecasting: Principles and Practice by Rob Hyndman and George Athanasopoulos is considered one of the modern bibles on classical forecasting techniques. It is now in its 3^rd edition and also available online for free.
Introductory Time Series with R by Paul Cowpertwait and Andrew Metcalfe. I found this short but concise Springer book (in the Use R! series) on classic time-series analysis in R a great help. It was useful both from an R perspective, but also for short practical introductions and explanations of the various ARIMA concepts. Some of the links to the datasets used are now broken apparently, but I have seen comments that the resources are not hard to find with a google search.
This recent and comprehensive review article in the International Journal of Forecasting is great (arxiv version here). It has short readable paragraphs and sections on a large number of concepts and forecasting topics, so you can simply pick the topic you’re interested in and read just that. Or read the whole article end-to-end if you want.

Blogs

Rob Hyndman’s blog is the main blog I tend to routinely look at. It is always an excellent read and contains links to blogs that Hyndman recommends (although these tend to be more econometrics and statistics focused).

Software

I’m only going to give links to free open-source software. There are some other excellent commercial applications available, but not everyone will be able to get access to them, so I won’t list them.

R: I have tended to do most of my classical time-series analysis in R. The in-built arima functions and also the forecast package (created by Rob Hyndman and Yeasmin Khandakar) provide a great deal of functionality and are my go-to packages/functions in R for time-series.
The statsmodels package in python provides a similar model building experience to building models in R. Consequently, its time-series functionality provides similar capabilities to that found in R.
Darts package in python: I have done less time-series analysis in python than I have in R. When I have done exploratory time-series analysis in python I have tended to use statsmodels. Having said that, the Darts package, and the Kats package (from Facebook) look like useful python packages from the bits I have read.
Prophet package: The Prophet package, from Facebook, is open-sourced, flexible and very powerful. I have tried it for a couple of tasks. Under the hood it is based upon the Stan probabilistic programming language (PPL), which I have used a lot (both in and outside of my main employment). Prophet is fully automated but I would still recommend you have a basic grasp of classical time-series analysis concepts before you use Prophet, to guard against those situations where a fitted model is inappropriate or clearly wrong.
The engineering team at Uber have also released their own forecasting package, Orbit, which performs Bayesian forecasts using various PPLs under the hood (similar to the way the Prophet package uses Stan).

Methods/Concepts/Techniques you should know about

ARIMA: You should definitely become familiar with the classical approaches to time-series analysis, namely ARIMA. The name is a combination of acronyms, and I’ve given a breakdown of the full acronym and what I think are the important aspects to know about.
- AR: Auto-Regressive. These are the ‘lag’ terms in the time-series model equation, whereby the response variable value at timepoint t can depend on the value of the response variable at previous time-points. It is important to understand, i) how the value of the lag coefficients affect the long-run mean and variance of the response variable, including how this determines whether a process is stationary or not, ii) how to determine the order of the AR terms, e.g. by looking at a Partial Auto-Correlation Function (PACF) plot, iii) how the AR terms are infinite impulse response (IIR) terms, in contrast to the finite impulse response (FIR) moving average terms.
- I: Integrated. This is the concept within ARIMA that most Data Scientists are least familiar with but a very important one, particularly when dealing with quantities that we naturally expect to grow over time, for example by accumulating increments that increase on average. It is important to understand, i) how to run unit-root tests to test for integrated series – beware the difference between Phillips-Perron (PP) and KPSS tests, as the null hypothesis is different between them, ii) Cointegration and spurious regression (spurious correlation) – for which Clive Granger won a Nobel memorial prize in Economics (shared with Robert Engle) in 2003.
- MA: Moving Average. These are the ‘error’ terms in the time-series model equation, whereby the response variable value at timepoint t can be affected by the stochastic error not just at t, but at previous timepoints as well. This allows the response variable to be affected by short timescale perturbations of finite duration (hence the classification of the moving average terms as being finite impulse response terms).
Even if you intend to use only neural network approaches to forecasting, or want to use, say, the Prophet package as an AutoML forecasting solution, it is still a good idea to get a good grasp of ARIMA models. Investing some time in getting to grips with the basics of ARIMA and doing some hands-on playing with ARIMA models will pay huge dividends.
Error Correction Models (ECMs). You may never have need to use Error Correction Models, but they are useful for incorporating the transient departures from a long-term equilibrium. I found these University of Oxford summer school lectures notes given by Prof. Robin Best from Binghamton University (SUNY) a very good and accessible introduction to error correction models. The lecture notes also give excellent explanations of the concepts of integration and co-integration in time-series analysis. I used these lecture notes when I was having to develop an ECM for a long-range stress-testing model.
Neural Network and other Machine Learning techniques: Neural networks have been applied to time-series for a long time, but being blunt, until the last 7 years or so they weren’t very good and didn’t outperform the classical time series approaches (in my opinion). In part that was probably because machine learning practitioners tended to view time-series analysis and forecasting as ‘just another prediction problem’, and so the approaches didn’t really take into account the time-series nature of the data, e.g. the auto-regressive structure, the very things that make time-series what they are. Coupled with the fact that the classical time-series analysis approaches have a very solid theoretical underpinning in ARIMA (expressed in terms of the lag or backshift operator), this meant that machine learning approaches didn’t make as many inroads into time-series analysis as they did in other fields. However, with the advent of Deep Learning approaches using Recurrent Neural Networks and LSTM units, machine learning approaches to time-series analysis have really begun to make their mark. Models such as the DeepAR model from Salinas et al are now considered to outperform classical approaches for specific tasks. Hyndman’s book, “Forecasting: Principles and Practice” contains a chapter on machine learning approaches to time-series analysis, but in my opinion it is only very basic. The extensive review article by Petropoulos et al has a section on ‘Data-Driven Methods’ that includes sub-sections on Neural Networks and also Deep Probabilistic Forecasting Models. However, given the comprehensive nature of the whole review article these sub-sections are necessarily short. Other more extensive resources that I have found useful, which also cover the Deep Learning approaches include,
- This survey paper by Lim and Zohren.
- Lina Faik has an excellent and very up-to-date two-part blog-post on DeepAR and other seq2seq based approaches (Part 1 here, Part 2 here – on Medium) .
- There is a nice introductory TensorFlow tutorial on time-series forecasting with Neural Networks (including notebooks).
- This blog-post from Google on interpretable deep learning for time series forecasting.

That is it. I hope you have enjoyed this series of posts on forecasting. As with anything in Data Science, forecasting isn’t a spectator sport. The best way to learn is to download some datasets and start playing. You will make mistakes, but that is how you learn.

Part 2 – What does the future hold? : Using forecasting in a commercial environment

On January 11, 2022July 4, 2022 By dchoyleIn Data Science, Forecasting, TimeSeriesLeave a comment

<TL;DR>

This is part 2 of 3 about producing forecasts in real-world situations. Part 1 was more about the ‘what’ of forecasting, and specifically about different forecast horizons and how those different horizons shape how you do the forecast and what you can do with it. Part 2 is advice to help you avoid common mistakes when producing a forecast.

There are multiple distinct stages to producing and using a forecast. At the simplest level, we can list these different stages as,

Planning a forecast.
Executing a forecast.
Assessing a forecast.
Taking actions or decisions informed by a forecast.
Updating a forecast.
Deploying a forecasting process.

I have found that overwhelmingly the majority of mistakes I’ve made or seen made, are in the planning stage of producing a forecast. In fact, mistakes that I’ve seen in many of the later stages can ultimately be traced back to a failure to plan appropriately. That is, mistakes were made and spotted in one of the later stages, but if we’d thought about it properly, we could have anticipated that the error or issue would occur due to the way the forecasting process had been planned. This introduces our main takeaway,

Put a lot more time and effort into planning the forecasting process than you were initially going to do

</TL;DR>

Let’s get started. Since the majority of errors I’ve seen (and therefore opportunities for learning) are in the planning stage, that is where I’m going to focus most of my discussion. In fact, I am going to simplify the 6 stages outlined to above to just 3 broad areas of discussion,

Mistakes to avoid when planning a forecast
Mistakes to avoid when executing a forecast
Mistakes to avoid when assessing a forecast

In each of those broad areas I’ll introduce a couple of common issues or mistakes that tend to get made, and also provide some hints on how to solve the issues or avoid the mistakes – the issues and solutions will be underlined to highlight them. I’ll also drop in a couple of real-world examples where I’ve seen these mistakes made, or where I made them myself.

Planning

Model Scope:

Issue: The initial information supplied to you is never sufficient to perform an appropriate forecast. This is an unwritten first rule of forecasting¹.

There will always be important/crucial things the person requesting the forecast has not told you – out of ignorance or absent-mindedness. This is the time to ask those extra questions, such as,
1. Why do you need the forecast? What problem are you actually trying to solve? More importantly, what decision are you trying to make using the information the forecast will give?
2. How are you going to consume the forecast? Is it for insight – identifying the drivers that have the biggest impact on a medium-range outcome? Or is it for strategic planning? Or is it to be incorporated into a machine learning pipeline with an action automatically determined from the result of the forecast, e.g. changing an offer to a segment of customers.
3. What is the forecast horizon over which you need the forecast?
4. At what level of time granularity and segmentation do you need the forecast?
Solution: The answers to a) – d) above are inter-related, i.e. the answer to one may uniquely determine the answer to one of the others, but you should still ask each of those questions individually.

By understanding the scope of a system and how the forecast output is actually going to be used we avoid errors such as, failing to identify when the use-case does not justify the time and effort to develop the proposed forecasting model. A good example of this I’ve seen was a model developed for a national social housing charity that needed to predict the future costs of repairs to its housing stock. Due to various operational sensitivities, actions off the back of this prediction could only be taken at the regional level – the charity only needed to predict what the next month’s total repairs costs would be for each region in the country. But …the solution that was built used xgboost to predict the likely repair costs for each house in a region, given details about each house, and then simply aggregated the total predictions to regional level. Over the time horizons being forecasted the housing stock in each region was stable, so an equally accurate forecast could be obtained by using just the actual historical total monthly repair costs. As the historical total monthly repair costs just displayed seasonality and trend, a simple piece of SQL gave a prediction as accurate on a holdout sample as the xgboost based model.

Model Inputs:

Issue: Will you actually know all the input values and model parameter values at forecast time? Check that the values of all the exogenous variables will be known ahead of running the main forecasting model. If the exogenous variables you’ve used in your model include macro-economic quantities, their future values will not be known and you will either need a separate forecasting model for these, or their values will need to be part of the forecast scenario specification. This may be what you intended all along, but you’d be surprised how often someone builds a forecasting model and only afterwards realizes the challenges in specifying the input variables.

This problem can occur even in seemingly benign situations. For example, one mistake I’ve made in the past is using a set of dummy variables to model cohort fixed effects for a model of default rates in a loans portfolio. The only problem was that the loan book was still open – new cohorts were still coming onto the loan book – so to forecast the future default rate of the portfolio required assigning fixed effects to future, as yet unobserved, cohorts. In this instance we chose to make assignment of the future cohort effects part of the scenario specification – scenarios designated future cohorts as ‘high risk’, ‘medium risk’ or ‘low risk’ with the effects values being calculated from an appropriate centile of the historic cohort effects estimates. An alternative approach might have been to treat the cohort effects as random effects and when forecasting marginalize over the random effects of future cohorts. However, the two takeaways from this are, i) when planning a forecast model, think ahead to when you’re going to use it to produce the forecast and make sure you know how you’re going to obtain the input variables, ii) be cautious about including fixed cohort effects when producing forecasts for a changing cohort mix.

Solution: Mentally run through the forecasting process in your head, or on paper, before you start estimating your models. This will flush out issues with the model form or forecasting technique before you have committed to building them.
Issue: Variables/features not included in the model in a sensible or correct form.

I’ve a seen a model built by a marketing team that predicted the response to marketing activity. It was suspected that weather had an impact on how effective the marketing was. This was not an unreasonable hypothesis – when the summer weather is hot and sunny (not that common in the UK) most people want to be outside, in the park or at the beach, not paying close attention to some TV advert. The only problem was that ‘weather’ had been included in the predictive model as the average monthly temperature across the whole of England. There are so many things wrong with this,

The temperature in England on a single day can vary hugely from one place to another. It can be sunny and 25°C in London whilst it is raining and 15°C in Manchester and hailing and 10°C in Newcastle. The England-wide average is a meaningless feature to try and reflect how likely anybody in a specific geography is to respond to marketing. The people in London will be relaxing in the park, ignoring the TV adverts, whilst people in Newcastle will be putting an extra jumper on and hunkering down in front of soap re-runs on the TV.
In a similar vein, the use of a monthly average temperature is pointless. It was believed that the impact of weather on the marketing effectiveness was because of un-seasonal sunny weather over a few days. A monthly average will not reflect this. The monthly England-wide average temperature will reflect just seasonal patterns, not the particular effect the stakeholder was trying to understand.
Temperature is the wrong feature to use here. Hours of sunlight may be better, since the hypothesis was that it was the un-expected very sunny summer weather that was reducing the effectiveness of the marketing. Even better, a feature that captured the presence of unusually hot, dry weather on summer days would be preferable to include in the model. Note that we have now moved from discussing temperature to talking about ‘dry’ summer days, i.e., the absence of any precipitation. When including weather effects in a model it can even be crucial to think what form of precipitation is relevant here. A former colleague told me about some work he’d done for a British mobile phone operator. The mobile company was interested in the impact of weather as they’d noticed that call volumes increased sometimes when there was precipitation. The analysis revealed that, yes, precipitation had an impact, but the form of the precipitation is hugely important. If it’s raining the impact is small, but lower the temperature so that the precipitation comes in the form of snow and the call volumes spike – everybody is phoning home or phoning work to say they are delayed because of snow-blocked roads or trains not running because of ice and snow on the rails. The lesson here is that the precise way in which weather affects our outcome of interest needs to be understood.
Lastly, weather is a prime example of a variable that we won’t automatically know when it comes to producing the forecast. We may have a brilliant forecasting model, but we need to forecast the specific weather feature as well and we may not be able to do that accurately enough to get the benefit of our main forecasting model.

It was suspected that the person who built the model I’ve described just threw ‘weather’ into the model because they were told to. They simply got hold of the easiest or most accessible single weather variable they could find.

Solution: Spend time thinking through the form or particular variant of the feature you are putting into your forecasting model. Will it actually be capable of reflecting the actual effect you are trying to capture in your model? Is it at the right temporal and spatial granularity to be able to do that?

Issue: Insufficient length of training data. Make sure the length of historical data you use to build your forecasting model is sufficiently long. How long is ‘sufficiently long’? Paul Saffo in this 2007 Harvard Business Review article on the 6 Rules For Effective Forecasting says that you should “look back twice as far as you look forward”. I would say you should look back even further. The point here, however, is not to give a precise rule for how much historical data you need for a given forecast horizon, but more to emphasize that the length of historical data you should use is always longer than you think. It should be sufficiently long enough to display several examples of the phenomena you need to capture with your model. To give an example – if you are modelling the effects of macro-economic climate on a financial metric, e.g., loan default rates, then you will want to include several business cycles and more importantly several recessionary periods in your historical training data. How far back in history is still a matter of subjective judgement – for example, was the recession resulting from the 2008 financial crash typical or atypical of the dynamics you want to model and forecast? This highlights two points, i) how far you go back in history can require a detailed discussion and review of the historical data – it is not simply, ‘let’s just include the last two recessions’, ii) you need to have a good idea of what sort of phenomena and/or dynamics you need to model for your forecasts to be representative of the scenarios you are trying to understand. Use the wrong data and the usefulness of your forecasting model may be short-lived. For long-range forecasts you’ll never really know in advance the full range of phenomena that your model needs to capture, as unpredictable and impactful phenomena are always capable of arising within the forecast horizon of a long-range forecast – what are called ‘Black-Swan’ events in parts of the popular science literature. In Part 1 we explained that by considering a very wide range of scenarios we mitigate against this, to a degree. But it means that a model used for long-range forecasting has to have captured dynamics and behaviour appropriate to a very wide range of phenomena – and that can require an awful lot of historical data. Anecdotally, I’ve seen that the length of training data required increases super-linearly with the length of the forecast horizon.

Solution: Think about the kind of phenomena or scenarios you want your model to be capable of forecasting. Does your training data contain adequate examples of such phenomena or scenarios? If yes, then you probably have enough training data. If no, then get additional appropriate training data, or shorten your forecast horizon (see Part I for why you should do this).

Model Form:

Issue: Using a complex or unusual generative modelling technique and believing you can just include the predictive features into the ‘model’ as you would when building a linear model or GLM.

I have seen an agent-based model used to attempt to forecast and identify emergent phenomena that it was believed would occur in response to a macro-economic change or shock. The agent-based simulation was used to mimic the microscopic interactions between the agents and their external environment – the general economy. A macro-economic variable, I think it was unemployment rate, was directly coupled to each agent’s propensity to spend. Lo-and-behold, when the unemployment rate increased the forecast showed that the total expenditure in the system went down. This was hailed as a new finding, showing emergent behaviour. No! It just reflected how the macro-economic variables had been included in the modelling. At this point over 12 months (at 3 FTE, I believe) had been expended on this project.

How the exogenous influences are coupled to a forecasting technique is critically important if we want to identify genuine emergent phenomena. Genuinely emergent phenomena are typically a global property of the system, often resulting from a global constraint. For example, in a physical system it could be a requirement to minimize the overall free energy of the system. In a financial system it could be a requirement to maximize total profit. Ideally, we should think about how the exogenous influences interact with these global constraints when including the exogenous variables in our modelling. If instead the exogenous variables are coupled directly to a metric we will later measure, we should not be surprised when that metric changes when the exogenous variables do.

Solution: The more complex the technique used you more you’ll need to think about how you put the predictive features into the ‘model’.
Issue: Computational optimization of forecast model outputs will exploit the weaknesses in your model. Be aware of the potential future downstream uses of your forecasting process. The forecasting technique you’ve used may not be robust enough to support likely (and anticipatable) downstream uses. It is likely, you’ve set up your forecasting process as an automatable, reproducible codebase. That codebase can then be included in a downstream automated process, such as finding the optimal value of one the actionable input variables. The optimization process will optimize the output of the forecasting model and because forecasts are, almost by definition, ‘out-of-sample’, there is the potential for the optimization process to drive the model to a region of the input space where the model output is non-sensical. This is because the optimization process does not know any better – we have not ensured that the forecasting model output has sensible and credible behaviour for all scenarios or for all sets of input values. To do this we need to build sensible structural constraints or principles into our forecasting model. Such constraints or principles usually come from domain knowledge, e.g. from economic principles when building forecasting models that include macro-economic inputs. These constraints or principles represent assumptions – we are assuming that our system of interest will or should obey classical economic principles. If those assumptions are incorrect, we will be guilty of produced a biased forecast. How do we know when to include such constraints or principles? We don’t know precisely, but we can think before constructing the forecasting model whether the benefits of including them outweigh the dis-advantages, and we are forewarned as to the potential bias. The main point here is, again, think and plan.

Solution: Understand if you will always be in control of the uses of your model. If not, then think whether your model needs to be robust to use-cases you can’t control.

Model Estimation:

Issue: The model estimation process is not set up to reflect what you’re actually trying to model. Use a cost-function that reflects the outcome you care about. When fitting a forecasting model, we will typically be minimizing some cost-function. Choose the cost-function appropriately. If forecast accuracy is going to be assessed using a different cost-function you may want to rethink the cost-function you use for fitting. Or in simple terms, fit your model to optimize the outcome you actually care about.

Solution: Think through each step of the proposed estimation process. Is it ideal for the thing you’re trying to capture.

Execution

Issue: How do you gauge whether your forecast is credible? Always run a baseline calculation or baseline scenario. For a short-range forecast you may have characterized very precisely the scenario you wish to predict, but your baseline scenario can still be something like a Business-As-Usual (BAU) scenario. For long-range forecasting, you can also use a BAU scenario as your baseline scenario but the definition of BAU may be more subjective and contain significant movement in exogenous influences – although it probably won’t be what you consider to be the most extreme scenario. The main benefit of running a baseline scenario is that it allows you to compute realistic ‘deltas’ even if there are far-from-perfect assumptions in your forecasting methodology. Remember the quote from George Box – ‘All models are wrong, but some are useful.’ The skill as a Data Scientist/Statistician is in knowing how to extract the useful insight and information from a ‘wrong’ model. With a baseline calculation you can compute how much worse or better the outcome is under scenario X compared to the baseline scenario. As a human you may then have a feel for what is incorrect in the baseline calculation and correct it or down-weight it. The model based estimate of the delta between scenario X and baseline scenario can then be applied to the human corrected baseline.

Solution: Running a baseline scenario where you have a more intuitive feel for how the system responds will help you assess any other scenario.
Issue: We have a forecast but not a quantitative measure of how confident we are with it. Always produce measures of uncertainty, e.g. confidence intervals for your forecasts. There is limited value in just a point estimate. How sensitive is that estimate to the stochastic component of the response variable dynamics? Then on top that we have uncertainty in the forecast due to parameter uncertainty and potentially also input uncertainty. Sensitivity analysis can help us quantify the impact on the forecast from both parameter uncertainty and input uncertainty, so that we can identify which we need to improve most. Don’t assume that just because values of exogenous variables have been specified for the forecast scenario that they are accurate. Forecast scenarios that are, upfront, specified very precisely can still be mis-specified or specified inappropriately, or even subject to change – it is not unusual for a company to execute a different BAU scenario to what they said they would at the time the forecast was produced.

Solution: Quantify the impact on a forecast from all the major sources of uncertainty. Doing so is essential for framing and qualifying any decisions or actions taken on the back of the forecast output.

Issue: Distinguishing outputs from different forecasts gets messy when you have lots of them. Create a system for time-stamping forecast outputs, the associated input data and meta-data and also the codebase used to produce the forecast. This creates a disciplined process for running, re-running, and changing forecasts whilst knowing which run produced which forecast. You’ll be surprised how often you’ll end up asking yourself questions similar to , ‘now, did I include 5 or 6 years of training data and was it a 2month gap between the end of the training data and the beginning of the forecast horizon?’ Set-up a system to accurately capture what each forecast used, did, and what it was about.

Solution: Set up a system from the outset for timestamping and logging your forecast outputs along with the details on the inputs.

Assessment

Issue: How do you assess the accuracy of your forecasting method if the inputs may also be uncertain? If your forecasting model includes an input feature that itself is forecasted, then always perform holdout tests on your model with and without perfect hindsight.
1. Testing with perfect hindsight is when we perform the holdout test using the actual observed values of all the input features.
2. Testing without perfect hindsight is when we perform the holdout test using the forecasted values of any input features that need to be forecasted when actually running in production.
The clear value of performing the two different versions of the holdout tests is that it helps identify where the biggest bottleneck in forecast accuracy is. There is no point in trying to further improve a main forecasting model that is already accurate, say over a 1 year time horizon, if it is only accurate if we know all the inputs precisely and our forecasts of one of the input features is woeful – put the effort into improving the feature forecasting model.

Solution: Assess holdout forecast accuracy with and without perfect hindsight on the input variables.
Issue: How do know whether your complex forecasting technique is adding any value? Always run a baseline technique when assessing holdout accuracy. This is true of any use of machine learning. When building predictive classification models we often build a simple classifier, such as a naïve Bayes classifier, to provide a baseline against which to judge our more complex and sophisticated models and to check that the extra complexity is warranted. Similarly, when producing a forecast for a scenario using our chosen forecasting technique, we should include the forecast from a much simpler technique, such as exponential smoothing. In fact, this commentary, in the International Journal of Forecasting, on the recent M5 forecasting competition suggests 92.5% of the time you won’t beat a simple exponential smoothing model.

Solution: Include a simple baseline forecasting technique.
Issue: Over confidence in the forecast output.

Another of my favourite quotes from the famous statistician George Box,

‘Statisticians, like artists, have the bad habit of falling in love with their models.
George Box

I have been guilty of this myself. We become blind to the possibility that the output from a model can still be garbage even when we have provided high-quality input data. We believe that because we have circumvented the ‘garbage-in, garbage-out’ issue we must have a credible forecast because of the elegance and sophistication of the forecasting technique we have used. We have become seduced by the elegance of the forecasting technique. We have forgotten that ‘all models are wrong’. Well, if a model can be wrong, it can be completely and utterly wrong, and we should remember that. The first rule of forecast assessment – always doubt your forecast. Look at the forecast and see if you can explain to yourself why the forecast has that shape given the inputs.

Solution: Always be prepared to doubt and, if necessary, overrule your forecasts.

That’s Part 2. I hope it has been helpful. In Part 3 I’ll list some forecasting learning resources and tools that I’ve found useful.

Footnote 1: I seem to recall seeing this rule written in a blogpost or paper somewhere, but I’m unable to locate it. If anybody is aware of an original source for it, please let me know.

What does the future hold? : Using forecasting in a commercial environment

On November 14, 2021January 12, 2022 By dchoyleIn Data Science, Forecasting, TimeSeries2 Comments

TL;DR: Forecasting is a process, not just a forecasting model. The over-whelming majority of textbooks will teach you how to build a particular type of forecasting model, but not about how, when and where to use a forecasting process. These things are often learnt only through experience. In this series of blog posts I detail what I have learnt about building forecasts and the forecasting process, through 10 years of commercial Data Science roles. The main takeaways – before you build any forecasting models think long and hard about why you need a forecast, what you are going to do with it, at what granularity and over what time horizon is the forecast needed – long-range forecasting is very different from short-range forecasting. You’ll always need some human involvement in the forecasting process even when using automated short-range forecasts, where it is still advisable to include a human oversight step in the decision making process. The longer the range of the forecast, the more human involvement is advisable.

Why forecast?

Almost all my Data Science roles in the commercial sector have been focused on some form of forecasting – from my first role outside of academia, where I was building long-range ‘stress-testing’ models for the UK’s largest retail bank, through building models that predicted the website clicks for AutoTraderUK in response to a TV advertising campaign, to the demand models I build at dunnhumby to forecast demand for the world’s largest grocery retailers. The focus on forecasting is perhaps no surprise. The ultimate use of the models in business is to help optimize some aspect of the business, be it helping determine the correct Tier 1 capital required to underpin the bank’s risk-weighted-assets, or to determine the best mix of TV channels and timings given a TV marketing budget, through to determining the optimal prices for products in a supermarket category. In all these examples it is the future performance of the business that we want to optimize. The use of forecasting models for business optimization is very much at the ‘prescriptive’ end of the Gartner analytics ascendency staircase. Businesses that use Data Science and ML models in this way are attempting to influence the future towards an outcome that is beneficial for them.

Why the need for this series of posts?

Not all businesses do use Data Science and Machine Learning in this way, or are able to, and so are less in control and more subject to the random winds of chance. Businesses that use forecasting models to optimize business operations tend to be both data and analytics mature. Typically, they have been using analytics in this way for a long time. It is not a new endeavour for those businesses. For other businesses that are new to forecasting there will be a temptation to believe that learning to forecast just requires learning the various forecasting techniques. During my various commercial roles I obviously had to learn the technical details of various forecasting techniques – ARIMA, Holt-Winters, etc. BUT….this article is not another introduction to how to use those various techniques to build models – there are plenty of excellent textbooks and online educational resources that will show you how to do that better than I can¹. Instead, this is a series of blog posts about what I have learnt about forecasting along the way. Things which typically aren’t explained in the technical textbooks or technical online articles. Some of these things I’ve learnt the hard way – by making mistakes. Other things I have learnt after the forecast models have been built – when the real challenges of utilizing the models for the actual use case emerge.

Overall, the focus is on how to use forecasting, not how to build specific forecasting models. It will be on understanding what forecasting can do, where you should use it, what it can’t do, and how to get the best out of the forecasting process. I’m going to break this down into a series of 3 posts,

Part1: What is forecasting? What can forecasting be used for?
Part2: How to organize a forecasting process. What to do and what not to do.
Part3: Links to resources and further reading.

Part 1

What is forecasting?

In this era of machine learning and AI can’t we just regard forecasting as just another form of prediction, and forecasting models should be constructed and interpreted just like any other machine learning model? The answer is no.

Forecasting vs Prediction

The high-level distinction between forecasting and prediction is the temporal element. When we forecast we are extrapolating into the future. When we build a predictive model we are usually interpolating within the training set from which the model has been built.

The term ‘projection’ is also used when talking about forecasting. Some organizations, such as the International Panel on Climate Change (IPCC) refer to a projection as a forward looking prediction under a particular scenario, whilst a forecast is the projection (scenario) that is considered most likely.

This also highlights that forecasts acknowledge the inherent element of uncertainty within them. Nate Silver in his book, ‘The Signal and the Noise’ notes that some fields such as seismology strongly emphasize this aspect, distinguishing,

A prediction is a definitive and specific statement about when and where an earthquake will strike…whereas a forecast is a probabilistic statement, usually over a longer timescale.
Nate Silver

Nate Silver states that, ‘The United States Geological Survey’s official position is that earthquakes cannot be predicted. They can, however, be forecasted’ – more details from the USGS here.

Recognizing the importance of the probabilistic nature of forecasts, many modern forecasts are now built directly from probabilistic models, with the uncertainty communicated, for example visually through the use of fan charts.

It’s about time

The temporal element of forecasting is key. It impacts two important aspects of any forecasting model we construct – i) The nature of the variables used in the forecasting model, ii) The time-horizon over which we forecast and what we can use those forecasts for. Let’s look at those two aspects.

Endogenous vs exogenous factors

The temporal element of forecasting means it naturally involves trying to model and/or understand how a system evolves. The factors that influence that evolution can be internal to the system itself -what we call endogenous factors. These are variables that are determined or created by, or emerge from the system itself. An endogenous variable could be as simple as the lagged response variable itself. Other factors that can influence a system’s evolution are external to the system – what we call exogenous factors – such as the broader macro-economic climate when modelling the short-term dynamics of demand for goods or services in a small geographical region.

Forecast horizon

There are multiple temporal components/dimensions/concepts we may need to consider when building a forecasting model,

The time-period used to train a forecasting model.
The time-period over which the forecasting model is tested.
The temporal granularity at which the forecasts are made, e.g. daily, weekly, monthly, etc.
The time increments we use when advancing training/testing windows during the evaluation of the forecasting model.
The time increments we use that set the frequency of the forecasting process when deployed.
The time gap between when the forecasts are made and the date of the first forecast period, i.e., the gap between when the forecasts are made to when they are used.

Figure 1: Some of the different temporal concepts involved in defining a forecast.

Some of these concepts are illustrated in Figure 1 above, but perhaps the most important temporal component, and the one I want to focus on, is the length of the forecast horizon – how far into the future are we attempting to forecast? That is, are we making forecasts for the short-term, medium-term, or long-term. The forecast horizon is strongly linked to what a forecast model can be used for (or should be used for), and how it is used. More specifically,

The appropriateness of different forecasting models and techniques is different over the different horizons.
The accuracy of a forecasting model is different over different horizons.
The factors or variables that influence the response variable being forecasted differ over different horizons.
Even how the system being forecasted responds or evolves can be different over different horizons.

The net effect of all this is that the uses of forecasting are and should be different over different forecast horizons. So how do we define the forecast horizon? What defines a short-term horizon, versus a medium-term or long-term horizon? Ultimately, those concepts should be defined in terms of the characteristics or response of the system being forecast, and not the forecasting technique used.

However, there is no universally agreed definition of short-range versus long-range, as this discussion on CrossValidated testifies to. Below I’ll give my own definition and discuss in detail what distinguishes a short-range forecast from a medium-range or long-range forecast. As well as giving a definition based upon the dynamical characteristics of the system being forecasted and the factors that influence it, I’ll also give a second practical definition based upon how we intend to use the forecasts.

Short-range forecasts:

At a trivial level a short-range or short-term forecast is a forecast of the system we are interested in, but over a short period into the future – yes, a very unhelpful definition. So, what precisely defines short-term? It is more helpful to realize that what we mean by ‘short-term’ can vary from system to system. By short-term, we ultimately imply that we expect the behaviour of the system in the immediate past to be reasonable guide to its behaviour in the short-term future – we don’t expect to frequently see massive changes in level of the response variable, and the recent historical values of the response variable alone can enable us to produce a decent forecasting model. We are in the realm where ARIMA models do well. Over a short-term horizon the influence of exogenous variables has not yet begun to kick in, primarily because the important exogenous variables have not changed significantly – they are effectively constant – on these short-term timescales. For a high-street retailer interested in forecasting how many items they will sell, short-term may mean a few days ahead and up to two weeks ahead, whilst for a financial trader involved in ultra-high frequency trading, short-term is measured in milli-seconds and up to only a second or so. For a system whose dynamics are almost entirely endogenous, or systems whose exogenous influences evolve on timescales of years, e.g. climate systems, then short-term can be measured in multiples of years.

The fact that over a short-term horizon any exogenous influences may not vary illustrates that over different forecast horizons the dynamics of our system can be very different. Over short timescales the dynamics is endogenously controlled, over long timescales the dynamics is exogenously controlled.

In complex systems the dynamics over short and long timescales may be different for reasons other than just how the forecast horizon timescale compares to the timescale on which the exogenous influences vary. In complex systems we are likely to have multiple endogenous timescales defined or emergent. This is particularly true for economic systems and we see it in how those systems respond to a shock or a perturbation. Economists have a rule of thumb – in the short-term people are price inelastic (price-insensitive)², meaning that after a change to a economic system, e.g. supply chain shocks, consumers may not have had time to adapt their behaviour to the new circumstances or the new prices/information available, and so still purchase in a similar manner to before even though prices may have risen. Over longer timescales, people adapt their behaviour – they find cheaper substitutes for the now more expensive item they used to purchase, or they find cheaper suppliers, and so consumers become more price-sensitive again over longer timescales.

The short-term horizon is defined by the shortest timescale process that has an appreciable/relevant influence on the response variable. In our system of interest there may be both exogenous and endogenous processes, and so timescales defined exogenously and endogenously. In a complex system, we will have multiple endogenous timescales, possibly varying by orders of magnitude.

This economic example also highlights that in complex systems and over long timescales we should probably regard everything as ultimately being endogenous due to the degree of inter-connectedness of the various sub-systems. Or in other words, no sub-component of the complex system can be considered on its own as a closed system or independent of other sub-systems, and we should always study the complex system as a whole – but very likely with a lot of simplifying assumptions.

Over the short-term we would expect a forecast to be accurate, or rather capable of being accurate. This doesn’t mean a short-term forecast can’t be massively inaccurate; we could have a extraordinary event and perturbations occur after the the forecast was made, i.e., an assumption that we are forecasting a stable system (a stationary process in statistical language) may turn out to be incorrect due to circumstances that could not have possibly been foreseen – think of a retailer making supply chain forecasts four or six weeks prior to the stock-piling panics that occurred as a consequence of the first Covid-19 lockdowns. Or it may simply be the case that the short-range forecasting model has been poorly built.

Putting unforeseen circumstances and model building competence aside, we would expect a short-range forecast to be reasonably accurate. It can be used for making accurate predictions for very specific scenarios. In contrast, a long-range forecast cannot. This gives us a second practical means of defining forecast horizon. Practically, short-term means the time horizon over which we can use the forecasting model for making detailed specific predictions. Note the emphasis on the word ‘use’. The use-case/business model will define the level of accuracy we require and so can effectively define what is short-range and what is long-range. The recent example of Zillow which exited from US house price forecasting is a case point – see here, here, and here for more detailed discussions. Zillow was using forecasting models over time horizons for which the accuracy was not sufficient to support the particular business model. Zillow was effectively relying on long-range forecasts for detailed predictions, even though the time horizon of 6months ahead may have appeared to be short-term.

The Zillow example illustrates again the difficulty in forecasting complex systems such as markets, particularly if actions taken on the back of the forecast are intended to be part of the market making process. It highlights that perhaps for complex systems we should regards almost all forecasts as long-range.

Medium-range forecasts:

As you might have guessed we can define a medium-term forecast as a forecast over a horizon over which any exogenous influences begin to show significant variation. This is true also for a long-range forecast, but for a medium range forecast horizon we may have a reasonable idea what the values of those exogenous influences may be, or we may even be in control of them, for example they may correspond to actionable variables such as marketing activity variables³. For typical business use-cases medium-range can mean anything from 3-6months out to as much as 18months in the future.

Because exogenous influences start to show significant variation, they can’t simply be absorbed into the intercept of any model, and the typical modelling techniques used are of the form, ‘technique A + X’, meaning that we include the exogenous variables X much like we would when building a standard regression model. Over the medium-term techniques such as ARIMA+X and SARIMA+X are useful.

Long-range forecasts:

By contrast to the definition of what constitutes short-range and medium-range forecasts, a long-range forecast is a forecast over a time-horizon in which exogenous factors have a significant influence and display a significant variance. This could be, for example, macro-economic factors evolving through several business cycles. For the stress-testing models I had to build we were interested in forecasting bank-loan default rates with unemployment rate and central bank base-rate as inputs into the forecasting model. The models were used to produce forecasts with a 5yr forecast horizon. Future unemployment and interest rates were obviously unknown, and so required their own additional forecasting models to predict them.

This macro-economic example highlights that the exogenous influences are themselves subject to variation that is difficult to know in advance. Long-range forecasts have an additional element of uncertainty that increases the final uncertainty of our end forecasts – namely that we probably don’t know all of the inputs to our main forecasting model to a high degree of accuracy. To a large extent this is to be expected. We are forecasting multiple years into the future. In that time many unforeseen circumstances can play-out, e.g., a referendum to leave a major trading block not going the way many people expected, or a global pandemic occurring.

Forecasting exogenous inputs, which encapsulate the influence of national and international contexts, can only be suggestive at best – a reflection of what we think might happen to those exogenous variables, all other things being stable. But….major random, global events do happen. Since we cannot always confidently know what the true future values of the exogenous variable will be, a long-range forecast can only ever be viewed as a ‘what-if’ forecast – what would be the loan default rate if the macro-economic conditions were X. More importantly, a long-range forecast should only ever be used as a ‘what-if’. An individual, specific long-range forecast shouldn’t be used to plan the operation of an organization, or its tactical response to a particular situation.

Does this mean long-range forecasting is useless? No! Far from it! Long-range forecasts won’t tell us what will happen, they tell us what might happen. And so long-range forecasting can be used to help an organization plan strategically. Okay, I hear you say that all forecasts only tell us what might happen, because all forecasts have some uncertainty. What I mean here is that, because the validity of a long-range forecast is dependent on the validity of the input values, we don’t even know if we are looking at an appropriate input scenario. So instead of producing a long-range forecast for a single input scenario, we should always produce a range of long-range forecasts from a range, or ensemble, of input scenarios. The output from an ensemble of long-range forecast then might reveal some behaviours we weren’t expecting, which the business or organization can plan an appropriate response or intervention to. Or alternatively, an ensemble or long-range forecasts may reveal that a particular output metric is largely insensitive to the input scenario, and therefore although we don’t know which scenario will ultimately play-out, we can be confident we know what the value of the metric will be. In our bank stress-testing example we may see that for a wide-ranging ensemble of input scenarios the long-range forecasts indicate that in all the scenarios considered a bank has sufficient capital to withstand the likely increased loan default rates. The bank executives may be confident that no significant capital needs to be raised to protect the bank against whatever the future holds.

You may scoff at last example given financial crisis of 2008, and you may question whether large banks are ever well-prepared for whatever economic future transpires. This may be true, but what it also highlights is that there should always be some discussion around whether the ensemble of input scenarios considered has been wide-ranging enough. Has a big enough stress been applied in the ‘what-if’ scenarios during the stress-testing exercise? This illustrates that long-range forecasting has a high degree of human involvement – to discuss both inputs and interpret outputs. How successful a long-range forecasting exercise is can depend on how an organization approaches it, and how the human contributions/element are brought into play – the excellent non-technical book, Uncharted by Margaret Heffernan discusses these points in depth – I discovered the book through this review by Tim Harford in the Financial Times.

Figure 2: The different forecasting time horizons

Human involvement

The need for significant human involvement in producing some forecasts may be surprising in this era of Data Science and Machine Learning, but direct human involvement in producing forecasts has a long history. Prior to the development of rigorous time series analysis techniques such as ARIMA, this is to be expected. It is interesting to go back and read old articles such as this 1971 Harvard Business Review article on forecasting in business. Putting aside the very gendered language, it is intriguing to see the emphasis upon judgmental forecasting methods and forecasting by analogy. The value of judgmental methods has been re-discovered over the 15 years or so. Specific techniques such as the Delphi method are still widely used in fields as diverse as public transport planning and health. Techniques such as the Delphi method excel at getting a wide range of opinions and inputs to help reduce the uncertainty in understanding complex, many-layered situations. This is what humans are good at. It is also exactly the situation we often face when making long-range forecasts for complex systems. It is unsurprising then that the ability of humans to handle nuanced, ambiguous and complex scenarios is used in other techniques such as prediction markets and superforecasting.

What can we use forecasting for?

The challenges of making long-range forecasts illustrate that there can be markedly different uses of forecasting. Long-range forecasts are about reducing uncertainty through gaining qualitative and semi-quantitative understanding of what might happen. Short-range forecasts are about quantitative prediction of what we think will happen.

There are also other dimensions that differentiate what we can use forecasting for. Three of these that are worth highlighting are,

Insight vs Prediction: This is illustrated well already above by the contrast between short-range forecasts and long-range forecasts, but it is also applicable to short-range or medium-range forecasts on their own. We can use a medium-range forecasting model to make predictions of what will happen in the future, but also to extract insight from the values of the parameters of that model as to what are the relative influences of the different factor upon that future.

Prediction vs Prescription: In my opening paragraphs I highlighted the Gartner analytic ascendency staircase and how some companies have the data and analytic maturity that enable them to use computational forecasting models they’ve built in a prescriptive way – they are used to determine the optimal course of action as opposed to merely forecasting the current baseline scenario.

Different levels of aggregation: When we build any predictive model we have to decide upon the response variable we are going to model. This typically involves a choice about what level of aggregation we going to use – should we build models of individual units (e.g. consumers), groups of units (e.g. a cohort of customers), or the entire population/collection of units (e.g. the entire customer base of an enterprise)? Generally speaking, we should model at the lowest level of granularity at which we first expect to see a homogenous response over the time-horizon of the forecasting exercise. Think of the example of modelling the future default rate of a loans portfolio; if the portfolio is made up of a heterogenous mix of different customer (loan) segments whose response to economic conditions differs across the segments, then as we change the segment mix of that portfolio we will get very different forecast. Modelling the default rate of each homogenous segment separately will allow us to flex that segment mix when exploring different forecast scenarios, whilst modelling the default rate of the portfolio in a single model will not.

Summary:

Forecasting involves extrapolation into the future. This makes it different to other predictive models you might build – these typically involve interpolation within a training dataset. The granularity at which you model is important. Even more important is the horizon over which you are forecasting. A long-term forecast should only be used for exploring behaviour under a range of hypothetical scenarios, whilst short-term and medium-term forecasts can be used to make detailed predictions about specific and highly likely scenarios. Long-range forecasts inform strategic planning, short and medium-term forecast inform tactical responses. This means that when forecasting we should identify what kind of forecast we want and only then choose our forecasting technique appropriately.

In the next part of this series of blog posts I will cover some of the do’s and don’ts of forecasting. It won’t be about how to use a particular forecast model building technique. It will be about the common mistakes made – including ones I’ve made or I’ve seen made – so that you can avoid them (in the future).

Footnotes

This very recent review by Petropoulos et al (to appear in International Journal of Forecasting) gives both a comprehensive coverage of the different forecasting techniques available and also a comprehensive set of case studies. The case studies illustrate the practice and challenges of forecasting in individual sectors and so touch in part on some of the issues I’ll be discussing. I’ll also be aiming to give broad general advice (not sector specific) on the practice of forecasting.
See for example, Milgrom, Paul and Roberts, John. “The LeChatelier Principle.” American Economic Review, March 1996, 86(1):173-179.
For the purposes of this blog and for simplicity I’m going to ignore the subtle distinction of whether price and marketing variables in demand models are exogenous or endogenous. I’m going to consider them here as exogenous since they are being imposed or set by the marketer or retailer. However, price drives demand and demand drives the price, so it is common to consider price to be an endogenous variable over longer timescales and within the more complex system consisting jointly of the retailer and the consumer.

Using your own algorithms with hyperparameter optimization in AWS SageMaker

On August 9, 2021August 9, 2021 By dchoyleIn AI, Artificial Intelligence, Data Science, Deep Learning, Machine LearningLeave a comment

TL;DR

AWS SageMaker provides a number of standard Machine Learning algorithms in containerized form, so you can pull those algorithms down onto a large EC2 instance and just run, with minimal effort.
AWS SageMaker also provides a hyperparameter optimization functionality that pretty much runs ‘out-of-the-box’ with the algorithms provided.
You can run your own algorithms within SageMaker if you containerize your algorithm code.
I wanted to find out if it was possible to easily combine the ‘run-your-own-containerized-algorithm’ functionality with the ‘out-the-box’ hyperparameter optimization functionality in SageMaker. It is. It was a straight-forward, but slightly lengthy process.

Introduction

<DISCLAIMER> This is a blog-post I started back in Autumn/Winter 2019. I knew it would be a fairly length post but one I was keen to write. But then, well, a pandemic got in the way and its taken a while to get back to writing blog posts. I still believe there are some useful learnings here – I hope you do too </DISCLAIMER>.

Back in 2019 I was using SageMaker a lot, including running an AWS Machine Learning Immersion Day at Infinity Works. One of the things I like about SageMaker is how the resources used to do any heavy lifting in training a model are separated from the resources supporting the Jupyter notebook. The SageMaker service provides several standard Machine Learning algorithms (e.g. Random Forests, XGBoost) in containers. This means it is possible to explore a dataset and develop an modelling approach in a Jupyter notebook that runs on one EC2 instance, and then when we want to scale-up the training process to the full dataset we can pull down the relevant container from ECR and run the training process on a separate much larger instance. Provisioning of heavier infrastructure needed for training on the full large dataset is only done when it is needed and you only pay for what you use of those larger EC2 instances. A Data Scientist like me doesn’t have to worry about the provisioning of the larger EC2 instance, it is handled by through simple configuration options when configuring the training job. It is also possible to configure a hyper-parameter optimization job in a similar way, so that multiple training jobs (with different hyper-parameter values) can be easily run, potentially in parallel, on large EC2 instances just by adjusting a few lines of json config.

So far, so good. As a Data Scientist the pain of getting access to or configuring compute resource has been removed and training on really large datasets is almost as easy as exploring a smaller dataset in a Jupyter notebook running on my local machine. But are we restricted to only using the algorithms that AWS has containerized? This is where it get more interesting and fun. You can use any algorithm that is available in a container in ECS. That means you can develop/code up your own algorithm/training process, containerize it, and then run that algorithm using multiple large EC2 instances with minimal config.

AWS have an example of how to containerize your own algorithm and deploy it to an endpoint. The git repo is here. The AWS team use the example of a scikit-learn decision tree trained on the Iris dataset (I know, why do examples not use something more original than the Iris dataset).

What I wanted to explore was,

How easy was it to actually containerize my own algorithm for use in SageMaker,
How easy was it to combine my containerized algorithm with the easy to configure hyperparameter optimization capability already present in SageMaker.

The rest of this post is about what I learnt in exploring those two questions, in particular the second of those. The first question is essentially already answered by the original AWS repo. What I wanted to learn was could I easily use my own algorithm with the out-the-box hyperparameter optimization functionality that SageMaker provided, or was the easy-to-use hyperparameter optimization functionality essentially restricted to the in-built SageMaker algorithms? What I’ll cover is,

The choice of algorithm we’re going to containerize
The basics of building the Docker container
Pushing the container to the AWS container registry
Using the containerized algorithm within a SageMaker notebook
Running hyperparameter optimization jobs using the containerized algorithm.

If you want to follow the technical details, I would suggest that you first become familiar with the basics of AWS SageMaker – tutorial here. You may also want to look at the basics of hyperparameter tuning for one of the standard machine learning algorithms within SageMaker, as I’ll be assuming some of this background knowledge is known to you or at least you can pick it up quickly – to fully explain all the SageMaker background material would make this an even longer blog. You can find explanations of how to configure and run a SageMaker hyperparameter tuning job here and here.

Now, let’s start with the first of our questions.

Algorithm choice

I wanted to use an algorithm that wasn’t already available within SageMaker, otherwise what would be the point of going through this exercise? I have been doing some work recently on Gaussian Processes (GPs), in particular with kernel functions that are composite functions.

I won’t explicitly cover the basics of GPs here – the blog post is long enough already. Instead I will point you towards the excellent book by Carl Rasmussen and Chris Williams and this tutorial from Neil Lawrence. However, I will say briefly what my interest in GPs is. Gaussian Processes have an interesting connection with large (wide) Neural Networks. This connection was discovered by Chris Williams and Radford Neal. I wrote some GP code, on the basis of the Williams’ paper, that made it into commercial software (my first ever example) back in 1999 (yes – I am that old, and have been working in Machine Learning that long). More recently, the connection has been extended to link Deep Learning Neural Networks and Gaussian Processes (see for example, here and here). Cho & Saul did some nice early work in this area, using dot-product kernels that are composite functions. It is the dot-product kernels derived by Cho & Saul that I’ll use here for my example algorithm, as the kernels are of relatively simple form, and yet are specified in terms of a few simple parameters that we can regard as hyper-parameters. For the purposes of this blog on AWS SageMaker it is not important to know what the Cho & Saul kernels might represent, merely how they are defined mathematically. So let’s start there,

For this illustration we are focusing on datapoints on the the surface of the unit hypersphere, i.e ${\bf x} \in \mathbb{R}^{d}$ with $||{\bf x}||_{2}^{2}\;=\;1$ . We then consider a set of kernels, $K_{q,l}\left ( {\bf x}_{1}, {\bf x}_{2} \right)$ , defined via,

$K_{q,l}\left ({\bf x}_{1}, {\bf x}_{2} \right )\;=\; k_{q,l}\left ( {\bf x}_{1}\cdot {\bf x}_{2} \right )$

The dot-product kernels $k_{q,l}(t)$ are defined iteratively,

$k_{q,l+1}(t)\;=\;k_{q,0}\left ( k_{q, l}(t)\right )$

The base kernels $k_{q,0}(t)$ are constructed from,

$k_{q,0}(t)\;=\; J_{q}\left ( \arccos (t)\right ) / J_{q}\left ( 0\right )$

with,

$J_{q}\left ( \theta \right )\;=\;(-1)^{q}\left (\sin\theta\right)^{2q+1}\left ( \frac{1}{\sin\theta}\frac{\partial}{\partial \theta}\right )^{q} \left ( \frac{\pi-\theta}{\sin\theta}\right )$

Choosing a particular kernel then boils down to making a choice for q and l. Once we have made choice of kernel, we can train our model. For simplicity, I have defined the model training here to be simply the process of constructing the Gram matrix from the training data, i.e. the process of calculating the matrix elements,

$M_{ij} = \sigma^{2}\delta_{ij}\;+\;K_{q,l}\left ( {\bf x}_{i}, {\bf x}_{j}\right)$

Here, σ² is the variance of the additive Gaussian noise that we consider present in the response variable, and ${\bf x}_{i}\;,\; i=1,2,\ldots,N$ , are the feature vectors for the N datapoints in the training set. Along with the training feature vectors we also have the response variable values, $y_{i}$ .

Whilst it may not match the more traditional concept of model training – there is no iterative process to minimize some cost function – I am using the training data to construct a mathematical object required for calculating the expectation of the response variable conditional on the input features. Within a Gaussian Process it is considered usual to optimize any parameters of the covariance kernel as part of the model training. In this case, for simplicity, and for purposes of illustrating the hyperparameter tuning capabilities of SageMaker, I wanted to consider the kernel parameters q,l and σ² as hyperparameters, essentially leaving no remaining kernel parameters to be optimized during the model training.

Once we have the matrix ${\bf M}$ defined, we can calculate a prediction for the response variable at a new feature vector ${\bf x}_{\star}$ via the formula,

$\mathbb{E}\left ( y\left ( {\bf x}_{\star}\right )\right )\;=\;{\bf v}\left ( {\bf x}_{\star}\right )^{\top}{\bf M}^{-1}{\bf y}\,\,$

where ${\bf y}$ is the vector of response values in the training set, and the vector ${\bf v}\left ( {\bf x}_{\star} \right )\;=\; (v_{1}, v_{2}, \ldots, v_{N})$ , with the element $v_{i}\left ( {\bf x}_{\star}\right )$ given by,

$v_{i}\;=\; k_{q,l}\left ( {\bf x}_{\star}\cdot {\bf x}_{i}\right )$

Now we have given the mathematical definition of our algorithm, we need to focus on code. Following the example in the original AWS repo we need python code that,

Defines a class for a trained GP model. I have called my class, unsurprisingly, trainedGPModel . Instantiating an instance of this class by passing the training data to the class constructor method, runs the Gram matrix calculation process mentioned earlier. Within my trainedGPModel class I also have a method predict(xstar) that returns the predicted expectation of the response variable given an input datapoint xstar. The code for the trainedGPModel class implements the linear algebra formulae given above and so is straight-forward.
We also need code that runs the training process. This code is held in a file called train. I made minimal modifications to the train module in the original AWS repo. The main change I made was including code to make predictions on a validation dataset, and from that calculating the Root-Mean-Squared-Error (RMSE) on the validation dataset. The validation RMSE is the metric I will use for hyperparameter tuning and so I have to write the validation RMSE value to stdout so that it can get picked up by the SageMaker hyperparameter tuning process. I had to write the RMSE value with a string prefix and delimiter, e.g.

print( "validation:RMSE=" + str(RMSE_validation) + ";" )

with a corresponding matching regex in the configuration of the hyperparameter tuning job – see later section on running the containerized algorithm in a SageMaker notebook. It wasn’t obvious that I needed to write the validation metric in this way, and it took a bit of googling to work out. Most SageMaker links on hyperparameter tuning point to this page , but the detail on how the metric is passed between your algorithm code and the SageMaker hyperparameter optimization code is actually explained in this SageMaker documentation page.

Docker basics

Now let’s talk about putting our code in a container. We need to construct a Docker compose file. For a refresher on Docker I found this tutorial by Márk Takács to be really helpful. I actually use a Windows machine for my work, so I’m running Docker Desktop. However, I also use WSL (Windows Subsystem for Linux) for when I want a linux like environment. Although you can install a Docker client under WSL, you still have to make use of the native Docker daemon of Docker Desktop. I found this guide from Nick Janetakis on getting the WSL Docker client working with Docker Desktop invaluable, particularly the configuring of where WSL mounts the Windows file system (by editing the /etc/wsl.config file) so that I can then easily mount any sub-directory of my Windows file system to any point I choose in the container image when testing the Docker file locally.

I won’t go through the aspects of testing the container locally – you can read the original AWS repo to see that. Instead we’ll just go through the Docker file for building the final SageMaker container. The Docker file is fairly simple and other that changing it to use a python3 runtime (see lines 9&10) we have not changed anything else in the Docker file in the original AWS repo. Line 36 of the Docker file is where we copy across our algorithm code into pre-specified directory in the image that SageMaker will look for when running the containerized algorithm.


# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:18.04

RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
python3 \
python3-pip \
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN pip3 install numpy scipy scikit-learn pandas flask gevent gunicorn &amp;amp;&amp;amp; \
(cd /usr/local/lib/python3.6/dist-packages/scipy/.libs; rm *; ln ../../numpy/.libs/* .) && \
rm -rf /root/.cache

RUN pip3 install setuptools

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

# Set up the program in the image
COPY gaussian_processes /opt/program
WORKDIR /opt/program

Pushing the container to AWS

We can now push our Docker container to the AWS ECR (Elastic Container Registry). This is simple using the AWS CLI (command line interface) and the build_and_push.sh shell script provided in the original AWS repo. Within the shell script we have just modified on lines 16 and 17 the name of the top-level directory in which our training and prediction code resides,

image=$1

if [ "$image" == "" ]
then
    echo "Usage: $0 "
    exit 1
fi

chmod +x gaussian_processes/train
chmod +x gaussian_processes/serve

Then we just run shell script, passing the name of the container we have just built as a command line argument,

./build_and_push.sh gpsagemaker

After running the shell script we can see the container present in the AWS ECR,

Screenshot of our Gaussian Process SageMaker Docker container in AWS Elastic Container Registry (ECR) – ready to use within a SageMaker notebook.

Using the containerized algorithm in SageMaker

Now we have the container, that has our GP code, in AWS ECR we can use it within a SageMaker notebook. Let’s do so. For this I’m just going to adapt the notebook within the original AWS repo. I go to the Sagemaker under ‘ML’ in the list of AWS services and from there I can start/create my SageMaker notebook instance. Once the notebook instance is ready I can open up a Jupyter notebook as usual,

The first main difference is that we’ll create some simple small-scale simulated training and validation data. Our goal here is to test how easy it is to containerize and use our own algorithm, not build a perfect model. Our generative model is a simple one – a linear model, dependent on just two features (with coefficients that we have chosen as 1.5 and 5.2 respectively). We use this simple model to create the response variable values and then add some Gaussian random noise (of unit variance).


# create training and validation sets
nTrain = 100
X_train = np.random.randn( nTrain, 2 )
y_train = (1.5 * X_train[:, 0]) + (5.2*X_train[:,1]) + np.random.randn( nTrain )
y_train.shape = (nTrain, 1)
data_train = np.concatenate( (y_train, X_train), axis=1)
df_data_train = pd.DataFrame( data_train )

nValidation = 50
X_validation = np.random.randn( nValidation,2 )
y_validation = (1.5 * X_validation[:, 0]) + (5.2*X_validation[:,1]) + np.random.randn( nValidation )
y_validation.shape = ( nValidation, 1 )
data_validation = np.concatenate( (y_validation, X_validation), axis=1)
df_data_validation = pd.DataFrame( data_validation )

We then specify our account details and also the image that contains our Gaussian Process algorithm.


account = boto3.client('sts').get_caller_identity()['Account']
region = boto3.session.Session().region_name
image = '{}.dkr.ecr.{}.amazonaws.com/gpsagemaker:latest'.format(account, region)

The next cell in our notebook then uploads the training and validation data to our s3 bucket,


# write training and validation sets to s3
from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = mybucket

# write training set
csv_buffer = StringIO()
df_data_train.to_csv(csv_buffer, header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Bucket(bucket).Object('train/train_data.csv').put(Body=csv_buffer.getvalue())
csv_buffer.close()

# write validation set
csv_buffer = StringIO()
df_data_train.to_csv(csv_buffer, header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Bucket(bucket).Object('validation/validation_data.csv').put(Body=csv_buffer.getvalue())
csv_buffer.close()

Running a single training job

So first of all let’s just configure and run a single simple training job. Note the validation metric being specified along with the regex.


create_training_params = \
{
    "RoleArn": role,
    "TrainingJobName": job_name,
    "AlgorithmSpecification": {
        "TrainingImage": image,
        "TrainingInputMode": "File",
        "MetricDefinitions":[{"Name":"validation:RMSE",
                              "Regex":"validation:RMSE=(.*?);"    
        }]
    },

We also set values for the hyperparameters, which are static since we are just running a single training job and not doing any hyperparameter optimization yet.


    "HyperParameters": {
        "q":"0",
        "l":"2",
        "noise":"0.1"
    },

We can then run a training job using our containerized Gaussian Process code, just as we would any other algorithm available in SageMaker. We can see the training job running in the AWS Management console – click under “Training jobs” on the left hand side of the console. We can see the current training job ‘in progress’ and also an earlier completed training job that I ran.

Screenshot of a single SageMaker training job running using our GP algorithm code.

Running a hyperparameter tuning job

So that appears to run ok. So now we have our algorithm running in SageMaker ok, we can now just configure the SageMaker hyperparameter optimization wrapper and run one of the out-of-box SageMaker hyperparameter optimization algorithms over what we have specified as hyperparameter in our Gaussian Process code. The config for the hyperparameter tuning job is below – we have largely just modified slightly the examples in the original AWS repo and also followed the guidance. You can see that we have specified the RMSE metric on the validation set as the metric to optimize with respect to the hyperparameters. For illustration purposes we have specified that we want to optimize only over the q and l hyperparameters. The σ² hyperparameter we have kept static at σ²=0.1. You can also see that we have specified to run 10 training jobs in total, i.e. we will evaluate the validation metric at 10 different combinations of the two hyperparameters, but we only run 3 training jobs in parallel at any one time.


# Define HyperParameterTuningJob
# We will only tune the learning rate by maximizing the AUC value of the 
# validation set. The hyperparameter search is a random one, using a sample of
# 10 training jobs - better methods for searching the hyperparameter space are 
# available, but for simplicty and demonstration purposes we will use the 
# random search method. Run a max of 3 training jobs in parallel
job_name = "gpsmbyo-hp-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
response = sm.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=job_name,
    HyperParameterTuningJobConfig={
        'Strategy': 'Random',
        'HyperParameterTuningJobObjective': {
            'Type': 'Minimize',
            'MetricName': 'validation:RMSE'
        },
        'ResourceLimits': {
            'MaxNumberOfTrainingJobs': 10,
            'MaxParallelTrainingJobs': 3
        },
        'ParameterRanges': {
            'IntegerParameterRanges': [
            {
              "Name": "q",
              "MaxValue": "4",
              "MinValue": "0",
              "ScalingType": "Auto"
            },
            {
              "Name": "l",
              "MaxValue": "4",
              "MinValue": "1",
              "ScalingType": "Auto"
            }    
        ]}
    },
    TrainingJobDefinition={
        'StaticHyperParameters': {
            "noise":"0.1"
        },
        'AlgorithmSpecification': {
        'TrainingImage': image,
        'TrainingInputMode': "File",
        'MetricDefinitions':[{"Name":"validation:RMSE",
                              "Regex":"validation:RMSE=(.*?);"
                             }]
        }

If we then look at our AWS console (screenshot below) we can see the hyperparameter tuning job running, along with previous completed tuning jobs.

Screen shot of AWS console showing current and previous hyperparameter tuning jobs.

We can also see the individual training jobs, corresponding to that tuning job, running (screenshot below). Remember that the hyperparameter tuning job is just a series of individual evaluations of the validation metrics, run at combinations of (q,l) specified the tuning algorithm. From the screenshot we can see that there are 3 training jobs running, in accordance with what we specified in the tuning job config.

Screenshot of the 3 training jobs running as part of the hyperparameter tuning job.

Once the tuning job has completed, we can retrieve the validation metric values for the 10 different hyperparameter combinations that were tried, to see which combination of q and l gave the smallest RMSE on the validation set.

Summary

The two questions I was trying to address were,

How difficult is it to create your own algorithm to use in SageMaker?
How easy is it to use the hyperparameter optimization algorithms available in SageMaker with your new algorithm?

The answer to both questions is, “it is a relatively easy but lengthy process”. That it is a lengthy process is understandable – SageMaker gives you a functionality to apply out-the-box hyperparameter tuning on an algorithm/code that it knows nothing about until runtime. Therefore there has to be a lot of standardized syntax in specifying how that algorithm is structured and called as a piece of code. Fortunately, all the details of how to structure your algorithm and create the Docker container are in the excellent example given in the AWS repo and the documentation. The only complaint I would have is that it would be good for the repo to have an example showing how your own algorithm can utilize the hyperparameter optimization functionality of SageMaker – hence this blog. Working out the few remaining steps to get the hyperparameter optimization working with my Gaussian Process code was not very difficult, but not easy either.

The example algorithm I have chosen is very simplistic – the training process literally only involves the calculation and inversion of a matrix. A full training process could involve optimization of, say, the log-likelihood with respect to the parameters of the kernel function, but explaining the extra details would make this blog even long. Secondly, we only needed a simple/minimal training process to address the two questions above. Likewise, we have not illustrated our new trained algorithm being used to serve predictions – this is very well illustrated in the original repo and I would not be adding any new with my Gaussian Process example.

The different types of data I have encountered as a Data Scientist

On March 31, 2021January 12, 2022 By dchoyleIn Data Management, Data Science1 Comment

TL;DR

Where Data or Data Science is the business model or primary purpose of an organization you can expect data and the data eco-system to be properly invested in. In scientific research and industrial settings this will be common. In the commercial world this is not always the case. As a Data Scientist in the commercial world you should learn to ask, ‘Does this company need data? Does this company actually need Data Science?’ If the answers are not a resounding yes, then beware. Sometimes you won’t be able to answer those questions until you’re up close and inside the organization, but there are indicators that suggest upfront whether an organization is likely to have good quality data and a functioning data eco-system, or whether it will be a pile of trash. In the long read below I outline the different kinds of data I’ve encountered across the commercial and non-commercial sectors I’ve worked in, and what signs I’ve learnt to look for.

Long version

Over 20 years ago, as I was just starting research in the Bioinformatics field, a colleague explained to me the challenges of working with high-throughput biological experimental data. The data, he commented, was typically high-dimensional and very messy. “Messy” was new to me. With a PhD in Theoretical Physics and post-doctoral research in materials modelling, the naïve physicist in me thought, ‘…ah…messy…you mean Gaussian errors with large variance”. I was very wrong.

I learnt that “messy” could mean a high-frequency of missing data, missing meta-data, and mis-labelled data. Sometimes the data could be mis-labelled at source because the original biological material had been mis-labelled – I remember one interesting experience having to persuade my experimental collaborator to go back to the -80°C freezer to check the lung tissue sample because a mis-labelled sample was the only remaining explanation for the anomalous pattern we were seeing in a PCA plot, and yes something had gone wrong in the LIMS because the bar-code associated with the assay data did not match the bar-code on the sample in the freezer.

However, as I moved from the academic sphere into the commercial realm (10 years in the commercial sector now in 2021), I’ve learnt that data errors and issues can be much more varied and challenging, and varies from sector to sector, from domain to domain. BUT…. I have seen that there are broad classes that explain the patterns of data issues I have experienced over 30yrs of mathematical and statistical modelling. In this piece I am going to outline what those broad classes and patterns are, and more importantly how to recognize when they are likely to arise. The broad classes of data are,

Scientific or Experimental – here, data is collected for the purposes of being analysed, and so therefore is optimized towards those purposes.
Sensor data – here, data is collected for the purposes of detecting signals, possibly diagnostic, i.e., working out what the cause of a problem is, or being monitored, i.e., detecting a problem in the first place. It is intended primarily to be automatically monitored/processed, but not necessarily analysed by a human (Data Scientist) in the loop. The data lends itself to large-scale analysis but may not be in a form that is optimal or friendly for doing data science.
Commercial operational data – this is data that is stored, rather than actively collected. It is stored, usually initially on a temporary (short-term), for the purposes of running the business. It could be server event data from applications supporting online platforms or retail sites, or customer transaction data, Google ad-click data, marketing spend data, or financial data.

It is this last category that is perhaps the most interesting – quite obviously if you are a Data Scientist working in a commercial sector. For commercial operational data, not only can the data contain the usual errors but there can be a range of additional challenges – columns/fields in tables containing completely different data than they are supposed to because a field has been over-loaded, or re-purposed for a different need without going through a modification/re-design of the schema. Without an accurate updated data-dictionary, this knowledge about the true content of the fields in a table resides as, ‘folk knowledge’ in the heads of the longer serving members of staff – until they leave the company, and that knowledge is lost.

I have my own favourite stories about data issues I have encountered in various organizations I have worked for – such as the time I worked with some server-side event data whose unique id for each event turned out only to be unique for a day, because the original developer didn’t think anyone would be interested in the data beyond a day.

Every commercial data scientist will be able to tell similar ‘war stories’, often exchanged over a few beers with colleagues after work. Such war stories will not go away any time soon. The more important point is how do we recognize the situations where they are more likely to occur? The contrast I have already highlighted above, between the different classes of data, gives us a clue – where data is not seen as the most valuable part of the process, nor the primary purpose of the process, it will be relatively less invested in and typically of lower quality.

For example, in the first two categories, and in the scientific realm in particular, data is generated with the idea, from the outset, that it is a valuable asset; the data is often the end itself. For example, the data collected from a scientific experiment provides the basis for scientific insight and discoveries, and potentially future IP; the data collected to monitor complex technical systems saves operating costs by minimizing downtime of those systems and potentially identifying improved efficiencies. Poor quality data, or inadequate systems for handling data has an immediate and often critical impact upon the primary goal of the organization or project, and so there is an immediate incentive to rectify errors and data quality issues and to invest in systems that enable efficient capture and processing of the data.

Some commercial organizations also effectively fall into these first two classes. For companies where the data and/or data science is the product from the outset, or the potential value of a data stream has been realized early on, then there is an incentive to invest in efficient and effective data eco-systems, with the consequence benefit in terms of data quality – if sufficient investment in data and data systems is not made, a company whose main revenue stream is the data will quickly go out of business.

In contrast, for most commercial organizations, the potential value of commercial data may be secondary to its original purpose, and so data quality from the perspective of these secondary uses may be poor, even if the value or future revenue streams attached to these secondary uses may be much greater than the original primary use of the data. For such companies, the general importance of data may be recognized, but that does not mean that data eco-systems are getting the right sort of investment. For example, for a company providing a B2C platform, data on the behaviour of consumers can provide potentially actionable and important insight, but ultimately it is the platform itself (and its continued operation 24/7) that is of prime importance. Similarly, for an online retail site, the primary concern is volume of transactions and shipping. Consequently, for these organizations the data issues that arise are richer, more colorful, and often more challenging. This is because poor quality data and systems do not immediately threaten the viability of the business, or main revenue streams. Long-term, yes, there will be an impact upon UX and overall consumer satisfaction, but many operations are happy to take that hit and counter it by attempting to increase transaction volume.

For these organizations it may be recognized that data is important enough for capital to be spent on data eco-systems, but that capital investment may be poorly thought through or not joined-up. Again, there are several symptoms I have learnt to recognize as red flags. Dividing these into symptoms related to strategy and tactical related symptoms, they are,

Strategy related symptoms:
1. Lack of a Data Strategy. The value of data, even the potential of value has not been recognized or realistically thought through. Consequently, there is unlikely to be any strategy underpinning the design of the data eco-system.
2. Any ‘strategy’ is at best aspirational, being driven top-down by an Exec. The capital investment is there, but no joined up plan. The organization sets up a Data Science function because other organizations are – typified by hiring of a ‘trophy Data Scientist’. Dig deeper and you will find no real pull from the business areas for a Data Science function. Business areas have been sold Data Science as a panacea, so have superficially bought into the aspirational strategy – who in a business function would not agree when told that the new junior Data Science hires will analyse the poorly curated or non-existent data and will come up with new product ideas that have a perfect market fit and will be coded into a fully productionized model. I have seen product owners who have been sold a vision and believe that they can now just put their feet up because the new Data Scientist will do it all.
3. The organization cannot really articulate why it needs a Data Science function when asked – this is one of my favourites. Asking ‘why’ here is the equivalent of applying the ‘Five Whys’ technique at the enterprise-level rather than at a project-level. Many companies may respond to the question with the cliched answer, ‘because we want to understand our customers better, which will help us be more profitable’. Ok, but where are the examples of when not understanding this particular aspect your customers has actually hurt the business. If there is no answer given, you know that the need for an machine learning based solution to this task is perceived rather than evidenced.
4. A mentality exists that data is solely for processing/transacting, not analysing – hence there is little investment in analytics platforms and tooling, resulting in significant friction in getting analytics tooling close to data (or vice-versa). Possibly there is not even a recognition that there is distinction between analytics systems and reporting systems. Business areas may believe that only data querying skills and capability are needed, so the idea that you want to make building of predictive models as frictionless as possible for the Data Science team, is an alien one.
5. Lack of balance in investments between Data Engineering and Data Science – this is not optimal whichever function gets the bigger share of the funding. This is often mirrored by a lack of linking up between the Data Science team and the Data Engineering team, sometimes resulting in open hostility between the two.
6. Lack of Data Literacy in business area owners or product owners. Despite this, organizations often have a belief that they will be able to roll-out a ‘self-serve’ analytics solution across the enterprise that will somehow deliver magical business gains.
Tactical related symptoms:
1. Users in business areas are hacking together processes, which then take on a life of their own. This leads to multiple conflicting solutions, most of which are very fragile, and some of which get re-invented multiple times.
2. Business and product owners ask for Data Science/Machine Learning driven solutions but can’t answer how they are going to consume the outputs of the solution.
3. The analytics function, whether ad-hoc or formal, primarily focuses on getting the numbers ready for the month-end report. These numbers are then only briefly discussed, or ignored by senior management, because they conflict with data from other ad-hoc reports built from other internal data sources, or they are too volatile to be credible, or they conflict with the prior beliefs of senior management.
4. Business and product owners consider it acceptable/normal to use a Data Analyst or Data Scientist for, ‘can you just pull this data for me’ tasks.

As a Data Scientists what can you do if you find yourself in a scenario where these symptoms arise? There are two potential ways to approach this,

How do we respond to it – what can we do reactively? Flag it up – complain when systems don’t work and highlight when it they do. Confession here – I’ve not always been good at complaining myself, so this is definitely a case of me recognizing what I should have done, not what I did. On the less political front, you can try and always construct your analytical processes to minimize the impact of any upstream processes not under your control. By this I mean making downstream analysis pipelines more robust by building in automated checks for data quality and domain specific consistency. Where this is not possible, then stop or refuse to work on those processes that are so impacted as to be worthless. This last point is about making others share in your pain or shifting/transferring the pain elsewhere – onto business/product owners, from insight to operational teams. If other teams also experience the pain of poor-quality data, or poorly designed data-ecosystems and analytical platforms, then they have an incentive to help you try and fix the issues. Again, on a less political front, make sure expectations take into account the reality of what you have to deal with. Get realistic SLAs re-negotiated/agreed that reflect the reality of the data you are having to work with – this will protect you and highlight what could be done if the data were better.

How can we prevent it – what can we do proactively? The most effective way for any organization to avoid such issues is to have a clear and joined up Data Strategy and Analytics Strategy. It is important for an organization to recognize that having a Data Strategy does not mean it has an Analytics Strategy. Better still, an organization should have a clear understanding of what informed decisions underpin the future operations of its business model. Understanding this will drive/focus the identification of the correct data needed to support those decisions and hence will naturally lead (via pull) to investment in high-quality data capture processes and efficient data engineering/data analysis infrastructure. Formulating the Data Strategy may not formally be part of your brief, but it is likely that, as a Data Scientist and therefore end-user, you can influence it. This is particularly important in order to make sure that the analytical capability is a first-class and integrated citizen of any Data Strategy. This can be done by giving examples of the cost/pain when it is not so. Very often, a CDO or CTO formulating the Data Strategy will come from an Engineering background and so the Data Strategy will be biased towards Ops. Any awareness of a need for the Data Strategy to support analytics will be focused on near-term analytics, e.g., supporting real-time reporting capabilities. The need to for the Data Strategy to support future Ops by supporting insight, innovation and discovery capabilities may be omitted. If a Data Science function is considered, it will be as a bolt-on – possibly a well-funded Data Science bolt-on, but still a bolt-on.

Demand Forecasting at Amazon

On July 17, 2019 By dchoyleIn Data Science, Demand Forecasting, Machine LearningLeave a comment

This is a post that I’ve been meaning to write for a while. Having worked on demand forecasting in the past, I was intrigued when I saw this paper posted on the arXiv pre-print archive from one of the research teams at Amazon.

Although it was obvious why Amazon would be interested in forecasting demand, I was intrigued that Amazon chose to use a state space model approach. About 6 months later I attended the ISBIS2018 conference, at which Lindsay Berry from Duke University presented this paper that also used a state-space model approach to modelling the demand for supermarket goods. I also subsequently became aware of this technical report from Ivan Svetunkov at Lancaster University’s Management School.

With three pre-prints on demand forecasting that all utilised a state-space modelling approach I thought it would be interesting to do a post summarizing the work from the Amazon team. I may get round to doing a further post on the other two pre-prints at a later date.

At this point it is worth explaining a bit about demand models in general. Demand models are statistical models that are usually built from 2-5yrs worth of historical sales data. A demand model enables us to forecast how many units of a product we will sell given the price of the product and a number of other variables. The models allow us to forecast a number of ‘what-if’ scenarios, e.g. what will happen to my sales if I reduce the product price by 20%? Ultimately, the demand model can enable us to determine what is the optimal price we should charge for a product depending on what business KPI we want to optimize. Some traditional approaches to demand modelling use a log-log model, with the log of the demand of an item being linear in the log of the price of the item. These models are of the Working-Leser type of demand models^1,2. For goods with large demand volumes, a log-log model form will be a reasonable assumption as the natural quantum of demand (a single unit) is much smaller than the typical level of demand and so we can consider the demand of such goods as effectively being continuous random variables.

The actual problem the Amazon team tackled was forecasting of intermittent demand, i.e. for what are commonly called, ‘slow-moving goods’, whose pattern of sales are bursty. We might typically only sell, on average, less than say 10 units a week for these kinds of products. There may be no sales for a week or two, followed by a burst of sales concentrated in a few days. Snyder et al give a good modern review of the problem of intermittent demand forecasting³.

For such products, the traditional log-log type demand models can perform poorly, as we are dealing with products that sell only a few units per time period. However, there is no consensus approach to modelling such products, but this means it is an area ripe for novel and innovative methods. The paper by Seeger et al combines 3 interesting ideas,

A multi-stage model – this means decomposing the modelling of demand into several models that cover different demand sizes. In this case separate models are constructed for when the expected demand is 0 units, 1 unit, and >1 unit.
The combining of the multi-stage model with a state-space model. This has the effect of introducing exponential smoothing and hence some temporal continuity to the modelled demand.
The use of a Kalman-filter approach to location of the mode when using a Laplace approximation to approximate a marginal posterior. This third innovation is the most technical, but, for me, also the most interesting.

The first of these innovations is not necessarily that much of a step-change. Other attempts to model slow-moving goods have also considered a mixture of distributions/processes to allow for the zero-inflation that one has in the weekly observed sales of a slow-moving good. Seeger et al use a three stage model, so that we have 3 latent functions,

$y_{t}^{(0)}(x)$ , which is used in modelling the probability of zero sales at time point t

$y_{t}^{(1)}(x)$ , which is used in modelling the probability of a single unit being sold at time point t

$y_{t}^{(2)}(x)$ , which is used in modelling the distribution of units sold at time point t, given the number of units is greater than 1.

The second innovation is an interesting one. Whilst I had come across the use of self-excitation (Hawkes processes) to model the bursty behaviour of intermittent demand, I hadn’t seen temporal continuity enforced via a latent state contribution to the linear predictors of the mixture components. For demand greater than a single unit Seeger et al model the demand ${z_{t}}$ at time point t as following a Poisson distribution,

$P\left ( z_{t}-2 | y^{(2)}_{t}\right )\;=\; \frac{1}{(z_{t}-2)!}\lambda( y^{(2)}_{t} )^{z_{t}-2}\exp\left ( -\lambda ( y^{(2)}_{t} )\right )\;\;.$

Here $\lambda(\cdot)$ is a transfer function. The latent function $y^{(2)}_{t}$ depends upon a latent state ${\boldsymbol l}_{t}$ and it is this latent state that is governed by a Kalman filter. Overall the latent process is,

$y^{(2)}_{t}\;=\; {\boldsymbol a}^{\top}{\boldsymbol l}_{t-1}\;+\; b_{t}\;\;,\;\;b_{t}\;=\;{\boldsymbol \omega}^{\top}{\boldsymbol x}_{t}\;\;,\;\;{\boldsymbol l}_{t}\;=\;{\boldsymbol F}{\boldsymbol l}_{t-1}\;+\;{\boldsymbol g}_{t}\epsilon_{t}\;\;,\;\;\epsilon_{t}\sim N(0,1)$

The latent variables $\epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0}$ have to be integrated out to yield a marginal posterior distribution that can then be maximized to obtain parameter estimates for the parameters than control the innovation vectors ${\boldsymbol g}_{t}\;,t=1,\ldots,T-1$ .

It is the marginalization over $\epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0}$ that the third interesting technical innovation of Seeger et al is concerned with. The integration over $\epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0}$ is approximated using a Laplace approximation. The Laplace approximation simply replaces the exponent of the integrand by its second order Taylor expansion approximation, in order to approximate a complicated integration by an integration of a Gaussian. It is the simplest of a family of saddlepoint expansion techniques for obtaining asymptotic expansions of integrals (see for example the classic book by Wong).

The main task in a Laplace approximation is locating the maximum of the exponent of the integrand. Seeger et al do this via a Newton-Raphson procedure, i.e. expand the exponent to second order around the current estimate of the mode and then find the maximum of that second order approximation.

Consider a 1-dimensional example. Let $q(x)$ be the function whose maximum, $x_{*}$ , we are trying to locate. If the expansion of $q(x)$ around our current estimate ${\hat x}_{*}$ of $x_{*}$ is,

$q(x) \;=\; q( {\hat x}_{*} )\;+\; ( x - {\hat x}_{*}) q^{(1)}({\hat x}_{*})\;+\; \frac{1}{2}(x- {\hat x}_{*} )^{2}q^{(2)}({\hat x}_{*})\;+\; O\left ( (x-{\hat x}_{*})^{3}\right )$

The updated estimate of $x_{*}$ is then determined by maximizing the second order expansion above, and is given by,

${\hat x}_{*} \rightarrow {\hat x}_{*} \;-\; \frac{q^{(1)}( {\hat x}_{*})}{q^{(2)}( {\hat x}_{*})}$

The schematic below depicts how the iterative Newton-Raphson procedure locates the maximum of a one-dimensional function.

The multi-dimensional equivalent update rule when we are maximizing a function $q({\boldsymbol x})$ of a vector ${\boldsymbol x}$ is,

$\hat{\boldsymbol x}_{*} \rightarrow \hat{\boldsymbol x}_{*} \;-\; {\boldsymbol H}^{-1}( \hat{\boldsymbol x}_{*})\,\nabla q (\hat {\boldsymbol x}_{*} )\;\;,$

where ${\boldsymbol H}( \hat{\boldsymbol x}_{*})$ is the Hessian of $q({\boldsymbol x})$ evaluated at $\hat{\boldsymbol x}_{*}\;\;.$

As Seeger et al are marginalizing the posterior over $\epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0}$ , the Taylor expansion around any point is necessarily multi-variate, and so ordinarily, finding the maximum of that second order approximation would involve inverting the Hessian of the log-posterior evaluated at the current estimate of the mode. As the latent variables we are trying to marginalize over by doing the Laplace approximation are the $T-1$ innovations, $\epsilon_{1},\;\ldots\;, \epsilon_{T-1}$ and ${\boldsymbol l}_{0}$ , this means that each step of the Newton-Raphson procedure would involve the inversion of a $T\times T$ matrix, i.e. an $O\left(T^{3}\right)$ operation for each Newton-Raphson step. However, Seeger et al point out that once we have replaced the log-posterior by a second-order approximation, finding the maximum of that approximation is equivalent to finding the posterior mean of a linear-Gaussian state-space model, and this can be done using Kalman smoothing. This means in each Newton-Raphson step we need only run a Kalman filter calculation, an $O\left( T \right)$ calculation, rather than a Hessian inversion calculation which would be $O\left(T^{3}\right)$ . When training on, say, 2 years of daily sales data with $T=730$, the speed-up will be significant. Seeger et al do point out that this trick of reducing the computation to one that scales linearly in $T$ is already known within the statistics literature⁴, but not widely known within machine learning.

Seeger et al apply their methodology to a number of real-world scale datasets, for example to a ~40K item dataset with almost a year of historical data at the day-level. Overall run-times for the parameter learning are impressive (typically a few seconds for each of the separate demand model stages), though admittedly this is when running on a 150 node Spark cluster.

References

Working, H. (1943) Statistical laws of family expenditure. J. Am. Statist. Ass., 38:43–56.
Leser, C. E. V. (1963) Forms of Engel functions. Econometrica, 31, 694–703.
Snyder, R., Ord, J., and Beaumont, A. (2012), Forecasting the intermittent demand for slow-moving inventories: A modelling approach. International Journal on Forecasting, 28:485-496
Durbin, J. and Koopman, S. (2012), Time Series Analysis by State Space Methods. Oxford Statistical Sciences. Oxford University Press, 2nd Edition.

Defensive & Offensive Predictive Analytics

On September 17, 2018 By dchoyleIn Data Management, Data ScienceLeave a comment

Harvard_business_review

This article from the Harvard Business Review was a short but interesting read. The article talks about defensive and offensive data management strategies – defensive being about minimizing downside risk, and being more common in regulated industries, whilst offensive data management strategies focus on supporting business objectives and are more common in un-regulated or less regulated industries. The authors rightly point out that the correct mix of defensive and offensive strategies will be specific to each individual organization.

Having worked in a number of commercial sectors, my experience is that the use of predictive analytics within organizations also divides into defensive and offensive activities, irrespective of whether that predictive analytics and data science activity is enabled by a well thought out data management strategy. There are good reasons for this, and again it is largely determined by whether or not the activity of an organization carries a large downside risk.

Consider a company, such as a bank, whose activity has a large downside risk, and where losses due to bad loans can almost wipe out a balance sheet very quickly. My experience of doing analytics in a UK retail bank is that the predictive analytics focus is on modelling that downside risk with a view to understanding it, forecasting it and ultimately mitigating against it. The analytics effort focuses on risk minimization (still an optimization), whilst optimization of the profit side of the P&L is less computationally based, e.g. by committees of human subject matter experts deciding mortgage or savings rates for the coming year.

In contrast, in companies where the downside risk is lower, such as those where transactions with the organization’s customers are on much shorter timescales than a bank, then the use of predictive modelling tends to focus more on the optimization of revenue and profits, rather than minimization of losses from liabilities. Take grocery supermarkets, where predictive demand models are used to set product prices in order to optimize profit. Whilst getting the pricing strategy wrong will impact revenues, it does not lead to the organization holding a long-term liability and is ultimately reversible. Mistakes when using predictive models in this domain are unlikely to take a company down.

From what I have seen the use of predictive modelling within a business is typically almost binary, i.e. either predominantly on the downside risk, or predominantly on optimizing the business objectives, even though most businesses will have both upsides and downsides to their activity. I haven’t seen that many medium-scale organizations where predictive modelling is used at a mature level across the majority of the business areas or tasks. Even rarer in my experience are situations where predictive modelling of both downside risks and business objectives are done concurrently with the optimization taking into account both sides of the P&L. It would be interesting to find good examples outside, say the largest 50 companies in the FTSE100, DowJones, Nasdaq, or S&P500, where a more joined up approach is taken to using predictive analytics for optimizing the P&L.

Practical PCA

On February 19, 2018 By dchoyleIn Data Science, Dimensionality Reduction, PCALeave a comment

PCA_Schematic

Principal Component Analysis (PCA) is a commonly applied algorithm in statistics and data science. Because it is so easy to understand at a high-level, and because it is so easy to apply, PCA has become ubiquitous. It is often applied without much thought and with the output rarely questioned. Typically, the questions I ask when applying PCA are,

Do I need to do any transformation of the data before applying PCA?
How many principal components should I select?
How do I interpret the loadings (and scores)?

Unfortunately, I have seen a number of talks and presentations recently where PCA has been used and the impact on the analysis of not having thought about these questions was clear. Ok, I’m biased, as the behaviour of PCA, particularly when applied to high-dimensional data, is one of my areas of research. Whilst the research papers on PCA can be very complex, they do however provide some useful insight and guides on how to apply PCA to real data. In this post I’m going to look at those three questions in turn. In the following I’m also going to assume you are already familiar with PCA and that you are aware that the principal components are the eigenvectors of the sample covariance matrix corresponding to the largest eigenvalues. For a good introduction to PCA see the blog post by Laura Hamilton and also the classic book by Jolliffe.

TL;DR – the the short answers to the questions above are, i) Check for outliers and/or heavy tails in your data before applying PCA, ii) use the ‘knee’ in the eigenvalue scree plot to select the number of components, iii) make sure you look at the loadings of the selected principal components and that you can explain any major patterns in those loadings.

1. Transformation of the data

The original derivations of PCA, such as that by Hotelling (1933), are heuristic and make no explicit assumptions other than the requirement that the selected components retain as much of the variance in the original data as possible. The formal derivation of PCA as the maximum likelihood estimate of parameters of a probabilistic model assumes additive Gaussian noise – see for example the paper by Tipping and Bishop. In practice, where the distributions of latent factors and additive noise are still reasonably symmetric and decay sufficiently fast we would expect deviations from perfect Gaussians to have only a minor effect. Despite this I have seen a number of talks with PCA plots similar to that shown on the left below.

PCAScores_HeavyTailExample1

I have based this example on a real commercial data set I was shown, but for privacy and confidentiality reasons I have generated a simulated data set that reproduces the issue. The majority of PC1 scores are very squashed into the left hand side of the plot, with a few scores reaching much higher values than the rest. This can been seen more clearly if we look at the estimated density of the 1st PC scores – see the right hand plot above. A heavy tail is clearly present, meaning we are deviating significantly from the assumptions under which the eigenvectors of the sample covariance matrix are the optimal estimators of the signal directions in the raw data.

So an obvious first step would be to take a quick look at the distribution of the raw data that we are trying to decompose using PCA. If that distribution has a heavy tail or significant outliers then there is an argument for applying a transformation (e.g. logarithm) before applying PCA. If we take the log of the example data above then we obtain much more reasonable distributions for the PC scores – see below.

PCAScores_HeavyTailExample2

Tip: If you see an elongated distribution of scores along the PCs you have selected, then it may be worth going back and looking at the distribution of the raw data going into the PCA – you should have already looked at the distribution of your raw data anyway as part of EDA best practice.

2. Selection of the number of components

There are two common methods for selecting the number of components that most people will be familiar with or encounter. Those are,

Select the number of components so that a fixed proportion, e.g. 90%, of the total variance is retained.
Look at a scree plot of the eigenvalues and locate the ‘elbow’ (or ‘knee’ depending on your interpretation of anatomy).

Of those two methods it is the second that I always prefer. This is because it has a sound theoretical underpinning and is more robust when applied to the kind of high-dimensional data sets that are commonplace nowadays. Let me explain why. The goal of PCA is to select a small number of directions in the data that we believe capture the signal within the data. The first approach to PCA model selection is effectively based upon the assumption that the ‘signal’ contribution to the total variance is considerably greater than the noise contribution to the total variance, and thus by selecting to retain the majority of the total variance in the original data we believe we are effectively selecting components that represent signal.

If our data points are $p$ dimensional, then we have $p$ sample covariance eigenvalues, $\lambda_{i},\;i=1,\ldots,p$ . We consider the signal part of the data is represented by a small number, $k$ , of eigenvectors/eigenvalues, and we have the $p-k-1$ remaining non-zero eigenvalues that correspond to pure noise in the original data.

Typically, there is no reason to believe that the noise process affects any of the original features/variables more strongly than the others – i.e. it is reasonable to consider the noise process to be isotropic. If not, then the preferred directions are essentially some form of signal and not noise. This means we expect the noise eigenvalues to be approximately the same, and let’s say they have an average value $\sigma^{2}$ . As we look at data sets of increasing dimension $p$ , unless the number of signal eigenvalues, $k$ , or the eigenvalues, $\lambda_{1},\ldots,\lambda_{k}$ , themselves grow with $p$ , then the variance explained by the signal values, $\sum_{i=1}^{k}\lambda_{i}$ , will remain relatively static whilst the noise eigenvalues contribute $(p-k-1)\sigma^{2}$ to the total variance. Thus we can see that the fraction of total variance explained by the signal components is essentially,

${\rm Fraction\;of\;variance\;explained}\;=\;\frac{\sum_{i=1}^{k}\lambda_{i}}{(p-k-1)\sigma^{2}\;+\;\sum_{i=1}^{k}\lambda_{i}}\;\;,$

and so decreases as the data dimension $p$ increases. Consequently, for high-dimensional data, where $p$ is very large (often in the thousands) the percentage of variance explained by the true signal components can be very very small. Conversely, if we select the number of components so as to retain say 90% of the total variance, we will be including a lot of noise in the retained components and not reducing the dimensionality as efficiently as we could be.

In contrast, if the true eigenvalues are split into a small number of ‘signal’ eigenvalues and a much larger number of ‘noise’ eigenvalues and we expect the noise process to be isotropic (i.e. we have a single highly degenerate noise eigenvalue), then the observed (sample) eigenvalues will also consist of a bulk of eigenvalues clustered around a single value and a small number of larger eigenvalues separated from this bulk. In other words we expect to see the distribution of sample covariance eigenvalues look something like the plot on the left below.

SP500_ScreePlot1

The eigenvalues in this example have been obtained from the sample covariance matrix of inter-day returns of closing prices over the 4 year period 2010 to 2013 for S&P500 stocks. We have actually omitted the largest eigenvalue, as this of a different scale and represents a ‘market-mode’ – see the next section.

Note that the bulk of the sample eigenvalues show some dispersion even though we had only one true, highly degenerate, noise eigenvalue. This is due to sampling variation, i.e. the fact that we have only a finite sized sample. If we plot the top 20 eigenvalues from the distribution above, ranked from largest to smallest, we get a scree plot looking like the plot on the right above.

Clearly, the sample eigenvalues corresponding to the bulk are very similar to each other, and so we see only small decreases with increasing rank in the sample eigenvalues in the bulk. In contrast, where there is an atypical jump in eigenvalue as we decrease the rank, this represents the point at which a signal eigenvalue can be detected as being separate from the bulk. This point also represents the ‘knee’ of the scree plot.

Statistical models that produce this kind of scree-plot are called ‘spiked-covariance’ models, so-called because the true population eigenvalue distribution is concentrated (or spiked) around just a small number of values. For these models we consider the data to have been produced by a small number of latent factors with isotropic noise. That is, our data point ${\bf x}_{i}$ is given by,

${\bf x}_{i}\;=\;\sum_{j=1}^{k}z_{ij}\lambda_{j}{\bf B}_{j}\;+\; \boldsymbol{\epsilon}_{i}\;\;\;,\;\;\;\boldsymbol{\epsilon}_{i}\sim {\cal N}\left ( {\bf 0}, \sigma^{2}{\bf I} \right)\;\;\;,\;\;\;z_{ij}\sim {\cal N}\left ( 0, 1\right)\;\;\;,$

with the vectors ${\bf B}_{1},\ldots,{\bf B}_{k}$ forming an orthonormal set.

From the mathematical form of the spiked-covariance models we can work out the expected distribution of sample covariance eigenvalues when our data consists of $N$ data points (each of dimension $p$ ). A large amount of research has been done in this area over the last 10-15 years. I won’t try to summarize it here, instead I’ll point you to the excellent review by Johnstone, and the original work by Paul. This research allows us to devise methods for the automatic detection of the number of principal components. Such methods have the advantage that they can be automated, i.e. programmed as part of an algorithm. On the practical side you can always “eyeball” the scree plot. For an isolated piece of analysis this is always advisable, as it takes so little time to do.

Also on the pragmatic side I have often found that simple ‘knee’ detection algorithms work surprisingly well, particularly for real ‘real-world’ data sets that I encounter as part of my day-to-day work. The simplest of such algorithms involves finding the maximum in the value of a discrete approximation to the second derivative of the ranked eigenvalue plot. That is, we choose $k$ as,

$k \;=\; \underset{i}{\mathrm{argmax}} \left ( \lambda_{i+1} + \lambda_{i-1} - 2\lambda_{i}\right )$

Improved approaches to ‘knee’ detection are based upon discrete approximations of the curvature. The paper by Satopää et al gives a good introduction to such methods. Again, these simple ‘knee’ detection approaches have the advantage that they can be coded and hence included as part of some automated process.

Note that in situations where the signal eigenvalues (those distinct from the bulk) do contribute the majority of the variance in the original data, then selecting the number of PCs on the basis of detecting a ‘knee’ in a scree plot will have the same advantages as selecting on the basis of wanting to retain a certain fixed percentage of the total variance. Consequently, there is very little reason not use the scree plot for selection in all cases.

3. Interpretation of the loadings

I have also found it is not uncommon for PCA to be blindly applied and the scores, i.e. the values of the new, lower-dimensional, features to be plotted, fed into some downstream process without any further curiosity being applied. Where the leading sample eigenvalue is very large – that is on a different scale to the other principal components retained or the bulk of the sample eigenvalues – I always take a look at the loadings. This is the case for the S&P500 data discussed above. The loadings tell us the contribution each of the original variables make to the principal component. The loadings for the 1st principal component of the S&P500 data is shown in the plot below,

SP500_1stPC_Loadings

Where the 1st sample eigenvalue is very large (compared to the others) it is not uncommon to see a loading pattern like that above – each loading value is of the same sign (and in this case of comparable size). The large 1st eigenvalue tells us that variation along the 1st principal component is the dominant mode of variation present in the original data. The loading pattern tells us that in this mode of variation all the original variables increase together, or decrease together. Such a pattern is often termed a ‘global mode’, i.e. it is a mode of variation that largely has the same effect globally on all the original variables. There are several scenarios and situations were the presence of a global mode can be naturally explained or is to be expected. For example,

Market modes in stock prices. This is where a rising or falling market causes all stock prices to rise or fall together.
In gene expression data obtained from model organisms exposed to a large environmental perturbation or insult. Here, for model organisms, e.g. yeast cultures, we can shock the biological system being studied without any ethical concerns, e.g. starve the organism of its primary food/fuel source or other essential nutrients. Consequently we see a system wide response to the starvation.
Price sensitivities of a collection of products sold by a retailer in a store. Here we expect price elasticities of products will reflect, to a large part, the economic conditions of the local geography. Consequently, a predominant part of the variation in product elasticities will be due to store-to-store variation in economic conditions and may show up as a global mode.

Where we see a global mode in the loadings, we should ask whether we can identify a credible mechanism behind the global mode. If not, then this should make us cautious about the appropriateness of the 1st principal component and hence the complete decomposition.

Finally it is worth mentioning that, if a sparse loading pattern is more naturally expected or convenient then sparse versions of PCA can be used. The lecture notes by Rob Tibshirani provide a good introduction to sparse PCA.

China invests big in AI

On January 21, 2018 By dchoyleIn AI, Artificial Intelligence, Data Science, Machine Learning, Science PolicyLeave a comment

This week saw interesting news and career guide articles in Nature highlighting Chinese government plans for its AI industry. The goal of the Chinese government is to become a world leader in AI by 2030. China forecasts that the value of its core AI industries will be US$157.7Billion in 2030 (based on exchange rate at 2018/01/19). How realistic that goal is will obviously depend upon what momentum there already is within China’s AI sector, but even so I was still struck and impressed by the ambition of the goal – 2030 is only 12 years away, which is not long in research and innovation terms. The Nature articles are worth a read (and are not behind a paywall).

ChinaAI_screenshot — Nature news article on China’s ‘New Generation of Artificial Intelligence Development Plan’

What will be the effect of China’s investment in AI? Attempting to make technology based predictions about the future can be ill-advised, but I will speculate anyway, as the articles, for me, prompted three immediate questions:

How likely is China to be successful in achieving its goal?
What sectors will it achieve most influence in?
What are competitor countries doing?

How successful will China be?

Whatever your opinions on the current hype surrounding AI, Machine Learning, and Data Science, there tends to a consensus that Machine Learning will emerge from its current hype-cycle with some genuine gains and progress. This time it is different. The fact that serious investment in AI is being made not just by corporations but by governments (including the UK) could be taken as an indicator that we are looking beyond the hype. Data volumes, compute power, and credible business models are all present simultaneously in this current AI/Machine Learning hype-cycle, in ways that they weren’t in the 1980s neural network boom-and-bust and other AI Winters. Machine Learning and Data Science is becoming genuinely commoditized. Consequently, the goal China has set itself is about building capacity, i.e. about the transfer of knowledge from a smaller innovation ecosystem (such as the academic community and a handful of large corporate labs) to produce a larger but highly-skilled bulk of practitioners. A capacity building exercise such as this should be a known quantity and so investments will scale – i.e. you will see proportional returns on those investments. The Nature news article does comment that China may face some challenges in strengthening the initial research base in AI, but this may be helped by the presence of large corporate players such as Microsoft and Google, who have established AI research labs within the country.

What sectors will be influenced most?

One prominent area for applications of AI and Machine Learning is commerce, and China provides a large potential market place. However, access to that market can be difficult for Western companies and so Chinese data science solution providers may face limited external competition on their home soil. Equally, Chinese firms wishing to compete in Western markets, using expertise of the AI-commerce interface gained from their home market, may face tough challenges from the mature and experienced incumbents present in those Western markets. Secondly, it may depend precisely on which organizations in China develop the beneficial experience in the sector. The large US corporates (Microsoft, Google) that have a presence in China are already main players in AI and commerce in the West, and so may not see extra dividends beyond the obvious ones of access to the Chinese market and access to emerging Chinese talent. Overall, it feels that whilst China’s investment in this sector will undoubtedly be a success, and Chinese commerce firms will be a success, China’s AI investment may not significantly change the direction the global commerce sector would have taken anyway with regard to its use and adoption of AI.

Perhaps more intriguing will be newer, younger sectors in which China has already made significant investment. Obvious examples, such as genomics, spring to mind, given the scale of activity by organizations such as BGI (including the AI-based genomic initiative of the BGI founder Jun Wang). Similarly, robotics is another field highlighted within the Nature articles.

What are China’s competitors investing in this area?

I will restrict my comments to the UK, which, being my home country, I am more familiar with. Like China, the UK has picked out AI, Robotics, and a Data Driven Economy as areas that will help enable a productive economy. Specifically, the UK Industrial Strategy announced last year identifies AI for one of its first ‘Sector Deals’ and also as one of four Grand Challenges. The benefits of AI is even called out in other Sector Deals, for example in the Sector Deal for the Life Sciences. This is on top of existing UK investment in Data Science, such as the Alan Turing Institute (ATI) and last year’s announcement by the ATI that it is adding four additional universities as partners. In addition we have capacity-building calls from research councils, such as the EPSRC call for proposals for Centres for Doctoral Training (CDTs). From my quick reading, 4 of the 30 priority areas that the EPSRC has highlighted for CDTs make explicit reference to AI, Data Science, or Autonomous Systems. The number of priority areas that will have some implicit dependence on AI or Data Science will be greater. Overall the scale of the UK investment is, naturally, unlikely to match that of China – the original Nature report on the Chinese plans says that no mention of level of funding is made. However, the likely scale of the Chinese governmental investment in AI will ultimately give that country an edge, or at least a higher probability of success. Does that mean the UK needs to re-think and up its investment?

Notes:

An English-language government summary of the plan, published on the 20th July 2017, can be found here.
A more detailed outline of the plan’s goals and policy is given in a Medium blog article by Jia He.
An English-language copy of the plan can be found on the website of the non-profit organization the Foundation for Law and International Affairs (FLIA).

Summary

Introduction

Theory

Practice

Getting MAD

Other distributions

Conclusions

Share this:

Summary

Introduction

What do you think an outlier is?

An illustration

Life (and data) is always made up of multiple processes

How to detect outliers?

Conclusions

Share this:

Summary

Introduction

Example 1:

Example 2:

How to use abstraction as a practical tool

Conclusion

Share this:

Summary

Introduction

The problem which slapped me in the face

A colleague asks an awkward question

The mathematical detail

Conclusion

Share this:

Summary

Introduction

The log-sum-exp trick

The log-sum-exp function

Calculating log-sum-exp in Python

Share this:

Summary

Introduction

GenAI for rigorous mathematics? Really?

Less formal mathematics research

Share this:

Introduction

What is a Bland-Altman plot?

What the eye doesn’t see

Share this:

Share this:

Introduction

What simulation will give you and what it won’t

What simulated data will give you

Consistency check

Bias check

Efficiency check

Sensitivity check – robustness to contamination

Runtime scaling

What simulated data won’t give you

Identify model mis-specification

Accuracy of your model on real data

How to simulate

Simulating features and response

Linear model example

Assessing the OLS Estimator for a linear model

Neural network example

Sampling features from more realistic distributions

Simulating the response only

Conclusions

Footnotes

Share this:

The recommendations

Why learn the mathematics of transformers?

The benefits of learning the maths of transformers and the risks of not doing so

What to do?

Share this:

Benford’s Law

What is Benford’s Law?

What’s behind Benford’s Law

Deriving Benford’s Law

Ted Hill’s proof of Benford’s Law

Benford’s Law is base invariant

When should a dataset follow Benford’s Law?

Why computer file sizes follow Benford’s Law