Featured

Abstraction is your friend

Summary

  • When constructing a coding solution to an implememtation problem, or an algorithm/theory for a mathematical problem, we can get bogged down in detail and lose our way.
  • Abstraction helps us to move up a level and leave the detail behind.
  • Once we temporarily leave the detail behind, what we need to do becomes clearer.
  • Now we know what we have to do or implement, we can successfully put the detail back in because we have a clear ‘north star’ of what we need to implement.

Introduction

The computer scientist Edsger Dijkstra  once said,

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

This is one of my favourite computing-related quotes, not least because it comes from Dijkstra. Edsger Dijkstra was a prodigious and extremely impactful computer scientist. In 1972 he was awarded the ACM A.M. Turing prize, widely regarded as the “Nobel prize for computer science”. Dijkstra was also “the computer scientist’s computer scientist”, providing rigorous solutions to tough practical problems, such as finding the shortest path between any two nodes on a network.

I also like the Dijkstra quote because I can relate to it. Whenever I want clarity on a problem, I abstract. Abstraction allows me to identify and focus on the “what” by taking the details of the “how” out of the discussion (for the time being). Once I know “what” it is I need to do I can put the details of the “how” back in and it becomes a straightforward technical implementation task (with debugging of course).

Yes, I hear you say, but what do you actually mean by all this high-level philosophical talk? I’ll give you two example. Yes, they are concocted examples, but they help illustrate what I’m driving at. Once I’ve discussed the two examples, we can then distil a more general approach of, i) when to recognize that we need more abstraction, and ii) how to do it.

Example 1:

Imagine you have the following equation,

G + \Lambda g = \kappa T    Eq.1

In Eq.1 \Lambda and \kappa are constants. With just your high-school mathematics, you can see that Eq. 1 is just a linear equation. With just high-school maths you can confidently understand that as T increases then so does G.

I will now tell you that Eq.1 represents the field equations for Einstein’s theory of general relativity. With just high-schools maths you’ve been able to understand general relativity. G is the curvature tensor – it describes how spacetime is curved, g is called the metric tensor, and T is the stress-energy tensor. The simple high-level form of Eq.1 allows us to understand that energy, or equivalently mass (remember E = mc^{2}) affects how curved spacetime is. The more mass we have or the higher the energy density, then the more spacetime will be curved. And the curvature of spacetime affects the dynamics of anything moving in that spacetime. This simple high-level picture is so easy to grasp that the famous cosmologist John Archibald Wheeler could summarize Einstein’s theory in one short sentence, “Spacetime tells matter how to move; matter tells spacetime how to curve”.

All of the physics concepts of Einstein’s theory are in that simple equation of Eq.1 and likewise in the simple sentence from Wheeler. At this abstract level, the “what”, i.e. the physics, is crystal clear. However, when we start getting into the technical detail we have to go deeper. To “implement” a calculation we have to start to get to grips with the “how” and learn about Christoffel symbols, Ricci tensors, covariant and contravariant tensors, and much more. It requires mathematical training.

The details are also a lot messier. In the way I have presented the example, we started with the abstraction and then introduced some of the detail. When we are actually doing research, it is usually the reverse. We start with some existing details and want to construct a simplifying, unifying theory, but we get lost in the details because of the deep technical nature of those details.

A similar thing can happen when we are coding a Data Science solution. When we are deep in the technical details we often fail to spot the high-level patterns and so we fail to spot the simple description. We end-up producing lots of corner-cases and edge-cases. Our code starts to become one big series of “if-else if” statements or a big “switch” statement. Our second example illustrates that in a Data Science context.

Example 2:

In our first example we established that we are comfortable with linear models. In statistics a linear model is of the form,

\mathbb{E}\left ( y \right ) \;=\; \underline{\beta}^{\top}\underline{x}  Eq.2

The left-hand side of Eq.2 is the mean of our target variable y and the right-hand side is what we call the “linear predictor” because it is a linear combination of the various predictive features and it predicts the average value of y given the feature vector \underline{x}.

What about non-linear models? Imagine that we have a non-linear relationship between our target variable and a single explanatory feature x. One approach I’ve seen some Data Scientists take is to partition the values of x into a number of sub-ranges and build a linear model for each sub-range. Sticking with linear models seems comfortable. But we end up with whole zoo of different linear models (yes, I really have seen someone do this).

Let’s move away from the details about how we do the model fitting and focus on what we want. Let’s abstract. We want to model the mean of the target variable. When we state the problem in these simple terms we realise we can just write the mean of the target variable as a non-linear transformation of the linear predictor in Eq.2. We write this as,

\mathbb{E}\left( y \right )\;=\;g^{-1}\left ( \underline{\beta}^{\top}\underline{x} \right ).  Eq.3

This is what statisticians call a Generalized Linear Model (GLM). Again, the left-hand side of Eq.3 is how we represent the mean of the target variable y and the non-linear transformation function is traditionally represented as g^{-1}(\cdot). The function g is called the “link function” and is monotonic. So we can read Eq.3 as “mean of y = simple monotonic non-linear transformation of \underline{\beta}^{\top}\underline{x}”. You can see where the name Generalized Linear Model comes from. It is just a non-linear transformation of the same thing we focus on when build a linear model.

The way we introduced Eq.3 seemed very natural and intuitive. And yet so many Data Scientists seem scared of GLMs. The jargon of link functions and the notation of expectation values \mathbb{E}( y) can be off-putting when you first encounter them. Consequently, many Data Scientists don’t persevere with GLMs. When you understand what Eq.3 is saying at high-level they are very easy to understand. When we step away from the messy detail of “how” to code the non-linearity and instead focus on the “what” of  what we want to achieve – introduce a non-linear relationship between the mean of our target variable and a linear combination of predictive features – then Eq.3 becomes the intuitive and obvious way to do it. Coding it up then becomes easy.

What can we learn more generally from those two examples?

How to use abstraction as a practical tool

What the two examples above will hopefully have convinced you about is:

  • Focusing at a high-level on the ‘things’ in our problem and what we want them to do helps us to identify the main actors in problem, the relationships between them and what we need to do to them. We discover the “what”.
  • Once we have the what, the “how”, i.e. the implementation is typically easy.

But how to use this in practice? Whenever I find myself trying to solve a coding problem and I’m getting bogged down in the details of structuring the problem, flipping between different choices of data-structures to use, then I know I’m too close to the problem. I know I need to step away from the keyboard and get the pen and paper out.

I start by representing the main objects in my problem by some sort of symbol, e.g. a circle or a square. I then sketch out what interactions happen between those objects. By this point I’m beginning to describe the interactions in terms of high-level language such as, “I have matrix of type X. It is processed by a transform of type Y that is parameterized by mathematical object of type A”. You’ll spot that I’m already beginning to describe the interactions more in terms of interfaces and method signatures. That is, I’m focusing on the abstraction of the problem and not on the implementation details that would occur behind those interfaces and method signatures. This is because I have no other choice. I have only pen and paper. I can’t code, so I can’t bogged down again in implementation details and discussions on what data structures to use.

Once I have finished a couple of iterations of pen-and-paper sketching I have the architecture of my solution nailed. And the best thing is the structure typically mimics the structure of the actual mathematical calculation I’m trying to code. At this point I have identified what interfaces, classes, and methods I need. I now return to the keyboard and the implementation is easy.

Conclusion

If you’re getting bogged down in working out implementation details, it is a good sign you don’t actually know “what” it is you need to implement.

Step back, move up a level of abstraction and start identifying what are the things/actors in your problem and what are the interactions between them. Doing this using pen-and-paper will aid you a lot, because it forces you away from the keyboard and the temptation to continue implementing.

Once you have sketched out the necessary objects and their interactions you will pretty much have the classes, interfaces, and methods you need. You have the “what”. Now you can implement. It will now be a lot faster.

© 2026 David Hoyle. All Rights Reserved

An equation showing the mathematical form of the log sum exp calculation.

Data Science Notes: 2. Log-Sum-Exp

Summary

  • Summing up many probabilities that are on very different scales often involves calculation of quantities of the form \log\left ( \sum_{k} \exp\left( a_{k}\right )\right ). This calculation is called log-sum-exp.
  • Calcuating log-sum-exp the naive way can lead to numerical instabilities. The solution to this numerical problem is the “log-sum-exp” trick.
  • The scipy.special.logsumexp function provides a very useful implementation of the log-sum-exp trick.
  • The log-sum-exp function also has uses in machine learning, as it is a smooth, differentiable approximation to the {\rm max} function.

Introduction

This is the second in my series of Data Science Notes series. The first on Bland-Altman plots can be found here. This post is on a very simple numerical trick that ensures accuracy when adding lots of probability contributions together. The trick is so simple that implementations of it exist in standard Python packages, so you only need to call the appropriate function. However, you still need to understand why you can’t just naively code-up the calculation yourself, and why you need to use the numerical trick. As with the Bland-Altman plots, this is something I’ve had to explain to another Data Scientist in the last year.

The log-sum-exp trick

Sometimes you’ll need to calculate a sum of the form, \sum_{k} \exp\left ( a_{k}\right ), where you have values for the a_{k}. Really? Will you? Yes, it will probably be calculating a log-likelihood, or a contribution to a log-likelihood, so the actual calculation you want to do is of the form,

\log\left ( \sum_{k} \exp \left ( a_{k}\right )\right )

These sorts of calculations arise where you have log-likelihood or log-probability values a_{k} for individual parts of an overall likelihood calculation. If you come from a physics calculation you’ll also recognise the expression above as the calculation of a log-partition function.

So what we need to do is exponentiate the a_{k} values, sum them, and then take the log at the end. Hence the expression “log-sum-exp”. But why a blogpost on “log-sum-exp”? Surely, it’s an easy calculation. It’s just np.log(np.sum(np.exp(a))) , right ? Not quite.

It depends on the relative values of the a_{k}. Do the naïve calculation np.log(np.sum(np.exp(a))) and it can be horribly inaccurate. Why? Because of overflow and underflow errors.

If we have two values a_{1} and a_{2} and a_{1} is much bigger than a_{2}, when we add \exp(a_{1}) to \exp(a_{2}) we are using floating point arithmetic to try and add a very large number to a much smaller number. Most likely we will get an overflow error. If would be much better if we’d started with \exp(a_{1}) and try to add \exp(a_{2}) to it. In fact, we could pre-compute a_{2} - a_{1}, which would be very negative and from this we could easily infer that adding \exp(a_{2}) to \exp(a_{1}) would make very little difference. In fact, we could just approximate \exp(a_{1}) + \exp(a_{2}) by \exp(a_{1}).

But how negative does a_{2} - a_{1} have to be before we ignore the addition of \exp(a_{2})? We can set some pre-specified threshold, chosen to avoid overflow or underflow errors. From this, we can see how to construct a little Python function that takes an array of values a_{1}, a_{2},\ldots, a_{N} and computes an accurate approximation to  \sum_{k=1}^{N}\exp(a_{k}) without encountering underflow or overflow errors.

In fact we can go further and approximate the whole sum by first of all identifying the maximum value in an array a = [a_{1}, a_{2}, \ldots, a_{N}]. Let’s say, without loss of generality, the maximum value is a_{1}. We could ensure this by first sorting the array, but it isn’t necessary actually do this to get the log-sum-exp trick to work. We can then subtract a_{1} from all the other values of the array, and we get the result, 

\log \left ( \sum_{k=1}^{N} \exp \left ( a_{k} \right )\right )\;=\; a_{1} + \log \left ( 1\;+\;\sum_{k=2}^{N} \exp \left ( a_{k} - a_{1} \right )\right )

The values a_{k} - a_{1} are all negative for k \ge 2, so we can easily approximate the logarithm on the right-hand side of the equation by a suitable expansion of \log (1 + x). This is the “log-sum-exp” trick.

The great news is that this “log-sum-exp” calculation is so common in different scientific fields that there are already Python functions written to do this for us. There is a very convenient “log-sum-exp” function in the SciPy package, which I’ll demonstrate in a moment.

The log-sum-exp function

The sharp-eyed amongst you may have noticed that the last formula above gives us a way of providing upper and lower bounds for the {\rm max} function. We can simply re-arrange the last equation to get,

{\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; \log \left ( \sum_{k=1}^{N}\exp \left ( a_{k}\right )\right )

The logarithm calculation on the right-hand side of the inequality above is what we call the log-sum-exp function (lse for short). So we have, 

{\rm max} \left ( a_{1}, a_{2},\ldots, a_{N} \right ) \;\le\; {\rm lse}\left ( a_{1}, a_{2}, \ldots, a_{N}\right )

This gives us an upper bound for the ${\rm max}$ function. Since a_{k} \le {\rm max}\left ( a_{1}, a_{2},\ldots,a_{N}\right ), it is also relatively easy to show that,

{\rm lse} \left ( a_{1}, a_{2}, \ldots, a_{N}\right )\;\le\; {\rm max}\left ( a_{1}, a_{2},\ldots, a_{N} \right )\;+\;\log N

and so we have have a lower bound for the {\rm max} function. So the log-sum-exp function allows us to compute lower and upper bounds for the maximum of an array of real values, and it can provide an approximation of the maximum function. The advantage is that the log-sum-exp function is smooth and differentiable. In contrast, the maximum function itself is not smooth nor differentiable everywhere, and so is less convenient to work with mathematically. For this reason the log-sum-exp is often called the “real-soft-max” function because it is a “soft” version of the maximum function. It is often used in machine learning settings to replace a maximum calculation.

Calculating log-sum-exp in Python

So how do we calculate the log-sum-exp function in Python. As I said, we can use the SciPy implementation which is in scipy.special. All we need to do is pass an array-like set of values a_{k}. I’ve given a simple example below, 

# import the packages and functions we need
import numpy as np
from scipy.special import logsumexp
# create the array of a_k values
a = np.array([70.0, 68.9, 20.3, 72.9, 40.0])
# Calculate log-sum-exp using the scipy function
lse = logsumexp(a)
# look at the result
print(lse)

This will give the result 72.9707742189605

The example above and several more can be found in the Jupyter notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes

The great thing about the SciPy implementation of log-sum-exp is that it allows us to include signed scale factors, i.e. we can compute,

\log \left ( \sum_{k=1}^{N} b_{k}\exp\left ( a_{k}\right ) \right )

where the values b_{k} are allowed to be negative. This means, that when we are using the SciPy log-sum-exp function to perform the log-sum-exp trick, we can actually use it to calculate numerically stable estimates of sums of the form,

\log \left ( \exp\left ( a_{1}\right ) \;-\; \exp\left ( a_{2}\right ) \;-\;\exp\left ( a_{3}\right )\;+\;\exp\left ( a_{4}\right )\;+\ldots\; + \exp\left ( a_{N}\right )\right ).

Here’a small code snippet illustrating the use of the scipy.special.logsumexp with signed contributions,

# Create the array of the a_k values
a = np.array([10.0, 9.99999, 1.2])
b = np.array([1.0, -1.0, 1.0])
# Use the scipy.special log-sum-exp function
lse = logsumexp(a=a, b=b)
# Look at the result
print(lse)

This will give the result 1.2642342014146895.

If you look at the output of the example above you’ll see that the final result is much closer to the value of the last array element a_{3} = 1.2. This is because the first two contributions, \exp(a_{1}) and \exp(a_{2}) almost cancel each other out because the contribution \exp(a_{2}) is pre-fixed by a factor of -1. What we get left with is something close to \log(\exp(a_{3}))\;=\; a_{3}.

There is also a small subtlety in using the SciPy logsumexp function with signed contributions. If the substraction of some terms had led to an overall negative result, scipy.special.logsumexp will rerturn NaN as the result. In order to get it to always return a result for us, we have to tell it to return the sign of the final summation as well, by setting the return_sign argument of the function to True. Again, you can find the code example above and others in the notebook DataScienceNotes2_LogSumExp.ipynb in the GitHub repository https://github.com/dchoyle/datascience_notes.

When you are having to combine lots of different probabilities, that are on very different scales, and you need to subtract some of them and add others, the SciPy log-sum-exp function is very very useful.

© 2026 David Hoyle. All Rights Reserved