Log Normal Bias and the infinite capacity of mathematics to surpise

Summary

Always confirm your mathematical intuition with an explicit calculation, no matter how simple the calculation.
Always confirm your mathematical intuition with an explicit calculation, no matter how well you know the particular area of mathematics.
Don’t leave it 25 years to check your mathematical intuition with a calculation, no matter how confident you are in that intuition.

Introduction

One of the things about mathematics is that it’s not afraid to slap you in the face and stand there laughing at you. Metaphorically speaking, I mean. Mathematics has an infinite capacity to surprise even in areas where you think you can’t be surprised. And you end up looking or feeling like an idiot, with mathematics laughing at you.

This is a salutary lesson about how I thought I understood an area of statistics, but didn’t. The area in question is the log-normal distribution. This is a distribution I first encountered 25 years ago and have worked with almost continuously since.

The problem which slapped me in the face

The particular aspect of the log-normal distribution in question, was what happens when you go from models or estimates on the log-scale, to models or estimates on the ‘normal’ or exponentiated scale. For a log-normal random variable $Y$ with $\sigma^{2}$ being the variance of $\log Y$ , we have the well-known result,

$\mathbb{E}\left ( Y \right ) \;=\; \exp\left( \mathbb{E}\left (\log Y \right ) \right ) \times \exp\left ( \frac{1}{2}\sigma^{2}\right )$

This is an example of Jensen’s inequality at work. If you apply a non-linear function $f$ to a random variable $Y$ , then the average of the function is not equal to the function of the average. That is, $\mathbb{E}\left ( f(Y)\right )\;\neq\; f\left ( \mathbb{E}\left ( Y\right )\right )$ .

In the case of the log-normal distribution, the first equation above tells us that we can relate the $\mathbb{E}\left ( \log Y\right )$ to $\mathbb{E}\left (Y)\right)$ via a simple bias-correction factor $\exp ( \frac{1}{2}\sigma^{2} )$ . So, if we had a sample of $N$ observations of $\log Y$ we could estimate the expectation of $Y$ from an estimate of the expectation of $\log Y$ by multiplying by a bias correction factor of $\exp\left ( \frac{1}{2} \sigma^{2}\right )$ . We can estimate $\sigma^{2}$ from the residuals of our estimate of the expectation of $\log Y$ . In this case, I’m going to assume our model is just an estimation of the mean, $\mathbb{E}\left ( \log Y \right )$ , which I’ll denote by $\mu$ , and I’m not going to try to decompose $\mu$ any further. If our estimate of $\mu$ is just the sample mean of $\log Y$ , then our estimate of $\sigma^{2}$ would simply be the sample variance, $s^{2}$ . If we denote our observations by $y_{i}\;,\; i=1,\ldots,N$ then $s^{2}$ is given by the formula,

$s^{2}\;=\; \frac{1}{N-1}\sum_{i} \log^{2} y_{i} \;-\; \frac{1}{N(N-1)}\left( \sum_{i}\log y_{i}\right )^{2}$

Note the usual Bessel correction that I’ve applied in the calculation of $s^{2}$ .

We can see from this discussion that the bias correction in going from the log-scale to the ‘normal’ scale is multiplicative in this case. The bias correction factor $B_{1}$ is given by,

$B_{1} \;=\;\exp\left ( \frac{1}{2} s^{2}\right )$

If we had suspected that any bias correction needs to be multiplicative, which seems reasonable given we are going from a log-scale to a normal-scale, then we could easily have estimated a bias correction factor as simply the ratio of the mean of the observations on the normal scale to the exponential of the mean of the observations on the log scale. That is, we can construct a second estimate, $B_{2}$ , of the bias correction factor needed, with $B_{2}$ given by,

$B_{2} \; =\; \frac{\frac{1}{N}\sum_{i}y_{i}}{\exp\left ( \frac{1}{N}\sum_{i}\log y_{i}\right )}$

I tend to refer to the $B_{2}$ method as a non-parametric method because it is how you would estimate $\mathbb{E}\left ( Y \right )$ from an estimate of $\mathbb{E}\left ( \log Y \right )$ and you only knew the bias correction needed was multiplicative. You wouldn’t explicitly need to make any assumptions about the distribution of the data. In contrast, $B_{1}$ is specific to the data coming from a log-normal distribution, so I call it a parametric estimate. For these reasons, I tend to prefer using the $B_{2}$ way of estimating any necessary bias correction factor, as obviously any real data won’t be perfectly log-normal and so the $B_{2}$ estimator will be more robust whilst also being very simple to calculate.

A colleague asks an awkward question

So far everything I’ve described is well known and pretty standard. However, a few months ago a colleague asked me if there would be any differences between the estimates $B_{1}$ and $B_{2}$ ? “Of course”, I said, if the data is not log-normal. “No”, my colleague said. They meant, even if the data was log-normal would there be any differences between the estimates $B_{1}$ and $B_{2}$ . Understanding the size of any differences when the parametric assumptions are met would help understand how big the differences could be when the parametric assumptions are violated. “Yeah, there will obviously still be differences between $B_{1}$ and $B_{2}$ even if we do have log-normal data, because of sampling variation affecting the two formulae differently. But, I would expect those differences to be small even for moderate sample sizes”, I confidently replied.

I’ve been using those formulae for $B_{1}$ and $B_{2}$ for over 25 years and never considered the question my colleague asked. I’d never confirmed my intuition. So 25 years later, I thought I’d do a quick calculation to confirm my intuition. My intuition was very very wrong! I thought it would be interesting to explain the calculations I did and what, to me, was the surprising behaviour of $B_{1}$ . Note, with hindsight the behaviour I uncovered shouldn’t have been surprising. It was obvious once I’d begun to think about it. It was just that I’d left thinking about it till 25 years too late.

The mathematical detail

The question my colleague was asking was this. If we have $N$ i.i.d. observations drawn from a log-normal distribution, i.e.,

$x_{i}\;=\;\log y_{i} \sim {\cal{N}}\left (\mu, \sigma^{2}\right )\;\;\; , i = 1,\ldots,N\;\;\;,$

what are the expectation values of the bias correction factors $B_{1}$ and $B_{2}$ at any finite value of $N$ . Do $\mathbb{E}\left ( B_{1} \right )$ and $\mathbb{E}\left ( B_{2}\right )$ converge to $\exp(\frac{1}{2}\sigma^{2})$ as $N\rightarrow\infty$ , and if so how fast?

Since the $\log y_{i}$ are Gaussian random variables we can express these expectation values in terms of simple integrals and we get,

$\mathbb{E}\left ( B_{1}\right )\;=\; \left( 2\pi\sigma^{2}\right)^{-\frac{N}{2}}\int_{{\mathbb{R}}^{N}} d{\underline{x}}\exp\left [-\frac{1}{2\sigma^{2}}\left ( \underline{x} - \mu\underline{1}_{N}\right )^{\top}\left ( \underline{x} - \mu\underline{1}_{N}\right )\right ] \exp\left[ \frac{1}{2} \underline{x}^{\top}\underline{\underline{M}}\underline{x}\right ]$ Eq.1

where the matrix $\underline{\underline{M}}$ is given by,

$\underline{\underline{M}}\;=\;\frac{1}{N-1}\underline{\underline{I}}_{N}\;-\;\frac{1}{N(N-1)}\underline{1}_{N}\underline{1}_{N}^{\top}$ .

Here ${\underline{1}}_{N}$ is an N-dimensional vector consisting of all ones, and $\underline{\underline{I}}_{N}$ is the $N\times N$ identity matrix.

We can similarly express the expectation of $B_{2}$ and we get,

$\mathbb{E}\left ( B_{2}\right )\;=\;\frac{1}{N}\sum_{i=1}^{N} \left( 2\pi\sigma^{2}\right)^{-\frac{N}{2}}\int_{{\mathbb{R}}^{N}} d{\underline{x}}\exp\left [-\frac{1}{2\sigma^{2}}\left ( \underline{x} - \mu\underline{1}_{N}\right )^{\top}\left ( \underline{x} - \mu\underline{1}_{N}\right )\right ] \exp \left [ \sum_{j=1}^{N}\left ( \delta_{ij} - \frac{1}{N}\right )x_{j}\right ]$ Eq.2

The matrices in the exponents of the integrands of Eq.1 and Eq.2 are easily diagonalized analytically and so the integrals are relatively easy to evaluate analytically. From this we find,

$\mathbb{E}\left ( B_{1}\right )\;=\; \left ( 1\;\;-\;\frac{\sigma^{2}}{N-1}\right )^{-\frac{(N-1)}{2}}$ Eq.3

$\mathbb{E}\left ( B_{2}\right )\;=\; \exp \left [ \frac{\sigma^{2}}{2}\left ( 1\;-\;\frac{1}{N}\right )\right ]$ Eq.4

We can see that as $N\rightarrow\infty$ both expressions for the bias correction tend to the correct value of $\exp(\sigma^2/2)$ . However, what is noticeable is at finite $N$ the value of $\mathbb{E}\left( B_{1}\right))$ has a divergence. Namely, as $\sigma^{2} \rightarrow N-1$ , we have $\mathbb{E}\left( B_{1}\right ) \rightarrow \infty$ and for $\sigma^{2} \ge N-1$ , we have that $\mathbb{E}\left( B_{1}\right )$ doesn’t exist. This means that for a finite sample size there are situations where the parametric form of the bias correction will be very very wrong.

At first sight I was shocked and surprised. The divergence shouldn’t have been there. But it was. My intuition was wrong. I was expecting there to be differences between $\mathbb{E}\left ( B_{1}\right)$ and $\mathbb{E}\left ( B_{2}\right )$ , but very small ones. I’d largely always thought of any large differences that arise in practice between the two methods to be due to the distributional assumption made – because a real noise process won’t follow a log-normal distribution we’ll always get some differences. But the derivations above are analytic. They show that there can be a large difference between the two bias-correction estimation methods even when all assumptions about the data have been met.

With hindsight, the reason for the large difference between the two expectations is obvious. The second, $\mathbb{E}\left ( B_{2}\right )$ , involves the expectation of the exponential of a sum of Gaussian random variables, whilst the first involves the expectation of the exponential of a sum of the squares of those same Gaussian random variables. The expectation in Eq.1 is a Gaussian integral and if $\sigma^{2}$ is large enough the coefficient of the quadratic term in the exponent becomes positive, leading to the expectation value becoming undefined.

Even though in hindsight the result I’d just derived was obvious, I still decided to simulate the process to check it. As part of the simulation I also wanted to check the behaviour of the variances of $B_{1}$ and $B_{2}$ , so I needed to derive analytical results for ${\rm Var}\left ( B_{1}\right )$ and ${\rm Var}\left ( B_{2}\right )$ . These calculations are similar to the evaluation of $\mathbb{E}\left ( B_{1}\right )$ and $\mathbb{E}\left ( B_{2}\right )$ , so I’ll just state the results below,

${\rm Var}\left ( B_{1}\right)\;=\; \left ( 1 - \frac{2\sigma^{2}}{N-1}\right )^{-\frac{(N-1)}{2}}\;-\;\left ( 1 - \frac{\sigma^{2}}{N-1} \right )^{-(N-1)}$ Eq.5

${\rm Var}\left ( B_{2}\right)\;=\; \left ( 1 - \frac{1}{N}\right )\exp \left [ \sigma^{2}\left ( 1 - \frac{2}{N}\right )\right ]\;+\;\frac{1}{N}\exp \left [ 2\sigma^{2}\left ( 1 - \frac{1}{N}\right )\right ]\;-\; \exp \left [ \sigma^{2}\left ( 1 - \frac{1}{N}\right )\right ]$ Eq.6

Note that the variance of $B_{1}$ also diverges, but when $\frac{\sigma^{2}}{N-1} = \frac{1}{2}$ , so much before $\mathbb{E}\left ( B_{1}\right )$ diverges.

To test Eq.3 – Eq.6 I ran some simulations. First, I introduced a scaled variance $\phi = \sigma^{2}/(N-1)$ . The divergence in $\mathbb{E}\left ( B_{1}\right )$ occurs at $\phi = 1$ , so $\phi$ gives us a way of measuring how problematic the value of $\sigma^{2}$ will be for the given value of $N$ . The divergence in ${\rm Var}\left ( B_{1} \right )$ occurs at $\phi = 0.5$ . I chose to set $N=20$ in my simulations, so quite a small sample but there are good reasons that I’ll explain later. For each value of $\phi$ I generated $N_{sim} = 100000$ sample datasets, each of $N$ log-normal datapoints, and calculated $B_{1}$ and $B_{2}$ . From the simulation sample of $B_{1}$ and $B_{2}$ values I then calculated sample estimates of $\mathbb{E}\left ( B_{1}\right )$ , $\mathbb{E}\left ( B_{2}\right )$ , ${\rm Var}\left ( B_{1}\right )$ , and ${\rm Var}\left ( B_{2}\right )$ . The simulation code is in the Jupyter notebook LogNormalBiasCorrection.ipynb in the GitHub repository here.

Below in Figure 1 I’ve shown a plot of the expectation of $B_{1}$ , relative to the true bias correction factor $\exp\left( \frac{1}{2}\sigma^{2}\right )$ , as a function of $\phi$ . The red line is based on the theoretical calculation in Eq.3, whilst the blue points are simulation estimates,

A plot showing how the average relative error made by the parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 1: Plot of theory and simulation estimates of the average relative error made by the parametric bias correction factor, as we increase the variance of the log-normal data.

The agreement between simulation and theory is very good. But even for these small values of $\phi$ we can see the error that $B_{1}$ makes in estimating the correct bias correction factor is significant – nearly 40% relative error at $\phi = 0.25$ . For $N=20$ , the value of $\sigma$ is 2.24 at $\phi=0.25$ . Whilst this is large for the additive log-scale noise, it is not ridiculously large. It shows that the presence of the divergence at $\phi = 1$ is strongly felt even a long way away from that divergence point.

The error bars on the simulation estimates correspond to $\pm 2$ standard-errors of the mean. In calculating the standard-errors of the mean I’ve used the theoretical calculation of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ (derived from Eq.5), rather than the simulation sample estimate of it. The reason being that once one goes above about $N=20$ and above $\phi = 0.25$ , the variance (over different sets of simulation runs) of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ becomes large itself, and so obtaining a stable accurate simulation estimate of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ requires such a prohibitively large number of simulations (way greater than 100000) that estimating ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ from a simulation sample becomes impractical. So I used the theoretical result instead. Hence also why the plot below is restricted to $\phi \le 0.25$ .

For the range $\phi\le 0.25$ , where the simulation sample size is sufficiently large enough that we can trust our simulation sample based estimates of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ , we can compare them with the theoretical values. This comparison is plotted in Figure 2 below.

A plot showing how variance of the relative error made by the parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 2: Plot of theory and simulation estimates of the variance of the relative error made by the parametric bias correction factor, as we increase the variance of the log-normal data.

We can see that the simulation estimates of ${\rm Var}\left ( B_{1}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ confirm the accuracy of the theoretical result in Eq.5, and consequently confirm the validity of the standard error estimates in Figure 1.

Okay, so that’s confirmed the ‘surprising’ behaviour of the bias-correction estimate $B_{1}$ , but what about $B_{2}$ ? How do simulation estimates of $\mathbb{E}\left ( B_{2}\right )$ and ${\rm Var}\left ( B_{2}\exp\left ( -\frac{1}{2}\sigma^{2}\right )\right )$ compare to the theoretical results in Eq.4 and Eq.6 ? I’ve plotted these in Figures 3 and 4 below.

A plot showing how the average relative error made by the non-parametric bias correction factor changes as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 3: Plot of theory and simulation estimates of the average relative error made by the non-parametric bias correction factor, as we increase the variance of the log-normal data.

The accuracy of $B_{2}$ is much better than that of $B_{1}$ . By comparison $B_{2}$ only makes a 10% relative error (under-estimate) on average at $\phi=0.25$ , compared to the 40% over-estimate of $B_{1}$ (on average). We can also see that the magnitude of the relative error increases slowly as $\phi$ increases, again in constrast with the rapid increase in the relative error made by $B_{1}$ (in Figure 1).

A plot showing how variance of the relative error made by the non-parametric bias correction factor increases as the variance of the log-normal data increases. The plot shows both simulation and theory estimates of the relative error. The theory is in close agreement with the simulation results. — Figure 4: Plot of theory and simulation estimates of the variance of the relative error made by the non-parametric bias correction factor, as we increase the variance of the log-normal data.

Conclusion

So, the theoretical derivations and analysis I did of $B_{1}$ and $B_{2}$ turned out to be correct. What did I learn from this? Three things. Firstly, even areas of mathematics that you understand well, have worked with for a long time, and are relatively simple in mathematical terms (e.g. just a univariate distribution) have the capacity to surprise you. Secondly, given this capacity to surprise it is always a good idea to check your mathematical hunches, even with a simple calculation, no matter how confident you are in those hunches. Thirdly, don’t leave it 25 years to check your mathematical hunches. The evaluation of those integrals defining $\mathbb{E}\left ( B_{1}\right )$ and $\mathbb{E}\left ( B_{2}\right )$ only took me a couple of minutes. If I had calculated those integrals 25 years ago, I wouldn’t be feeling so stupid now.

Hoyle Analytics

Log Normal Bias and the infinite capacity of mathematics to surpise

Summary

Introduction

The problem which slapped me in the face

A colleague asks an awkward question

The mathematical detail

Conclusion

Leave a comment Cancel reply

Summary

Introduction

The problem which slapped me in the face

A colleague asks an awkward question

The mathematical detail

Conclusion

Share this:

Leave a comment Cancel reply