Statistics – Hoyle Analytics

Summary

Estimating the variance of a distribution when we have outliers can create a circular problem, particularly if we want to use the estimated variance in an algorithm to detect the outliers in the first place.
If we are prepared to assume, that at most, a fraction $\alpha$ of our data are outliers, then we can use a trimmed-variance to estimate the population variance without having to explicitly identify the outliers.
We incorporate a correction factor to make our variance estimate consistent with the parametric distribution we have assumed our data has been drawn from.
The median absolute deviation from the median (MAD) also uses a similar idea and tends to have better performance (efficiency) for outlier-contaminated Normal data.
Calculating the Normal-consistent MAD-based estimate of the standard deviation is ridiculously simple.

Introduction

This is one of those little hacks that I really like. So simple, so easy, and yet it solves a real problem. It solves a circular problem. How to estimate the standard deviation of a distribution from a sample of data, without being impacted by outliers when we haven’t yet identified which datapoints are outliers. We may even want to use that estimate of the standard deviation to help detect those outliers – in an appropriate algorithm that takes the sample size into account of course (see my previous Data Science Notes post What is an outlier?). It looks like we have a circular problem; the outliers affect the estimation of the standard deviation, but we can’t identify and deal with those outliers until we have calculated the standard deviation.

We can drop the data points from the tails of the sample to remove outliers. The trouble is the standard deviation relates to the spread of values in a distribution, i.e. it relates to the tails of the distribution. It looks like we are going to lose signal about the standard deviation if we do this.

If only there was a way to estimate a standard deviation from just the data values around the mean. That is, can we estimate the standard deviation from the data points that have come from the bulk of the distribution and not its tails? There is, if we are prepared to make a parametric assumption about the shape of the distribution. Once we make a parametric assumption about the shape, any datapoint gives us signal about the parameters of the distribution. And once we have signal about the parameters of the distribution we have signal about its standard deviation.

Theory

Let’s use the Normal distribution as an example. We know that the probability density is given by,

$\phi (x)\;=\; \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp \left ( -\frac{1}{2\sigma^{2}}\left ( x - \mu\right )^{2}\right )$ Eq.1

In Eq.1 $\mu$ is the mean of the probability distribution and $\sigma$ is its standard deviation. The part of the probability distribution that corresponds to the central 90% of the probability distribution is given by the interval $\left [ -\sigma x_{5} + \mu , \sigma x_{5} + \mu \right ]$ , where $x_{5}$ corresponds to the 5^th centile of the standard Normal distribution and so is given by,

$x_{5} \;=\; -\sqrt{2} {\rm erf}^{-1}\left ( 0.95 \right )$ Eq.2

If we calculate the 2^nd moment of $x$ about the mean $\mu$ and over that central 90% portion of the probability distribution we can write that,

$\frac{1}{\sqrt{2\pi\sigma^{2}}}\int_{-x_{5} }^{x_{5}} x^{2}\,\exp \left ( -\frac{x^{2}}{2\sigma^{2}}\right )\, dx\;=\; 0.9\sigma^{2}\left [ 1 + \frac{2}{0.9}\Phi^{-1}\left ( 0.05\right ) \phi \left ( \Phi^{-1}\left ( 0.05 \right )\right ) \right ]$ Eq. 3

In Eq.3 $\Phi(z)$ is the CDF of the Standard Normal distribution and so $\Phi(z) = \frac{1}{2}{\rm erfc}\left ( -z/\sqrt{2} \right )$ . The right-hand side of Eq.3 is 0.9 times what we would get if we had an infinite sample of data from a Normal distribution with standard deviation of $\sigma$ and removed the upper and lower 5% of the datapoints and then calculated the sample standard deviation $s^{2}$ .

Let’s denote our sample of data by the vector $\underline{x}$ . Eq.3 tell us that once we have removed the upper and lower 5% of datapoints from our sample $\underline{x}$ and then calculated $s^{2}$ , we have,

$s^{2} \simeq \sigma^{2}\left [ 1 + \frac{2}{0.9}\Phi^{-1}\left ( 0.05\right ) \phi \left ( \Phi^{-1}\left ( 0.05 \right )\right ) \right ]$ Eq.4

A simple re-arrangement of Eq.4 gives us a means of estimating $\sigma^{2}$ from our trimmed data sample. We just use,

$\hat{\sigma}^{2}\simeq \frac{s^{2}}{\left [ 1 + \frac{2}{0.9}\Phi^{-1}\left ( 0.05\right ) \phi \left ( \Phi^{-1}\left ( 0.05 \right )\right ) \right ] }$ Eq.5

There you have it. A simple way to estimate the standard deviation of a Normal distribution whilst excluding the smallest and largest 5% of values. If we want to exclude $\alpha \times 100\%$ of values, we can easily generalize the formula in Eq.5. I’ll leave it to you as an exercise to do the derivation. I’m just going to quote the final result below,

$\hat{\sigma}^{2}\;=\; \frac{s^{2}}{\left [ 1 + \frac{2}{1 - \alpha}\Phi^{-1}\left ( \frac{1}{2}\alpha\right ) \phi \left ( \Phi^{-1}\left ( \frac{1}{2}\alpha \right )\right ) \right ] }$ Eq.6

Practice

That’s the theory. How do we use this simple idea in practice? Like all simple but great ideas someone has already coded it up for you. In Python the trimmed variance of an array a of numbers can be calculated using the trimmed_var function in scipy.stats.mstats, as follows,

			
from scipy.stats.mstats import trimmed_var
trimmed_var = trimmed_var(a,
                          limits=(0.05, 0.05),
                          inclusive=(1, 1),
                          relative=True,
                          axis=None,
                          ddof=1)

		

In the notebook DataScienceNotes5_TrimmedVariances.ipynb in the public GitHub repository github.com/dchoyle/datascience_notes, I have used the trimmed_var function to run a number of simulations. The plot below (Figure 1) shows the average estimated variance, averaged over 1000 simulation datasets, using the formula in Eq.6 and where I have used the scipy trimmed_var function to calculate $s^{2}$ for each simulation sample. The plot shows the average (over simulation datasets) estimated population variance at a number of different sample sizes $N$ and for three different values of $\alpha$ .

Figure 1: Plot of the simulation means for the trimmed-variance estimates against log N.

On the y-axis I’ve actually plotted the ratio of the average estimated variance to the true population variance (with which the data was generated) minus 1, so that it is easier to see how accurate the estimated variance is. A value of zero on the y-axis indicates a perfect estimate. You can see that as the starting sample size $N$ increases, each of the different $\alpha$ curves converges to zero, i.e. a perfect estimate, on average, of the true variance. However, the rate of convergence appears to be diffferent for the different values of $\alpha$ . In part, this is illusory, and is because we haven’t adjusted the x-axis for the effective sample size. Conisder if our starting sample size was $N=10000$ . At $\alpha=0.1$ we are actually estimating the variance from 900 data points. The effective sample size is 900. Whilst for $\alpha=0.4$ the variance is estimated from a trimmed sample consisting of 600 data points. In order to compare like effective sample sizes with like effectve sample sizes I simply need to adjust the x-axis values in Fig.1 by $\log(1-\alpha)$ . This I have done in Figure 2 below.

Figure 2: Plot of the simulation means of the trimmed-variance estimates plotted against the adjusted sample size.

You can see that in Fig.2 all three different $\alpha$ curves are nearly identical, particularly at the larger values of $N$ . In Fig.2 I have also included error bars on the $\alpha=0.1$ curve. These correspond to $\pm 2$ times the standard errors of the means in the $\alpha=0.1$ curve. I haven’t included the standard errors for the other two curves simply because they will be of similar scale and will make the plot too crowded. From Fig.2, with the standard errors visible we see that there is a bias in the variance estimates at smaller values of $N$ , i.e. the average estimate is systematically different from the true population level. Deriving the mathematical form of the bias is a lengthy and challenging mathematical calculation, so I’m not going to go into it here.

Getting MAD

So far we have been calculating trimmed variances. However, the idea of using the central portion of a data sample to estimate a population quantity is more widely applicable. One of my favourite and most well known applications of this idea is the median absolute deviation, or MAD for short.

MAD is the median absolute deviation of $x$ from some measure of central tendency of $x$ , such as the median or mean. Usually, the default is to use the median as the measure of central tendency, so that the MAD is calculated as the median absolute deviation from the median. That is, we take our sample of data, calculate its median, calculate the absolute difference between each data point and that median, and calculate the median of all those absolute values.

The great thing about MAD (when using the median central tendency measure) is that for a large sample of data drawn from a Normal distribution, there is a very simple relationship between the expectation value of MAD and the population standard deviation $\sigma$ . That relationship is,

$\mathbb{E}\left ( {\rm MAD} \right ) \;\rightarrow\; \sigma \Phi \left ( 3/4 \right )\;=\; 0.6745\sigma\;\;\;,\;\; {\rm as} \;N\rightarrow\infty$ Eq.7

Let’s unpack what Eq.7 is telling us. It says that, on average, the MAD calculated from a sample of (Normally distributed) data is a simple constant times the true standard deviation. This means we can invert Eq. 7 to get a extremely simple way of estimating the population standard deviation that is robust to the presence of outliers (provided they make up less than 50% of the sample). That simple way is,

$\hat{\sigma}\;=\; 1.4826 \times {\rm MAD} \left ( \underline{x} \right )$ Eq. 8

Calculating the MAD is such a common task that once again you will find that most programming languages used for numerical analysis will contain a MAD function either as part of the base distribution, or in a commonly used package. In R we can use the mad function which is part of the base R distribution. In Python we can use scipy.stats.median_abs_deviation. In both cases the use of MAD to estimate the standard deviation is such a common use of MAD that both the R and Python functions have the correction factor of 1.4826 built in. In R, it is included by default, meaning that if you call mad, it will return the Normal adjusted estimate of $\sigma$ (also called the Normal consistent estimate). Be aware of this in case what you actually wanted was just the MAD value for a sample of data, rather than the Normal consistent estimate of $\sigma$ calculated from MAD. In Python, for the scipy.stats.median_abs_deviation function we have to explicitly tell it that we want the correction factor applied. We do that by setting the scale argument of the function equal to ‘normal’. The code-snippet below illustrates the scipy.stats.median_abs_deviation function in action.

			
from scipy.stats import median_abs_deviation
mad_sd_estimate = median_abs_deviation(a, scale='normal')

For the simulated data used to produce Fig.1 and Fig.2, I also calculated the MAD based estimate of $\sigma$ . The plot below (Figure 3) shows the simulation average of the Normal consistent MAD-based estimates of the true standard deviation $\sigma$ . The simulation averages are plotted against $\log N$ . Again, I have actually plottted the ratio of the simulation average of $\hat{\sigma}$ to $\sigma$ then minus 1. The dashed line is at a value of zero on the y-axis and is used to indicate when we have a bias in our estimate. The points are the means of the simulation results, i.e. the mean of the estimates of $\sigma$ over all the simulation datasets for the particular value of $N$ . The error bars correspond to $\pm$ 2 standard errors of the simulation means, with the standard error simply estimated from the variance over the simulation results.

Figure 3: Plot of the simulation means of the MAD-based estimates of the standard deviation, plotted against log N.

From Fig.3 we can see that the accuarcy of the MAD-based estimate of $\sigma$ is very good. This would appear (from the plot) to be a consistent estimator. Again we can see evidence of the bias in the estimator at the smaller values of $N$ . However, we can’t yet do a comparison with the trimmed-variance estimator, as we are estimating a different quantity. The trimmed-variance estimates the underlying population variance, whilst the MAD-based estimator estimates the underlying population standard deviation. Fortunately, when I computed the MAD-based estimates, I also stored the effective estimates of the variance, so we can do a like-for-like comparison with the trimmed-variance estimator. The MAD-based estimates of $\sigma^{2}$ are shown in the plot in Figure 4 below.

Figure 4: Plot of Normal-consistent MAD-based estimates of the population variance plotted against log N.

The accuarcy of the MAD-based estimator appears to be, at least for these simulation results, superior to the trimmed-variance estimator, with the errors being around $\pm$ 1%. The number of simulation runs is not sufficient in this set of results to properly detect the bias even for the smallest values of $N$ . However, given the plot in Fig.3, we would expect that if we increased the number of simulation runs appropriately, we would be able to resolve this, i.e. it is simply that the standard errors of the mean of $\hat{\sigma}^{2}/\sigma^{2}$ are larger than the standard errors of the mean of $\hat{\sigma}/\sigma$ (for this MAD-based estimator).

Other distributions

So far we have been discussing outlier-contaminated Normal data. But what happens if we are certain our data wasn’t drawn from a Normal distribution with an additional outlier process on top? What happens if we think our data is Gamma-distributed with outliers? We can still apply the same ideas, but it can get trickier and also more computationally intensive. Specifically, we need to distinguish two cases:

The distribution we assume is still symmetric about its mean – in this case calculating the trimmed variance with an appropriate correction factor, or calculating MAD with an appropriate correction factor is still relatively straight forward. The correction factors may not be expressable in terms of common special functions, or in the worst case you may have to evaluate them numerically, once you have reduced the calculation down to some canonical form.
The assumed distribution is not symmetric about its mean – in these circumstances it can become a lot trickier. Obviously, now things such as the skewness of the distribution affect the value of MAD or the trimmed-variance value, and so calculation of the correction factors is now affected by the shape of the distribution. This means we typically need to estimate a ‘shape’ parameter of the parametric distribution as well as estimating the ‘scale’ parameter of the distribution, which is the thing we are predominantly interested in. This will mean solving for shape and scale parameters simultaneously, and we are probably going to have to do that root finding numerically, as it is unlikely we can do it analytically. Calculating the appropriate adjustment factors is usually possible in these circumstances, but is typically more computationally intensive and/or we have to introduce additional approximations.

Conclusions

In many situations we want to estimate the standard deviation of a distribution from a sample of data, but we know there are outliers present in the sample.
We can’t use our usual estimate of the standard deviation to detect the outliers, because that estimate is affected by the outliers.
By making an assumption about the parametric form of the distribution whose standard deviation we are trying to estimate, we can estimate the standard deviation using data from any part of the distribution. This means we can use just the middle part of the sample data.
If we throw away a fraction $\alpha$ of the data, we can easily compute the correction factor needed, based upon our parametric assumption of the distribution shape.
In R and Python there are convenient functions for calculating the trimmed-variance of a sample and also the Normal-consistent MAD-based estimate of the population standard deviation.
Calculation of trimmed-variance and MAD-based estimates of variance is possible for distributions other than the Normal but will be more computationally intensive, particularly for asymmetric distributions.

This year the UK’s Royal Statistical Society (RSS) held its annual international conference in Aberdeen between the 12^th and 15^th September 2022.

You may think that the society’s main conference doesn’t hold that much relevance for you as a Data Scientist. Yes, you have an interest in Data Science with a statistical flavour, but surely the main conference is all clinical trials analysis and the like, isn’t it? My job over the next 980 words is to persuade you otherwise.

Statistics is about the whole data life cycle

Go to the RSS website or look at an official email from the RSS and you’ll see that the RSS strapline is “Data | Evidence | Decisions”. This accurately reflects the breadth of topics covered at the conference – in the session talks, the posters, and the plenary lectures. Statistics is about data, and modern statistics now concerns itself with all aspects related to data – how it is collected, how it is analysed, how models are built from that data, how inferences are made from those models, and how decisions are made off the back of those inferences. A modern general statistics conference now has to reflect the full end-to-end lifecycle of data and also the computational and engineering workflows that go with it. This year’s RSS conference did just that.

A Strong Data Science focus

Over the three main days of the conference there were 7 specific sessions dedicated to Data Science, totalling 8hrs and 20mins of talks. You can see from the full list below the breadth covered in the Data Science sessions.

Novel applications and Data Sets
Introduction to MLOps
The secret sauce of Open Source
Data Science for Health Equity
The UK’s future data research infrastructure
Epidemiological applications of Data Science
Algorithmic bias and ethical considerations in Data Science

On top of this there were Data Science topics in the 8 rapid fire talk sessions and in the 110 accepted posters. Example Data Science related topics included MLOps, Decentralized finance, Genetic algorithms, Kernels for optimal compression of distributions, Changepoint detection, Quantifying the Shannon entropy of a histogram, Digital Twins, Joint node degree estimation in Erdos-Renyi networks, Car club usage prediction, and Deep hierarchical classification of crop types from satellite images.

A growing Data Science presence

I’ve been involved with the conference board this year and last (Manchester 2021) and my perception is the size of the conference in increasing, in terms of number of submissions and attendees, the range of topics, and the amount of Data Science represented. However, I only have two datapoints here. One of those was just as the UK was coming out of its first Covid-19 lockdown, so will probably not provide a representative baseline. So I’m not going to stick my neck out too much here, but I do expect further increases in the amount of Data Science presence at next year’s conference.

Other relevant sessions

If like me you work primarily as a Data Scientist in a commercial environment, then there were also many talks from other Sections of the RSS that were highly relevant. The Business, Industry and Finance section had talks on Explainable AI, Novel Applications of Statistics in Business, and Democratisation of Statistics in GlaxoSmithKline, whilst the Professional Development section had talks on Linked Open Data, programming in R and Python, and the new Quarto scientific publishing system.

The Future of the Data Science Profession

Of particular relevance to Data Scientists was the Professional Development section’s talk on the new Alliance for Data Science Professionals accreditations of which the RSS is part. The session walked through the various paths to accreditation and the collaborative nature of the application process. This was backed up by a Data Science ‘Beer and Pizza’ event hosted by Brian Tarran (former Significance magazine editor and now RSS Head of Data Science Platform) and Ricky McGowan (RSS Head of Standards and Corporate Relations) who both explained some of the RSS long-term plans for Data Science.

Diversity of topics across the whole conference

Diversity of topics was a noticeable theme emerging from the conference as a whole, not just in the Data Science and commercial statistics streams. For me, this reflects the broader desire of the RSS to embrace Data Scientists and any practitioners who are involved with analysing and handling data. It reflects a healthy antidote to the ‘Two cultures of statistical modelling‘ divide identified and discussed by Leo Breiman many years ago.

For example, the range of plenary talks was equally impressive as the diversity of topics in the various sessions. Like many Data Scientists my original background was a PhD in Theoretical Physics. So, a talk from Ewain Gwynne on Random Surfaces and Liouville Quantum Gravity – see picture below – took me back 30 years and also gave me an enjoyable update on what has happened in the field in those intervening years.

Ewain Gwynne talking about Random Surfaces and Liouville Quantum Gravity.

Other plenary highlights for me were Ruth King’s Barnett lecture on statistical ecology and Adrian Raftery’s talk on the challenges of forecasting world populations out to the year 2100 and as far as 2300 – see below.

Adrian Raftery talking about Bayesian Demography.

A friendly conference

The conference is not a mega-conference. We not talking NeurIPS or ICML. It was around 600 attendees – big enough not to be too insular and focused only on one or two topics, but still small enough to be welcoming, friendly and very sociable. There were social events on every evening of the conference. And to top it all, it was even sunny in Aberdeen for the whole week.

I also got to play pool against the person who led the UK’s COVID-19 dashboard work, reporting the UK government’s official daily COVID-19 stats to the general public. I lost 2-1. I now hold a grudge.

Next year – Harrogate 2023

Next year’s conference is in Harrogate, 4^th – 7^th September 2023. I will be going. Between now and then I will be practicing my pool for a revenge match. I will also be involved with the conference board again, helping to shape the Data Science content. I can promise a wide range of Data Science contributions and talks on other statistical topics Data Scientists will find interesting. I can’t promise sunshine, but that’s Yorkshire for you.

Hoyle Analytics

Tag: Statistics

Data Science Notes: 5. Trimmed variances and getting MAD

Summary

Introduction

Theory

Practice

Getting MAD

Other distributions

Conclusions

The Royal Statistical Society Conference and Data Science

Statistics is about the whole data life cycle

A Strong Data Science focus

A growing Data Science presence

Other relevant sessions

The Future of the Data Science Profession

Diversity of topics across the whole conference

A friendly conference

Next year – Harrogate 2023

Summary

Introduction

Theory

Practice

Getting MAD

Other distributions

Conclusions

Share this:

Statistics is about the whole data life cycle

A Strong Data Science focus

A growing Data Science presence

Other relevant sessions

The Future of the Data Science Profession

Diversity of topics across the whole conference

A friendly conference

Next year – Harrogate 2023

Share this: