Summary
- Estimating the variance of a distribution when we have outliers can create a circular problem, particularly if we want to use the estimated variance in an algorithm to detect the outliers in the first place.
- If we are prepared to assume, that at most, a fraction
of our data are outliers, then we can use a trimmed-variance to estimate the population variance without having to explicitly identify the outliers.
- We incorporate a correction factor to make our variance estimate consistent with the parametric distribution we have assumed our data has been drawn from.
- The median absolute deviation from the median (MAD) also uses a similar idea and tends to have better performance (efficiency) for outlier-contaminated Normal data.
- Calculating the Normal-consistent MAD-based estimate of the standard deviation is ridiculously simple.
Introduction
This is one of those little hacks that I really like. So simple, so easy, and yet it solves a real problem. It solves a circular problem. How to estimate the standard deviation of a distribution from a sample of data, without being impacted by outliers when we haven’t yet identified which datapoints are outliers. We may even want to use that estimate of the standard deviation to help detect those outliers – in an appropriate algorithm that takes the sample size into account of course (see my previous Data Science Notes post What is an outlier?). It looks like we have a circular problem; the outliers affect the estimation of the standard deviation, but we can’t identify and deal with those outliers until we have calculated the standard deviation.
We can drop the data points from the tails of the sample to remove outliers. The trouble is the standard deviation relates to the spread of values in a distribution, i.e. it relates to the tails of the distribution. It looks like we are going to lose signal about the standard deviation if we do this.
If only there was a way to estimate a standard deviation from just the data values around the mean. That is, can we estimate the standard deviation from the data points that have come from the bulk of the distribution and not its tails? There is, if we are prepared to make a parametric assumption about the shape of the distribution. Once we make a parametric assumption about the shape, any datapoint gives us signal about the parameters of the distribution. And once we have signal about the parameters of the distribution we have signal about its standard deviation.
Theory
Let’s use the Normal distribution as an example. We know that the probability density is given by,
Eq.1
In Eq.1 is the mean of the probability distribution and
is its standard deviation. The part of the probability distribution that corresponds to the central 90% of the probability distribution is given by the interval
, where
corresponds to the 5th centile of the standard Normal distribution and so is given by,
Eq.2
If we calculate the 2nd moment of about the mean
and over that central 90% portion of the probability distribution we can write that,
Eq. 3
In Eq.3 is the CDF of the Standard Normal distribution and so
. The right-hand side of Eq.3 is 0.9 times what we would get if we had an infinite sample of data from a Normal distribution with standard deviation of
and removed the upper and lower 5% of the datapoints and then calculated the sample standard deviation
.
Let’s denote our sample of data by the vector . Eq.3 tell us that once we have removed the upper and lower 5% of datapoints from our sample
and then calculated
, we have,
Eq.4
A simple re-arrangement of Eq.4 gives us a means of estimating from our trimmed data sample. We just use,
Eq.5
There you have it. A simple way to estimate the standard deviation of a Normal distribution whilst excluding the smallest and largest 5% of values. If we want to exclude of values, we can easily generalize the formula in Eq.5. I’ll leave it to you as an exercise to do the derivation. I’m just going to quote the final result below,
Eq.6
Practice
That’s the theory. How do we use this simple idea in practice? Like all simple but great ideas someone has already coded it up for you. In Python the trimmed variance of an array a of numbers can be calculated using the trimmed_var function in scipy.stats.mstats, as follows,
from scipy.stats.mstats import trimmed_vartrimmed_var = trimmed_var(a, limits=(0.05, 0.05), inclusive=(1, 1), relative=True, axis=None, ddof=1)
In the notebook DataScienceNotes5_TrimmedVariances.ipynb in the public GitHub repository github.com/dchoyle/datascience_notes, I have used the trimmed_var function to run a number of simulations. The plot below (Figure 1) shows the average estimated variance, averaged over 1000 simulation datasets, using the formula in Eq.6 and where I have used the scipy trimmed_var function to calculate for each simulation sample. The plot shows the average (over simulation datasets) estimated population variance at a number of different sample sizes
and for three different values of
.

On the y-axis I’ve actually plotted the ratio of the average estimated variance to the true population variance (with which the data was generated) minus 1, so that it is easier to see how accurate the estimated variance is. A value of zero on the y-axis indicates a perfect estimate. You can see that as the starting sample size increases, each of the different
curves converges to zero, i.e. a perfect estimate, on average, of the true variance. However, the rate of convergence appears to be diffferent for the different values of
. In part, this is illusory, and is because we haven’t adjusted the x-axis for the effective sample size. Conisder if our starting sample size was
. At
we are actually estimating the variance from 900 data points. The effective sample size is 900. Whilst for
the variance is estimated from a trimmed sample consisting of 600 data points. In order to compare like effective sample sizes with like effectve sample sizes I simply need to adjust the x-axis values in Fig.1 by
. This I have done in Figure 2 below.

You can see that in Fig.2 all three different curves are nearly identical, particularly at the larger values of
. In Fig.2 I have also included error bars on the
curve. These correspond to
times the standard errors of the means in the
curve. I haven’t included the standard errors for the other two curves simply because they will be of similar scale and will make the plot too crowded. From Fig.2, with the standard errors visible we see that there is a bias in the variance estimates at smaller values of
, i.e. the average estimate is systematically different from the true population level. Deriving the mathematical form of the bias is a lengthy and challenging mathematical calculation, so I’m not going to go into it here.
Getting MAD
So far we have been calculating trimmed variances. However, the idea of using the central portion of a data sample to estimate a population quantity is more widely applicable. One of my favourite and most well known applications of this idea is the median absolute deviation, or MAD for short.
MAD is the median absolute deviation of from some measure of central tendency of
, such as the median or mean. Usually, the default is to use the median as the measure of central tendency, so that the MAD is calculated as the median absolute deviation from the median. That is, we take our sample of data, calculate its median, calculate the absolute difference between each data point and that median, and calculate the median of all those absolute values.
The great thing about MAD (when using the median central tendency measure) is that for a large sample of data drawn from a Normal distribution, there is a very simple relationship between the expectation value of MAD and the population standard deviation . That relationship is,
Eq.7
Let’s unpack what Eq.7 is telling us. It says that, on average, the MAD calculated from a sample of (Normally distributed) data is a simple constant times the true standard deviation. This means we can invert Eq. 7 to get a extremely simple way of estimating the population standard deviation that is robust to the presence of outliers (provided they make up less than 50% of the sample). That simple way is,
Eq. 8
Calculating the MAD is such a common task that once again you will find that most programming languages used for numerical analysis will contain a MAD function either as part of the base distribution, or in a commonly used package. In R we can use the mad function which is part of the base R distribution. In Python we can use scipy.stats.median_abs_deviation. In both cases the use of MAD to estimate the standard deviation is such a common use of MAD that both the R and Python functions have the correction factor of 1.4826 built in. In R, it is included by default, meaning that if you call mad, it will return the Normal adjusted estimate of (also called the Normal consistent estimate). Be aware of this in case what you actually wanted was just the MAD value for a sample of data, rather than the Normal consistent estimate of
calculated from MAD. In Python, for the
scipy.stats.median_abs_deviation function we have to explicitly tell it that we want the correction factor applied. We do that by setting the scale argument of the function equal to ‘normal’. The code-snippet below illustrates the scipy.stats.median_abs_deviation function in action.
from scipy.stats import median_abs_deviationmad_sd_estimate = median_abs_deviation(a, scale='normal')
For the simulated data used to produce Fig.1 and Fig.2, I also calculated the MAD based estimate of . The plot below (Figure 3) shows the simulation average of the Normal consistent MAD-based estimates of the true standard deviation
. The simulation averages are plotted against
. Again, I have actually plottted the ratio of the simulation average of
to
then minus 1. The dashed line is at a value of zero on the y-axis and is used to indicate when we have a bias in our estimate. The points are the means of the simulation results, i.e. the mean of the estimates of
over all the simulation datasets for the particular value of
. The error bars correspond to
2 standard errors of the simulation means, with the standard error simply estimated from the variance over the simulation results.

From Fig.3 we can see that the accuarcy of the MAD-based estimate of is very good. This would appear (from the plot) to be a consistent estimator. Again we can see evidence of the bias in the estimator at the smaller values of
. However, we can’t yet do a comparison with the trimmed-variance estimator, as we are estimating a different quantity. The trimmed-variance estimates the underlying population variance, whilst the MAD-based estimator estimates the underlying population standard deviation. Fortunately, when I computed the MAD-based estimates, I also stored the effective estimates of the variance, so we can do a like-for-like comparison with the trimmed-variance estimator. The MAD-based estimates of
are shown in the plot in Figure 4 below.

The accuarcy of the MAD-based estimator appears to be, at least for these simulation results, superior to the trimmed-variance estimator, with the errors being around 1%. The number of simulation runs is not sufficient in this set of results to properly detect the bias even for the smallest values of
. However, given the plot in Fig.3, we would expect that if we increased the number of simulation runs appropriately, we would be able to resolve this, i.e. it is simply that the standard errors of the mean of
are larger than the standard errors of the mean of
(for this MAD-based estimator).
Other distributions
So far we have been discussing outlier-contaminated Normal data. But what happens if we are certain our data wasn’t drawn from a Normal distribution with an additional outlier process on top? What happens if we think our data is Gamma-distributed with outliers? We can still apply the same ideas, but it can get trickier and also more computationally intensive. Specifically, we need to distinguish two cases:
- The distribution we assume is still symmetric about its mean – in this case calculating the trimmed variance with an appropriate correction factor, or calculating MAD with an appropriate correction factor is still relatively straight forward. The correction factors may not be expressable in terms of common special functions, or in the worst case you may have to evaluate them numerically, once you have reduced the calculation down to some canonical form.
- The assumed distribution is not symmetric about its mean – in these circumstances it can become a lot trickier. Obviously, now things such as the skewness of the distribution affect the value of MAD or the trimmed-variance value, and so calculation of the correction factors is now affected by the shape of the distribution. This means we typically need to estimate a ‘shape’ parameter of the parametric distribution as well as estimating the ‘scale’ parameter of the distribution, which is the thing we are predominantly interested in. This will mean solving for shape and scale parameters simultaneously, and we are probably going to have to do that root finding numerically, as it is unlikely we can do it analytically. Calculating the appropriate adjustment factors is usually possible in these circumstances, but is typically more computationally intensive and/or we have to introduce additional approximations.
Conclusions
- In many situations we want to estimate the standard deviation of a distribution from a sample of data, but we know there are outliers present in the sample.
- We can’t use our usual estimate of the standard deviation to detect the outliers, because that estimate is affected by the outliers.
- By making an assumption about the parametric form of the distribution whose standard deviation we are trying to estimate, we can estimate the standard deviation using data from any part of the distribution. This means we can use just the middle part of the sample data.
- If we throw away a fraction
of the data, we can easily compute the correction factor needed, based upon our parametric assumption of the distribution shape.
- In R and Python there are convenient functions for calculating the trimmed-variance of a sample and also the Normal-consistent MAD-based estimate of the population standard deviation.
- Calculation of trimmed-variance and MAD-based estimates of variance is possible for distributions other than the Normal but will be more computationally intensive, particularly for asymmetric distributions.
© 2026 David Hoyle. All Rights Reserved




