The $n-1$ sample variance estimation problem intuition
I’ve been recently brushing up on statistics and have rediscovered an interesting problem I never fully understood: why do we divide by $n - 1$ in the unbiased variance estimator?
If you have no idea about what I’m talking about, here’s a summary, based on the excellent video from Khan Academy https://www.youtube.com/watch?v=KkaU2ur3Ymw&vl=en.
The problem¶
Context : we have a sample of datapoints (size $n$) from a larger population (size $N$) and we want to estimate the population statistics from this sample.
The population has true mean $\mu$, which we estimate with the sample mean
$$ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i. $$
Regarding dispersion, the population has variance $\sigma^2$, which we can estimate either with the biased estimator of the variance
$$ S_n^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2, $$
or with the unbiased estimator
$$ S_{n-1}^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2. $$
The interesting part of the video comes in after this introduction.
Suppose we have a population of $N=14$. Let’s then generate samples of size $n=3$, represent the sample mean and compare it with the population mean.
Generating samples from a population and analyzing the effect on distances¶
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
# let’s generate the population
N = 14
np.random.seed(12345)
population = np.random.uniform(low=0, high=10, size=N)
population
array([9.29616093, 3.16375555, 1.83918812, 2.04560279, 5.67725029, 5.95544703, 9.6451452 , 6.53177097, 7.48906638, 6.53569871, 7.47714809, 9.61306736, 0.08388298, 1.06444377])
# let’s make an animation that samples the population
n = 3
fig, ax = plt.subplots()
xdata, ydata = [], []
ax.plot(population, np.zeros_like(population) + 0.5, '-o', mfc='white')
ax.vlines(population.mean(), 0., 0.9, linestyles='dashed', label='population mean μ')
ln, = ax.plot([], [], '-ro')
vln = ax.vlines(0, -0.9, 0.0, label=r'sample mean $\bar{x}$', color='r', linestyles='dashed')
ax.legend()
def init():
"""Draw population."""
ax.set_xlim(0, 10)
ax.set_ylim(-1, 1)
return ln,
def update(frame):
xdata = np.random.choice(population, size=n, replace=False)
ydata = np.zeros_like(xdata) - 0.5
ln.set_data(xdata, ydata)
segments = vln.get_segments()
segments[0][:, 0] = xdata.mean()
vln.set_segments(segments)
return ln,
ani = FuncAnimation(fig, update, frames=np.arange(20),
init_func=init, blit=True, interval=2000)
plt.close(fig)
HTML(ani.to_html5_video())
#ani.save('variance_anim.mp4')
There’s a few interesting things that we can observe in the above graph:
- the range of the sample is almost always smaller than the population
- the distances between the sample mean and the samples are almost always smaller than the distances between the population points and the population mean
It turns out that this last observation is the reason for the underestimation of the population variance in the $S_n$ formula! Since $(x_i - \bar{x})^2$ is almost always smaller than $(x_i - \mu)^2$, it is quite logical that the $S_n$ yields something that is smaller on average than the population variance $\sigma^2$.
Another vizualization of distances¶
Let’s try to make a vizualization of the distances to the sample mean and population mean to get a better insight. To do this, we simulate a large number of samples by drawingthem from the original population. We then compute two things:
- the average distance of the sample points to the sample mean $\bar{x}$
- the average distance of the sample points to the true population mean $\mu$
import pandas as pd
average_dist_to_sample_mean = []
average_dist_to_pop_mean = []
for i in range(4000):
sample = np.random.choice(population, size=n, replace=False)
sample_mean = np.mean(sample)
average_dist_to_sample_mean.append(np.mean(np.abs(sample - sample_mean)))
average_dist_to_pop_mean.append(np.mean(np.abs(sample - np.mean(population))))
df = pd.DataFrame(data=np.array([average_dist_to_sample_mean, average_dist_to_pop_mean]).T, columns=['avg dist to sample mean', 'avg dist to pop mean'])
df.head()
avg dist to sample mean | avg dist to pop mean | |
---|---|---|
0 | 2.475660 | 3.355701 |
1 | 1.074484 | 3.762793 |
2 | 2.206585 | 2.172278 |
3 | 1.535100 | 2.136968 |
4 | 0.319711 | 0.597730 |
from matplotlib.gridspec import GridSpec
fig = plt.figure(layout="constrained", figsize=(10, 5))
gs = GridSpec(1, 4, figure=fig)
ax1 = fig.add_subplot(gs[0, :3])
ax2 = fig.add_subplot(gs[0, 3])
df.plot.line(lw=1, ax=ax1)
for label, data in df.items():
hist, bin_edges = np.histogram(data, bins=30)
hist = np.append(hist, hist[-1])
bottoms = np.zeros_like(hist)
ax2.plot(hist, bin_edges, label=label)
ax2.legend(fontsize=8)
ax2.set_title('histogram of distances')
ax2.set_xlabel('counts')
ax2.set_ylabel('distance')
Text(0, 0.5, 'distance')
As we can see in the above graph, the average distance is lower in the computation with the sample mean, as expected.
Since we have all these samples, we can also do a running mean (i.e. averaging the values as a function of the number of simulated samples). This will highlight whether the average distances converge to something.
running_mean_df = df.cumsum() / (df.index.values + 1)[:, None]
running_mean_df.plot()
<Axes: >
Clearly, this shows that indeed, the average distance to the sample mean is smaller than the average distance to the population mean: this is exactly the reason why we need to adapt the variance estimation for small sample sizes.
Visualizing the error as a function of $\bar{x}$¶
Finally, a last way to see why the bias happens is to plot the simulated distributions along the sample mean and computed sample variance $S^2_n$:
simulated_data = []
for i in range(10000):
sample = np.random.choice(population, size=n, replace=False)
sample_mean = np.mean(sample)
s2n = np.var(sample)
simulated_data.append([sample_mean, s2n])
fig = plt.figure(layout="constrained", figsize=(10, 5))
gs = GridSpec(1, 4, figure=fig)
ax1 = fig.add_subplot(gs[0, :3])
ax2 = fig.add_subplot(gs[0, 3])
pd.DataFrame(simulated_data, columns=['sample mean', '$S^2_n$']).plot.scatter(x='sample mean', y='$S^2_n$', ax=ax1, alpha=.5)
ax1.vlines(population.mean(), *ax1.get_ylim(), label='population mean', color='red')
ax1.hlines(np.var(population), *ax1.get_xlim(), label='population variance', color='gray')
ax1.legend()
hist, bin_edges = np.histogram(np.array(simulated_data)[:, 1], bins=10)
hist = np.append(hist, hist[-1])
bottoms = np.zeros_like(hist)
ax2.plot(hist, bin_edges, label=label)
ax2.hlines(np.var(population), *ax2.get_xlim(), label='population variance', color='gray')
ax2.set_title('histogram of $S^2_n$')
ax2.set_xlabel('counts')
ax2.set_ylabel('$S^2_n$')
Text(0, 0.5, '$S^2_n$')
What we do see in the above graph is that there are many more counts for variances below the population variance than above. And this is exactly why, on average, the value computed by $S^2_n$ is too low.
Of course, the derivation of the correction for the bias (replacing $n$ by $n-1$) is another matter that I will not discuss here. I think it is fascinating that the needed correction is just this simple formula: adjusting the denominator to $n-1$. I’m imagining how the correction could have been much more complicated, e.g. sample dependent or whatnot.
TL;DR¶
$S^2_n = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2$ is a biased estimator of the population variance when computed on a sample. This is because it uses $\bar{x}$, computed using the sample points, which introduces a bias (seen here as average distances to mean systematically smaller than to $\mu$).
Use $S^2_{n-1} = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$ if your goal is to have a better estimate of the population variance based on the sample data.
This post was entirely written using the Jupyter Notebook. Its content is BSD-licensed. You can see a static view or download this notebook with the help of nbviewer at 20240130_variance_nminus1.ipynb.