Tip:
Highlight text to annotate it
X
This right here is a simulation that was created by Peter Colengredge
Using the 'Kahn Academy Computer Science Scratchpad' to better understand
why we divide by 'n-1' when we calculate an unbiased sample variance.
When we are in an unbiased way trying to estimate
the true population variance.
So what this simulation does is at first it constructs a
population distribution, a random one and every time you go to it
there'll be a different population distribution. This one has a population of
383 and then it calculates the parameters for that population directly
from it, the mean is 10.9, the variance 25.5.
and then it uses that population and samples from it and it does samples of size
2, 3, 4, 5 all the way up to 10 and it keeps sampling from it, calculating the statistics for those samples
so the sample mean and the sample variance, in particular the biased sample variance and
it starts telling us some things about this that give us some intuition and
you can actually click on each of these and zoom in
to really be able to study these graphs in detail.
So I've already taken a screenshot of this and put it in my
doodle pad, so I can really delve into some of the math in theintuition what this is showing us.
So here I took a screenshot and, you see, for this case right over here
the population was 529, population mean was 10.6
and down here in this chart he plots the population mean
right here at 10.6 and
over there you see that the population variance is at 36.8 and
right here he plots that right over here.
36.8.
So this first chart in the bottom left tell us a couple of interesting
things and just to be clear this is the biased sample variance that he's calculating.
So he's calculating it. That is being calculated for each of our data points,
so starting with our first data point in each of our samples, going to nth data point
in each of the samples, we're taking our data point, subtracting out the sample mean
squaring it and dividing the whole thing not by n-1, but by lower case n.
And this tells us several interesting things.
The first thing is shows us is that the cases where we were
significantly underestimating the sample variance, when we're getting sample variances
close to 0, these are also the cases or they're
disproportionately the cases where the means for those samples are way far off
from the true sample mean or you can view that the other way around
the cases were the mean is way far off from the sample mean
it seems like you're much more likely to underestimate the sample variance in those situations.
The other thing that might pop out is the realisation, that the pinker dots are the ones for
smaller sample size, while the bluer ones are of a larger sample size.
And you see, here, these two little, I guess the tails, so to speak, of the hump
that at these ends you disproportionately, it's more of a reddish colour
that most of the bluish or the purplish dots are focused right in the middle
right over here, that they are giving us a better estimate, there
are some red ones over here and that's
why it gives us that purplish colour but here on these
tails it's almost purely some of these red, by and then
happens to get a little blue one, but this is disproportionately far more red.
Which really makes sense: when you have a smaller sample size,
you're far more likely to get a sample mean that is a bad estimate
of the population mean, it's far from the population mean and
you're more likely to significantly underestimate the sample variance.
Now these next chart really get's to the meat of the issue, because
what this is telling us is that for each of these sample
sizes, so this right over here, for sample size 2, if we keep
taking sample size 2 and we keep calculating the biased sample size variance
and dividing that by the population variance and finding the mean over all of those
you see
that over many, many, many trials, many, many samples of size 2
that the biased sample variance over population variance it's approaching half
of the true population variance
With sample sizes 3 it's approaching two thirds, 66,6%, of the true population variance
When sample size is 4 it's approaching three fourths of the true population variance.
So we can come up with a general theme of what's happening:
when we use the biased estimate, we're not approaching the population variance,
we're approaching n-1, let me write this down, we're approaching
n-1 over n times the population variance.
When n was 2 this approached one half, when n is 3 this is two thirds, when
n is four this is three fourths, so
this is giving us a biased estimate.
So how would we unbias this?
Well, if we really want to get our best estimate of the true population variance, not
n-1 over n times the population variance
we would want to multiply, let me do some of the colour I haven't used yet,
we would want to multiply times n over n-1.
We would want to multiply n over n-1 to get an unbiased estimate.
Here, these cancel out and we're left with just the population variance, that's
what we want to estimate and here, over here,
you are left with our unbiased estimate of population variance.
Our unbiased sample variance, which is equal to
and this is what we see, what we saw in the last several videos, what you see
in statistics books and it's confusing why and hopefully Peter's simulation
gives you a good idea of why, at least convinces you that this is the case.
So you would want to divide by n-1.