Tip:
Highlight text to annotate it
X
Here is a simulation created by Khan Academy user tetf
I can assume that's pronounced "tetf."
And what it allows us to do is give us an intuition as
to why we divide by n-1 when we calculate our
sample variance and why that gives us an unbiased
estimate of population variance.
So, the way this starts off (and I encourage you to go try
this out yourself) is that you can construct a distribution
that says, "build a population by clicking in the blue area."
So, here we're actually creating a population
So, we're creating them everytime I click, it increases the population size.
Let me just - and I'm just randomly doing this and I encourage you to go on
to this scratchpad. It's on the Khan Academy Computer Science
and try to do it yourself. So, here
We are - I can stop at some point. So, I've constructed a population
I can throw out some random points up here.
So, this is our population and you saw while I was doing that
it was calculating parameters for the population.
It was calculating the population mean at 204.09 and also
the population standard deviation, which is derived from the population variance.
This is the square root of the population variance and its at 63.8.
It was also plotting the population variance down here.
You see, it's 63.8 which is the standard deviation.
It's a little harder to see but it says squared.
These are these numbers squared. So this is essentially
63.8 squared is the population variance.
So, that's interesting by itself but it really doesn't tell us
a lot, so far, about why we divide by n-1.
And this is the interesting part.
We can now start to take samples and we can decide what
sample size we want to do. I'll start with really small samples.
The smallest possible sample that makes any sense.
So, I'm going to start with really small samples
and what they're going to do, what the simulation is going to do is
every time I take a sample, it's going to calculate the variance.
So, the numerator is going to be the sum of each of my
data points in my sample minus my sample mean.
Then, I'm going to square it. And, then it's going to
divide it by n+a and it's going to vary "a."
It's going to divide it by anywhere from between
n+-3... so, n-3
all the way to n+a, and we're going to do it
many, many, many times. We're going to essentially
take the mean of those variances for any "a"
and figure out which gives us the best estimate.
So, if I just generate one sample right over there. Well, we see a kind of this curve
When we have high values of "a", we are underestimating.
When we have lower values of "a", we are overestimating the population variance.
But, that was just for one sample, not really that meaningful.
It's one sample size too.
Let's generate a bunch of samples and then average them
over many of them. And, when you look at many, many
many samples, something interesting is happening.
When you look at the mean of those samples
when you average together those curves from all of those samples
you see that our best estimate is when "a" is pretty close to -1
When this is n+-1 or n-1
Anything less than -1, if we did
negative n-1.05 or n-1.5, we start overestimating the variance
Anything less than -1, so if we start
if we have n+0, if we divide by n, or if we have n+0.05 or
whatever it might be, we start underestimating
the population variance.
And you can do this for samples of different sizes.
Let me try sample size 6.
And here you go, once again, as I press
I'm just keeping "Generate Sample" pressed down.
As we generate more and more of more samples
for all of the a's, we essentially take the average across all of those samples
for the variance, depending on how we calculated it
you'll see that, once again, our best estimate is pretty darn close
to -1. And if you were to try this
If you were to get this to millions of samples generated
you will see that your best estimate is when
a is -1, or when you're dividing by n-1.
So, once again, thanks "tetf" for this
I think its a really interesting way to think about
why we divide by n-1.