Tip:
Highlight text to annotate it
X
Let's say that you're a watermelon farmer,
and you want to study how dense the seeds are
in your watermelon.
Perhaps you want to do this because over time, you're
trying to breed watermelons that have fewer seeds,
and you should see whether you are actually making progress.
And you don't want to cut open every watermelon
in your watermelon farm or patch or whatever
it might be called, because you want to sell most of them.
You just want to sample a few watermelons,
and then take samples of those watermelons
to figure out how dense the seeds are, and hope that you
can calculate statistics on those samples that
are decent estimates of the parameters for the population.
So let's start doing that.
So let's say that you take these little cubic inch chunks out
of a random sample of your watermelons.
And then you count the number of seeds in them.
And you have 8 samples like this.
So in one of them, you found 4 seeds.
In the next, you found 3, 5, 7, 2, 9, 11, and 7.
So this is a sample, just to make
sure we're visualizing it right.
If this is the population of all of the chunks--
I guess we could view this as a cubic inch--
the cubic inch chunks in my entire watermelon farm,
I'm sampling a very small sample of them.
Maybe I could have had a million over here.
A million chunks of watermelon could
have been produced from my farm, but I'm only
sampling-- so capital N would be 1 million,
lowercase n is equal to 8.
And once again, you might want to have more samples,
but this'll make our math easy.
Now, let's think about what statistics we can measure.
Well, the first one that we often do
is a measure of central tendency.
And that's the arithmetic mean.
But here, we're trying to estimate the population mean
by coming up with the sample mean.
So what is the sample mean going to be?
Well, all we have to do is add up these points,
add up these measurements, and then divide
by the number of measurements we have.
So let's get our calculator out for that.
Actually, maybe I don't need my calculator.
Let's see.
So 4 plus 3 is 7.
7 plus 5 is 12.
12 plus 7 is 19.
19 plus 2 is 21, plus 9 is 30, plus 11 is 41, plus 7 is 48.
So I'm going to get 48 over 8 data points.
So this worked out quite well.
48 divided by 8 is equal to 6.
So our sample mean is 6.
It's our estimate of what the population mean might be.
But we also want to think about how much in our population
we want to estimate, how much spread is there,
or how much do our measurements vary from this mean.
So there, we say, well, we can try to estimate the population
variance by calculating the sample variance.
And we're going to calculate the unbiased sample variance.
Hopefully, we're fairly convinced at this point
why we divide by n minus 1.
So we're going to calculate the unbiased sample variance.
And if we do that, what do we get?
I'll do this in a different color.
It's going to be 4 minus 6 squared plus 3 minus 6 squared
plus 5 minus 6 squared plus 7 minus 6 squared
plus 2 minus 6 squared plus 9 minus 6 squared
plus 11 minus 6 squared plus 7 minus 6 squared, all of that
divided by-- not by 8.
Remember, we want the unbiased sample variance.
We're going to divide it by 8 minus 1.
So we're going to divide by 7.
Let me give myself a little bit more real estate.
The unbiased sample variance-- and I could even
denote it by this to make it clear that we're
dividing by lowercase n minus 1-- is going
to be equal to-- let's see, 4 minus 6 is negative 2.
That squared is positive 4.
So I did that one.
3 minus 6 is negative 3.
That squared is going to be 9.
5 minus 6 squared is 1 squared, which is 1.
7 minus 6 is once again 1 squared, which is 1.
2 minus 6, negative 4 squared is 16.
9 minus 6 squared, well, that's going to be 9.
11 minus 6 squared, that is 25.
And then finally, 7 minus 6 squared, that's another 1.
And we're going to divide it by 7.
Let's see if we can add this up in our heads.
4 plus 9 is 13, plus 1 is 14, 15, 31, 40, 65, 66.
So this is going to be equal to 66 over 7.
And we could either divide-- we get that's 9 and 3/7.
We could write that as 9 and 3/7.
Or if we want to write that as a decimal,
I could just take 66 divided by 7
gives us 9 point-- I'll just round it.
So it's approximately 9.43.
Now, that gave us our unbiased sample variance.
Well, how could we calculate a sample standard deviation?
We want to somehow get added estimate of what the population
standard deviation might be.
Well, the logic, I guess, is reasonable to say, well,
this is our unbiased sample variance.
It's our best estimate of what the true population
variance is.
When we think about population parameters
to get the population standard deviation,
we just take the square root of the population variance.
So if we want to get an estimate of the sample standard
deviation, why don't we just take
the square root of the unbiased sample variance?
So that's what we'll do.
So we'll define it that way.
We'll call it the sample standard deviation.
We're going to define it to be equal to the square root
of the unbiased sample variance.
It's going to be the square root of this quantity,
and we can take our calculator out.
It's going to be the square root of what I just typed in.
I can do 2nd answer.
It'll be the last entry here.
So the square root of that is-- and I'll just round.
It's approximately equal to 3.07.
Now, I'm going to tell you something
very counterintuitive.
Or at least initially it's counterintuitive,
but hopefully you'll appreciate this over time.
This we've already talked about in some depth.
People have even created simulations
to show that this is an unbiased estimate of population variance
when we divide it by n minus 1.
And that's a good starting point if we're
going to take the square root of anything.
But it actually turns out that because the square root
function is nonlinear, that this sample standard
deviation-- and this is how it tends
to be defined-- sample standard deviation, that this sample
standard deviation, which is the square root of our sample
variance, so from i equals 1 to n
of our unbiased sample variance, so we divide it by n minus 1.
This is how we literally divide our sample standard deviation.
Because the square root function is nonlinear,
it turns out that this is not an unbiased estimate
of the true population standard deviation.
And I encourage people to make simulations of that
if they're interested.
But then you might say, well, we went through great pains
to divide by n minus 1 here in order
to get an unbiased estimate of the population variance.
Why don't we go through similar pains
and somehow figure out a formula for an unbiased estimate
of the population standard deviation?
And the reason why that's difficult
is to unbias the sample variance,
we just have to divide by n minus 1 instead of n.
And that'd work for any probability distribution
for our population.
It turns out to do the same thing
for the standard deviation.
It's not that easy.
It's actually dependent on how that population is actually
distributed.
So in statistics, we just define the sample standard deviation.
And the one that we typically use
is based on the square root of the unbiased sample variance.
But when you take that square root,
it does give you a biased result when
you're trying to use this to estimate the population
standard deviation.
But it's the simplest, best tool we have.