Simulation Providing Evidence That - N - 1 - Gives us unbiased estimate

Here is a simulation created by Khan Academy user tetf I can assume that's pronounced "tetf." And what it allows us to do is give us an intuition as to why we divide by n-1 when we calculate our sample variance and why that gives us an unbiased estimate of population variance. So, the way this starts off (and I encourage you to go try this out yourself) is that you can construct a distribution that says, "build a population by clicking in the blue area." So, here we're actually creating a population So, we're creating them everytime I click, it increases the population size. Let me just - and I'm just randomly doing this and I encourage you to go on to this scratchpad. It's on the Khan Academy Computer Science and try to do it yourself. So, here We are - I can stop at some point. So, I've constructed a population I can throw out some random points up here. So, this is our population and you saw while I was doing that it was calculating parameters for the population. It was calculating the population mean at 204.09 and also the population standard deviation, which is derived from the population variance. This is the square root of the population variance and its at 63.8. It was also plotting the population variance down here. You see, it's 63.8 which is the standard deviation. It's a little harder to see but it says squared. These are these numbers squared. So this is essentially 63.8 squared is the population variance. So, that's interesting by itself but it really doesn't tell us a lot, so far, about why we divide by n-1. And this is the interesting part. We can now start to take samples and we can decide what sample size we want to do. I'll start with really small samples. The smallest possible sample that makes any sense. So, I'm going to start with really small samples and what they're going to do, what the simulation is going to do is every time I take a sample, it's going to calculate the variance. So, the numerator is going to be the sum of each of my data points in my sample minus my sample mean. Then, I'm going to square it. And, then it's going to divide it by n+a and it's going to vary "a." It's going to divide it by anywhere from between n+-3... so, n-3 all the way to n+a, and we're going to do it many, many, many times. We're going to essentially take the mean of those variances for any "a" and figure out which gives us the best estimate. So, if I just generate one sample right over there. Well, we see a kind of this curve When we have high values of "a", we are underestimating. When we have lower values of "a", we are overestimating the population variance. But, that was just for one sample, not really that meaningful. It's one sample size too. Let's generate a bunch of samples and then average them over many of them. And, when you look at many, many many samples, something interesting is happening. When you look at the mean of those samples when you average together those curves from all of those samples you see that our best estimate is when "a" is pretty close to -1 When this is n+-1 or n-1 Anything less than -1, if we did negative n-1.05 or n-1.5, we start overestimating the variance Anything less than -1, so if we start if we have n+0, if we divide by n, or if we have n+0.05 or whatever it might be, we start underestimating the population variance. And you can do this for samples of different sizes. Let me try sample size 6. And here you go, once again, as I press I'm just keeping "Generate Sample" pressed down. As we generate more and more of more samples for all of the a's, we essentially take the average across all of those samples for the variance, depending on how we calculated it you'll see that, once again, our best estimate is pretty darn close to -1. And if you were to try this If you were to get this to millions of samples generated you will see that your best estimate is when a is -1, or when you're dividing by n-1. So, once again, thanks "tetf" for this I think its a really interesting way to think about why we divide by n-1.