Difference of Sample Means Distribution

I want to build on what we did in the last video a little bit. Let's say we have two random variables. So I have random variable x. And let me draw its probability distribution. And actually, it doesn't have to be normal. But I'll just draw it as a normal distribution. So this is the distribution of random variable x. This is the mean. The population mean of random variable x. And then it has some type of standard deviation. Actually, let me just focus on the variance. So it has some variance right here for random variable x. This is x, the distribution for x. Let's say we have another random variable. Random variable y. Let's do the same thing for it. Let's draw its distribution. And let me draw the parameters for that distribution. So it has some true mean, some population mean for the random variable y. And it has some variance right over here. And I've drawn it roughly normal. Once again, we don't have to assume that it's normal. Because we're going to assume, when we go to the next level, that when we take the samples, we're taking enough samples that the central limit theorem will actually apply. But with that said, let's think about the sampling distributions of each of these random variables. So let's think about the sampling distribution of the sample mean of x. Let's say the sample size over here is going to be equal to n. So what is that going to look like? Well it's going to be some distribution. And we're assuming that n is a fairly large number. So this is going to be a normal distribution. Or it can be approximated with a normal distribution. Let me shift it over a little bit. I'm going to draw it a little bit narrow. Let me draw the mean. So the population mean of the sampling distribution is going to be denoted with this x bar, that tells us the distribution of the means when the sample size is n. And we know that this is going to be the same thing as the population mean for that random variable. And we know from the central limit theorem that the variance of the sampling distribution or, often called the standard error of the mean, is going to be equal to the population variance divided by this n right over here. And if you wanted the standard deviation of this, you just take the square root of both sides. Let's do the same thing for random variable y. Let's take the sampling distribution of the sample mean. But here, we're talking about y, random variable y. And let's just say it has a different sample size. It doesn't have to be a different one. But it just shows you that it doesn't have to be the same. So it has a sample size of m. Let me draw its distribution right over here. Once again, it'll be a narrower distribution than the population distribution. And it will be approximately normal, assuming that we have a large enough sample size. And the mean of the sampling distribution of the sample mean is going to be the same thing as the population mean. We've seen that multiple times. And its variance for the sample means, or the standard error of the mean. Actually, this isn't the standard error. Standard error would be the square root of this. So if I called this standard error of the mean, that's wrong. The standard error of the mean is the square root of this. It's the standard deviation. This is the variance of the mean. Don't want to confuse you. So the variance of the mean here is going to be the exact same thing. It's going to be the variance of the population divided by our sample size. And everything we've done so far is complete review. It's a little different, because I'm actually doing it with two different random variables. And I'm doing it with two different random variables for a reason. Because now I'm going to define a new random variable. We could just call it z. But z is equal to the difference of our sample means. It's equal to the x sample mean minus the y sample mean. So what does that really mean? Well, to get a sample mean, or at least for this distribution, you're taking n samples from this population over here. Maybe n is 10. You're taking 10 samples and finding its mean. That sample mean is a random variable. Let's say you take 10 samples from here and you get 9.2 when you find their mean. That 9.2 can be viewed as a sample from this distribution right over here. Same thing if this right here is m. Or if m right here is 12. You're taking 12 samples, taking its mean. And that sample mean, maybe it's 15.2, could be viewed as a sample from this distribution. As a sample from the sampling distribution. So what z is, z is a random variable where you're taking n samples from this distribution up here, this population distribution, taking its mean. Then you're taking m samples from this population distribution up here, taking its mean. And then finding the difference between that mean and that mean. So it's another random veritable. But what is the distribution of the z? So let's draw it. Well there's a couple of things we immediately know about z. And we kind of came up with this in the last video. Instead of writing z, I'm just going to write the mean of x bar, which is a sample from the sampling distribution of x, or the sample mean of x, minus the sample mean of y. We saw this in the last video. In fact, I think I still have the work up here. Yeah, I still have the work right up here. The mean of the difference is going to be the difference of the means. The mean of the difference is the same thing is the difference of the means. So the mean of this new distribution right over here is going to be the same thing as the mean of our sample mean minus the mean of our sample mean of y. And this might seem a little abstract in this video. In the next video, we're actually going to do this with concrete numbers. And hopefully it'll make a little bit more sense. And just so you know where we're going with this, the whole point of this is so that we can eventually do some inferential statistics about differences of means. How likely is a difference of means of two samples, random chance or not random chance? Or what is a confidence interval of the difference of means? That's what this is all building up to. So anyway, we know the mean of this distribution right over here. And what's the variance of this distribution? We came up with that result in the last video. If we're taking essentially the difference of two random variables, the variance is going to be the sum of those two random variables. And the whole point of that video is to show that it's not the difference of the variances, it's the sum of the variances. The variance of this new distribution-- and I haven't drawn the distribution yet-- The variance of this new distribution, I'll just write x bar minus y bar, is going to be equal to the sum of the variances of each of these distributions. The variance of x bar plus the variance of y bar. Actually, let me just draw this here. Just so we can visualize another distribution. Although, all I'm going to draw is another normal distribution. Let me scroll down a little bit. So the mean over here, the mean of x bar minus y bar, is going to be equal to the difference of these means over here. I don't have to rewrite it. Let me draw the curve. And notice, I'm drawing a fatter curve than either one. And why am I doing that? Because the variance here is the sum of the variances here. So we're going to have a fatter curve. It's going to have a bigger variance, or a bigger standard deviation than either of these. So then we have some variance here, variance of x bar minus y bar. Now what are these, in terms of the original population distribution? We came up with those results right over here. We know what the standard deviation is. We know that this thing is the same thing as the variance of the population distribution divided by n. We've done this multiple, multiple times. What's this going to be equal to? This is right here is the same thing as the variance of our population distribution. And the x just means this is for random variable x. But there's no bar on top. This is the actual population distribution, not the sampling distribution of the sample mean. So that divided by n. And then if we want the variance of the sampling distribution for y, let me do that in a different color. I'll use blue, because that was what we were using for the y random variable. That's going to be equal to this thing over here. And we've done this multiple times. Same exact logic as this. The population distribution for y divided by m. And so once again, I'll just write this out front. This is the variance of the differences of the sample means. And now if you wanted the standard deviation of the differences of the sample means, you just have to take the square root of both sides of this. You take the square root of this, you get the standard deviation of the difference of the sample means is equal to the square root of the population distribution of x. Or the variance of the population distribution of x divided by n plus the variance of the population distribution of y divided by m. And this is just neat. Because it kind of looks a little bit like a distance formula. I'll throw that out there as we get more sophisticated with our statistics and try to visualize what all of this kind of stuff means in more advanced topics. But the whole point of this is, now we can make inferences about a difference of means. If we have two samples, and we take the means of both of those samples and we find some difference, we can make some conclusions about how likely that difference was just by chance. And we're going to do that in the next video.