Chapter 8 Introduction to Hypothesis Testing

Okay let's take a look at some of the ideas coming out of chapter 8. This is the hypothesis testing, I can't even write let's fix that, hypothesis testing. So the idea is that we start with an assumption, right. So we start with this assumption that if we had a particular population, and let's take a sample or a subgroup this population, that equally represents the whole group. So in other words, we have a sample that's a random sample, and if we apply a treatment to this sample, so we manipulated in some way. We start with the assumption that the treatment has no effect. Okay, so we start with an assumption the treatment has no effect, okay. What we wanna do is find enough evidence that says that that is not likely. So in other words, we wanna find enough evidence that we can reject that first assumption and that evidence we would we would come up with would be the observation of an extremely unlikely event. Okay, so we want to find enough evidence to suggest, otherwise our assumption is incorrect. Okay and like I said, that that enough evidence idea is to find, when you actually have a sample you applied to treatment to the sample you get a value usually the sample mean, let's say or some times some other statistic, but you find a value and you say okay what's likelihood of observing this value or something more extreme and if the likelihood of observing that value is very small, in other words it's a very unlikely event, then that is enough evidence well we set a standard but typically we say okay this is enough evidence to say that our first initial assumption is not correct. Okay, so just to kind of give some background on this the idea in in mathematics is to do a proof by contradiction to say okay well let's assume this thing is true and then show that it's not true and that's proof that something else is true. Okay and that's when you have the assumption that it's either yes or no. Okay, you're either inside of you're outside. So if I can established that you are not outside that means we're definitely inside. Okay, that's the basic premise but with statistics we don't necessarily have a yes no, such a clean you know, clearly defined boundary for um a treatment working or not working. And measuring the effect or things like that and so I wanna go through an example in in a moment, and kind of talk about those ideas and give you a good sense of what the statistics statistics is telling us but before then I wanna kind of outline the hypothesis test start to finish and give you a sense of some of the mechanics of it, okay. So there's two ways to go about doing a hypothesis test. So there's two ways that's an s, to test. The first way would be to compare critical values with test values, so let me just write it and then I'll explain it compare critical values with test values. The critical values are gonna be based on a set significance level, so these right here gonna be based on set significance level. so in other words we set how, mmm how unlikely we're gonna allow something to be before we say that that is really unlikely. Okay so this is the significance level is the alpha and its its maybe it's five percent,let's say as an example. Switch colors so you know it's an example. If it's .05 we're saying that that if we could show that the the probability of observing this test, the test I did, the sample mean is less than five percent then that is what we're gonna call unlikely. In other words, it's it's significant. What we could do then is translate that alpha level .05 into a z-score, and say in other words, what z-score corresponds to that tail area 5 percent and then what we would do is compare our x bar that we get, let's our test value, we would compare it with the um we would compute a z-score for a test value compare that with the z-score the critical value. If we find that our z-score, so maybe I'll write it like this, if we find that our z-score from the test is less then or equal to the critical z then we do nothing, but if we could show that our z-score from the test was more extreme then the critical value, then we would, what we would say is we would reject the initial assumption. I'm gonna change up the language in just a moment after a highlight some ideas okay, that's the idea. So what I'm suggesting maybe this isn't so practical in terms of computation, maybe it's a little bit too finagling, but I just wanna highlight this because this is a viable way for a lot of fields to to compare since, um if you have a, if you rely on a significance level. I'm gonna kinda pitch this idea that maybe we don't worry so much about the significance level and just compute the probability of observing this test score in which case then we would do the second method. Okay, so if we had a observation from a sample, the second way we could test it is we we compare, so the second way is that we we compare what we call the p-value and the Alpha, the level of significance, sorry the significance level, significance level. So in other words if, and I'll write this is a different color, if our p-value that we calculate is less than the alpha that we've set, in other words, if we can show that the probability of observing our test score, this the p-value is the probability of observing our test score or something more extreme. If we could show that probability is less than the significance level than that is enough information for us to go ahead and say that we can reject our initial assumption. Alright, um what we're trying to say is that the probability of observing our test score is is less then the minimum probability for something being quote unquote unlikely, alright. So let me back up a little bit, I've already mentioned the significance level but the p-value is just the probability of observing on our test score or something more extreme. Okay, um let me back up even further when we have a hypothesis test, we always start with an initial assumption at and to give this a kind of consistent reading, I'm gonna call this the null hypothesis, null referencing 0 or the ground level. So the null hypothesis, I'm gonna write the words and then show the symbols. The normal hypothesis is expressed as this H sub-zero, so subscript 0, and the null hypothesis is in a nutshell, saying that we assume the two treatment had no effect. Okay, so this is our assumption that that the treatment had no effect. Okay, so typically we write this as the mean is equal to whatever that means should be. Okay, and I'll kind of highlight that with an example, okay. We can have a different hypothesis if we weren't looking at means, maybe we were looking at proportions, in which case this this part of the proportion, this part of the hypothesis would change but it's still the basic statement that whatever we we've manipulated that manipulation had no effect on our statistics. In other words, the treatment for the drug, you know, the if the treatment of the drug, the drug did not influence anything in our population, okay. We always compare this, I realize I didn't finish the word hypothesis, okay. We always have to have a pre-planned alternative hypothesis, so in other words, before we go and collect data we have to have a sense of what we want to see or what we um, what we observe the data suggesting we should see, okay. For every hypothesis test, in order to be um objective, you must collect new data or or reprocess the data from a new stand point and so the alternative hypothesis, alternative hypothesis, is our pre-planned expectations. So we we write this with an H with the subscript one some references, some books use a subscript A, the reference alternative and and I put a one here to be consistent but it's also nice to to recognize you can have more than one alternative. You know in other words, the the drug might actually influence different, different people in a different way and so we need to be able to to run those kinds of analyses, but at this point we're just gonna have two hypothesizes so that the treatment had no effect or that the treatment had a an effect in a specified direction. This is what I mean, it's pre-planned so the most generic form of alternative hypothesis is that the mean value is not actually equal to the initial assumed values. So maybe I should highlight that, this is the assumed value for our mean, okay. This is called a two tail test and I'll emphasize that again in just a moment. This two tail test is the basic test, it allows for um how should I say, it's the most conservative test. So it's a it's a conservative test, meaning that the likelihood of um rejecting something falsely is is much much lower, um but what happens is that we we fail to reject the null hypothesis more frequently with the two tail test and that's referencing the power. If we do a one tail or directional test, we increase the power our test and probably, I should say hopefully, get better results. Okay, the alternative hypothesis usually is a two tail test unless you have definite evidence to suggest you should do a direction or one tail tail test. Sometimes we do, if our initial hypothesis is that the mean value is equal to some specified assumption value, okay this has to be known or given. Okay and we might actually say that we think the mean should be greater than or maybe we we mean we we think the me should be less than the hypothesized value. So these two are one tail or sometimes we call them directional tests. Okay and they allow us to do a stronger comparison okay and and I want to highlight the example um number fifteen coming off page 243 and I wanna look at that in a moment but the idea here is to say that we have choices and we need to make the choice before we collect the data, otherwise I'm gonna use the word cheating it's it's like it's like having a dataset and analyzing in a way that gives you the result that you think you can get from it. Instead of saying I need to establish this result, now let's go collect data and see if we can. I don't know if that's distinction that everybody follows um but it's like knowing the end result and saying yeah, yeah, yeah that's what I meant, instead of saying this is what I'm gonna say and this is what I wanna show, and then actually establishing it. There's a lot more variability if you have um you know research suggesting one idea and then you find data that then, you know, kinda as a second source supports your idea, alright. So this is the alternative hypothesis, this pre-planned hypothesis, um it is imperative that we we remembered that a one tail test gives us different results than a two tail test, okay, and that's coming down to the computation. Now wanna make a comment about error. So there's there's a potential for always making a mistake and and statistics and hypothesis test directly and so the idea is that we have two types of error and I just want to focus on type 1 error, type one error. This is the error that we want to minimize the most and it goes with the idea of failing to reject their, sorry falsely rejecting the null hypothesis when it's true. So we want to minimize this false rejection when it's really true and the way we we do that is we set a significance level. We don't always have to but the significance level alpha is precisely the amount of type 1 error we're willing to have. Type 1 error that is acceptable, acceptable, it's the end of the day, alright. So the idea with this is saying okay if I if our significant level is 5 percent that means we're we're allowing ourselves a 5 percent chance have falsely rejecting the null hypothesis. Okay so when we establish a p-value, let's say our value is 2 percent, what we're saying is that our p value 2 percent, the likely to observing the value that we did see occurs 2 percent of the time or less and that is a a less risky thing than the set significance level of five percent. So what we're saying is that based on our data we're actually making type 1 error less than five percent of the time therefore, we can say with more assurance that the null hypothesis is not true. Okay in other words, we we reject the null hypothesis. We don't know for a fact exactly what mu should be but we just know that it's not whatever it was given as. Okay so just clarify a bit with the significance level we have z-scores that correspond to significance level and so for the two tail, if we have an alpha of 5 percent we would have a z of plus or minus 1.96. So that's the two tests, in other words the picture is, here's our center at 0, here's negative 1.96, here's positive 1.96. In between those two tales we have five percent of the observations. So in other words, in here we have two and a half percent and then here we have two and a half percent of the area. Okay so between the two tails we add up to five percent. Now if we had an alpha of .01 our z-scores would be a plus or minus 2.58. If we had an alpha that was really small .001, our z-scores would be equal to plus or minus 3.30, by the way I'm pulling this from figure 8.5, I gotta remember and I didn't write down the page number, sorry. Figure 8.5 has all this information listed. So we're saying that these z-scores would would create the boundary line such that the two tails would collectively add up to the right significance level. Now if we had a single tail, let's say, so for one tail, the picture is going to be that single tail and so if we had an alpha level of 2.05. Our picture, we're saying what's the z-score that corresponds, and I'm just gonna do the upper tail by the way. You could do the lower tail so this here, we're saying all of that the probability is tech is secured in that tail, so five percent. So the z-score that corresponds with that would be, well depending on if it's upper or lower tail, it's 1.645. By the way, I pulled these from the table in the book, in the back of the book. This is table, Appendix B, statistical tables. This is table B1, okay. Now if we had an alpha of .01 the z-scores corresponding to that would be plus or minus 2.33. So again, if you had, this would be 2.3 sorry, this would be 1.645. If you had a new picture, so depending on the tail that you want 0, here's my upper bound, this is 2.33, negative 2.33. So a one tail test, if for example, my null hype hypothesis was that I think they mean should be that the mean is less than or equal to some hypothesize value and my alternative is that the mean is actually greater than the hypothesize valued and I'm looking at this one tail test up here, and I think it's nice to note that the corner of the inequality points at the tail. Okay and what we're saying is that in here we have five percent, okay now if we had a different null hypothesis and that hypothesis was that the mean should be less than some value then looking at that one tail that's below, and again in here we should see 5 percent, okay. Um the idea that we have, is to say depending on the test, our p-value is either going to be the localized tail or maybe it's the sum of the two tail. So let's look at an example, um I think the p-value is very straightforward to compute. So I'm going to do it, two tail test for number 15 and then I wanna show the one tail test. Okay, so let's see let me get a good color, look on page 243 and this is number 15, alright. So the idea here, you can read through this problem, researchers have noted that that as people age their cognitive functioning ability decreases so if we can introduce some ingredient into their food or just given a medicine, we can potentially and stop that decline in their cognitive functioning. So where they did is they took a sample of, let me switch colors, they took a sample of sixteen elderly people and administered some, a this what is it, anti antioxidant concoction and they found that when they gave them this test have cognitive ability, the average score was 50.2. Okay, so this was after the treatment was given. Okay, so sample of sixteen the given this test and after the treatment, the test average was 50.2. Now they knew from previous information that in the population the average score on the same exact test, the average score was 45 with the standard deviation of 9. Okay so they knew this information, this was known okay. So what they said is, okay well if if in our sample of size 16 we observed a mean of 50.2, how likely is that, okay? So in other words they want to come up with the p-value, which tells you the likelihood observing a mean score greater than or equal to 50.2 in a sample of 16 from the population, um population thats distributed normally with mean of 45 and standardization 9, okay. Blah blah blah we want to find a probability, like how likely is it that we would see this be our score and so let's just compute and that they give us a significance level and I'm gonna ignore that for a second. Let's look at the two tail test, the reason why I'm looking at the two tail test is I wanna go in without some meaning that says that the people should do better or worse. Mainly because I wanna know for either way. In other words maybe our intervention made it works. Maybe the people thought, oh wow I'm gonna take this drug then and take a test I better do better and then the stress made them to worse. So just to kind of block out and not have any sort of lurking variable influencing we're gonna do it two tail test. So in other words our assumption is that the mean is equal to 45 and this is the known value. Okay we're gonna run it against the alternative and this is the two tail test, we're gonna run against the alternative that the mean is actually not 45 after the treatment, okay. In other words I'm not saying that its greater than 45, I'm saying that it's just not 45. So it could be greater than or could be less than, okay, and of course our data suggests greater than in so we might them wanna, the temptation is to put the the directional tail and in fact we want to use the mu is greater than 45 but that's using the data after the fact. Okay so before we went into this if we wanted to we could say you know what we really think based on some other evidence that the antioxidant is gonna increase the scores and so we could go in with that assumption already established and then test it with the data we got after we've already created the assumption. Okay so the two tail test, we would come up with this probability, we would say that the probability of observing the mean, okay, greater than or equal to 50.2. Okay, we wanna come up with that. That's going to be our p-value for the one tail. What about in the other direction and I'm not gonna list it out just yet, let me, let me clear this. To find the two tail test, we're gonna double whatever we get here. So for ours for the two tail tests the the p-value is gonna be twice the p-value from the one tail, okay. It's kind of an interesting thing and it's not helpful to write it out but I just want you to know. So in other words, extreme and the ideas how far away is the 50.2 from the mean and then maybe we go below that same amount. What's the likelihood of going being that far in general. So let's compute the z-score in then it'll be easier for me to convey the two tail tests so the z-score would be 50.2 minus the mean of 45 divided by the standard error, right,this is a sample mean drawn from a size, sample size of 16, and so this is going to be the standard deviation divided by the square root of 16. you do the computation you come up with the z-score 2.311, what does that tell you? It tells you that 5.2 is 2.311 standard deviations away from 45 and above that. So in other words to do the two tail p-value, we want to look above and below that distance. So what's the probability that we have a z that is less than or equal to -2.311 plus the probability that a z is greater than or equal to positive 2.311 and that would be the p-value for the two tail test, okay. So it's nice that these are actually the same and so that's what I'm referencing up here, which is double the p-value we get. So what is that the p-value, what's the probability of observing a 50.2 when the mean is 45? So the probability of Z being greater than or equal to 2.311, if you look this up in Table B1 you find that that probability is .0104. Okay notice this would-be it, that would be the p-value for the one- tail let's look at it for the two-tail and so the p-value we actually want, so the p-value for two-tail test would be twice that, so point p is equal to .0208. In other words just barely over 2.8 percent of the time, just barely two percent of the time we would observe a value as extremist 50.2. That is very unlikely, you know, two percent a time if that's not that's that's a very low percent of the time but you know what in application you need to establish what unlikely means for your field. So in other words, sometimes in business 5 percent is fine, sometimes in in business 10 percent is fine. In other words if our p-value is less than 10 percent were totally okay with that but in manufacturing, so think like really detailed manufacturing, a 2 percent error really a lot. I mean you have to be very precise and so the margin of error that we'd be willing to accept would be much much lower and so our significant level might be a .0001. Okay, that was kind of a lot .0001, so one ten thousands. So other words if if it's like a a clean room or something in that saying that you would allow one particle of dust for every 10,000 particles air, something like that. Okay, so that's a very low value and so depending on the context and nursing on you might just stick with the standard five percent but I think it's better to just report the p-value and let somebody interpret your results. Okay so our interpretation and just based on the p-value we would say, switch the page here, having a p-value of .0208 is enough evidence to reject the assumption that Ho is true. So in other words, we are accepting, well I'm not even gonna say that, we're rejecting Ho in favor of the alternative. In other words the mean is not 45, what we will need to do now, okay so next step in research would be to repeat this, repeat the test. I don't mean test, I should say repeat the experiment, I don't know if I'm spelling that right. It doesn't seem right, expirement, experiment, whatever you can figure it out. Repeat the experiment and and use the specified alternative, that the mean should be greater than 45. Okay so real quick, there's another idea coming out of chapter 8. That's the idea of Cohen's d, the to measure effect size, effect size, so Cohen's d which measures effect size in absolute magnitude. In other words, we're not going to scale up by the standard error. We're going to scale the difference between the means, just using standard deviation okay. So this is am disregarding sample size. Okay, it's very straightforward computation, d is equal to the difference in the means divided by standard deviation. Okay so for our example, number fifteen that we're just finishing the the d value, the effect size, would be the difference in the means 50.2 minus 45 divided by the sample, uh the stand deviation. Okay I didn't do the competition ahead a time, I should have sorry, 50.2 minus 5 so the effect is .57 with a bunch of 7's, let's just round it to 57, okay, alright. What I wanna do now is look at chapter 9.