Qac201 - Chi square

What we’re going to be doing next is we’re going to explore how to work with categorical variables. From this, the discussion now is not going to have a quantitative variable as we did so far — like an SAT score. We’re going to develop what we call the chi-square test of independence. Let’s refresh our memories, to begin with about categorical and quantitative variables. When we talk about quantitative variables, we’re talking about things that we can measure or count — distance, temperature, number of students in a classroom. When we talk about continuous or discrete variables, but the idea is that we have something that we can quantify. When we have categorical variables, and we can talk about a nominal and ordinal categorical variable, it is basically that we classify observations to differ groups. If it is a nominal categorical variable, then it can be the blue group, the green group and the purple group. They’re just labels. It is very important for you to always understand the nature of the data that you’re working with, the nature of the variables that you’re working with. Most of the work that you’ve had to do to this point is actually relating to that, and most of the questions that you had to answer in the first exam is to make sure that we understand the difference between the two. The tools that you can use to explore relationships between these different variables will change depending on the nature of the variables. I cannot, it makes no sense to try to use Analysis of Variance to explore the relationship between two categorical variables. That is so important always to understand what it is that we have, and I think that asking these questions will help you. Is it something that we measure, or is it something that we record about the individual — some characteristic of the individual? Now, don’t make the mistake to think that oh, I actually asked the person whether or not she agrees with a particular statement. What I recorded in my dataset is a number; therefore, this is a quantitative variable. It is a numeric. You recorded the data and it appears as a numeric variable in your dataset, but it’s still categorical. Just because you have assigned numbers to categories, that doesn’t change it to a numeric variable — a quantitative variable — as we’ve defined it here. It’s still categories. We’re going to talk about two-way tables. There are different words that you’re going to hear it. You’re going to hear them called two-way tables, crosstabs, contingency tables. All of these things mean the same thing; you just take two factors and you try to see how they intersect. Crosstabs, two-way, contingency tables — all of these things mean the same thing. First of all, I know that you have worked in the labs — most of you — with two-way tables. Hopefully, the first part will go very quickly, but I want to make sure that we understand exactly what we have. I’m going to work here with two categorical variables — age group, education. Someone could say, “Well, age? Categorical variable? “ I said, “age group.” Age may be a quantitative variable, but I may create a categorical variable out of it by classifying people as belonging to a certain group — younger vs. older people. You could do that. Whenever you take a quantitative variable and you convert it to a categorical, there is a certain level of arbitrariness of how you define these groups. As a researcher, you have to always stand ready to defend your choices. Typically, you’ve relied on the experience in your discipline in particular ways that you’ve defined the groups for the variable that you’re interested in. So I have a categorical variable, age group at 1, 2, 3 levels — and another variable for levels of education at 1, 2, 3, 4 levels — and in a two-way table essentially one variable is represented by the rows of the table. I cannot [unintelligible] to get the total number of people that did not complete high school across all age groups, and the columns represent the levels of the other factor. The intersection essentially tells me how many people in this age group did not complete high school. In here it’s how many people in the 35 to 54 age group have one to three years of college, and so on and so forth. As we’re going to see, we refer to these as conditional probabilities, but that’s the basic idea. You don’t do this with quantitative variables. Let me repeat that it’s two factors, two categorical variables that you have in the rows and columns. We can look at this and calculate what we call the “marginal distributions.” If I look at the row totals, that basically tells me how many people are in the different levels of by factor education. It’s the same thing as if you were producing one-way frequency tables for the variable education, things that you’ve done already in the past in assignments. Down here it says for all of the columns, it says how many people are in quote, unquote my “young” group and how many people are in my “older” group. The marginal distributions tell me how my observations are distributed across different levels of a factor independently. In other words, without taking into account information about the other factor. Because they appear at the end, they’re at the margins, the marginal distributions. Here it’s essentially running simple frequencies for education; here simple frequencies for age groups. If we have the marginal distributions, this is what you’ve been doing. You can express this and say that okay, let me look at the frequency distribution for this categorical variable where the bar height will tell me some information about the distribution. It could be the density; it can be expressing percentages; relative distribution; the count; absolute frequency distribution and so on and so forth. These are things that you’ve done already. You can also take the information here in the second graph. I can express any aspect of this table in a way that it will actually allow me to make the point that I’m trying to make as effectively as possible What is important for us to remember is that besides this marginal distribution, what we’re really interested in is the conditional distributions. In other words, what we’re really interested in is to see what we see in the middle, in the main part of the table, because that presents the intersection of the two factors. It says that there are 9,000 people in the age group of 35 to 54 that did not complete high school. If you look at the people that did not complete high school, it’s a large number for the older group and a smaller number for the younger group. The question you’re asking essentially — the crosstab or two-way table provides a very effective way to summarize information between categorical variables. It’s a very effective way to use it for descriptive analyses. You will not use actual numbers or in addition to the actual numbers — the counts — what you are going to offer is the percentages, because that allows you to do the comparison much faster. “Well, is it the case that what I see here is the result of chance, or is it that something is going on here that these two factors are related?” That is the basis behind the test of independence. You say that I am looking at these two categorical variables, and I see that they have this type of intersection. Is what I see here just a result of chance, or is there something that is going on? That is what you're after with a chi-square test. What we see in the middle part here is what we felt for conditional distributions. Conditional, why? Because the value that I see — 14,226 — is the number of people 55 and older that did not complete high school. In other words, that information is conditional on the educational level vs. if you look at the total for the row. It says that this is how many people did not complete high school, independent of the level of education. Now, I’m conditioning that information on the level of education and I have the distribution that I see in the main part of the table — what we refer to as the conditional distributions. Just to avoid running into problems, especially if you have a very different number of folks in the different groups, you see that there are 37,000 in the younger group; 56 in the older group; 27 did complete high school; 44 have college of four or more years. Always, you expect this in percentages. To calculate the column percentage, you say that well, I see that this is 11,000 people that have four years or more of college out of a total number of 37,787. This number is divided by this times 100 will give you the percentage for that cell. If you look at the column percentage, you take the number in that cell; divide it with the column total; multiply it by 100, and that will be the percentage for that cell. If you want the row, again the number in that cell; divided by the row total and multiplied by 100 to get that percent for that cell, the row percent. If you want, and it will make it easier for you, you can take that information, of course, and display it nicely in simple bar charts that allow for a visual element. Sometimes it’s much easier to actually spot the trend by looking at these pictures. Earlier on when we were discussing the different tables, I said if we look at this table here and you want to say something about the relationship of these two variables, then essentially what you’re asking is, “Is what I see here the result of chance?” It could be because of something, or is there something that is going on that allows me to conclude that the two variables — the two factors — may be related? The question that we are saying then is okay, if that’s what you’re going to assess, how are you going to do it? Well, the way that I’m going to do it is by calculating what we call expected counts. The counts that I see here in the first cell, for example, it’s 44 to 59, is the actual number of observations that fall in that particular intersection. What the expected count will tell me is what I would expect to see in that intersection if the two factors are not related. If the two factors are not related, then I will expect to see a certain count in each cell. What will be that count? It will depend on the marginal distributions. It will be consistent with the marginal distributions. Then what I can do or the basic idea behind the chi-square test — that’s a Greek sophisticated chi-square test and so it must be something exotic, right? Well, what the chi-square test does is that it says that I’m now going to compare the expected and the actual counts, and I will see if these differences are big enough. If they’re big enough, then yes, something is going on; if they’re not big enough, nothing is going on. That’s the basic idea. To calculate the expected counts in any cell, you take essentially the row percentage — you Another way to see it is to take the row percent and what you see in a particular cell in a column should be consistent with the percent that you see in the marginal distribution. That allows me to calculate the expected count in the two-way table. I have the actual number, and this is what I expect if the two factors are not related. If the factors are not related, then essentially the expected count tells me that this should be consistent with the marginal distributions. What the chi-square test does is that it takes the information from the differences in expected and actual counts to calculate the appropriate test statistic — the chi-square statistic. To calculate this very exotic statistic, in each cell you have the observed count and you have now the expected count. You take the difference. Because some differences may be positive and some may be negative, we square them so that all of them are positive. In other words, I don’t care if the expected count is smaller than the actual count, or the opposite; I care that there is a difference. I take the difference between observed count and expected count and I square it, so that all of them are positive and I divide it by the expected count. I do this for every cell, and then I add them up. That’s my chi-square. After I calculate my chi-square, then based on the degrees of freedom that I have — degrees of freedom again for the chi-square — then I know something about the curvature of the chi-square distribution. Based on that information, I can calculate the p-value. Based on the p-value that I have and my selected level of significance, I will end up rejecting or failing to reject the null hypothesis. What is my null hypothesis here? What I set out to do is that I wanted to see if these two categorical variables are related. My null hypothesis is that the two categorical variables — it says not related. And the alternate? If I reject the null hypothesis, then I rule in favor of the alternate which is that the two factors are related. In other words, what we’re saying is I calculate my chi-square and if the two factors are not related, the chi-square will be so small. Why would the chi-square be so small? The differences between observed and actual are not going to be as big to justify concluding that this is not the result of chance. Here is an example where we’re dealing with a particular addiction and different ways of treatment. Some people that are subject to this treatment relapse after the treatment is over; some people do not. We would like to assess the efficacy of the different methods of treatment. We have two different medications and the control group — the placebo group. We see that at the top we have the simple two-way table where we have the number of people that did not have a relapse with this medication, and of course, the marginal distributions — the actual count we have here. The test that we want to perform is to say, first of all, are these categorical or quantitative variables? In order for me to be able to do this test, they must be categorical. It’s a yes/no for the relapse factor and three different treatments — A, B, C — that I have as row variables. I want to see whether or not these numbers that I observe here are the result of chance, or is it because something is going on? In other words, relapsing or not is relating to the different treatment methods. On the left here you will find these types of graphs in varying aspects because it gives you some idea right away. Look at what happens for your observed. For the first treatment, it’s about 60 percent; for the second it’s 27; the third is 17. That may give you some idea right away, but hopefully this will be confirmed with our chi-square test. We need to calculate the expected count. If I take the row total multiplied by column total and divide it by the total number of observations, I get the expected count. Now that I have the expected counts and I have the actual counts, I can very quickly calculate my chi-square. Just take the difference for every cell — 15 minus 8.78; square it; divide it by 8.78. For every cell take the difference and calculate what we call these components of the chi-square test. Again, you’re not going to do this. Your software will do it for you, but hopefully all of us understand how we get to this chi-square statistic. The chi-square components that we have are listed here. This is essentially the values of the different ratios. Why I brought it here is because by looking at them, we can see where exactly the differences may lie and why that may be important — because the chi-square test will tell you whether or not the two factors are independent. In other words, what you’re doing is you’re looking to see if factor A and factor B are related, so you rule in favor or against your null hypothesis. It doesn’t tell you anything about how strong the relationship is or where exactly you see these big differences. You have to engage in this type of analysis afterwards. In this calculation I calculated that my chi-square is 10.74. Now, I have to see what is the p-value associated with this value of the chi-square. That will depend on the curvature of the chi-square distribution, and the chi-square distribution will depend on the degrees of freedom — one degree of freedom, two degrees of freedom, eight degrees of freedom and so on and so forth. After you have your chi-square, of course, and you know the degrees of freedom, then you can look at the p-value and the same process that we described before will continue again. If the p-value here is smaller than your selected level of significance, then you reject the null hypothesis. The chi-square test in this case established that these two factors — relapse or not in treatment — are not independent of each other. We reject the null hypothesis. The next few slides, what I have is essentially the output from the different software. I’m using an example from General Social Survey where the respondents were asked the following question, of whether or not a woman could have a legal abortion for any reason. You have two categorical variables — male and female — and how they respond to this question. Let’s suppose that the responses were yes and no to the question of abortion. Here we have that information tabulated. It tells me what is the actual count. Again, 350, 478 — that is the actual count. You see the legend. The first entry in the cell is frequency. This is the SAS output. It’s frequency and then it’s the expected count, and then the percent. This number is calculated by taking the number 350 and dividing it by the total number of observations. That’s what the percent means, and then the row percentage; 350 divided by 828, and so on and so forth. What was your null hypothesis? These two factors: the sex of the respondent and how they responded to the question are not related. That’s your null hypothesis. If you reject the null hypothesis, you were in favor of the alternate, which is that the two factors are related. Several of you raised your hand to say that yes, I expect men and women to respond differently, meaning that you will expect here to have a chi-square that will be significant. Is that the case based on this p-value? No, and so in this case I fail to reject the null hypothesis.