Tip:
Highlight text to annotate it
X
What we’re going to be doing next is we’re going to explore how to work with categorical
variables. From this, the discussion now is not going to have a quantitative variable
as we did so far — like an SAT score.
We’re going to develop what we call the chi-square test of independence. Let’s refresh
our memories, to begin with about categorical and quantitative variables. When we talk about
quantitative variables, we’re talking about things that we can measure or count — distance,
temperature, number of students in a classroom. When we talk about continuous or discrete
variables, but the idea is that we have something that we can quantify. When we have categorical
variables, and we can talk about a nominal and ordinal categorical variable, it is basically
that we classify observations to differ groups. If it is a nominal categorical variable, then
it can be the blue group, the green group and the purple group. They’re just labels.
It is very important for you to always understand the nature of the data that you’re working
with, the nature of the variables that you’re working with. Most of the work that you’ve
had to do to this point is actually relating to that, and most of the questions that you
had to answer in the first exam is to make sure that we understand the difference between
the two. The tools that you can use to explore relationships between these different variables
will change depending on the nature of the variables. I cannot, it makes no sense to
try to use Analysis of Variance to explore the relationship between two categorical variables.
That is so important always to understand what it is that we have, and I think that
asking these questions will help you. Is it something that we measure, or is it something
that we record about the individual — some characteristic of the individual? Now, don’t
make the mistake to think that oh, I actually asked the person whether or not she agrees
with a particular statement. What I recorded in my dataset is a number; therefore, this
is a quantitative variable. It is a numeric. You recorded the data and it appears as a
numeric variable in your dataset, but it’s still categorical. Just because you have assigned
numbers to categories, that doesn’t change it to a numeric variable — a quantitative
variable — as we’ve defined it here. It’s still categories. We’re going to talk
about two-way tables. There are different words that you’re going to hear it. You’re
going to hear them called two-way tables, crosstabs, contingency tables. All of these
things mean the same thing; you just take two factors and you try to see how they intersect.
Crosstabs, two-way, contingency tables — all of these things mean the same thing. First
of all, I know that you have worked in the labs — most of you — with two-way tables.
Hopefully, the first part will go very quickly, but I want to make sure that we understand
exactly what we have. I’m going to work here with two categorical variables — age
group, education. Someone could say, “Well, age? Categorical variable? “ I said, “age
group.” Age may be a quantitative variable, but I may create a categorical variable out
of it by classifying people as belonging to a certain group — younger vs. older people.
You could do that. Whenever you take a quantitative variable and you convert it to a categorical,
there is a certain level of arbitrariness of how you define these groups. As a researcher,
you have to always stand ready to defend your choices. Typically, you’ve relied on the
experience in your discipline in particular ways that you’ve defined the groups for
the variable that you’re interested in.
So I have a categorical variable, age group at 1, 2, 3 levels — and another variable
for levels of education at 1, 2, 3, 4 levels — and in a two-way table essentially one
variable is represented by the rows of the table. I cannot [unintelligible] to get the
total number of people that did not complete high school across all age groups, and the
columns represent the levels of the other factor. The intersection essentially tells
me how many people in this age group did not complete high school. In here it’s how many
people in the 35 to 54 age group have one to three years of college, and so on and so
forth. As we’re going to see, we refer to these as conditional probabilities, but that’s
the basic idea. You don’t do this with quantitative variables. Let me repeat that it’s two factors,
two categorical variables that you have in the rows and columns. We can look at this
and calculate what we call the “marginal distributions.” If I look at the row totals,
that basically tells me how many people are in the different levels of by factor education.
It’s the same thing as if you were producing one-way frequency tables for the variable
education, things that you’ve done already in the past in assignments. Down here it says
for all of the columns, it says how many people are in quote, unquote my “young” group
and how many people are in my “older” group. The marginal distributions tell
me how my observations are distributed across different levels of a factor independently.
In other words, without taking into account information about the other factor. Because
they appear at the end, they’re at the margins, the marginal distributions. Here it’s essentially
running simple frequencies for education; here simple frequencies for age groups.
If we have the marginal distributions, this is what you’ve been doing. You can express
this and say that okay, let me look at the frequency distribution for this categorical
variable where the bar height will tell me some information about the distribution. It
could be the density; it can be expressing percentages; relative distribution; the count;
absolute frequency distribution and so on and so forth. These are things that you’ve
done already. You can also take the information here in the second graph. I can express any
aspect of this table in a way that it will actually allow me to make the point that I’m
trying to make as effectively as possible
What is important for us to remember is that besides this marginal distribution, what we’re
really interested in is the conditional distributions. In other words, what we’re really interested
in is to see what we see in the middle, in the main part of the table, because that presents
the intersection of the two factors. It says that there are 9,000 people in the age group
of 35 to 54 that did not complete high school. If you look at the people that did not complete
high school, it’s a large number for the older group and a smaller number for the younger
group. The question you’re asking essentially — the crosstab or two-way table provides
a very effective way to summarize information between categorical variables. It’s a very
effective way to use it for descriptive analyses. You will not use actual numbers or in addition
to the actual numbers — the counts — what you are going to offer is the percentages,
because that allows you to do the comparison much faster.
“Well, is it the case that what I see here is the result of chance, or is it that something
is going on here that these two factors are related?” That is the basis behind the test
of independence. You say that I am looking at these two categorical variables, and I
see that they have this type of intersection. Is what I see here just a result of chance,
or is there something that is going on? That is what you're after with a chi-square test.
What we see in the middle part here is what we felt for conditional distributions. Conditional,
why? Because the value that I see — 14,226 — is the number of people 55 and older that
did not complete high school. In other words, that information is conditional on the educational
level vs. if you look at the total for the row. It says that this is how many people
did not complete high school, independent of the level of education. Now, I’m conditioning
that information on the level of education and I have the distribution that I see in
the main part of the table — what we refer to as the conditional distributions. Just
to avoid running into problems, especially if you have a very different number of folks
in the different groups, you see that there are 37,000 in the younger group; 56 in the
older group; 27 did complete high school; 44 have college of four or more years. Always,
you expect this in percentages.
To calculate the column percentage, you say that well, I see that this is 11,000 people
that have four years or more of college out of a total number of 37,787. This number is
divided by this times 100 will give you the percentage for that cell. If you look at the
column percentage, you take the number in that cell; divide it with the column total;
multiply it by 100, and that will be the percentage for that cell. If you want the row, again
the number in that cell; divided by the row total and multiplied by 100 to get that percent
for that cell, the row percent. If you want, and it will make it easier for you, you can
take that information, of course, and display it nicely in simple bar charts that allow
for a visual element. Sometimes it’s much easier to actually spot the trend by looking
at these pictures.
Earlier on when we were discussing the different tables, I said if we look at this table here
and you want to say something about the relationship of these two variables, then essentially what
you’re asking is, “Is what I see here the result of chance?” It could be because
of something, or is there something that is going on that allows me to conclude that the
two variables — the two factors — may be related? The question that we are saying
then is okay, if that’s what you’re going to assess, how are you going to do it? Well,
the way that I’m going to do it is by calculating what we call expected counts. The counts that
I see here in the first cell, for example, it’s 44 to 59, is the actual number of observations
that fall in that particular intersection. What the expected count will tell me is
what I would expect to see in that intersection if the two factors are not related. If the
two factors are not related, then I will expect to see a certain count in each cell. What
will be that count? It will depend on the marginal distributions. It will be consistent
with the marginal distributions. Then what I can do or the basic idea behind the chi-square
test — that’s a Greek sophisticated chi-square test and so it must be something
exotic, right? Well, what the chi-square test does is that it says that I’m now going
to compare the expected and the actual counts, and I will see if these differences are big
enough. If they’re big enough, then yes, something is going on; if they’re not big
enough, nothing is going on. That’s the basic idea. To calculate the expected counts
in any cell, you take essentially the row percentage — you Another way to see it is
to take the row percent and what you see in a particular cell in a column should be consistent
with the percent that you see in the marginal distribution. That allows me to calculate
the expected count in the two-way table. I have the actual number, and this is what I
expect if the two factors are not related. If the factors are not related, then essentially
the expected count tells me that this should be consistent with the marginal distributions.
What the chi-square test does is that it takes the information from the differences
in expected and actual counts to calculate the appropriate test statistic — the chi-square
statistic. To calculate this very exotic statistic, in each cell you have the observed
count and you have now the expected count. You take the difference. Because some differences
may be positive and some may be negative, we square them so that all of them are positive.
In other words, I don’t care if the expected count is smaller than the actual count, or
the opposite; I care that there is a difference. I take the difference between observed count
and expected count and I square it, so that all of them are positive and I divide it by
the expected count. I do this for every cell, and then I add them up. That’s my chi-square.
After I calculate my chi-square, then based on the degrees of freedom that I have — degrees
of freedom again for the chi-square — then I know something about the curvature of the
chi-square distribution. Based on that information, I can calculate the p-value. Based on the
p-value that I have and my selected level of significance, I will end up rejecting or
failing to reject the null hypothesis. What is my null hypothesis here? What I set
out to do is that I wanted to see if these two categorical variables are related. My
null hypothesis is that the two categorical variables — it says not related. And the alternate? If
I reject the null hypothesis, then I rule in favor of the alternate which is that the
two factors are related. In other words, what we’re saying is I calculate my chi-square
and if the two factors are not related, the chi-square will be so small. Why would the
chi-square be so small? The differences between observed and actual are not going to be as
big to justify concluding that this is not the result of chance. Here is an example where we’re dealing with
a particular addiction and different ways of treatment. Some people that are subject
to this treatment relapse after the treatment is over; some people do not. We would like
to assess the efficacy of the different methods of treatment. We have two different medications
and the control group — the placebo group. We see that at the top we have the simple
two-way table where we have the number of people that did not have a relapse with this
medication, and of course, the marginal distributions — the actual count we have here. The test
that we want to perform is to say, first of all, are these categorical or quantitative
variables? In order for me to be able to do this test, they must be categorical. It’s
a yes/no for the relapse factor and three different treatments — A, B, C — that
I have as row variables. I want to see whether or not these numbers that I observe here are
the result of chance, or is it because something is going on? In other words, relapsing or
not is relating to the different treatment methods. On the left here you will find
these types of graphs in varying aspects because it gives you some idea right away. Look at
what happens for your observed. For the first treatment, it’s about 60 percent; for the
second it’s 27; the third is 17. That may give you some idea right away, but hopefully this will be confirmed with our
chi-square test. We need to calculate the expected count. If I take the row total multiplied
by column total and divide it by the total number of observations, I get the expected
count. Now that I have the expected counts and I have the actual counts, I can very quickly
calculate my chi-square. Just take the difference for every cell — 15 minus 8.78; square it;
divide it by 8.78. For every cell take the difference and calculate what we call these
components of the chi-square test. Again, you’re not going to do this. Your software
will do it for you, but hopefully all of us understand how we get to this chi-square statistic.
The chi-square components that we have are listed here. This is essentially the values
of the different ratios. Why I brought it here is because by looking at them, we can
see where exactly the differences may lie and why that may be important — because
the chi-square test will tell you whether or not the two factors are independent.
In other words, what you’re doing is you’re looking to see if factor A and factor B are
related, so you rule in favor or against your null hypothesis. It doesn’t tell you anything
about how strong the relationship is or where exactly you see these big differences. You
have to engage in this type of analysis afterwards.
In this calculation I calculated that my chi-square is 10.74. Now, I have to see what
is the p-value associated with this value of the chi-square. That will depend on the
curvature of the chi-square distribution, and the chi-square distribution will depend
on the degrees of freedom — one degree of freedom, two degrees of freedom, eight degrees
of freedom and so on and so forth.
After you have your chi-square, of course, and you know the degrees of freedom, then
you can look at the p-value and the same process that we described before will continue again.
If the p-value here is smaller than your selected level of significance, then you reject
the null hypothesis. The chi-square test in this case established
that these two factors — relapse or not in treatment — are not independent of each
other. We reject the null hypothesis. The next few slides, what I have is essentially
the output from the different software. I’m using an example from General Social Survey
where the respondents were asked the following question, of whether or not a woman could
have a legal abortion for any reason.
You have two categorical variables — male and female — and how they respond to this
question. Let’s suppose that the responses were yes and no to the question of abortion.
Here we have that information tabulated. It tells me what is the actual count. Again,
350, 478 — that is the actual count. You see the legend. The first entry in the cell
is frequency. This is the SAS output. It’s frequency and then it’s the expected count,
and then the percent. This number is calculated by taking the number 350 and dividing it by
the total number of observations. That’s what the percent means, and then the row percentage;
350 divided by 828, and so on and so forth.
What was your null hypothesis? These two factors: the sex of the respondent and how they responded
to the question are not related. That’s your null hypothesis. If you reject the null
hypothesis, you were in favor of the alternate, which is that the two factors are related.
Several of you raised your hand to say that yes, I expect men and women to respond differently,
meaning that you will expect here to have a chi-square that will be significant. Is
that the case based on this p-value? No, and so in this case I fail to reject the null
hypothesis.