Tip:
Highlight text to annotate it
X
Hi! I am Mike Marin and in this video
we'll introduce the idea of a dummy variable or
indicator variable and their use in regression models.
We can include categorical or qualitative variables, also known as factors
in a regression model using dummy or indicator variables.
We'll be working with the Lung Capacity data
that was introduced earlier in these series of videos.
I have already gone ahead and imported the data into R
and attached it. You can also notice here
that I have created a categorical representation of
the "Height" variable. Individuals are placed into "Height" categories,
where category A is less than 50 inches, category B 50 to 55,
category C 55 to 60,
and so on all the way up to the category F 70 or greater.
A categorical variable that has
K levels or categories requires K-1
dummy or indicator variables to represent it.
For example the variable Smoke has two levels: No and Yes
these two levels will require 1
indicator variable in order to represent Smoking status.
We can create a dummy or indicator variable,
we'll call it Xsmoke and we'll set this equal to 1,
if the individual Smokes or Smoking is Yes
and zero otherwise. Then for a non-Smoker the Xsmoke
indicator will equal 0. It's worth noting that
conversely we could instead create an indicator
for non-Smoking instead of an indicator for Smoking.
Now, let's go over the same idea but this time using the categorical Height variable.
You'll notice that categorical Height
has 6 levels, therefore we'll need 5 dummies or indicators
to represent this. We can create an indicator
or dummy, we'll call it XB and we'll set this
equal to 1 if the individual is in Height category B
and 0 otherwise. We can create another dummy or indicator,
we'll call this XC and we'll set this equal to 1
if the individual is in Height category C, and 0 otherwise;
we could also create indicators XD
XE and XF indicating categories D, E or F.
You can notice that Height category A
is serving as a reference or baseline group.
An individual in Height category A will have XB equal to 0
XC equal to 0, XD,
XE, XF all equal to 0.
An individual in Height category B will have
XB equal to 1 and XC, XD,
XE and XF all equal to 0.
An individual in Height category C, will have XC equal to 1
and all other X indicators equal to 0, same for category D,
category E and category F. These 5 dummy
or indicator variables allow us to identify which of
the 6 Height categories an individual falls into.
In a moment we will look at the use of these in a regression model but first
let's take a look at the mean Lung Capacity for each of the
groups formed by categorical Height. I've already written some code to do this,
so I'll go ahead and submit that up here. We'll go ahead and calculate
the mean Lung Capacity
for each of the Height categories.
Here, we are asking R to calculate the mean Lung Capacity
only for those in Height category A and then the mean Lung Capacity
only for those in Height Category B and so on.
Now, let's keep an eye on those means so that we can compare these
with the regression model that we're gonna fit.
Let's go ahead and fit a linear regression model.
we will relate Lung Capacity
to this variable categorical Height. We can ask for summary of this model,
and here we can see the R output
as well as the fitted model. The intercept or constant term
bnot (b0) of 2.15 is the estimated mean Y value
for all X's equal 0. That is our reference or baseline group.
In this particular model it is the mean Lung Capacity
for someone in Height Category A. The coefficient for category B
of 1.51 is the change in mean Lung Capacity
we would expect for someone in category B relative to category A,
for someone in category B we would have their
estimated mean Lung Capacity equal to 2.15
plus 1.51 times 1, here we have the XB
indicator equal to 1 because the individual is in category B
plus 3.25 times 0, the XC indicator
is equal to 0 as this individual is not in category C
and so on; all other X indicators equal to 0.
The mean Lung Capacity for someone in category B is 2.15
plus 1.51 which is equal to 3.66,
this is the mean Lung Capacity for someone in category B
the slight difference you see is due to rounding error.
The coefficient for category C
of 3.25 is the change in mean Lung Capacity
we would expect for someone in category C relative to someone in category A;
for someone in category C the XC indicator will equal 1,
all other indicators will equal 0. In this case their estimated mean Lung Capacity
will be 2.15 plus 3.25 which equals to 5.4;
You can repeat this process
to calculate the mean Lung Capacity for all other Height categories
and if you do this you'll see the mean for category D is 7.17,
the mean for category E is 8.69,
and the mean for category F is 10.8;
Using dummy or indicator variables
is how we can include categorical or qualitative variables
into a regression model. When including a categorical variable
into a regression model, R will create the dummy or indicator variables automatically.
The category that R chooses as the reference or baseline category
will be the category that comes first alphabetically
or numerically if categories are coded using 0,1,2
and so on. In a separate video I'll show how you can change which category serves
as the reference.
Also in a later video I'll show how to fit and interpret a regression model
that uses both categorical and numeric variables.
Thanks for watching this video and make sure to check out my other instructional videos!