Tip:
Highlight text to annotate it
X
Hi! I am Mike Marin and in this video
we'll talk about how to change the reference category
or baseline category for a categorical variable
in a linear regression model. In a linear regression model
the intercept or constant term refers to the
estimated mean Y-value for the reference or baseline group,
and the model coefficients or parameters refer to expected changes
in the mean Y-value relative to the reference group.
If you're not familiar with the concept of a dummy
or indicator variable, you can watch my video where I explain this concept.
We will be working with the Lung Capacity data
that was introduced earlier in these series of videos.
I have already gone ahead and imported the data into R and attached it.
In this video, we will demonstrate the use of the "relevel" command.
To access the Help menu you can type "help" and in brackets the name of the command
you'd like help for,
or you can type the command name in the help search window.
To start, let's fit a regression model. We will call this
'mod1' and we will fit a linear regression
relating Lung Capacity to both
Age and Smoking. We can ask for summary
of this model; here
we can see the model output from R as well as the
fitted model. The intercept or constant term of
1.09 is the estimated mean Lung Capacity
for the reference or baseline group. That is
a group with Age equal to 0 who do not smoke.
the coefficient for age of 0.56
is the expected change in the mean Y-value for 1 unit change in X.
We associate an increase of 1 year in Age
with an increase 0.56 in the mean Lung Capacity
adjusting or controlling for Smoking status.
The X variable indicator for smoking equals 1,
if the individual smokes and 0 otherwise; the coefficient for smoking
-0.65 is the expected change
in the mean Lung Capacity for a smoker relative to a nonsmoker
adjusting or controlling for the Age. In other words
for a smoker we expect the mean lung capacity
to be 0.65 lower than the non-smoker
holding the Age constant. But what if we want smokers
to be the reference or baseline? It's worth noting
that by default the category that R chooses to be the reference or baseline
is the first category that appears alphabetically
or numerically if categories are coded using 0,1,2.
You'll notice that if we ask for a frequency table
of the variable smoke, the No category appears first.
We can change the reference category to being Yes
using the "relevel" command. Here we would like to store
in the variable smoke, a re-leveled
version of the variable smoke and we would like our
reference category to be the Yes category.
We can now see if we ask for the frequency table for smoke
the Yes appears in the first column or as the reference.
Now let's go ahead and for the model using this
re-leveled version of the smoke variable. You will relate Lung Capacity to age
and smoking. We can ask for summary
of this model, and here
we can see the R model output as well as the fitted model.
The intercept or constant of 0.44
is the estimated mean Lung Capacity for a smoker
of age 0. The coefficient for non-smoking
+0.65 is the expected change
in mean Lung Capacity for a non-smoker relative to a smoker
holding age constant. If you compare this model
to the one previously fit, you'll notice nothing important has changed.
The R-squared,
the residual standard error, all these other summaries remain the same.
All we've done here is change the reference group.
This is known as re-parameterizing a model.
Thanks for watching this video and make sure to check out my other instructional videos.