Dummy Variables Or Indicator Variables in R - R tutorial 5.4

Hi! I am Mike Marin and in this video we'll introduce the idea of a dummy variable or indicator variable and their use in regression models. We can include categorical or qualitative variables, also known as factors in a regression model using dummy or indicator variables. We'll be working with the Lung Capacity data that was introduced earlier in these series of videos. I have already gone ahead and imported the data into R and attached it. You can also notice here that I have created a categorical representation of the "Height" variable. Individuals are placed into "Height" categories, where category A is less than 50 inches, category B 50 to 55, category C 55 to 60, and so on all the way up to the category F 70 or greater. A categorical variable that has K levels or categories requires K-1 dummy or indicator variables to represent it. For example the variable Smoke has two levels: No and Yes these two levels will require 1 indicator variable in order to represent Smoking status. We can create a dummy or indicator variable, we'll call it Xsmoke and we'll set this equal to 1, if the individual Smokes or Smoking is Yes and zero otherwise. Then for a non-Smoker the Xsmoke indicator will equal 0. It's worth noting that conversely we could instead create an indicator for non-Smoking instead of an indicator for Smoking. Now, let's go over the same idea but this time using the categorical Height variable. You'll notice that categorical Height has 6 levels, therefore we'll need 5 dummies or indicators to represent this. We can create an indicator or dummy, we'll call it XB and we'll set this equal to 1 if the individual is in Height category B and 0 otherwise. We can create another dummy or indicator, we'll call this XC and we'll set this equal to 1 if the individual is in Height category C, and 0 otherwise; we could also create indicators XD XE and XF indicating categories D, E or F. You can notice that Height category A is serving as a reference or baseline group. An individual in Height category A will have XB equal to 0 XC equal to 0, XD, XE, XF all equal to 0. An individual in Height category B will have XB equal to 1 and XC, XD, XE and XF all equal to 0. An individual in Height category C, will have XC equal to 1 and all other X indicators equal to 0, same for category D, category E and category F. These 5 dummy or indicator variables allow us to identify which of the 6 Height categories an individual falls into. In a moment we will look at the use of these in a regression model but first let's take a look at the mean Lung Capacity for each of the groups formed by categorical Height. I've already written some code to do this, so I'll go ahead and submit that up here. We'll go ahead and calculate the mean Lung Capacity for each of the Height categories. Here, we are asking R to calculate the mean Lung Capacity only for those in Height category A and then the mean Lung Capacity only for those in Height Category B and so on. Now, let's keep an eye on those means so that we can compare these with the regression model that we're gonna fit. Let's go ahead and fit a linear regression model. we will relate Lung Capacity to this variable categorical Height. We can ask for summary of this model, and here we can see the R output as well as the fitted model. The intercept or constant term bnot (b0) of 2.15 is the estimated mean Y value for all X's equal 0. That is our reference or baseline group. In this particular model it is the mean Lung Capacity for someone in Height Category A. The coefficient for category B of 1.51 is the change in mean Lung Capacity we would expect for someone in category B relative to category A, for someone in category B we would have their estimated mean Lung Capacity equal to 2.15 plus 1.51 times 1, here we have the XB indicator equal to 1 because the individual is in category B plus 3.25 times 0, the XC indicator is equal to 0 as this individual is not in category C and so on; all other X indicators equal to 0. The mean Lung Capacity for someone in category B is 2.15 plus 1.51 which is equal to 3.66, this is the mean Lung Capacity for someone in category B the slight difference you see is due to rounding error. The coefficient for category C of 3.25 is the change in mean Lung Capacity we would expect for someone in category C relative to someone in category A; for someone in category C the XC indicator will equal 1, all other indicators will equal 0. In this case their estimated mean Lung Capacity will be 2.15 plus 3.25 which equals to 5.4; You can repeat this process to calculate the mean Lung Capacity for all other Height categories and if you do this you'll see the mean for category D is 7.17, the mean for category E is 8.69, and the mean for category F is 10.8; Using dummy or indicator variables is how we can include categorical or qualitative variables into a regression model. When including a categorical variable into a regression model, R will create the dummy or indicator variables automatically. The category that R chooses as the reference or baseline category will be the category that comes first alphabetically or numerically if categories are coded using 0,1,2 and so on. In a separate video I'll show how you can change which category serves as the reference. Also in a later video I'll show how to fit and interpret a regression model that uses both categorical and numeric variables. Thanks for watching this video and make sure to check out my other instructional videos!