Simple Linear Regression - Always plot your data!

Let's look at a few cautions regarding simple linear regression. Here are plots of four separate and distinct data sets. These data sets were created and published by a statistician named Frank Anscombe back in the 70s, and they've come to be known as Anscombe's quartet. What's so interesting about these four data sets? Well, it might be a little bit more obvious if I fit in the least squares regression line in each case. The interesting thing about these four data sets is that these four data sets all have the same summary statistics. They have the same mean of X, same variance of X, same mean of Y, same variance of Y, same correlation, same slope of the line, same intercept of the line. So the summary statistics for all of these different data sets are the same. But the overall picture is quite different. And the lesson here is simply that summary statistics don't tell the whole story. If we were to just look at the computer output for these four data sets for a simple linear regression, everything would look exactly the same. But the reality of the situation is things are very very different. So the lesson is always plot your data and have a look. And it's a good thing to do right away before you do any analysis. In this next example, I've plotted fuel consumption versus speed for cars and light-duty trucks. And the relationship looks pretty strong. And we could fit a regression line through these points. We might want to use that regression line for prediction. We might use the regression line to predict fuel consumption for a speed of 100 kilometers per hour or for a speed of 110 kilometers per hour, or something like that. And the regression line might work very well in that spot. And that relationship looks so strong that it might be tempting to use that relationship outside of the range of the observed data. We might go out here, say, and try and use this relationship to predict fuel consumption for a speed of 80, or something like that, something way out here in the left tail. It might be tempting to do that. But the problem is we do not know what the relationship is out here. The real relationship might be very different from the one we observe in our data. And actually the full data set does contain a lot of other information so let's look at the complete data set here. Over here are the values we were looking at on the last page. And I've left that blue line the same. But we can see that the real relationship, it doesn't continue on down here, the relationship changes a great deal. It's a very very different type of relationship out to the left of what we previously observed. If we only had access to our original data that we looked at here, we might be tempted to think that that same relationship would continue on outside the range of our observed data. And maybe it does, and maybe it doesn't, but we don't really have any evidence of that. And when we go beyond our observed data, out here, or out here, the real relationship might be changing dramatically and be very different from the one we observe. Using the regression line for prediction and estimation beyond the range of the observed data is called extrapolation, and it really should be avoided. So the lesson here is don't extrapolate, because we don't have any real evidence of what the true relationship is beyond the range of our observed data. In this next example I've plotted life expectancy for men versus the number of households with a colour TV per 100 households for a sample of 20 countries. There does appear to be some sort of increasing trend here, the does appear to be some sort of increasing trend. It certainly doesn't look perfectly linear, and a linear model might not be the best way of modeling this type of data, but if we were to force a line through those points, we would see that the correlation coefficient is approximately 0.76, and the square of the correlation, the coefficient of determination, is approximately 0.57. And that's telling us that 57 percent of the variability in life expectancy in the sample can be attributed to the linear relationship with households with a colour TV. Does that mean we can increase life expectancy by shipping countries lots of televisions? Well that's unlikely at best. What's happening here is that there's an underlying variable of wealth of a country, which is strongly tied into well, both the number of households with a colour TV and the life expectancy. So it's unlikely that there is a strong direct causal link between that number of televisions and the life expectancy for men. Which leads us to the oft-repeated line that correlation does not imply causation. Even if we find strong evidence of a relationship between our explanatory variable and our response variable, that doesn't necessarily mean that changes in the explanatory variable cause changes in the response. We've simply shown that there's strong evidence of a relationship. Causation is a little bit harder to show. One way we have of showing is through well designed experiments. But we don't always have that luxury, so sometimes it's simply important to realize that correlation does not imply causation.