Working with Spss - Bivariate - Or simple - Regression

Hi, my name is Bart Poulson and in this short tutorial, I'm gonna show how to calculate a bivariate or simple regression for two quantitative variables in SPSS; also known as PASW for "Predictive Analytic Software." I'm using version 17, but it's basically identical in every previous version that I've used. Also, I'm doing this on my Mac, but it is also pretty much identical on the Windows PC. So this will work for basically any version of SPSS on any platform. The first thing I need to do is I need to open up a data set. I'm gonna use one that's already in SPSS so you can follow along if you want. I come up here to the corner and click on the file folder to open a data document. And it brings up a list of data sets that are already in SPSS. Now you have to dig around a little bit to find these, but you'll get there. I'm using one that's at the very end and it's called "world95.sav." The "sav" is the suffix for an SPSS data file. And I'm gonna open that up now. Just double click. It does this little "whoosh" thing. Alright, this is a data set with, uh, information for 109 different countries from Afghanistan down to Zambia. Uh, population, density, predominant religion, infant mortality, daily calories, and so on. I'm gonna look at 2 variables that I also use in a demonstration on scatterplots. Um, literacy rates, that's this one right here; the people who can read; percent of the population; and female life expectancy. Um, the idea here being countries that have a higher literacy rate are also typically countries that have higher female, uh, life expectancy. Now, I've already done a scatterplot in another, um, uh, tutorial, but I am gonna run through it very quickly. I'm gonna go quickly through it, you can see the other tutorial if you want. Because it is always a good idea to do a scatterplot or to do the graphics before you do any statistical procedure. Let's see here, uh, people who can read on the x and life expectancy on the y. Now SPSS has one window that has the data, and it looks like a spreadsheet, but it has a separate window that has the output. That's a little different from a regular spreadsheet like Excel. Um, actually, this is advantageous in a lot of ways, but it is a little different. There's also a 3rd window called a syntax window which is like a written program, uh, that you can use for these, uh, procedures. I use that as well, I'll show it in another one. Anyhow, here's the scatterplot. You've got people who can read, the literacy rate across the bottom. It has female life expectancy up the side. You can see it's a strong uphill pattern. We gotta bunch of countries right here with high, uh, literacy rates and also with high, uh, average life expectancy for women although life expectancy goes down to the 40s for several country and we have literacy rates that are below 20 in some countries. Um, one thing I wanna do is I wanna run a regression line through this. Always a good idea because that's what I'm gonna be doing numerically. So I double click on it to edit it and I just click this thing right here to run a regression line. And I'll close that again. Again, I have a separate tutorial that shows how to do all the reâ€”the graphics. But, you see, it's a strong uphill line and this thing right here gives what's called the R2 which is a measure of how well the data fit â€”uh, how closely they fit the line. And in this case, an R2 of 74â€”749 â€”or 75 is very very high. And it means that if you know the, uh, percentage of people in the country who can read, you can accurately predict 75% of the variants in female life expectancy which is really high. But what I'm gonna do right now is I'm gonna do a procedure, bivariate regression, which gives us the slope and the intercept for this straight regression line that we ran through it. To do that, I come up here to "Analyze," down to "Regression," and there's a lot of different forms of regression. I'm gonna do the simplest which is just "Linear" which means a straight line through the data regression. So I'm gonna select that procedure and the first thing I need to do is I need to specify the dependent variable. Truthfully, this should be called the outcome variable because this is a correlational study and not an experiment, but what I wanna look at is average female life expectancy. That's my outcome, or dependent variable and I'm going to use the percentage of people who read, or the literacy rate as the independent or predictor variable. So I've got those two in there, I press the arrow to get them in. And I'm just leaving everything else the same way that it is because this is the simplest possible regression. Click "OK." Alright and we have a bunch of statistics that come out. This just says that I ran a regression command. This says which data set I used. This says that the dependent variable is life expectancy, that I used only one predictor variable which is the percentage of people who can read. Uh, these two only have to do when you have several predictor variables. We'll do that later. This one right here, "Model Summary," says how well this particular regression predicts the outcome variable. It has a uh, correlation, an R. It's a capital R because it could be what's called a multiple correlation, but it's 865 and that tells you what the predictors are. If you square that number to get the proportion of variants accounted for in the outcome, you get just about 75% which is enormous. Now there's something called an adjusted R2 which takes into consideration the number of, uh, observations, or the sample size in your data as well as the number of predictor variables. Um, and it can be pretty different, but right here it's real close. The standard error which goes into, uh, calculating the probability values for a null hypothesis test, but we'll ignore that for right now. This one right here, the analysis of variants, ANOVA table, is just an indication of how well the model, that is the slope and the intercept model, fits the data. And the regression fits it really super well. You can see this F value, that's the inferential test at 313 which is just gargantuan and the significance value from the null hypothesis test is, uh, less than 001. I mean, it's probably less than a million to one. But anyhow, the one that we really want to look at is this bottom one that says "Coefficients." There's two things we wanna look at. This first one, right here, the constant, is the intercept for the regression line. And the intercept is this one right here. It's 38.5. And one way to interpret this which works for some data sets but not for others isâ€”is if the predictor variable were zero. That is, for instance if we had a country that for some reason had a zero literacy rate, that women would be expected to have an average life expectancy of 38Â½ years. Now that's not an actual value in the data set, so that one's not very helpful. This one here is just a standard error is something that goes into the, um, inferential test which is called the t-test where we get a value of 20.67 and a significant test of, uh, P-value less than 000, so it justâ€”really that just tells us if it's reliably different from zero. Probably the most important one is this one right here, the second one, the people who read percent and its regression coefficient hereâ€”this is the slope of the line. And it's .403. So, what this means is that for every percentage point increase in literacy, as you go, for instance, from 71 to 72 percent literacy, you can expect a four-tenths of a year increase in women's life expectancy, you know, which is pretty huge considering you can go from, you know, up to 100% and you recall that actually we had about a 40 year range in women's life expectancy, so, that fits it pretty well. This right here, uh, is the inferential test, the t-test, uh, 17.699 is a very large value and this tells us the fact that the probability level is less than 000, uh, lets us know that this is reliably different from 0. That's one way of interpreting it. Um, the standardized coefficient says if we turned all the variables into z-variables, into z-scores which have a standardâ€”a mean of 0 and a standard deviation of 1â€”this would be the slope. The intercept becomes 0 because that'sâ€”'cuz it got standardized. And the slope would be 0. It's 8â€”excuse me, the slope is 865â€”which please notice is the same number as right here. When you have only one predictor variable and one outcome, the standardized regression slope is the same thing as the correlation coefficient. Anyhow, that is how you do a regression, bivariate regression, in SPSS. And we'll try looking at some other statistics in other tutorials. Thank you.