Tip:
Highlight text to annotate it
X
Hi, my name is Bart Poulson and in this short tutorial, I'm gonna show how to calculate
a bivariate or simple regression for two quantitative variables in SPSS; also known as PASW for
"Predictive Analytic Software." I'm using version 17, but it's basically identical in
every previous version that I've used. Also, I'm doing this on my Mac, but it is also pretty
much identical on the Windows PC. So this will work for basically any version of SPSS
on any platform. The first thing I need to do is I need to open up a data set. I'm gonna
use one that's already in SPSS so you can follow along if you want. I come up here to
the corner and click on the file folder to open a data document. And it brings up a list
of data sets that are already in SPSS. Now you have to dig around a little bit to find
these, but you'll get there. I'm using one that's at the very end and it's called "world95.sav."
The "sav" is the suffix for an SPSS data file. And I'm gonna open that up now. Just double
click. It does this little "whoosh" thing. Alright, this is a data set with, uh, information
for 109 different countries from Afghanistan down to Zambia. Uh, population, density, predominant
religion, infant mortality, daily calories, and so on. I'm gonna look at 2 variables that
I also use in a demonstration on scatterplots. Um, literacy rates, that's this one right
here; the people who can read; percent of the population; and female life expectancy.
Um, the idea here being countries that have a higher literacy rate are also typically
countries that have higher female, uh, life expectancy. Now, I've already done a scatterplot
in another, um, uh, tutorial, but I am gonna run through it very quickly. I'm gonna go
quickly through it, you can see the other tutorial if you want. Because it is always
a good idea to do a scatterplot or to do the graphics before you do any statistical procedure.
Let's see here, uh, people who can read on the x and life expectancy on the y. Now SPSS
has one window that has the data, and it looks like a spreadsheet, but it has a separate
window that has the output. That's a little different from a regular spreadsheet like
Excel. Um, actually, this is advantageous in a lot of ways, but it is a little different.
There's also a 3rd window called a syntax window which is like a written program, uh,
that you can use for these, uh, procedures. I use that as well, I'll show it in another
one. Anyhow, here's the scatterplot. You've got people who can read, the literacy rate
across the bottom. It has female life expectancy up the side. You can see it's a strong uphill
pattern. We gotta bunch of countries right here with high, uh, literacy rates and also
with high, uh, average life expectancy for women although life expectancy goes down to
the 40s for several country and we have literacy rates that are below 20 in some countries.
Um, one thing I wanna do is I wanna run a regression line through this. Always a good
idea because that's what I'm gonna be doing numerically. So I double click on it to edit
it and I just click this thing right here to run a regression line. And I'll close that
again. Again, I have a separate tutorial that shows how to do all the re—the graphics.
But, you see, it's a strong uphill line and this thing right here gives what's called
the R2 which is a measure of how well the data fit —uh, how closely they fit
the line. And in this case, an R2 of 74—749 —or 75 is very very high. And it means
that if you know the, uh, percentage of people in the country who can read, you can accurately
predict 75% of the variants in female life expectancy which is really high. But what
I'm gonna do right now is I'm gonna do a procedure, bivariate regression, which gives us the slope
and the intercept for this straight regression line that we ran through it. To do that, I
come up here to "Analyze," down to "Regression," and there's a lot of different forms of regression.
I'm gonna do the simplest which is just "Linear" which means a straight line through the data
regression. So I'm gonna select that procedure and the first thing I need to do is I need
to specify the dependent variable. Truthfully, this should be called the outcome variable
because this is a correlational study and not an experiment, but what I wanna look at
is average female life expectancy. That's my outcome, or dependent variable and I'm
going to use the percentage of people who read, or the literacy rate as the independent
or predictor variable. So I've got those two in there, I press the arrow to get them in.
And I'm just leaving everything else the same way that it is because this is the simplest
possible regression. Click "OK." Alright and we have a bunch of statistics that come out.
This just says that I ran a regression command. This says which data set I used. This says
that the dependent variable is life expectancy, that I used only one predictor variable which
is the percentage of people who can read. Uh, these two only have to do when you have
several predictor variables. We'll do that later. This one right here, "Model Summary,"
says how well this particular regression predicts the outcome variable. It has a uh, correlation,
an R. It's a capital R because it could be what's called a multiple correlation, but
it's 865 and that tells you what the predictors are. If you square that number to get the
proportion of variants accounted for in the outcome, you get just about 75% which is enormous.
Now there's something called an adjusted R2 which takes into consideration the number
of, uh, observations, or the sample size in your data as well as the number of predictor
variables. Um, and it can be pretty different, but right here it's real close. The standard
error which goes into, uh, calculating the probability values for a null hypothesis test,
but we'll ignore that for right now. This one right here, the analysis of variants,
ANOVA table, is just an indication of how well the model, that is the slope and the
intercept model, fits the data. And the regression fits it really super well. You can see this
F value, that's the inferential test at 313 which is just gargantuan and the significance
value from the null hypothesis test is, uh, less than 001. I mean, it's probably less
than a million to one. But anyhow, the one that we really want to look at is this bottom
one that says "Coefficients." There's two things we wanna look at. This first one, right
here, the constant, is the intercept for the regression line. And the intercept is this
one right here. It's 38.5. And one way to interpret this which works for some data sets
but not for others is—is if the predictor variable were zero. That is, for instance
if we had a country that for some reason had a zero literacy rate, that women would be
expected to have an average life expectancy of 38½ years. Now that's not an actual
value in the data set, so that one's not very helpful. This one here is just a standard
error is something that goes into the, um, inferential test which is called the t-test
where we get a value of 20.67 and a significant test of, uh, P-value less than 000, so it
just—really that just tells us if it's reliably different from zero. Probably the
most important one is this one right here, the second one, the people who read percent
and its regression coefficient here—this is the slope of the line. And it's .403. So,
what this means is that for every percentage point increase in literacy, as you go, for
instance, from 71 to 72 percent literacy, you can expect a four-tenths of a year increase
in women's life expectancy, you know, which is pretty huge considering you can go from,
you know, up to 100% and you recall that actually we had about a 40 year range in women's life
expectancy, so, that fits it pretty well. This right here, uh, is the inferential test,
the t-test, uh, 17.699 is a very large value and this tells us the fact that the probability
level is less than 000, uh, lets us know that this is reliably different from 0. That's
one way of interpreting it. Um, the standardized coefficient says if we turned all the variables
into z-variables, into z-scores which have a standard—a mean of 0 and a standard
deviation of 1—this would be the slope. The intercept becomes 0 because that's—'cuz
it got standardized. And the slope would be 0. It's 8—excuse me, the slope is 865—which
please notice is the same number as right here. When you have only one predictor variable
and one outcome, the standardized regression slope is the same thing as the correlation
coefficient. Anyhow, that is how you do a regression, bivariate regression, in SPSS.
And we'll try looking at some other statistics in other tutorials. Thank you.