Tip:
Highlight text to annotate it
X
Hi this is Bart Poulson and I am going to demonstrate how to make a boxplot, uh, primarily
to identify outlying scores in the program StatCrunch. The first thing you need to do
is get to the website, statcrunch.com, and sign in. And then, I'm going to use a data
set that I've used in previous examples that already exists on StatCrunch. And because
it's a StatCrunch one, I'm gonna come up here to "Explore." Um, by the way, you can get
to it by either pressing "Explore" and then clicking "Data," but you can also go to this
drop down menu right here and "Explore," "Data," that way also. And this gives me a, uh, window
where I can browse. I'm gonna type in—Um, I'm gonna use the, uh, data set "surveyf08.xls."
Because I've used it already, it pops up the first time, the first word. But I press "Search,"
and it brings up this data set with the barber. Um, in the last one I made a histogram of
the hours studied per week. That's this variable right here. I'm gonna do that again, except
this time I'm gonna use a boxplot to identify potential outliers. That's important because
in the, uh, uh, histogram that I made, it looks like there might be some outliers on
the high end here—people who spend, uh, 20 hours or more, uh, studying per week.
Ok, the way I do this is I go to "Graphics," down to "Boxplot." From there, it brings up
a dialogue box and I can click on my variable. I'm gonna use "Study." I can use this if I
wanna restrict it to, for instance, non-outlying scores or by some other variable. This one
I can break it down by men and women or, uh, whatever. I'm not gonna do either one of those.
I'm just gonna go to "Next." Now this is one which strangely enough should be mandatory.
This should be the default, and that is "Use fences to identify outliers." And that's the
whole reason that I use boxplots. So make absolutely sure you check that one. Now by
default, boxplots go up and down. Um, however, I like to do them side to side because then
it puts the score scale, the hours of studying, in the same direction that it is on a histogram.
Makes it easier to compare the two. So I'm gonna do it horizontally, but you don't need
to. Then I press "Next" and I get to "Titles." And this is something that needs to always
be done. The title on the top, I'm going to put, um, [typing] "Hours of Studying," studying,
"per Week for Students in StatCrunch Survey." I'm also going to put an x label, because
otherwise it would just put "Study" on the, uh, bottom of the chart and then you could
put [typing] "Hours of Studying per Week." And I think that's it. If I wanted more than
one boxplot, sometimes I do, I could use this to put them, uh, but I'm just going to press
"Create Graph" now. And there is my boxplot. The blue box here is--represents the middle
50% of the scorers, so the middle 50% of the scorers go from about 3 hours of studying
per week up to about 9 hours per week. The line in the middle is the 50th percentile
or the median for the second quartile. Those all mean the same thing. And it says half
of the people in the survey do about 6 hours of studying per week or less. The other half
do, uh, more than that. These lines right here are sometimes called "whiskers," sometime--so
it can be called a box and whiskers plot. This one represents the lowest. It looks like
we have somebody, at least one person, who spends no time studying per week. And then
this one up here, these are the high scorers, but they are identified as outliers and the
whisker here goes to the highest non-outlying score which is about 16 or 17 hours per week.
These two at 20 and 21 or 22 hours are outliers. And the way it determines outliers is it goes
to this box, the blue box. From the bottom to the top of it is called the interquartile
range because it goes from the first quartile on the low end, to the third quartile up here
on the high end. And the difference between those two, which in this case is about 11
or 12 hours, you multiply that times 1½ if there's 12 hours that gets you to 18, and
you add that on to the top end. And that gets you to where the cutoff is. And anything beyond
there would be an outlier. Now the nice thing about that is there may be situations in which
you want to ignore outliers and there are ways to restrict them so they're not included
in analyses if you're so inclined. You can also do what's called a transformation of
the data where you take, for instance, the logarithm, and that will bring them in. We
can deal with that another time. But for right now, this is the boxplot. It identifies 2
outliers and it can be helpful for us to, uh, keep those in mind in future analyses.
Because things like means, and standard deviations, and correlation, and regression can get distorted
by the presence of outliers. So the last thing I'm gonna do is I'm gonna export this to save
it. And I type in, [typing] "Boxplot of Hours Studying per Week." And I'll just hit "Export."
And now it's saved.