Tip:
Highlight text to annotate it
X
Hi everyone. Lloyd Rieber here again with my next presentation in the course, "Statistics
in education for Mere Mortals." In this presentation I'll introduce you to the topic of descriptive
statistics now we were already explored one of the main types of descriptive statistics
namely measures of central tendency such as the mean median and mode of course you already
knew a lot about that given your familiarity with averages in this presentation we'll begin
to explore the other main type of descriptive statistics variability we'll begin by exploring
something called the standard deviation again you'll need a solid understanding of these
ideas in order to be ready for our next Excel spreadsheet example OK let's get started let's
start with a very interesting example that comes from the sport of major league baseball
Ted Williams was the last major league baseball player to hit .400 in a season. He actually
batted .406 in 1941 so that means that for every 10 times he was at bat he got a hit
four times out of 10 for actual at-bats not counting walks and sacrifices and things like
that now an interesting question is simply why is that? Why has no other major league
baseball player since 1941 able to break through the 400 mark you might say that the players
now aren't as good as they were back then but you see there an interesting statistic
that the overall major league batting average has always been about .260 least since about
1900 so the answer to this question or I should say a theory as to an answer to this question
is one that we'll explore based on some of the ideas in this presentation so before this
presentation is over we will come back to this question we need to get a little more
formal now in our use of language in statistics and particularly we need to explore the two
words population and sample both of these words I think have everyday meanings population
for example people will usually think of a large group of people like the population
of a country or city or state you think of a large group like you see in the photographs
near the top of the screen however a population is simply going to be defined by us what are
going to be the boundaries of a particular group so for example maybe I would like to
do some work with the Boy Scouts or Girl Scouts so if I'm working with that troop of Girl
Scouts that you see in the middle and all I care about is their performance or how well
they are improving in something I'm trying to teach them that one troop can be the population
but if I would like to make some statements about Girl Scouts in the state of Georgia
or the state of Pennsylvania and what I am doing to help the Girl Scouts then I would
be talking about this sample of Girl Scouts I am working with now this is not just semantics
one of the reasons why we have to make this distinction is because the statistics we are
going to calculate may be different depending on whether or not we are going to be defining
a group as a population or as a sample indeed the word statistic really has two meanings
the first is that it simply describes something whether that be a person or group whatever
the second meaning is that a statistic is an estimate of a population based on a sample
of data that I happen to have so this distinction between the two meanings is extremely important
from here on out so I have an important decision to make I have a choice to make Am I interested
in collecting data on population? If so that means I have to access to every single member
of the population that's not very likely in most situations more likely is that I'm interested
in knowing about a population that I can only collect data on a sample so I might be interested
in helping an organization like the Girl Scouts and I would like to be able to design something
that helps all the Girl Scouts learn something but there's no way that I can try that out
and have every single Girl Scout go through this and collect data on every single girl
scout. Instead I would have to choose a sample and make some inferences or am I interested
in collecting data on a sample but don't care about the population I think this third situation
is most typical in training situations especially in business and industry so while I might
accept the fact that the population might be all the employees of a particular company
what I'm most interested in or what I'm only interested in is helping a small group of
employees to perform better in their jobs. I was hired to help this small group actually
perform better in their job so here is an illustration of what we've been talking about
we are interested in a population and we want to describe it in some way if I have access
to every single member of the population and can take a measurement that measurement is
called a parameter now as I say most of the time it's very impractical to have access
to every member of the population so we are going to use a process of sampling to take
a sample and that sample of course will be incomplete it will not include the entire
population it'll probably include a very small number of members of the population so again
we are going to describe that sample and we describe that sample with a statistic and
so the statistic is used then to make an inference to the parameter and I can't help but bring
up a classic example trying to describe a population namely the United States Census
every 10 years we try to count every single person who is living in the United States
in order to determine how many representatives we will have in the US House of Representatives
one part of the Congress now it's kind of interesting because in the 2000 census there
actually was an attempt to use statistical sampling instead of actually trying to count
every single person but that was ruled as being unconstitutional and you can read an
excerpt there it is in the U.S. Constitution that we have to count every single person
so sampling was considered not allowed or not appropriate this is obviously a very controversial
topic but many people especially those who know something about statistics really think
this is a shame because it would save a lot of money and they believe that the result
would be essentially equivalent before we go any further I want to talk a little bit
about what I think is one of the most challenging aspects of statistics for a lot of people
and it's really just the fact that the mathematical symbols we're going to be using are going
to be very strange and unfamiliar so I don't think the concepts or the computations are
very difficult at all but how they are represented mathematically can be a big obstacle for people
and I just want to remind you that we have to get used to mathematical symbols and for
example here are a bunch of mathematical symbols that I'm sure you already know and are very
comfortable with the plus sign, the equal sign, the divide sign, three ways to do multiplication
with a symbol and we accept these and we've grown accustomed to them and we can see that
they allow us to describe it in very concise terms or symbols some very interesting and
very complicated operations and I just want to remind you as you get into some of these
symbols in our next set of computations that the only mathematics that you will need is
simple arithmetic so don't let the symbols put you off but you will have to take time
to master them and to feel as comfortable with the symbols were about to explore just
as you are familiar and comfortable with these symbols of basic arithmetic. So here we go!
We already know this statement compute the mean for a set of numbers we're very comfortable
with but how do we express that in mathematical symbols and statistics well there actually
are going to be two ways depending if I am talking about a population or if I'm talking
about a sample so again you look at these symbols and you're saying what is this? But
remember it is simply the idea of compute the mean for a set of numbers so in the example
of the population we have a Greek symbol mu that kinda looks like an italics "u" but it
really the Greek symbol mu and that is equal to well we have another Greek symbol which
is the summation sign So we are going to sum all of the scores and that X capital X stands
for a particular score so we're going to sum or add up all of the scores and divide by
well, capital n is the total number of scores so that's exactly what we've been doing. Now
for a sample we're going to show the difference of a sample versus a population by having
it, it looks like to be almost exactly the same formula except instead of the Greek symbol
mu we have something called x bar which is again just the mean of the sample So I would
read that as X bar equals the sum of x, which is the sum of all the scores divided by the
total number of scores the summation sign is actually uppercase Greek letter Sigma now
the calculation we actually do is exactly the same whether it be for a population or
a sample so in this case it really is more of a symbolic difference between the two the
actual calculation is the same but by using the symbols properly I am able to understand
if I am talking about a population or a sample so again the ideas are nothing that we haven't
already explored we're just expressing them now in the symbols of statistics so there
are two important types of descriptive statistics the first are measures of central tendency
which we've already discussed and covered the examples being the mean the median and
the mode but now we're going to discuss and consider measures of variability so how variable
or how varied the scores are so here I've drawn two examples of the normal curve, one
superimposed upon the other they're both symmetrical but you notice how one is wider or more spread
out or fatter than the other one There is actually a word that statisticians use for
the shape of a normal curve like this and it's called kurtosis and that is not anoother
form of bad breath just a word that talks about or means that curve's quality of "peakedness"
But I like to think about it more in terms of the spread of the scores or the spread
of the curve being wider or more narrow so let's consider some ways to measure variability
I'm gong to lay out three methods method one would be to compute the difference between
the highest and the lowest score so if I gave a test and the highest score was a 90 and
the lowest score was a 50 the difference would be 40 that's also called the range and that
sounds reasonable You know, the highest and lowest score gives me some sense of how varied
the scores are but you know, note, it only takes into account two of the scores so if
I gave that test to 100 people I'm not even considering the scores of the other 98 people.
So what's method two? Well, method two starts off by computing the difference between each
score and the grand mean. Recall that the grand mean is simply the average of all of
the scores. Well, great! And in fact that is a very important statistic called a deviation
score so if I do that maybe what I should do is take the average of that well the problem
is and we'll look into this when we get into our Excel spreadsheet example the answer is
always zero and you might be wondering now why is it always zero? Well if you think about
what an average is if the average was 75 for example you're going to have a certain number
of people who scored above the average and a certain number of people who scored below
the average so a person's score minus the average is either going to be a positive number
or a negative number. So it kind of balances out you might say when you add up that entire
set of deviation scores so we need another method and that method, method number three
is actually going to be to compute this statistic called the standard deviation so what we do
is we square the difference scores first, those deviation scores, then we're going to
take the square root of all of that. Now when you square something, if it is a negative
number, a negative times a negative equals a positive. So we're going to wind up with
all positive numbers then since we squared everything we need to take the square root
of it that's a pretty clever idea as you will see in the formula we will also be dividing
by the number of scores so in a way it's more or less an average of how each score varies
from the group so here is the formula for a standard deviation when describing a sample
which is not the same as trying to estimate a parameter of a population. We'll talk about
that again on the next slide. And here's the symbolism that we were talking
about earlier but it's exactly what I was describing to you a few minutes ago so "s"
is the standard deviation of the sample and that's going to be equal to the square root
of the sum of all those deviation scores squared, so I'm going to take each person's score subtract
from it the average or the mean of the group whatever that number is, positive or negative,
I'm going to square it. So, it's going to turn out to be a positive. I will sum all
those up and then I will divide that by the total number of scores in the data set and
then that value, I will take the square root of it, and that will be the standard deviation
and I'm sure this might be as clear as mud right now in your heads but I assure you when
we take a look at it in Excel it will seem very easy it will seem I would say even simple
OK let's look at the formula again when we are looking at the standard deviation, this
time though where we have a sample, but our intent is to estimate the population parameter
which would be trying to estimate what the standard deviation is for the population so
as you look at the formula it looks almost identical but what should jump out at you
is instead of dividing by N we're dividing by N minus one. You might also notice that
instead of the capital S we have a lowercase s with a little carrot on top of it which
is actually would be called "s hat" but it's simply trying to show you what we're now trying
to compute is an estimate it's a standard deviation which is an estimate of the standard
deviation of the population but why do we have N -1? Well, the answer is a little abstract
but let me give it a try. As you know looking at the numerator we have a person's score
minus the group average now the group average is the average from the sample we don't know
the mean of the population and from statistics the history of statistics we have learned
that that number tends to be a little too small that is it tends to underestimate the
true value so since it's a little too small we make up for it by dividing by a denominator
that's also a little smaller so it when the denominator gets a little smaller the overall
value gets a little bit higher or larger. Now also notice, of course, with a very small
sample, let's say my sample was only 10. 10 minus 1 is 9. And that's a pretty big difference
dividing by 10 or 9. Now, as my sample gets very very large, if I had a sample of 10,000
well, dividing by 10,000 as compared to 9999 is not going to be as big a difference. So
the larger the sample the closer the estimate now I'm sure that was not a very satisfactory
explanation and I'm sorry about that this is one of those subtle points of statistics
that you need to be aware of and to continue to study when to use N versus N-1 in the denominator
but the important point to be made here is the point we're trying to make earlier and
that is OK I I have collected data on a sample is my interest only on that sample or am I
trying to estimate the population parameter now that was much too abstract for my taste
so let's do an example and I think you'll find this will help tremendously in making
sense out of all that so here's a little example that I like it again deals with the Girl Scouts
and the Girl Scouts like to sell cookies and it's a great fundraiser let's say there was
a troop of six girls and you can see I have a table here and in the in the left column
with the capital X at the top right under boxes of cookies I see what each girl in the
troop sold. So girl one sold 28, girl two, 11 girl three, 10 and so on and below that
you see the sum of all those scores is 60 in the second column you see that we have
the deviation scores which is each score minus the average so if I see what the average was,
I take the the sum of all those boxes of cookies I get 60, and 60 divided by six is 10 so the
deviation scores are going to be again those individual sales numbers minus the average
of 10 so girl number one, her deviation score is going to be 28 minus 10 which is 18, girl
two, 11 minus 10 is one, girl three, right on the average, 10 minus 10 is zero then you
can see the next three girls they sold boxes of the cookies that were below the average
so that they have negative numbers so those are deviation scores and at the bottom here
I put in the other symbol for a deviation score, a deviation score again is the score
minus the average, or x minus x-bar but another symbol for the deviation score is a lowercase
X. and if I were to add those up and you can do that you will see that that sum comes to
zero and that's a good check of our work because mathematically it has to be zero now let's
look at the third column which consists of all the deviation scores squared so as you
can see 18 squared is 324 one squared is 1, 0 squared is zero, but now you see the negative
numbers -5 is going to be squared and the answer is going to be 25-6 is going to be
36 and -8 is 64 so all of the values in the third column are positive numbers. If I sum
all those up I get 450 so let's now compute the standard deviation in this particular
case I decided that we definitely have a sample and I am not interested in estimating the
population standard deviation, so I simply am going to be dividing by N so you see down
below in the blue box I have the formula for calculating the standard deviation so you
can see I am summing up all those deviation scores. I'm going to square those divide by
N and then take the square root of that entire value. So you kinda move over to the next
spot here. And it is just the same formula, but again I'm using the symbol x instead of
X minus X- bar, with the other lowercase x being the deviation score symbol so again
I'm going to square that and sum all those up dividing by how many Girl Scouts I have
and then take the square root of all of that. So if I do that you will see that I get a
value of 450 divided by six and then you take the square root of that so that would be the
square root of 75 equals 8.66, the standard deviation is 8.66 so the important question
and concept that I have to wrestle with now is, what does that mean exactly, a standard
deviation of 8.66? Well, if all the girls sold exactly the same number of boxes of cookies
if they all sold 10 boxes for example there would be no variation, there would be no deviation
from the mean, and so the standard deviation in that case would be zero. Now if the girls
all sold a similar number of boxes so that all of their averages were all hovering very
very closely around the group average then you would have a very small standard deviation
but if you have some of the girls selling a lot of cookies and a lot of the girls selling
very few cookies then you have on a very large variation, a large spread in that distribution
of scores and therefore you would have a large standard deviation so here is our old friend
the normal curve this time I've drawn it as you can see in a fairly colorful way and honestly
I didn't draw this one I found it doing a google search of copyright free images this
is actually a very interesting representation because along the bottom you see have these
little tick marks at very equal distant, they are equidistant. At the very center I have
the mean, I'm using the population symbol mu to represent the mean, of course it's a
normal distribution so the mean, median, and mode would be at that particular point now
notice how we have these bands of blue again that are going up like columns and again,
I want to point out that these are at equal marks along the x-axis so what this really
represents are standard deviation units, so one standard deviation to the right of the
mean, is the one sigma, again sigma is the symbol used to represent the standard deviation
of a population if I go to the left of mu I have negative one sigma and so again those
are equidistant but notice that the column or the band of blue if you can imagine that
being filled with water, that's how I like to think of it the amount of water that is
contained to the right and to the left in that first band of dark blue is 34.1% and
that's a very important number the next number in the slightly less dark blue is 13.6% and
then finally you see 2.1% in the lightest blue and then at the tails you see .1% now
these numbers are very important because they will tell you in standard deviation units
how much or how many of the scores in a percentage will be found in that one unit of that one
standard deviation unit so if the normal distribution is very narrow you're going to have those
bands be closer together but those bands are still going to contain that percentage of
the scores and you'll see these numbers often in your statistical career so these are actually
numbers worth memorizing OK let's end this presentation by coming back to the question
that we started with this goes to the question of why have there not been any more .400 hitters
since 1941 and Ted Williams well the answer may involve the principle of variance and
this is a theory that was put forward by Stephen Jay Gould, the late paleontologist and biologist,
and basically goes it goes this that he believed that the skill level of all players is more
similar now than before or maybe stated another way there is less variation between the best
and worst players including hitters and pitchers of course so someone like Ted Williams who
was an excellent hitter no matter what timeframe in baseball we are talking about Ted Williams
was going to be facing, sure some of the best pitchers of all time, but he was also going
to be facing pitchers who were very much less skilled than we have today so perhaps he had
more opportunities to take advantage of the overall less skill level in terms of the variation
of the pitchers that he was facing. So I don't know if this is a theory that I believe or
not, it's again very controversial. But it's kind of a fun theory and I love the way that
it brings in the idea of variance to try to explain something because again the major
league batting average has always been about .260. So I hope this presentation has at least
set the stage for getting you to think about some of important ideas and I hope it is now
going to have you be ready for our next Excel spreadsheet example and as I say actually
doing the examples is much more concrete and I believe once you do the example you will
see very clearly how to do the computation and because of the power of Excel I do believe
it's a very simple operation to follow. Yes, there are many steps we're going to have to
go through but Excel is going to do the hard work for us. So, until next time!