Statistics in Education for Mere Mortals - Descriptive statistics - The standard deviation

Hi everyone. Lloyd Rieber here again with my next presentation in the course, "Statistics in education for Mere Mortals." In this presentation I'll introduce you to the topic of descriptive statistics now we were already explored one of the main types of descriptive statistics namely measures of central tendency such as the mean median and mode of course you already knew a lot about that given your familiarity with averages in this presentation we'll begin to explore the other main type of descriptive statistics variability we'll begin by exploring something called the standard deviation again you'll need a solid understanding of these ideas in order to be ready for our next Excel spreadsheet example OK let's get started let's start with a very interesting example that comes from the sport of major league baseball Ted Williams was the last major league baseball player to hit .400 in a season. He actually batted .406 in 1941 so that means that for every 10 times he was at bat he got a hit four times out of 10 for actual at-bats not counting walks and sacrifices and things like that now an interesting question is simply why is that? Why has no other major league baseball player since 1941 able to break through the 400 mark you might say that the players now aren't as good as they were back then but you see there an interesting statistic that the overall major league batting average has always been about .260 least since about 1900 so the answer to this question or I should say a theory as to an answer to this question is one that we'll explore based on some of the ideas in this presentation so before this presentation is over we will come back to this question we need to get a little more formal now in our use of language in statistics and particularly we need to explore the two words population and sample both of these words I think have everyday meanings population for example people will usually think of a large group of people like the population of a country or city or state you think of a large group like you see in the photographs near the top of the screen however a population is simply going to be defined by us what are going to be the boundaries of a particular group so for example maybe I would like to do some work with the Boy Scouts or Girl Scouts so if I'm working with that troop of Girl Scouts that you see in the middle and all I care about is their performance or how well they are improving in something I'm trying to teach them that one troop can be the population but if I would like to make some statements about Girl Scouts in the state of Georgia or the state of Pennsylvania and what I am doing to help the Girl Scouts then I would be talking about this sample of Girl Scouts I am working with now this is not just semantics one of the reasons why we have to make this distinction is because the statistics we are going to calculate may be different depending on whether or not we are going to be defining a group as a population or as a sample indeed the word statistic really has two meanings the first is that it simply describes something whether that be a person or group whatever the second meaning is that a statistic is an estimate of a population based on a sample of data that I happen to have so this distinction between the two meanings is extremely important from here on out so I have an important decision to make I have a choice to make Am I interested in collecting data on population? If so that means I have to access to every single member of the population that's not very likely in most situations more likely is that I'm interested in knowing about a population that I can only collect data on a sample so I might be interested in helping an organization like the Girl Scouts and I would like to be able to design something that helps all the Girl Scouts learn something but there's no way that I can try that out and have every single Girl Scout go through this and collect data on every single girl scout. Instead I would have to choose a sample and make some inferences or am I interested in collecting data on a sample but don't care about the population I think this third situation is most typical in training situations especially in business and industry so while I might accept the fact that the population might be all the employees of a particular company what I'm most interested in or what I'm only interested in is helping a small group of employees to perform better in their jobs. I was hired to help this small group actually perform better in their job so here is an illustration of what we've been talking about we are interested in a population and we want to describe it in some way if I have access to every single member of the population and can take a measurement that measurement is called a parameter now as I say most of the time it's very impractical to have access to every member of the population so we are going to use a process of sampling to take a sample and that sample of course will be incomplete it will not include the entire population it'll probably include a very small number of members of the population so again we are going to describe that sample and we describe that sample with a statistic and so the statistic is used then to make an inference to the parameter and I can't help but bring up a classic example trying to describe a population namely the United States Census every 10 years we try to count every single person who is living in the United States in order to determine how many representatives we will have in the US House of Representatives one part of the Congress now it's kind of interesting because in the 2000 census there actually was an attempt to use statistical sampling instead of actually trying to count every single person but that was ruled as being unconstitutional and you can read an excerpt there it is in the U.S. Constitution that we have to count every single person so sampling was considered not allowed or not appropriate this is obviously a very controversial topic but many people especially those who know something about statistics really think this is a shame because it would save a lot of money and they believe that the result would be essentially equivalent before we go any further I want to talk a little bit about what I think is one of the most challenging aspects of statistics for a lot of people and it's really just the fact that the mathematical symbols we're going to be using are going to be very strange and unfamiliar so I don't think the concepts or the computations are very difficult at all but how they are represented mathematically can be a big obstacle for people and I just want to remind you that we have to get used to mathematical symbols and for example here are a bunch of mathematical symbols that I'm sure you already know and are very comfortable with the plus sign, the equal sign, the divide sign, three ways to do multiplication with a symbol and we accept these and we've grown accustomed to them and we can see that they allow us to describe it in very concise terms or symbols some very interesting and very complicated operations and I just want to remind you as you get into some of these symbols in our next set of computations that the only mathematics that you will need is simple arithmetic so don't let the symbols put you off but you will have to take time to master them and to feel as comfortable with the symbols were about to explore just as you are familiar and comfortable with these symbols of basic arithmetic. So here we go! We already know this statement compute the mean for a set of numbers we're very comfortable with but how do we express that in mathematical symbols and statistics well there actually are going to be two ways depending if I am talking about a population or if I'm talking about a sample so again you look at these symbols and you're saying what is this? But remember it is simply the idea of compute the mean for a set of numbers so in the example of the population we have a Greek symbol mu that kinda looks like an italics "u" but it really the Greek symbol mu and that is equal to well we have another Greek symbol which is the summation sign So we are going to sum all of the scores and that X capital X stands for a particular score so we're going to sum or add up all of the scores and divide by well, capital n is the total number of scores so that's exactly what we've been doing. Now for a sample we're going to show the difference of a sample versus a population by having it, it looks like to be almost exactly the same formula except instead of the Greek symbol mu we have something called x bar which is again just the mean of the sample So I would read that as X bar equals the sum of x, which is the sum of all the scores divided by the total number of scores the summation sign is actually uppercase Greek letter Sigma now the calculation we actually do is exactly the same whether it be for a population or a sample so in this case it really is more of a symbolic difference between the two the actual calculation is the same but by using the symbols properly I am able to understand if I am talking about a population or a sample so again the ideas are nothing that we haven't already explored we're just expressing them now in the symbols of statistics so there are two important types of descriptive statistics the first are measures of central tendency which we've already discussed and covered the examples being the mean the median and the mode but now we're going to discuss and consider measures of variability so how variable or how varied the scores are so here I've drawn two examples of the normal curve, one superimposed upon the other they're both symmetrical but you notice how one is wider or more spread out or fatter than the other one There is actually a word that statisticians use for the shape of a normal curve like this and it's called kurtosis and that is not anoother form of bad breath just a word that talks about or means that curve's quality of "peakedness" But I like to think about it more in terms of the spread of the scores or the spread of the curve being wider or more narrow so let's consider some ways to measure variability I'm gong to lay out three methods method one would be to compute the difference between the highest and the lowest score so if I gave a test and the highest score was a 90 and the lowest score was a 50 the difference would be 40 that's also called the range and that sounds reasonable You know, the highest and lowest score gives me some sense of how varied the scores are but you know, note, it only takes into account two of the scores so if I gave that test to 100 people I'm not even considering the scores of the other 98 people. So what's method two? Well, method two starts off by computing the difference between each score and the grand mean. Recall that the grand mean is simply the average of all of the scores. Well, great! And in fact that is a very important statistic called a deviation score so if I do that maybe what I should do is take the average of that well the problem is and we'll look into this when we get into our Excel spreadsheet example the answer is always zero and you might be wondering now why is it always zero? Well if you think about what an average is if the average was 75 for example you're going to have a certain number of people who scored above the average and a certain number of people who scored below the average so a person's score minus the average is either going to be a positive number or a negative number. So it kind of balances out you might say when you add up that entire set of deviation scores so we need another method and that method, method number three is actually going to be to compute this statistic called the standard deviation so what we do is we square the difference scores first, those deviation scores, then we're going to take the square root of all of that. Now when you square something, if it is a negative number, a negative times a negative equals a positive. So we're going to wind up with all positive numbers then since we squared everything we need to take the square root of it that's a pretty clever idea as you will see in the formula we will also be dividing by the number of scores so in a way it's more or less an average of how each score varies from the group so here is the formula for a standard deviation when describing a sample which is not the same as trying to estimate a parameter of a population. We'll talk about that again on the next slide. And here's the symbolism that we were talking about earlier but it's exactly what I was describing to you a few minutes ago so "s" is the standard deviation of the sample and that's going to be equal to the square root of the sum of all those deviation scores squared, so I'm going to take each person's score subtract from it the average or the mean of the group whatever that number is, positive or negative, I'm going to square it. So, it's going to turn out to be a positive. I will sum all those up and then I will divide that by the total number of scores in the data set and then that value, I will take the square root of it, and that will be the standard deviation and I'm sure this might be as clear as mud right now in your heads but I assure you when we take a look at it in Excel it will seem very easy it will seem I would say even simple OK let's look at the formula again when we are looking at the standard deviation, this time though where we have a sample, but our intent is to estimate the population parameter which would be trying to estimate what the standard deviation is for the population so as you look at the formula it looks almost identical but what should jump out at you is instead of dividing by N we're dividing by N minus one. You might also notice that instead of the capital S we have a lowercase s with a little carrot on top of it which is actually would be called "s hat" but it's simply trying to show you what we're now trying to compute is an estimate it's a standard deviation which is an estimate of the standard deviation of the population but why do we have N -1? Well, the answer is a little abstract but let me give it a try. As you know looking at the numerator we have a person's score minus the group average now the group average is the average from the sample we don't know the mean of the population and from statistics the history of statistics we have learned that that number tends to be a little too small that is it tends to underestimate the true value so since it's a little too small we make up for it by dividing by a denominator that's also a little smaller so it when the denominator gets a little smaller the overall value gets a little bit higher or larger. Now also notice, of course, with a very small sample, let's say my sample was only 10. 10 minus 1 is 9. And that's a pretty big difference dividing by 10 or 9. Now, as my sample gets very very large, if I had a sample of 10,000 well, dividing by 10,000 as compared to 9999 is not going to be as big a difference. So the larger the sample the closer the estimate now I'm sure that was not a very satisfactory explanation and I'm sorry about that this is one of those subtle points of statistics that you need to be aware of and to continue to study when to use N versus N-1 in the denominator but the important point to be made here is the point we're trying to make earlier and that is OK I I have collected data on a sample is my interest only on that sample or am I trying to estimate the population parameter now that was much too abstract for my taste so let's do an example and I think you'll find this will help tremendously in making sense out of all that so here's a little example that I like it again deals with the Girl Scouts and the Girl Scouts like to sell cookies and it's a great fundraiser let's say there was a troop of six girls and you can see I have a table here and in the in the left column with the capital X at the top right under boxes of cookies I see what each girl in the troop sold. So girl one sold 28, girl two, 11 girl three, 10 and so on and below that you see the sum of all those scores is 60 in the second column you see that we have the deviation scores which is each score minus the average so if I see what the average was, I take the the sum of all those boxes of cookies I get 60, and 60 divided by six is 10 so the deviation scores are going to be again those individual sales numbers minus the average of 10 so girl number one, her deviation score is going to be 28 minus 10 which is 18, girl two, 11 minus 10 is one, girl three, right on the average, 10 minus 10 is zero then you can see the next three girls they sold boxes of the cookies that were below the average so that they have negative numbers so those are deviation scores and at the bottom here I put in the other symbol for a deviation score, a deviation score again is the score minus the average, or x minus x-bar but another symbol for the deviation score is a lowercase X. and if I were to add those up and you can do that you will see that that sum comes to zero and that's a good check of our work because mathematically it has to be zero now let's look at the third column which consists of all the deviation scores squared so as you can see 18 squared is 324 one squared is 1, 0 squared is zero, but now you see the negative numbers -5 is going to be squared and the answer is going to be 25-6 is going to be 36 and -8 is 64 so all of the values in the third column are positive numbers. If I sum all those up I get 450 so let's now compute the standard deviation in this particular case I decided that we definitely have a sample and I am not interested in estimating the population standard deviation, so I simply am going to be dividing by N so you see down below in the blue box I have the formula for calculating the standard deviation so you can see I am summing up all those deviation scores. I'm going to square those divide by N and then take the square root of that entire value. So you kinda move over to the next spot here. And it is just the same formula, but again I'm using the symbol x instead of X minus X- bar, with the other lowercase x being the deviation score symbol so again I'm going to square that and sum all those up dividing by how many Girl Scouts I have and then take the square root of all of that. So if I do that you will see that I get a value of 450 divided by six and then you take the square root of that so that would be the square root of 75 equals 8.66, the standard deviation is 8.66 so the important question and concept that I have to wrestle with now is, what does that mean exactly, a standard deviation of 8.66? Well, if all the girls sold exactly the same number of boxes of cookies if they all sold 10 boxes for example there would be no variation, there would be no deviation from the mean, and so the standard deviation in that case would be zero. Now if the girls all sold a similar number of boxes so that all of their averages were all hovering very very closely around the group average then you would have a very small standard deviation but if you have some of the girls selling a lot of cookies and a lot of the girls selling very few cookies then you have on a very large variation, a large spread in that distribution of scores and therefore you would have a large standard deviation so here is our old friend the normal curve this time I've drawn it as you can see in a fairly colorful way and honestly I didn't draw this one I found it doing a google search of copyright free images this is actually a very interesting representation because along the bottom you see have these little tick marks at very equal distant, they are equidistant. At the very center I have the mean, I'm using the population symbol mu to represent the mean, of course it's a normal distribution so the mean, median, and mode would be at that particular point now notice how we have these bands of blue again that are going up like columns and again, I want to point out that these are at equal marks along the x-axis so what this really represents are standard deviation units, so one standard deviation to the right of the mean, is the one sigma, again sigma is the symbol used to represent the standard deviation of a population if I go to the left of mu I have negative one sigma and so again those are equidistant but notice that the column or the band of blue if you can imagine that being filled with water, that's how I like to think of it the amount of water that is contained to the right and to the left in that first band of dark blue is 34.1% and that's a very important number the next number in the slightly less dark blue is 13.6% and then finally you see 2.1% in the lightest blue and then at the tails you see .1% now these numbers are very important because they will tell you in standard deviation units how much or how many of the scores in a percentage will be found in that one unit of that one standard deviation unit so if the normal distribution is very narrow you're going to have those bands be closer together but those bands are still going to contain that percentage of the scores and you'll see these numbers often in your statistical career so these are actually numbers worth memorizing OK let's end this presentation by coming back to the question that we started with this goes to the question of why have there not been any more .400 hitters since 1941 and Ted Williams well the answer may involve the principle of variance and this is a theory that was put forward by Stephen Jay Gould, the late paleontologist and biologist, and basically goes it goes this that he believed that the skill level of all players is more similar now than before or maybe stated another way there is less variation between the best and worst players including hitters and pitchers of course so someone like Ted Williams who was an excellent hitter no matter what timeframe in baseball we are talking about Ted Williams was going to be facing, sure some of the best pitchers of all time, but he was also going to be facing pitchers who were very much less skilled than we have today so perhaps he had more opportunities to take advantage of the overall less skill level in terms of the variation of the pitchers that he was facing. So I don't know if this is a theory that I believe or not, it's again very controversial. But it's kind of a fun theory and I love the way that it brings in the idea of variance to try to explain something because again the major league batting average has always been about .260. So I hope this presentation has at least set the stage for getting you to think about some of important ideas and I hope it is now going to have you be ready for our next Excel spreadsheet example and as I say actually doing the examples is much more concrete and I believe once you do the example you will see very clearly how to do the computation and because of the power of Excel I do believe it's a very simple operation to follow. Yes, there are many steps we're going to have to go through but Excel is going to do the hard work for us. So, until next time!