Statistics in Education for Mere Mortals - Introduction to statistics

Hi everyone, Lloyd here with my first presentation of the course, course on several presentations of key fundamental statistical ideas knowing these ideas is very important for making sense of the statistical computations we will be performing in this presentation I'll introduce you to the key ideas to get you ready for the first Excel spreadsheet example OK let's get started OK let me start with one of my favorite quotes there are lies and damn lies and statistics. I'm not really sure who gets credit for this quote but it's an interesting one, isn't it? I've always felt it didn't quite do justice to statistics because I believe that only people can lie. Now they can lie with statistics but statistics themselves really had nothing to do with it. So let us explore this idea a little bit here and the first example is bowling and I think I gave you the heads up about this that our first Excel spreadsheet example was going to be in the context of bowling. So if somebody said to you that they have a bowling average of 175 you begin to make certain assumptions for example you assume that that average is based on a collection of scores actually based on bowling at a bowling alley and not perhaps on a Wii play station or some other videogame version of bowling. You also expect the course that they followed the rules of bowling as they gathered that data. So if their foot crossed the line that would be a foul and you would expect their scores to reflect that Now you might have wondered did they count all their scores? So, for example somebody who has a bowling average of 175 usually only counts the scores when their bowling as part of the games in a league, and you don't count the games that are part of practice. So I think these are all again examples of where do the statistics come from and what are the assumptions under which that data was collected. And you know sometimes especially in education research we would like to collect data that frankly isn't available to us. So sometimes we have to make some tough choices and you will often hear of people saying that they had a convenience sample you have be kind of a careful about just how convenient something is if that data really will be meaningful. And it reminds me of an old joke here of a husband who is looking for a shirt button in the kitchen and his wife says, "What are you doing? and he says, "Well, I'm looking for a button that fell off my shirt in the bedroom she says "Well, why are you looking for it in the kitchen?' "Well," he says, "You know the light is better in here." So sometimes it can be very convenient to gather certain kinds of data but again it won't necessarily be that meaningful, if at all. And I apologize for this slide being so text heavy. I really try to avoid text heavy slides. But I think this is a good one and the slide actually comes from ideas in a book that we use in one of our classes at the University of Georgia. And the book is Leedy and Ormrod's Practical research planning and design. It talks about the idea of measurement as a tool of research and the importance of measurement as limiting the data. Because you see there is far more data than you could ever possibly be able to to deal with or collect. So you have to make some decisions on how you're going to limit the data and the data are going to be limited by the measurement construct, the thing you're trying to measure -- learning, motivation, engagement -- the instrument capability an online survey for example as compared to a paper survey has very different capabilities and finally the amount of raw information that you are prepared to deal with as a result So again you have to make some very difficult decisions but certainly in all research you're going to have to limit the data. And to give you an idea or example of this -- it's really more of a metaphor of the different kinds of data that you will be able to collect, depending on what options you choose. Here's a satellite photo of a place in the United States and I wonder if you can figure out where this place is. Now at this high altitude obviously I can't make out very many details of all really known about the actual place but in terms of the own the of the buildings and so forth but I did get some sense of the geography I think you can see those rivers that are quite prominent and I see it looks like there's a there are three rivers there to coming together to form a third as I go from right to left. And I see that triangle right where the rivers meet Well, it turns out this is my hometown of Pittsburgh and again taken from this viewpoint you can get a very interesting sense of, again the geography, but you don't get any sense of the communities that we have in Pittsburgh. Down if I come down, let's just say I'm in this helicopter, and I come down about halfway closer to the earth I might get this view. And in this view I start to see much more information about neighborhoods, and I see the street layouts, I get some sense of the green areas. I see that baseball field down there. I get some sense perhaps of the groupings of people now if I come down as close to street level - there we go - we get a much better idea of what that neighborhood looks like and the size of the houses and size of the yards and indeed this is actually my neighborhood. This is the street where I grew up in Pittsburgh. Now what is really interesting by the way as I go back up in my helicopter I stay centered on my, on my street. This is again is right above my street. And this imagery as well is taken where my street is the focal point of the image. OK so to kind of wrap that up here is another little metaphor for you. We'll see if you can figure this out based on my imagery there of the firefighters on the left holding that fire hose and that teacup on the right. So isolating meaningful data when conducting most research studies is like what? Well, the metaphor is you know, the amount of data that's possible to collect might be that fire hose but the data that I'm able to collect and make some sense of it would be like the tea cup's worth of the water coming out of the firehose. Very good. So, let's now talk about a very important a set of concepts in statistical circles called the four scales of measurement. Which is going to refer to when I collect data, what kind of data is it? And depending on the scale of the measurement of that data it will tell me what kinds of manipulations or calculations I'm allowed to perform on it. And there are four scales of measurement nominal, ordinal, interval, and ratio. So let's briefly consider these in turn here. So first is nominal, the nominal scale actually, kind of gives you an idea what it means because nom- nominal means name you're giving a number as a name for something in the data. So for example I could give somebody a survey that says what is your favorite color and in order to code that I might say blue is one, red is two, yellow is three, green is four, and purple is five. So you can imagine giving a survey like that but again the numbers don't really mean that four, that green is twice as much as red. It just means that it's a convenience that four is the name. But some people may do the exact wrong thing by actually saying well, why not just add it up and say, the average color, average favorite color was 1.6. Lloyd says this result makes absolutely no sense because the data are nominal, therefore we can't average them. I'm not allowed to do that mathematical calculation on them. And again to use a Pittsburgh example, two of my favorite Steelers Of course, Hines Ward on the right retired, but there we have Hines and Troy Polamalu You know, Hines his number when he was playing was 86 and Troy's was 43. That's their name on the field it does not mean that Hines was twice as good as Troy. OK, let's consider the ordinal scale of measurement. And, really ordinal means the order of things, or a ranking, and so we're going to compare various pieces of data in terms of one being greater or higher than another. So that is the idea of ranked order data. So to give you a visual, to give you a sense of this, here are some of the presidential contenders in the 2008 presidential race, again only on the Democratic side. So some of these people you know probably very well and what their futures held. Some you might be struggling to know who they actually were. So if I were to give you a survey of some sort to say please rank the current candidates in your order preference. Well, I could put them in order but it doesn't mean that my first-place person necessarily is twice as preferred as the second place. The only thing that matters here is the order of the data. So let's compare that to the interval scale of measurement. So again, somewhat of a text heavy slide because the these are the key ideas but let's go through them and then come up with a few memorable examples. So an interval scale of measurement has equal amounts of measurement so from point 1 to 2 or 2 to 3 that does actually have some meaning. It's considered an equal amount from point to point. But the zero point has been established arbitrarily and that's going to be something I'll back to later. So the zero itself doesn't really mean zero in the sense of nothing. And the example that I will come to that I might as well give to you now is the idea of temperature. OK, zero on one scale like Fahrenheit means one thing but on the Celsius scale it mean something else. So by having an interval scale though I am allowed to determine the mean, the standard deviation and things like the product moment correlation. It also, you are allowed to conduct inferential statistical analyses on interval scales of measurement. Now finally let's compare all that to the ratio scale of measurement. It is also important to note that as you go up from nominal nominal to ordinal to interval and now to ratio, each of those latter scales of measurement take on the characteristics, they include the characteristics of what came before. And in fact the ratio scale, the key idea here is that there is an absolute zero point. So it is again very similar to interval with the important new difference that it does have an absolute zero point. And one is allowed to conduct virtually any inferential statistical analysis. So what I like to leave you though, because I think most people get confused as they struggle with what's the difference between the interval and ratio scales of measurement. I'll leave you with the example here that a measurement of of heat or temperature we use a thermometer. And if you think to yourself, it does not make sense to say that 40 degrees is twice as hot as 20 degrees because no, no you know that's an interval of that I can look at 20 to 30, 30 to 40, that's an equal amount. But I can't be making ratio comparisons that this is twice as much as that. Whereas a ratio scale I think the most common one would be length. If I measure a board and this board is 5 feet and that board is 10 feet, then I can make the ratio comparison that the second board is twice as long as the first board because zero length has meaning. But it does not have meaning when it comes to temperature. So, I hope you walk away from this was that important distinction. If not, please study this whether that be in a textbook that you may have, or Google this so that you make sure that you understand the difference. So here's a nice chart that let's you know when it's OK to compute these different kinds of calculations. And as you can see under nominal there really is only one type of computation I can do and that would be, well, frequency distribution, or counting something up for example. So, like your so being able to count your colors, which color was the most preferred for example, so the most of something. Whereas with ordinal, I can do a little bit more with that. I can certainly count up as well ... so the preferred candidates, but I can also compute things like medians and percentiles. And you can see for interval I can do just about every kind of calculation except those that involve ratios, whereas as with a ratio scale of measurement I can do all the things. And as we'll see throughout this course, there are two important types of statistics -- measures of central tendency and measures of variability. And in this particular module we're only going to be concerned with measures of central tendency. In the next module, when we get into descriptive statistics We're going to explore both of these in more detail. So there are three measures of central tendency we have the mean which we are all familiar with. It is the average of a set of numbers. We have median, which I think we are also somewhat familiar with, if only from the news, because many statistics are often reported using the median, such as home sale prices the median is simply, if you take a set of numbers and arrange them in descending order the median would be the number at the midpoint, although sometimes you would have to interpolate to find that number. The mode is simply the number in the set of scores that has the greatest frequency. You might say it is the most popular number. The interesting thing is given a normal distribution these are all the same number. So, here you have a graphical representation of the normal distribution, the classic bell curve. And if you look at the apex of the curve and kind of follow that down, trace it down to the x axis that's going to identify the mean, median, and the mode. So a normal distribution is symmetrical and when you have normal distribution the mean, median, and mode are all the same number or the same value. But that is not the case when you have a skewed distribution so in this graphic, and I apologize, it's a little fuzzy, we have of course the normal distribution again in the middle, but you see on the left and the right two skewed distributions. They are not symmetrical. The left is negatively skewed, you can see the numbers are rather bunched up on the right-hand side and the tail going toward the left. And we call it negatively skewed because we follow the tail going in the negative direction. Positively skewed is the mirror opposite. You see again the numbers bunched more toward the left and tail going to the right. Now notice in the two skewed distributions the fact that the mean, median, and mode do not represent the same value the mode, of course, is again, it can be traced down from the tallest part of the curve, given that it has the highest frequency, far to the left on the positively skewed we see the mean is far to the right. And somewhere in the middle is going to be the median. Now this is very important and I actually have another video to share with you where we really demystify the normal distribution to make sure we really understand where it comes from and how it is generated. So I do recommend that you watch that video and make sure that you understand the normal distribution very very well. The reason is that the statistics that we're going to be calculating all depend on or are based on the assumption that your distributions are normally distributed and I think one classic example I think helps to understand why I would use median instead of mean and what a skewed distribution really means so this is an exaggeration but it drives the point home pretty well so you often hear median income so imagine a neighborhood of 1000 people and in that neighborhood there are 999 extremely poor people. Let's just say they all have an average income of maybe $1000 a year. But also in that neighborhood you have one billionaire who earns $1 billion a year. Well, if I take up all 1000 scores, their incomes, divide by 1000 you're going to get a data point that says "Wow! while that's a pretty rich neighborhood I need to go live there." Well, no, it is obviously a very poor neighborhood. The best measure of the central tendency, what number best captures or represents the entire a group is not going to be the mean. Instead, the median is going to do a much better job of saying, well if you take middle income of all those incomes that's going to a much better representation of the income of that entire neighborhood. It's the same thing with the median price of home, you often have neighborhoods where you have several homes that might be quite exquisite, or just over-the-top in terms of what they what they're worth and most of the other homes or the far majority of the other homes are not nearly with the same value. So the median price is a better measure of that value of the homes, in that neighborhood. OK, we're going to stop there. In the first Excel spreadsheet example all we are going to do is compute some averages to get some sense of how that works. And I'm actually using it as a starting point because I know all of you understand what an average or mean is. It will be a good example of following me in Excel in a video tutorial and then seeing how we're going to submit that for evaluation. And this concludes this presentation.