Intro to Graphing Video.mp4

Hi everyone! One of the things that we try to do when we investigate a new natural phenomenon is to figure out what factors are correlated to the phenomenon. For example in this plot, we see that the number of brain cancer cases in Europe has increased from the period of 1975 to 2008. One factor that has been mentioned frequently as the cause of this increase of brain cancer rates, is the increased use of cell phones among the population. The theory is that cell phones emit radiation which might, depending on the extent of the exposure, cause tumors to form in the brain tissue. So our goal is to figure out if there is a correlation between brain cancer incidence and the frequency of cell phone usage. Let’s say we can perform an experiment, where we gradually increase the use of cell phones on a select group of individuals and then we measure the rate of brain cancer on these individuals after a number of years. We can then plot the use of cell phones on the x-axis as the number of hours per day and the rate of brain cancer incidence on the y-axis. Now these variables have special names. The X variable is the variable we control during the experiment so it is called the independent variable. The Y variable, on the other hand, is the variable we measure in response to a certain value of the X variable. So we call this the dependent variable. The type of plot you see here, that we use to represent these two variables, is called a scatter plot. We would expect the following results. If there is a correlation you would see some kind of a trend between the two variables. We can either see that as one variable goes up, the other one also goes up. This is what we call a positive correlation. However, we can also see that as one variable goes up, the other variable goes down. This is what we call a negative correlation. If there is no correlation, then you should see no trend. When one variable goes up, the other one might go up, but it also might go down. So overall there is no specific trend. When two variables show a correlation, the next step is to try to fit the data. What does “data fitting” mean? What it means is that we want to see if the shape of that data can be modeled using an equation. Why do we do this? Fitting allows us to convert what we see as a qualitative relationship in the two variables, to a quantitative one that is given by the equation. So here is an example. Let’s say you have this data that shows the number of homes being sold as a function of their price. We logically expect that as the houses get more expensive you should only be able to sell only a few of them. So you see a negative correlation here between the two variables. But let’s say you are a homebuilder, and you ask this question: how many houses would I be able to sell if I were to price them at $243,500? That is not a data point because your data point is at $240,000 and $260,000. Since we do not have data for this exact price, can we really answer the question using this particular plot? The answer is YES! And here is where data fitting is useful. We see that this data has a linear shape or a line shape. So what we can do is we can draw a line through these data points as shown here. This line is called the best-fit line. It is a line drawn in such a way that minimizes the difference between each data point to the points on the line. Once we have the best-fit line, we can generate a line equation, which you should know, is of the form, y = mx+b. In this example, The line equation is y is equal to -0.7929 x plus 249.9. Now the question is how can we use this line equation to help us determine how many houses we can sell at $243,500? The only thing we need to do is just rewrite this equation in terms of the actual variables we use. So y is the number of homes sold is equal to 0.7929x, but x is the price of the house, so we just put Price here, plus 249.9. And then, to figure out how many we would get at $243,500 we would just plug in 243.5 for the Price, and then what we get is 56.3 homes. So in other words, what that tells us is that even though we did not do the experiment in other words we did not know exactly how many houses we would be able to sell at $243,500, we can predict what it would be using the equation of the (best-fit) line. So in science we do this all the time. First we collect data. Then we plot usually a scatter plot, the variables together, the x and y, and see if they display any kind of correlation. When we see that they are correlated, we can then try to fit the data with a geometric shape, for example a line. We then generate an equation for the that shape, the line equation, and then we use that equation to calculate what we would get in the future, for future experimental results, that we do not have to perform anymore. Alright, now let’s talk about some other aspects of data fitting. So even though we can draw the best-fit line using a pencil and a ruler, most often we will not be doing this. We will ask, a computer program, we will use a computer program to draw the line. The program draws the line using a procedure called the least-squares method. This is a mathematical procedure and we really do not have time to cover it in this class. If you are interested, you should consult a freshman statistics course. Next thing to find out is whether the line fits the data well or not. The reason is because you can draw a line through pretty much any collection of data points. So the line equation is only meaningful if the two variables are correlated. To determine whether two variables are correlated or not, we use a parameter called the correlation coefficient, which is symbolized as R-squared. The value of R-squared ranges from zero, where there is no correlation, between x and y, to 1.0, where there is perfect correlation between x and y. In Chem 11, we would consider x and y to be well-correlated if the R-squared is 0.95 or higher. Obviously the higher the R-squared, the better the correlation. Here as you can see are three data sets, shown with their best- fit lines and their R-squared values. As I mentioned earlier, you can draw a line pretty much through any collection of data points, so in this case, the data on the left has no correlation, R-squared is equal to zero, but we see that we can still draw a best-fit line. But the line here is meaningless. It just has no predictive value, because the variables themselves are not correlated. The middle data set shows an R-squared around 0.5. Now you might think that this is not a very high R-squared, but this is generally considered reasonable for fields where it is impossible to perform controlled experiments, for example in Sociology, in Education, et cetera. The plot on the right, is what we are trying to get to. It is a perfect correlation, R-squared equals one. But this really does not happen in reality, so you should not worry about it too much. One other issue with data fitting is the presence of outliers. An outlier is a data point that lies far away from all the other data points in the trend. So for example, here is data that shows the number of hours that somebody would exercise per week once they own an exercise machine for a certain number of months. You can see that there is a negative correlation here which means that the longer somebody owns a machine, the less likely they are to exercise on it. However, you can see that there is here, a data point that is far away from the trend. The trend goes this way, but this data point is so far away from the rest of the data points. So this is what we consider an outlier. So generally speaking, in chemistry, an outlier, is usually a result of an experimental error. So we generally remove outliers from our dataset. I mentioned earlier that we use equations that we get from the data to make predictions on future experimental results. There are two types of predictions that you can make. Interpolation is predicting future results that fall within the existing data range. Whereas extrapolation is you are making prediction outside the existing data range. So let’s say you have this data right here, given by the dots, and then you have your best-fit line, shown by these dashed lines right here. Interpolation, means you’re trying to predict what happens within the data range, but you do not have the data point for it, for example this data right here. So the exercise we did earlier, in predicting the house, the number of homes we can sell for $243,500 is interpolation. Extrapolation means that you’re trying to predict what is going to happen outside the data range, which is somewhere around here. So, let’s go back to the example that we looked at earlier, which is the number of hours somebody exercises after they own a certain exercise machine. We can see that if we extrapolate this data, if someone owns a machine, after fifteen months or more, then he would not use it to exercise anymore (because the y-value is zero). Now that clearly is questionable (interpretation of the data), ok. It is more likely that the person will still use the machine from time to time. but not very frequently. So instead of drawing it as an intercept (on the x-axis), it probably makes more sense to draw it as a curve, where there is a plateuing effect. Somewhere around here, ok, where there are some number of hours used, just not very frequently. What this is telling you is that extrapolation is quite dangerous because it is possible that the relationship between x and y changes after a given value of x, as in our example right here. I hope you get a feel for how plotting and data-fitting are important in science. After watching this video, you should be able to understand why scientists must plot the data they collect, you should be able to understand the meaning of these various terms that I used in this video. What is a best-fit line? What is the R-squared? What are outliers? And what do we mean by interpolation and extrapolation.