Tip:
Highlight text to annotate it
X
Hi everyone!
One of the things that we try to do when we investigate
a new natural phenomenon
is to figure out what factors are correlated to the phenomenon.
For example in this plot, we see that the number of brain cancer cases in
Europe has increased from the period of 1975 to 2008.
One factor that has been mentioned frequently as the cause of this increase
of brain cancer rates, is the increased use of cell phones among the population.
The theory is that cell phones emit radiation
which might, depending on the extent of the exposure, cause tumors to form
in the brain tissue. So our goal is to figure out
if there is a correlation between brain cancer incidence
and the frequency of cell phone usage.
Let’s say we can perform an experiment, where we
gradually increase the use of cell phones on a select group of individuals
and then we measure the rate of brain cancer on these individuals
after a number of years. We can then plot
the use of cell phones on the x-axis as
the number of hours per day and the rate of
brain cancer incidence on the y-axis. Now these variables
have special names. The X variable is the variable we control
during the experiment so it is called the independent variable.
The Y variable, on the other hand, is the variable we measure
in response to a certain value of the X variable. So we
call this the dependent variable. The type of plot you see
here, that we use to represent these two variables, is called
a scatter plot. We would expect
the following results. If there is a correlation
you would see some kind of a trend between
the two variables. We can either
see that as one variable goes up, the other one also goes up.
This is what we call a positive correlation.
However, we can also see that as one variable goes up,
the other variable goes down. This is what we call a negative
correlation. If there is no correlation,
then you should see no trend. When one variable goes up,
the other one might go up, but it also might go down. So
overall there is no specific trend.
When two variables show a correlation, the next step
is to try to fit the data. What does “data fitting” mean?
What it means is that we want to see if
the shape of that data can be modeled using an equation.
Why do we do this?
Fitting allows us to convert what we see as a
qualitative relationship in the two variables, to a
quantitative one that is given by the equation. So here is
an example. Let’s say you have this data
that shows the number of homes being sold as a function of
their price. We logically expect
that as the houses get more expensive
you should only be able to sell only a few of them. So you see a negative
correlation here between the two variables. But let’s say you are a homebuilder,
and you ask this question: how many
houses would I be able to sell if I were to price them at
$243,500? That is not
a data point because your data point is at $240,000
and $260,000. Since we do not have
data for this exact price, can we really answer
the question using this particular plot? The answer is YES!
And here is where data fitting is useful.
We see that this data has a linear shape or a line
shape. So what we can do is we can draw a line through
these data points as shown here.
This line is called the best-fit line. It
is a line drawn in such a way that minimizes the difference
between each data point to the points on the line.
Once we have the best-fit line,
we can generate a line equation, which you should know,
is of the form, y = mx+b. In this example,
The line equation is y is equal to -0.7929
x plus 249.9. Now the question is
how can we use this line equation to help us determine
how many houses we can sell at $243,500?
The only thing we need to do is just rewrite this equation in
terms of the actual variables we use.
So y is the number of homes sold is equal to
0.7929x, but x is the price of the house, so we just put
Price here, plus 249.9. And then,
to figure out how many we would get at $243,500
we would just plug in 243.5 for the Price,
and then what we get is 56.3 homes. So in other words,
what that tells us is that even though we did not do the experiment
in other words we did not know exactly how many houses we would
be able to sell at $243,500, we can predict
what it would be using the equation of the (best-fit) line.
So in science we do this all the time. First
we collect data. Then we plot
usually a scatter plot,
the variables together, the x and y, and see if they display
any kind of correlation. When we see that they are correlated,
we can then try to fit the data with a geometric
shape, for example a line. We then generate an equation for the
that shape, the line equation, and then we use that equation to
calculate what we would get in the future, for future
experimental results, that we do not have to perform anymore.
Alright, now let’s talk about some other aspects of data fitting.
So even though we can draw the best-fit line using a pencil and
a ruler, most often we will not be doing this. We will ask,
a computer program, we will use a computer program to draw the line.
The program draws the line using a procedure called the least-squares method.
This is a mathematical procedure and we really do not have time to cover it
in this class. If you are interested, you should consult a freshman
statistics course.
Next thing to find out is whether the line fits the data well or not.
The reason is because you can draw a line through pretty much any collection of data points.
So the line equation is only meaningful if the
two variables are correlated. To determine whether two
variables are correlated or not, we use a parameter called the
correlation coefficient, which is symbolized as R-squared.
The value of R-squared ranges from zero, where there is no correlation,
between x and y, to 1.0, where there is perfect correlation between
x and y. In Chem 11, we
would consider x and y to be well-correlated if the R-squared
is 0.95 or higher. Obviously
the higher the R-squared, the better the correlation.
Here as you can see are three data sets, shown with their best-
fit lines and their R-squared values.
As I mentioned earlier, you can draw a line pretty much through any collection of data points,
so in this case, the data on the left has no
correlation, R-squared is equal to zero, but we see that we can still
draw
a best-fit line. But the line here is meaningless.
It just has no predictive value, because
the variables themselves are not correlated.
The middle data set shows an R-squared around 0.5. Now you
might think that this is not a very high R-squared, but this is generally considered
reasonable for fields where it is impossible to
perform controlled experiments, for example in Sociology, in Education,
et cetera.
The plot on the right, is what we are trying to get to. It is a perfect
correlation, R-squared equals one. But this really does not happen in
reality, so you should not worry about it too much.
One other issue with data fitting is the
presence of outliers. An outlier is a data point that
lies far away from all the other data points in the trend.
So for example, here is data that shows the number of
hours that somebody would exercise per week once
they own an exercise machine for a certain number of months.
You can see that there is a negative correlation here which means that the
longer somebody owns a machine, the less likely they are to exercise on it.
However, you can see that there is here,
a data point that is far away from the trend. The trend goes this way, but
this data point is so far away from the rest of the data points.
So this is what we consider an outlier.
So generally speaking, in chemistry, an outlier,
is usually a result of an experimental error. So we generally
remove outliers from our dataset.
I mentioned earlier that we use equations
that we get from the data to make predictions on future experimental results.
There are two types of predictions that you can make.
Interpolation is predicting future
results that fall within the existing data range.
Whereas extrapolation is you are making prediction outside
the existing data range. So let’s say you have this data right here,
given by the dots, and then you have your best-fit line,
shown by these dashed lines right here. Interpolation,
means you’re trying to predict what happens within the data range,
but you do not have the data point for it, for example this data right here.
So the exercise we did earlier,
in predicting the house, the number of homes we can sell for
$243,500 is interpolation. Extrapolation means that you’re trying to
predict what is going to happen outside the data range, which is somewhere around here.
So, let’s go back to the example that we looked at earlier,
which is the number of hours somebody exercises
after they own a certain exercise machine. We can see that if
we extrapolate this data, if someone owns a machine,
after fifteen months or more, then he would not
use it to exercise anymore (because the y-value is zero). Now that clearly
is questionable (interpretation of the data), ok. It is more
likely that the person will still use the machine from time to time.
but not
very frequently. So
instead of drawing it as an intercept (on the x-axis),
it probably makes more sense to draw it as a
curve, where there is a plateuing effect.
Somewhere around here, ok, where there are some number of hours
used, just not very frequently.
What this is telling you is that extrapolation is quite
dangerous because it is possible that the relationship between x and y
changes after a given value of x, as
in our example right here.
I hope you get a feel for how plotting and data-fitting are
important in science. After watching this video, you should be able to
understand why scientists must plot the data they collect,
you should be able to understand the meaning of these various terms that I used in this video.
What is a best-fit line?
What is the R-squared?
What are outliers? And what do we mean by
interpolation and extrapolation.