Tip:
Highlight text to annotate it
X
[ Music ]
>> Our next speaker is Bradley Patterson.
He comes from the department of statistics
at George Mason University.
And the title of his presentation is ROC curves
For Methods of Evaluating Evidence,
a Common Performance Measure Based on Similarity Scopes.
>> Good morning, everyone.
My name is Brad Patterson and I am a graduate student
at George Mason University.
First of all I would like to thank the conference organizers
very much for the invitation to present here today.
It is a great conference and been learning a lot.
I'd also like to think everyone in the audience for attending.
In addition, I'd like to recognize my collaborators
with this work, Professor Miller and Professor Saunders,
both at George Mason University.
Thank you both very much.
So the title of this presentation is ROC curves
for methods of evaluating evidence,
common performance measures based on similarity scores.
And before I go any further I would also
like to express my great gratitude to the NIJ
for partly supporting this research,
and I need to issue the standard disclaimer
that anything I present is a reflection of me
and not necessarily the Department of Justice.
So here's an out line of what I'll present.
I'll begin with an introduction,
in which I will discuss the forensic setting
and give a first look at ROC curves.
Then I'll move on to a background where I'll go
into more detail about ROC curves, and in the analysis
and results section I'll discuss the particular problem
that we studied and share our findings,
and I'll end with a conclusion.
So let's move on now to the introduction.
First of all, here is the forensics context.
Let's suppose that two samples of observations.
We'll call one the control and the other the recovered.
We can form two hypothesis with these two samples.
The first hypothesis is that the to samples originate
from the same source, and the second hypothesis is
that they originated from different sources.
Now the methods of evaluating evidence that we would
like to assess the performance
of all follow the same general pattern.
They take as input both the control sample
and the recovered sample,
and as out put they generate a similarity score.
So what is a similarity score?
Well, it's a numerical value that indicates the degree
of association between the two input samples.
And for our purposes we will assume
that higher values are more indicative of a common source
between the two input samples.
Next I'd like to discuss thresholds a bit.
Some methods of evaluating evidence may have
nominal thresholds.
For example, if you're using a statistical test
of significance, you could consider the critical value
for a type one error of 5% or if you're working
with a likelihood ratio you could consider the number one
as a cut off.
But in general, a threshold is a fixed cut off
on the similarity scores.
And if the similarity score is above the threshold we will say
that the method indicates
that the two inputs have a common source.
And if the similarity score is below the threshold we would say
that the method suggests
that the two inputs have different sources.
Now to put this into the language
of ROC curves we'll label the case of common source
as positive and we'll label the case
of different sources as negative.
Now in order to actually evaluate the performance
of a method we need to work
with data whose true sources we actually know.
This would be called reference data or method evaluation data.
Then there are two types of errors that we may observe.
First a false positive.
This occurs when we have two samples that originated
from different sources and yet the method suggests
that they came from the same source.
Another type of error is the false negative, which occurs
when we have two samples from the same source
but the methods suggest that they come
from different sources.
And here are the four different outcomes that we might observe
with a single fixed threshold.
The columns here are labels according to the truth,
the left column is positive for pairs from the same course
and the right column is negative for pairs
from different sources.
And then we have on the rows the indication from the method
of evaluating the evidence,
whether that indication is positive or negative.
And you can see here where the false positives
and the false negatives land.
Now it turns out that the column totals on the left
and the right are both fixed by your method evaluation data.
So they're constant, determined by that data.
So instead of working with absolute numbers
of false negatives and false positives we can convert those
to rates by dividing each by its column total.
So here are the formulas for that,
for both the false positive rate and the false negative rate,
and for reasons that we'll see in a minute I'm going
to introduce the true positive rate which is equal
to 1 minus the false negative rate.
So working with a single fixed threshold you would have just a
pair of error rates.
But let's now imagine varying the threshold
and seeing what all possible error rates are.
If we do that, we would generate a receiver operating
characteristic plot, or ROC plot.
So here's the axis --
the vertical axis is the true positive rate
and the horizontal axis is the false positive rate.
And there are two curves that I've drawn on here,
the light gray one is the true theoretical ROC curve,
and for illustrative purposes I've drawn a random sample
from that.
So this is made up data.
And I've computed the empirical ROC curve, which is shown
by the blue step function.
So let's understand a little bit
about how varying the threshold leads
to the different points on the ROC curve.
First of all, the highest value of the threshold occurs
at the start of the ROC curve.
And then as we gradually lower the ROC curve we march
up along the blue curve.
And at the end the threshold is smallest.
So let's take a couple of examples to understand
where the points on the ROC curve come from.
Suppose with this simulation data
that we apply a threshold of 2.37.
Then the number of false positives is zero,
and the column total is 100.
So the false positive rate is zero and the number
of true positives is twenty.
And again, for the positives the column total is 100.
So that gives a true positive rate of 0.2.
And you can see the point there is
at the coordinates zero and 0.2.
Next, if we lower the threshold a little bit more we observe ten
false positives and ten divided
by 100 gives a false positive rate of 0.1,
and 60 true positives, and 60 divided by 100 is 0.6.
So that gives a true positive rate of 0.6 and we can see
that the coordinates of that point are 0.1 and 0.6.
And I won't do the last table, but you can quickly see
that the coordinates of that point are 0.35 and 0.9.
Now it turns out you don't have to actually go
through the laborious process of calculating these tables
for every point on an ROC curve.
Instead, the ROC curve depends on only the order
of the similarity scores from the positives and negatives.
And I'll explain that more in the background section.
Here is the specific application to forensics we considered.
We wanted to assess the performance
of different statistical methods of evaluating evidence
in the form of glass fragments.
So let's now go on to look a little bit more
at the background of ROC curves.
First, they appeared in the 1940's
to assess performance in reading radar.
Then in the 1950's and 1960's they spread
to signal detection theory in general and a seminal book
from that time is by Green and Swets.
And then Swets and Picket in 1982 wrote
about diagnostic systems and featured ROC curves.
And that really popularized ROC curves in medicine,
especially in the field of radiology.
And today, ROC curves are used in many fields.
A couple of other examples are machine learning and astronomy.
So now I'd like to give a little deeper overview of ROC curves.
As I mentioned before,
the vertical axis is the true positive rate
and the horizontal axis is the false positive rate.
And they both range between 0 and 1.
Then let's look at a few samples of ROC curves.
A perfect ROC curve is demonstrated by this red line.
It starts at 0, 0, runs up to 0, 1,
and then runs straight across to 1, 1.
And this indicates that the method of evaluating evidence,
if you choose the threshold at the top left corner
where the coordinates are 0 and 1,
the method has not committed any errors.
On the other hand, you could have the opposite extreme
where the method does no better than randomly guessing.
And if that occurred, the ROC method would follow the green 45
degree line.
And usually in practice,
an ROC curve will be somewhere between those two.
I hope that you can get a sense
that the closer the ROC curve comes
to the upper left corner the better the method is performing.
Here are some important properties of ROC curves.
First of all, they show the complete range of error rates
across all possible thresholds.
And another very valuable aspect of ROC curves is
that they are independent of the scale of similarity scores.
So it does not matter how the similarity scores are normalized
or calibrated.
All that matters is the order of the similarity scores
from the positives and negatives.
Another way of thinking
about this is the ROC curve is a rank-based method,
and a related property is that the ROC curve is invariant
under non increasing monotone transformations
of the underlying similarity scores.
Next I'd like to discuss some
of the insightful performance measures that ROC curves offer.
As I've been saying, they should all attainable error rates
across all possible thresholds.
Those are the points themselves on the curve.
And usually the region in the upper left is of most interest.
That is where the false positive rate
and the false negative rate are both lowest.
If you're interested in finding the equal error rate
which is the point where the false positive rate equals the
false negative rate, you can do
that by drawing this anti 45 degree line and find the point
where that intersects the ROC curve.
In addition, if you have a theoretical ROC curve
for which you can calculate the derivative or find the slope,
where that is equal to 1, you have point of equal likelihood.
And probably the most common summary computer
from ROC curves is the area under the curve,
which is usually abbreviated as the AUC.
This is equal to the probability that a similarity score
from two random samples from the same source is greater
than the similarity score from two random samples
from different sources.
And it also runs between 0 and 1.
And again, hopefully you can get a sense
that if the curve moves more quickly
to the upper left corner the area
under the curve is going to be valued higher.
So larger values of the AUC indicate better performance.
I'll now move on to discuss the analysis that we did.
The data are publicly available on the web site of the Journal
of the Royal Statistical Society, and they were analyzed
by Dr. Aitken and Dr. Lucy in a paper published in the journal
of the Royal Statistical Society.
We would also like to recognize the person
who originally collected the data.
The data consists of 62 windows
and from each window there are 5 fragments.
Then there are measurements of silicone, potassium, calcium,
and iron on each fragment.
And these can be converted to variables where we take the log
of calcium to each of the remaining elements.
So this means we have three variables for each fragment
and the data are there for multi-variant.
And we selected four methods from the paper by --
discussed in the paper by Aitken and Lucy
to study the performance of.
The first is called multiple T statistics,
and with this method a separate T statistic is calculated
for each variable, and then the one
with the maximum absolute value is used as a similarity score.
hotelling t-squared statistic begins with multivariant data
and out puts of scale and value.
There are also two likelihood ratio methods.
One assumes a normal distribution for the
within variation and a normal distribution
for the between variation.
The other assumes a normal distribution for the
within variation, but then does a density estimate
for the between distribution.
And we would like to thank Dr. Aitken and Dr. Lucy very much
for sharing some updated computer code
to calculate those values.
Now, the most important thing to take away
from these four methods right now is
that we can treat all methods as mappings from two samples
to a similarity score.
And this will then allow us to create ROC curves
for each method and compare them on a level playing field.
Let's move on now to the results.
First I'll show what the error rates are
at some nominal cut offs.
In this analysis, we did comparison of two
versus three fragments.
But yesterday I heard a very insightful comment by somebody
in the audience, which is
that often times you may have only one observation
for one of your samples.
And unfortunately, I did -- I have done that analysis,
but I left those results at home.
But I would be happy to share them with some of you.
So let's go across now the methods.
We have the multiple T statistic method,
and for that we use the 5% critical value adjusted
by the Bonferroni method.
For the T squared statistic we use the 5% critical value
and for the two likelihood ratio methods we use the number 1.
Now this is very interesting.
At these nominal -- at these nominal error --
number cut offs, the false negative rate
for the two methods on the left is higher
than the two methods on the right.
However, just the opposite happens when we go to look
at the false positive rate.
The two methods on the left are lower and the two methods
on the right are higher.
So this leads to an ambiguous interpretation,
which is how are the methods performing overall.
So instead of just using two points we created ROC curves
to all of these methods.
And to start I'd like to call your attention
to the large curves in the background here.
There are actually four curves drawn there,
one for each method.
But the curve essentially looks red for the last method,
which is drawn on top.
And that's because all
of the methods have ROC curves that merely overlap.
And what that means, practically,
is that all of the methods achieve very similar error rates
by choosing the right threshold.
Another important aspect here is
that the ROC curves all rise very quickly in the beginning
and they reach a true positive rate of 1 very quickly.
So we should expect that the AUC value of the area
under the curve should be very close to 1.
And finally, this inset here, in this area, is a zoom-in
on the upper left corner of the ROC curve.
And I've drawn the points on the ROC curve
for which the rates occur at those nominal thresholds,
and I think the thing to take away from this is
that they're scattered about the ROC curves,
and by choosing different thresholds you could achieve
very similar error rates with each of the methods.
So let's now move on to actually looking at the area
under the curve, the AUC values.
We've multiply these numbers by 100 to convert them to images.
These are all very high.
They range from 98.98 to 99.09, which are all very good.
In another field such as medicine, if you had an AUC
of 88 that would be considered quite good.
So these are encouraging numbers.
Here's an example of calculating the equal error rate
with real data.
Remember that the equal error rate occurs
where the anti 45 degree line intersects the ROC curve,
and we find values ranging from 2.61 to 3.23.
And you can note the different thresholds that are --
would lead to those error rates for each of the methods.
Next I want to talk about a few other applications.
You could start off -- suppose you have an application
where a 5% false negative rate were deemed acceptable
and you wanted to find what the corresponding false positive
rate were.
Well, remember that the false negative rate is equal
to 1 minus the true positive rate.
So the true positive rate here is going to be 0.95.
And we can draw a horizontal line at 0.95 and find
where that intersects the ROC curve
and from there read off the corresponding false
positive rates.
And when we do that, we see
that the false positive rates range from 2.30 to 2.61.
And I've given thresholds there on the right.
In addition, you could start from a different point.
You could say that you wanted to achieve false positive rate
of 5%, and if you did that you would draw a vertical line
that intersects the horizontal axis at 0.05,
and then you could easily read off
where that intersects the ROC curve.
And there are many more applications of ROC curves.
You can incorporate a cost-benefit analysis
into figuring out what error rate has some optimal property.
So now I'll wrap up here.
I would like to reiterate that ROC curves
from similarity scores are comprehensive.
And by that I mean that they depict the entire range
of error rates that you can achieve across all thresholds.
I'd also like to mention that ROC curves are comparable and by
that I mean that they are independent
of the scale of the scores.
Remember that they put everything
on the standardized axis of true positive rate
versus false positive rate.
So this allows you to compare different methods
on the -- with the same tool.
In addition, the ROC curves lead
to objective performance measures such as the error rates
on the curve itself or the area under the curve, the AUC value.
And it occurs to me, I've been listening to a lot of discussion
about analyzing glass.
I think one other neat potential application
for ROC curves would be to use them
to compare the different instrumentation that is applied
to measuring the element, the composition of glass.
And our application to trace evidence in the form
of glass fragments showed
that the statistical methods all have fairly high performance.
And before I end, I just want to mention
that if you're interested in reading more about ROC curves,
I would recommend starting with the article from 2000 by Swets,
and then also the article from 2006 by Fawcett and the article
from 2007 by Zhou and if you really want to dig into them,
are some other good books there that I recommend.
Thank you all very much.
[ Applause ]