Trace Evidence 2011 - Interpretation of data - Patterson

[ Music ] >> Our next speaker is Bradley Patterson. He comes from the department of statistics at George Mason University. And the title of his presentation is ROC curves For Methods of Evaluating Evidence, a Common Performance Measure Based on Similarity Scopes. >> Good morning, everyone. My name is Brad Patterson and I am a graduate student at George Mason University. First of all I would like to thank the conference organizers very much for the invitation to present here today. It is a great conference and been learning a lot. I'd also like to think everyone in the audience for attending. In addition, I'd like to recognize my collaborators with this work, Professor Miller and Professor Saunders, both at George Mason University. Thank you both very much. So the title of this presentation is ROC curves for methods of evaluating evidence, common performance measures based on similarity scores. And before I go any further I would also like to express my great gratitude to the NIJ for partly supporting this research, and I need to issue the standard disclaimer that anything I present is a reflection of me and not necessarily the Department of Justice. So here's an out line of what I'll present. I'll begin with an introduction, in which I will discuss the forensic setting and give a first look at ROC curves. Then I'll move on to a background where I'll go into more detail about ROC curves, and in the analysis and results section I'll discuss the particular problem that we studied and share our findings, and I'll end with a conclusion. So let's move on now to the introduction. First of all, here is the forensics context. Let's suppose that two samples of observations. We'll call one the control and the other the recovered. We can form two hypothesis with these two samples. The first hypothesis is that the to samples originate from the same source, and the second hypothesis is that they originated from different sources. Now the methods of evaluating evidence that we would like to assess the performance of all follow the same general pattern. They take as input both the control sample and the recovered sample, and as out put they generate a similarity score. So what is a similarity score? Well, it's a numerical value that indicates the degree of association between the two input samples. And for our purposes we will assume that higher values are more indicative of a common source between the two input samples. Next I'd like to discuss thresholds a bit. Some methods of evaluating evidence may have nominal thresholds. For example, if you're using a statistical test of significance, you could consider the critical value for a type one error of 5% or if you're working with a likelihood ratio you could consider the number one as a cut off. But in general, a threshold is a fixed cut off on the similarity scores. And if the similarity score is above the threshold we will say that the method indicates that the two inputs have a common source. And if the similarity score is below the threshold we would say that the method suggests that the two inputs have different sources. Now to put this into the language of ROC curves we'll label the case of common source as positive and we'll label the case of different sources as negative. Now in order to actually evaluate the performance of a method we need to work with data whose true sources we actually know. This would be called reference data or method evaluation data. Then there are two types of errors that we may observe. First a false positive. This occurs when we have two samples that originated from different sources and yet the method suggests that they came from the same source. Another type of error is the false negative, which occurs when we have two samples from the same source but the methods suggest that they come from different sources. And here are the four different outcomes that we might observe with a single fixed threshold. The columns here are labels according to the truth, the left column is positive for pairs from the same course and the right column is negative for pairs from different sources. And then we have on the rows the indication from the method of evaluating the evidence, whether that indication is positive or negative. And you can see here where the false positives and the false negatives land. Now it turns out that the column totals on the left and the right are both fixed by your method evaluation data. So they're constant, determined by that data. So instead of working with absolute numbers of false negatives and false positives we can convert those to rates by dividing each by its column total. So here are the formulas for that, for both the false positive rate and the false negative rate, and for reasons that we'll see in a minute I'm going to introduce the true positive rate which is equal to 1 minus the false negative rate. So working with a single fixed threshold you would have just a pair of error rates. But let's now imagine varying the threshold and seeing what all possible error rates are. If we do that, we would generate a receiver operating characteristic plot, or ROC plot. So here's the axis -- the vertical axis is the true positive rate and the horizontal axis is the false positive rate. And there are two curves that I've drawn on here, the light gray one is the true theoretical ROC curve, and for illustrative purposes I've drawn a random sample from that. So this is made up data. And I've computed the empirical ROC curve, which is shown by the blue step function. So let's understand a little bit about how varying the threshold leads to the different points on the ROC curve. First of all, the highest value of the threshold occurs at the start of the ROC curve. And then as we gradually lower the ROC curve we march up along the blue curve. And at the end the threshold is smallest. So let's take a couple of examples to understand where the points on the ROC curve come from. Suppose with this simulation data that we apply a threshold of 2.37. Then the number of false positives is zero, and the column total is 100. So the false positive rate is zero and the number of true positives is twenty. And again, for the positives the column total is 100. So that gives a true positive rate of 0.2. And you can see the point there is at the coordinates zero and 0.2. Next, if we lower the threshold a little bit more we observe ten false positives and ten divided by 100 gives a false positive rate of 0.1, and 60 true positives, and 60 divided by 100 is 0.6. So that gives a true positive rate of 0.6 and we can see that the coordinates of that point are 0.1 and 0.6. And I won't do the last table, but you can quickly see that the coordinates of that point are 0.35 and 0.9. Now it turns out you don't have to actually go through the laborious process of calculating these tables for every point on an ROC curve. Instead, the ROC curve depends on only the order of the similarity scores from the positives and negatives. And I'll explain that more in the background section. Here is the specific application to forensics we considered. We wanted to assess the performance of different statistical methods of evaluating evidence in the form of glass fragments. So let's now go on to look a little bit more at the background of ROC curves. First, they appeared in the 1940's to assess performance in reading radar. Then in the 1950's and 1960's they spread to signal detection theory in general and a seminal book from that time is by Green and Swets. And then Swets and Picket in 1982 wrote about diagnostic systems and featured ROC curves. And that really popularized ROC curves in medicine, especially in the field of radiology. And today, ROC curves are used in many fields. A couple of other examples are machine learning and astronomy. So now I'd like to give a little deeper overview of ROC curves. As I mentioned before, the vertical axis is the true positive rate and the horizontal axis is the false positive rate. And they both range between 0 and 1. Then let's look at a few samples of ROC curves. A perfect ROC curve is demonstrated by this red line. It starts at 0, 0, runs up to 0, 1, and then runs straight across to 1, 1. And this indicates that the method of evaluating evidence, if you choose the threshold at the top left corner where the coordinates are 0 and 1, the method has not committed any errors. On the other hand, you could have the opposite extreme where the method does no better than randomly guessing. And if that occurred, the ROC method would follow the green 45 degree line. And usually in practice, an ROC curve will be somewhere between those two. I hope that you can get a sense that the closer the ROC curve comes to the upper left corner the better the method is performing. Here are some important properties of ROC curves. First of all, they show the complete range of error rates across all possible thresholds. And another very valuable aspect of ROC curves is that they are independent of the scale of similarity scores. So it does not matter how the similarity scores are normalized or calibrated. All that matters is the order of the similarity scores from the positives and negatives. Another way of thinking about this is the ROC curve is a rank-based method, and a related property is that the ROC curve is invariant under non increasing monotone transformations of the underlying similarity scores. Next I'd like to discuss some of the insightful performance measures that ROC curves offer. As I've been saying, they should all attainable error rates across all possible thresholds. Those are the points themselves on the curve. And usually the region in the upper left is of most interest. That is where the false positive rate and the false negative rate are both lowest. If you're interested in finding the equal error rate which is the point where the false positive rate equals the false negative rate, you can do that by drawing this anti 45 degree line and find the point where that intersects the ROC curve. In addition, if you have a theoretical ROC curve for which you can calculate the derivative or find the slope, where that is equal to 1, you have point of equal likelihood. And probably the most common summary computer from ROC curves is the area under the curve, which is usually abbreviated as the AUC. This is equal to the probability that a similarity score from two random samples from the same source is greater than the similarity score from two random samples from different sources. And it also runs between 0 and 1. And again, hopefully you can get a sense that if the curve moves more quickly to the upper left corner the area under the curve is going to be valued higher. So larger values of the AUC indicate better performance. I'll now move on to discuss the analysis that we did. The data are publicly available on the web site of the Journal of the Royal Statistical Society, and they were analyzed by Dr. Aitken and Dr. Lucy in a paper published in the journal of the Royal Statistical Society. We would also like to recognize the person who originally collected the data. The data consists of 62 windows and from each window there are 5 fragments. Then there are measurements of silicone, potassium, calcium, and iron on each fragment. And these can be converted to variables where we take the log of calcium to each of the remaining elements. So this means we have three variables for each fragment and the data are there for multi-variant. And we selected four methods from the paper by -- discussed in the paper by Aitken and Lucy to study the performance of. The first is called multiple T statistics, and with this method a separate T statistic is calculated for each variable, and then the one with the maximum absolute value is used as a similarity score. hotelling t-squared statistic begins with multivariant data and out puts of scale and value. There are also two likelihood ratio methods. One assumes a normal distribution for the within variation and a normal distribution for the between variation. The other assumes a normal distribution for the within variation, but then does a density estimate for the between distribution. And we would like to thank Dr. Aitken and Dr. Lucy very much for sharing some updated computer code to calculate those values. Now, the most important thing to take away from these four methods right now is that we can treat all methods as mappings from two samples to a similarity score. And this will then allow us to create ROC curves for each method and compare them on a level playing field. Let's move on now to the results. First I'll show what the error rates are at some nominal cut offs. In this analysis, we did comparison of two versus three fragments. But yesterday I heard a very insightful comment by somebody in the audience, which is that often times you may have only one observation for one of your samples. And unfortunately, I did -- I have done that analysis, but I left those results at home. But I would be happy to share them with some of you. So let's go across now the methods. We have the multiple T statistic method, and for that we use the 5% critical value adjusted by the Bonferroni method. For the T squared statistic we use the 5% critical value and for the two likelihood ratio methods we use the number 1. Now this is very interesting. At these nominal -- at these nominal error -- number cut offs, the false negative rate for the two methods on the left is higher than the two methods on the right. However, just the opposite happens when we go to look at the false positive rate. The two methods on the left are lower and the two methods on the right are higher. So this leads to an ambiguous interpretation, which is how are the methods performing overall. So instead of just using two points we created ROC curves to all of these methods. And to start I'd like to call your attention to the large curves in the background here. There are actually four curves drawn there, one for each method. But the curve essentially looks red for the last method, which is drawn on top. And that's because all of the methods have ROC curves that merely overlap. And what that means, practically, is that all of the methods achieve very similar error rates by choosing the right threshold. Another important aspect here is that the ROC curves all rise very quickly in the beginning and they reach a true positive rate of 1 very quickly. So we should expect that the AUC value of the area under the curve should be very close to 1. And finally, this inset here, in this area, is a zoom-in on the upper left corner of the ROC curve. And I've drawn the points on the ROC curve for which the rates occur at those nominal thresholds, and I think the thing to take away from this is that they're scattered about the ROC curves, and by choosing different thresholds you could achieve very similar error rates with each of the methods. So let's now move on to actually looking at the area under the curve, the AUC values. We've multiply these numbers by 100 to convert them to images. These are all very high. They range from 98.98 to 99.09, which are all very good. In another field such as medicine, if you had an AUC of 88 that would be considered quite good. So these are encouraging numbers. Here's an example of calculating the equal error rate with real data. Remember that the equal error rate occurs where the anti 45 degree line intersects the ROC curve, and we find values ranging from 2.61 to 3.23. And you can note the different thresholds that are -- would lead to those error rates for each of the methods. Next I want to talk about a few other applications. You could start off -- suppose you have an application where a 5% false negative rate were deemed acceptable and you wanted to find what the corresponding false positive rate were. Well, remember that the false negative rate is equal to 1 minus the true positive rate. So the true positive rate here is going to be 0.95. And we can draw a horizontal line at 0.95 and find where that intersects the ROC curve and from there read off the corresponding false positive rates. And when we do that, we see that the false positive rates range from 2.30 to 2.61. And I've given thresholds there on the right. In addition, you could start from a different point. You could say that you wanted to achieve false positive rate of 5%, and if you did that you would draw a vertical line that intersects the horizontal axis at 0.05, and then you could easily read off where that intersects the ROC curve. And there are many more applications of ROC curves. You can incorporate a cost-benefit analysis into figuring out what error rate has some optimal property. So now I'll wrap up here. I would like to reiterate that ROC curves from similarity scores are comprehensive. And by that I mean that they depict the entire range of error rates that you can achieve across all thresholds. I'd also like to mention that ROC curves are comparable and by that I mean that they are independent of the scale of the scores. Remember that they put everything on the standardized axis of true positive rate versus false positive rate. So this allows you to compare different methods on the -- with the same tool. In addition, the ROC curves lead to objective performance measures such as the error rates on the curve itself or the area under the curve, the AUC value. And it occurs to me, I've been listening to a lot of discussion about analyzing glass. I think one other neat potential application for ROC curves would be to use them to compare the different instrumentation that is applied to measuring the element, the composition of glass. And our application to trace evidence in the form of glass fragments showed that the statistical methods all have fairly high performance. And before I end, I just want to mention that if you're interested in reading more about ROC curves, I would recommend starting with the article from 2000 by Swets, and then also the article from 2006 by Fawcett and the article from 2007 by Zhou and if you really want to dig into them, are some other good books there that I recommend. Thank you all very much. [ Applause ]