Trace Evidence 2011 - Instrumental Monitor - Tatiana Trejos

[ Music ] >> Our next speaker is Tatiana Trejos who is from the International Forensic Research Institute at Florida International University in Miami. And she is going to talk about Evaluation of the Performance of Different Match Criteria for the Comparison of Elemental Composition of Glass by micro-XRF, ICP-MS, Laser Ablation ICP-MS, and Laser Induced Breakdown Spectroscopy. [ Pause ] >> Good morning everyone. First of all, I would like to thank all the organizers for such a wonderful workshop and for the opportunity to come here and present some of this data. I'm gonna be talking about the evaluation of the performance of match criteria for the comparison of glass, elemental composition of glass by different techniques. So in order to evaluate the effect of the match criteria on glass comparisons the elemental analysis working group decided to conduct four inter-laboratory studies or Round Robins and even though we met in advance to look at the overall design of the Round Robins and the aims for each of these tests all the samples were submitted to the participants as blind test to avoid any bias on the results and the reporting of the results. So for the first Round Robin we focused on the evaluation of the analytical performance of the methods to see how we compared to the others and also to evaluate and assess the match criteria that each laboratory was using at that moment in their agencies. For the second Round Robin and the succeeding Round Robins, we design them based on the discussions that we have in the group and experience that we gain from the previous Round Robin studies. So for the second round we decided to make a larger set of standard reference materials to further evaluate the analytical performance of the methods and also include some samples for comparisons to evaluate type 1 and type 2 errors. The third Round Robin was more focused on the evaluation of false inclusions while the fourth Round Robin was more focused on the evaluation of false exclusions. So there were two main questions that we wanted to answer to a-- with the study of these Round Robins, one is dealing with the analytical performance of the methods and the second one related to the match criteria. So in terms of the analytical performance we wanted to know how each technique perform versus the others in terms of precision, accuracy, sensitivity, limitations, interferences, discrimination capability, how consistency we can get results among the participants, and also something of interest in our group was to study or work towards the standardization of the methods and we are very close to submit two AS-- two methods for consideration to the ASTM as a product of this group, one related to micro-XRF and the other one for Laser Ablation ICP-MS of glass analysis. In terms of match criteria what we wanted to do is to evaluate the effect of sampling strategies as well as the selection of the match criteria on the error rates for element and comparison of glass. And of course, one of the interests of the group was to take a look at the interpretation of the significance of an association when one is found. So these graphs here represent the number of participant laboratories that we get for each of the Round Robins. So we get about 14 to 18 different laboratories that participated in each of the Round Robins. And as you can see the majority of the data will come from ICP users and XRF analysis. ICP included data from digestion followed by ICP-MS analysis, Laser Ablation ICP-MS, and as well as Laser Ablation coupled to ICP-OES. And one of the important things about the number of participants that we get in this Round Robin is that we gathered enough data from different techniques, different methods taking-- by different analysts at different locations, different instruments, brands, and configuration. So we get enough data to do inter and intra lab variation studies. So this is an example of the results for the second Round Robin where we were comparing the analytical performance. This is an example of lithium present at about 5 ppms in the standard reference material 1831. And this is the comparison of the results obtained for the participants of the ICP users. And as you can see each laboratory was able to compare their individual precision and accuracy versus the study mean and the certified value. So we get excellent agreement between the participants in terms of precision and accuracy with-- per the majority of the elements greater than 10 percent. And one important thing is that these studies are led to the standardization of the methods, tweaking the methods, improving, finding some outliers and that was important as a validation process also for the members of the group. The second Round Robin consisted also on three samples that we submitted for a comparison to simulate a case work. And so those samples we're architectural float glass that was manufactured in the Cardinal Plant. K1 and Q1 share a common origin and they were manufactured into 2001, while Q2 originating from a different source manufactured in the same plant but years apart. And before I present the results for the different match criteria that I-- we evaluated, I would like to present really briefly a description of how that data display and how we call an association. So a for XRF or micro-XRF, the participants or the examiners, usually the first thing that they do is to look at the spectra of Ks and the Qs and compare to see if they can find any significant differences in the spectra. So, once they have done the comparison of the spectra overlay they can also take a look and do intensities, ratios of the intensities to look at the data in a numerical way. So when we ask during the protocol of the analysis when we submitted the inter laboratory test is that they report at least 6 to 8 ratios for the samples. And we requested to take at least 9 to 10 measurements of the K before comparing to the measurements with the Q samples. The LIBS data will look very similar to the micro-XRF so I didn't have any slide specifically for LIBS, but we also have a spectra and they are gonna also have ratios of intensity of the elements that they can be looking at. In the case of ICP-MS data we get quantitative information. So what we have is a concentration of elements present in very low concentrations in ppms. So we'd look about 16 to 18 different elements present in the low ppm range for trace and minor elements. And same thing we requested at least 10 measurements of the non-samples and at least 3 measurements of each of the questioned fragments. So once we have the numerical data and we selected a match criterion, if the K and the Q are significantly different but at least one out of those 18 elements or one out of those 8 ratios then we can exclude the samples to have come from the same source. If we fail to find any differences in any of those elements, then we call that an association. So with these said I'm gonna present here a table for represented results for the comparison of those samples for the second Round Robin as reported by each laboratory using their own match criteria. So something that I want you to note here is that the match criteria that was reported for the second Round Robin changes a lot between participants. Everybody was using their own match criteria, different match criteria in their agency. So we have T-test with 95 percent confidence range overlap 2, 3 or 4 standard deviation or a spectra overlay as the match criteria of choice. And as you can see here in the second Round Robin we got 100 percent correct association of the samples that originating from the same source and as well all the participants were able to correctly discriminate the samples that came from different sources. >> So due to the encouraging result that we get in the second Round Robin we try to make the test every time more challenging to the participants. So for the third Round Robin what we did is that we asked each participant to compare the elemental analysis of the samples that we submitted as K1 versus all the question items and this particular test was three question items and also a second K with all the question items. And the samples that we've submitted for the analysis were manufactured in the same plant it was at Cardinal Plant. And they were manufactured years apart, months apart, and weeks apart. So we wanted to know but it was discrimination capabilities of the methods when those intervals in time in the same plant come closer. So this is an example of the data prior distribution of the analysis. This was taken with the Laser Ablation ICP-MS. And as you can see for example here with the yellow, highlighted in yellow, as the samples get closer in time the composition of the element is look very, very similar. And I highlighted in red the elements that were responsible for the differences or the major differences between the samples. So you can see that some of them are present in very low concentrations so not all the techniques may be able to detect those differences. That's what we expected in advance. So this is a summary of the comparison results for those samples that were manufactured at least 2 years apart as reported by each of the participants again using their own match criteria. So, as you can see all the participants were able to discriminate the samples that originated from the different sources regardless of the method that they were applying and the match criteria that they selected. The only two exceptions is here for one of the LIBS laboratory that they were using their own mathematical algorithm and after this they fine tuned the method, they found some errors in the codes. And we also have an inconclusive result for one of the acid digestion ICP-MS participants because they had a problem with one of the samples and they didn't have enough sample to repeat the analysis so they called this inconclusive. But other than that we were able to discriminate all the samples correctly. When the samples were closer-- closer in time of manufacturing only the more sensitive methods like ICP-MS and LIBS were able to discriminate the samples that were manufactured few weeks or months apart. So the summary for this, the Round Robin, is that this Round Robin allowed the study of type 2 errors in samples that were very similar in composition. So we took this in the worse case scenario. We took from our database the samples that were more close in composition and in time. However, all the techniques were able to differentiate samples that were manufactured in the same plant, months, 2 months-- 2 or 3 months apart, regardless of the match criteria that they selected for the analysis. The samples that were very, very similar in composition and they were manufactured only two weeks apart were only differentiated by the methods that were more sensitive like the ICP and some of the LIBS laboratory. So for the four Round Robin, we decided to study a little bit a more the type 1 errors. So we collected sample from our database from Pilkington Plant. So we have the Q1 was manufacture in February of 2010, and all the other samples that we submitted for comparison originating from the same source manufactured in the same plant just two weeks apart. This is an example of the pre-distribution analysis by Laser Ablation ICP-MS and something that I want you to note here is that in this particular plant, for these particular samples, even though these samples were manufactured only two weeks apart you can know that there are significant differences in the elemental composition. So in comparison to the previous one, we were expecting most of the laboratories to be able to find these differences in the samples even though they were manufacture only two weeks apart. And this is the summary of the results as reported by the XRF participants using their own match criteria. And something that I want you to note here is that by these four Round Robins you can see how the participants were having a match better agreement in the match criteria that they were selecting for doing their comparisons. Using this match criteria that all the participants were able to discriminate correctly the samples that we're manufactured two weeks apart and all of them were able to associate correctly the samples that originated from the same source. When the measurements were taken with more sensitive methods still all the participants were able to discriminate the samples that were manufactured two weeks apart. However, we start seeing some type 1 errors in some of the fragments. And something that I want you to note here is that the steel the ICP uses, we were using like a large variety of match criteria for the fourth Round Robin. And in this particular case this raised a flag to the participants that we may need to use a wider match criteria for ICP-MS data due to the nature of the analytical technique that we get very sensitive and the precision is very tied between measurements, so that could contribute in the increased rate of type 1 errors. For the LIBS participants we got similar results where we have some type 1 errors. However, we also have some type 2 errors reported which we attributed to the fact that in comparison to the older methods LIBS still lack of standardization amount in the participant laboratory so there was a lot of variation of variables and even the ratios that we're using for comparisons. So because the rate of type 1 errors that we get in the four Round Robin for the ICP data was very atypical of what we have observed in over the years based on our database and studies that have been conducted in the past. We decided to take a closer look to see what could have been also the cause of those errors. So here's a little bit of history about the source of the samples. They were taken from a manufacturing plant the Pilkington Plant that was having a transition at the moment of the iron content due to customer requirements. So you can see that over the time they were going from low concentration of iron to high iron and so on. They reported-- important transition in this area, in these dates in 2010 and the samples that we're taking to evaluate type 1 errors were sampled just four days before this big transition in the plant. So the group I was interesting in looking more closely are the originality of those samples to see how that compares to other samples that we have in the laboratory. So we conducted homogeneity study where we took all those Pilkington samples that were included in the fourth Round Robin and also a pane of glass from the Cardinal Plant that was used in the third Round Robin. So we conducted a homogeneity study to evaluate the variation within a pane so we took five to seven fragments per set and do a comparison of the elemental composition. But we also look at the spatial variation within the fragment, so we did comparison of the elemental composition in the float side versus the non-float side and also in different areas of the cross section. And this is an example of what we observed for the Cardinal Plant. This is just iron concentration in here and you can see this is the mean value and the standard deviation for each of the measurements. We didn't find significant differences across the section of the glass in the Cardinal samples. However, you can note here a significant difference in the concentration of iron across the glass not only for float versus the non-float but also within the cross section. For this particular Round Robin, the participants requested that they wanted to have the samples as small as possible to be a representative of what they have in real cases. So we submitted samples that were very small and they were irregular in shape so chances are that the samples that we are submitted as Q samples originated from different sources-- different areas of the cross section, and therefore then may explain why we got such a high rate on type 1 errors for the four Round Robin. So after that we requested each participant to take their own data and apply all these match criteria to compare the error rates that we can get with-- under different circumstances. >> So we requested range-overlap T-test at different P values, T-test with Bonferroni correction, Hotelling's T for some of the sets, and then two all way to six standard deviation with and without a minimum 3 percent RSD. So this is a summary of the results for the micro-XRF for the three Round Robin studies. So as you can see this is for type 2 errors, and we we're able or they were able to discriminate correctly all the samples submitted for Round Robin 2 and Round Robin 4 regardless of the match criteria that was employed. However, you can see that there are more type 2 errors in then Round Robin 3 due to the nature of the test. Some techniques perform better than others. However, you have to notice that the samples that produce these errors were samples that were manufactured only 2 weeks or 3 months apart. So they have very similar elemental composition. In terms of type 1 errors, in most cases the failure to associate the samples were obtained for techniques that were like to test range-- range-overlap, three standard deviation, spectral overlay, and Hotelling's T perform much better than the other comparison methods in terms of type 1 error. For the ICP methods in terms of type 2 errors again very good ability to discriminate samples in Round Robin 2 and Round Robin 3. We got some percentage of type 2 errors in Round Robin 3. However, all these came from samples that were manufactured only two weeks apart for some laboratories that were not able to discriminate the samples. In terms of type 1 errors, if you look here at the Round Robin 4 you can see that there is a higher type 1 error associated mainly to the heterogeneity of that sample as I previously described particularly for this transition in the plant. Nonetheless, you can see here that still for the second Round Robin that was taken from samples from that Cardinal Plant. We get some type 1 errors depending on the match criteria that was selected. So four standard deviation and four standard deviation with minimum 3 percent RSD provided better rates for type 1 error. But I want you to take a closer look at the Round Robin 2, taking the fourth standard deviation we still have about 26 percent rate for the type 1 error. Five of 19 of the comparisons where they're responsible to give that 26 percent type 1 error in the Round Robin 2. However, these errors came from 2 out 7 laboratories and only in one element pair laboratory. So these are the examples. This is one of the laboratories comparison of K versus all the Q fragments and other laboratory comparison of the K with the Q fragments, only magnesium, only potassium were discriminated or excluded in the sample. And I want you to know here when we use a four-- a standard deviation that the samples are very close in composition however because of the excellent precision that is typically observed in Laser Ablation ICP-MS measurements these tiny or small differences were responsible for excluding those samples only for one out of 18 elements. So one of the things that was studied presented in the group is due to the reduced precision that we had in ICP measurements we decided that we can use four standard deviation and 4S as the criteria. But instead of using the relative standard deviation of the measurements we can fix that value to 3 percent when the precision is that small. And this is a method that has been in used by the CFS in Canada and the BKA. They recently reported in 2011 in the Journal of Analytical Atomic Spectroscopy all the fundamentals behind that. They performed a very nice study to evaluate the different type 1 and type 2 errors under different circumstances. And they have found that these criteria provided the less number of type 1 and type 2 errors. So we also evaluated these match criteria for our Round Robins in the case of ICP data. Even though these match criteria may look a little bit wide for what we have typically used in the past we noticed that when we used this match criteria we reduced the type 1 error without really sacrificing the type 2 errors. And the reason for that is that if you look here at this graph it does represent the different elements and each data point represents the concentration and standard deviation for those measurements. When the samples are originated from different sources like in the case of the orange trend versus the blue and the green one, they differ not only by one element by-- but by many elements that-- with many standard deviation. So if those two samples came from different sources even if we widen it a little bit, the match criteria, we'd still be able to find those differences. So in terms of recommendations learned from this study, in terms on sampling what we recommend is to take as much as measurements as practical. So a minimum of nine measurements for the K samples is recommended to really take representation of the variation of the elemental composition of the sample and taking to account the heterogeneity of the samples. In case of XRF data also appropriate samples should account for differences in size of the fragments and different geometries. In terms of quality assurance and Kristine gave a talk about that yesterday as well in much more detail. What we recommend is to use an evaluation or a control standard to evaluate precision and accuracy in a daily basis in their laboratory and one easy way of doing that is measuring a reference standard material like 1831. And also conduct a study in their laboratories to evaluate a method detection limits, and method quantification limits, precision, and accuracy so that you can know when to call Peak a Peak and when to use those for comparisons and Troy did an excellent job yesterday describing all the idea behind it. In terms of ICP what we learned is that we need to open a little bit the match criteria, make it a little bit wider. So for standard deviation or for standard deviation with three percent RSD produced the less amount of type 1 and type 2 errors. For XRF data, a spectra overlay seems to perform well and that's one of the techniques preferred by XRF users also the use ratios for the comparisons can be used with the match criteria of range overlap or three standard deviation which have shown to perform well for XRF data. Hotelling's T which is a multivariate t-test also performs well for the XRF data, so it can be considered as well as an alternative match criteria for elemental composition or comparisons. So finally in terms of interpretation, the take home message for this study is that glass samples that are manufactured in different plants or even in the same plant at very short time intervals, weeks or month depending on techniques and the variation of plant, are clearly differentiated by the methods that were evaluated. And therefore we think that we can use this statement as a start point to add in significance to the evaluation of the elemental composition or match criteria for the comparison of glasses. I would like to thank the NIJ for the funding this grant and of course all the members for the Elemental Analysis Working Group and particularly Dr. Robert Koons that helped a lot with the data analysis and discussions relating to match criteria. Thank you. [ Applause ] >> I think we really had three very interesting talks there. We're run a little bit late but I thought the importance of these talks was worth giving them their allotted time. Let me start by asking are there any questions from the audience there are microphones here and I guess that-- two microphones out there if you have a question you might step up. [ Pause ] >> This question was for Megan. What elemental analyses-- what elements were selected for the copper wires for differentiation? >> I believe there were multiple elements I think about 10, we-- I can't remember all of them but I know that molybdenum, vanadium, titanium, iron, and I can't remember the others where included in that study. >> Other questions? >> It seemed to be that homogeneity and heterogeneity of the samples was an important consideration for all of your, you know, studies. And the particularly when you're aiming a laser at a small spot on a larger sample, the homogeneity of the analysis across the surface is of importance and of consideration. Would you like to start by saying something more about, you know, homogeneity of your work and homogeneity samples in your work with respect to SERS? >> Yeah, I can-- I can say something about as far as-- the tattoo was a very good example of heterogeneity sample. And the Raman microscope itself, that example was done with normal disperse of Raman not with SERS itself. And so in that case because it was a solid sample we're able to focusing on an area and get scattering from essentially the diameter of the laser which is very small. When it comes to SERS work however if we were to then take that solid sample and add silver anything that would be soluble that would be able to go on to the particle would-- could be on the particles. So you could get mixtures and get multiple components. So separation methods are important. However, with that said, with SERS there are some compounds that are much more SERS active than others. Things with long chains don't give very good SERS spectra things with-- with more ring groups are gonna give you a much stronger signal. So there is that, that aspect that can help separate some sort of signal and be able to identify what's there. >> Megan, let me ask you're using a laser to focus and do laser [inaudible] indective a couple of plasma aspects and again, you're focusing on a small part of the sample and you've looked at variations across some of your materials. What kind of percent variations do you see? Do you ever see a dramatic change in the signal at some point on the sample? >> Actually, yes we do see some dramatic changes especially around the edges of the sample or parts that have been intentionally contaminated. As far as the other like just the rest of the sample. We don't see much range in our-- sorry, I just blanked. We don't see much, much range which is nice especially it's an advantage of ICP. >> Obviously we're interested in reproducibility of results. But, you know, sometimes with some of these probe methods, the identification of it and additional material on the sample might be a probative value too, right? >> Yes, that's correct. >> To Tania, okay, so at the end you said nine measurements or more if possible and sample three fragments of glass if possible? Now we had some discussions on our statistics workshop on Monday and my joke was-- this is the question that statisticians hate the worst, how many measurements should I take? And their usual answer is "take more" because that's always better. But I think that your-- your comment about taking samples from multiple fragments also speaks to heterogeneity across fragments. And in your data sets, what-- what how do the results compare taking measurements across different fragments or did you look at that aspect of the data? >> We also conducted some heterogeneity studies that we didn't discuss due to time constraints. But we did some comparison of the heterogeneity of containers versus float glass for example. And containers had typically more a heterogenous. So what we recommend is to take more measurements in that case of containers and multiple fragments it's always better to take a measurements from multiple-- a single measurement from multiple fragments than take like three fragments and then three measurements in each of the fragments. If you have more than three fragments indicate which is usually the case. What we recommend is take as many measurements as possible minimum nine and if possible one measurement per fragments. So that we can have an idea of the variation, a representation of the variation of the heterogeneity in the sample before we do any comparison with the Q samples that are usually more limited in size. >> Are there other questions from the audience? Yes. Please step up. I believe this might be a statistician? >> Yes, but it's not a statistical question. So interested in your study of the heterogeneity of the plate glass, did you also sample sideways on the-- you know, did you have a big plate and take different areas on the big plate? >> We have done heterogeneity studies in different manufacturing plants at FIU. For these particular sets we have a limited size of samples. So the panes were about this big, so the heterogeneity was in like not in a big scale. But we have sample like big panes of architectural glass collected from the plants at different time intervals for different parts of the ribbon and that has been reported. >> Wait! Stay right there George. [ Laughter ] >> I have a question for Tatiana but-- that perhaps you can help me answer. But you know so you're using some statistical criteria to make match to set match criteria-- match judgments about samples. And particularly for the hypothesis test, the hypothesis test are about the means of the measurement. And so it's possible to say there is a different in the means although have samples in the two-- you know, two-- have measurement in the two samples overlap with one another to some extent so that it would be possible for a measurement from-- form one chemical sample to actually be closer to the measurements of another chemical sample despite the fact that the hypothesis test says they're at least statistically different-- statistically say they're different at the means. I don't know, does anybody have any experience with talking to lawyers and testifying in courts as to the results of such hypothesis that's why I wanted you do stay up there George and, you know, mention something about that but do you have any response to that? It's a statistical issue. [ Laughter ] >> I know that's one of the main problems that we all deal with when we have to present the data in-- to a jury, how to explain that in lay terms. And that's why when we evaluate the match criteria there is not perfect match criteria, there is always a compromise that is gonna be between type 1 and type 2 errors. So we have to try to the make the best decision based on the data and then try to worry as a community to work a little bit more in the language, what do we say about it, and how we present that in court to make it understandable. So I think that we us community can work on that. First select the match no matter if it is difficult to explain and then worry about how we are gonna present that in a easy way for the jury to understand. >> That was a good answer. Have you testified before? [ Laughter ] >> Anyway, I think I'd like to thank all the speakers this morning they were all great and they'll be available for questions. We need to move on to our break or we won't have one. Thanks very much. [ Applause ]