Nih Center for Scientific Review Presentation - Richard nakamura

Eric Green: So, Rich. Richard Nakamura: Thank you very much, Eric, and it's indeed a pleasure to come and present to you an update on the Center for Scientific Review. It isn't very often that the president speaks about peer review, so we had to feature that in a talk to the National Academy of Sciences, and in response to some recent concern about peer review expressed in Congress. I think President Obama very nicely defended the notion of peer review as important for our ability to have the top science. That was very much appreciated. Let me just tell you a little bit about the CSR mission, which is to see that NIH grant applications receive fair, independent, expert, and timely reviews, free from inappropriate influences, so NIH can fund the most promising research. In 2012, CSR reviewed almost 55 percent of NHGRI's grant applications. There was a total of 286, which is the equivalent of one of our study sections, about the workload of one of our study sections. These are distributed a little bit more than one study section, and we have 174 standing study sections. I'll go quickly over the path of applications through CSR because some points are a little obscure, but most, you, as council, would be familiar with. So all NIH extramural grant applications run through CSR. We receive all NIH applications. And they are referred from CSR to NIH institutes and centers, and to scientific review groups, or SRGs. We review the majority, about 65 percent, of grant applications for scientific merit for NIH. But the other 35 percent or so are reviewed within the institutes, and in the case of NHGRI, a slightly larger fraction. We have a somewhat unusual peer review process in that applicants, PIs, send in applications based either on their own initiative or on funding announcements. There is peer review either at CSR or the institutes. Those applications go through study sections, where they are ranked. And then they are percentiled. So, as you know, percentiling normalizes the output of all of our study sections. And that's an important step, and is increasingly being discussed at NIH. At the IC, and in front of council, with strategic goals -- are applied to the decision to make awards and to decide about funding. Funding itself is a more difficult activity these days because of the low number of awards, and the low proportion of dollars for the number of awards. Finally, we expect research to be the outcome, and outcome progress to be represented by publications. And we hope these ultimately will affect the public health. Within the center, 85,000 applications are received each year. Of those, the center reviews 58,000 of those. This involves 16,000 reviewers, over 230 scientific review officers, and almost 1,500 review meetings a year. So this is really a factory of review. We like to think we do it well, and we do it efficiently. This graphic is an overwhelming fact of life, both for CSR and all the institutes these days. As we all know, especially since the doubling of the budget, the success rate has been dropping. It is now at historic lows. The last couple of points on this graph -- the last one is wrong. It should be flat over the 2011-2012 period, at 18 percent. However, an 18 percent success rate -- overall success rate conceals an important issue. And that is, for any given award, the odds have fallen to approximately 10 percent of an award coming from any given application. That essentially means that of our primary constituency, the scientific community, 90 percent are inherently unhappy with CSR at any given time, and caused me second thoughts about accepting this position. However, I think we, the institute directors and myself, all believe that this is a critical role, and very important for the future of science in the United States, and that we all must step up to the plate when it comes to being invited to serve. This applies not only to the institute directors, but also to those who serve on review itself. They, too, have to suffer the consequences of this curve, not only in their personal activities and in their labs, but as an acquaintance of many PIs that are having -- or a friend of many PIs who are having a hard time. Some review issues that we faced early in 2013 look like this. One critical thing was grade inflation. We were seeing quite a bit of inflation, especially since enhancing peer review started, and the new scoring system started. As you know, the scoring system went from 10 to 50, 100 to 500, to 1 to 9. And this is an examination of how scores worked out to percentiles. And, as you can see, in 2009 -- it's hard to see the pointer -- it's down at the bottom is 2009, October. And you can see the distribution of scores representing a percentile score at the top of 20, 25, and 30. As you may know, it's rare that scores above 20 can be considered for an award. So, essentially, a score of 7 represented a score of 20; a score of 13 represented a score of 25 -- a percentile score of 25; and 19, a percentile score of 30. Now that essentially means that two digits are available for indicating within the award range out of the 1 through 9 scale. This got increased -- compressed over time, where a -- towards the end. Nine represented a score of 20, and a 15 represented a score of 25. This year, we implemented some changes in order to decompress the scores, and this is an indication. We've gone back to better than 2009 decompression, and we're not finished yet, so that now we have all of two digits available to us, and in some cases, the third digit. I'll show you a little bit on how that works. In 2009, the scoring chart was developed. And we have changed the scoring chart for the beginning of 2013 slightly. This is felt to be a tweak, but an important tweak, because one thing that had become the practice in a number of review committees is that the strengths and weaknesses were treated as inverses. As you can see here, a number 1 score is seen as exceptionally strong, with essentially no weaknesses, whereas 9 is very few strengths and numerous major weaknesses. We started to hear some reviewers say, "Well, this has no weaknesses. Therefore, I'm giving it 1." This was a severe distortion of the original concept of enhancing peer review, in which the significance or the strengths of the application were supposed to be score-driving features, not counting weaknesses. This wasn't universally true, but it was enough true that we felt we had to make a change, and so we created this scoring chart. It actually looks more different from the other scoring chart than necessary, largely to impress upon our review committees that the scoring chart had changed. There's an emphasis on overall impact on the chart, and an emphasis that you cannot get into the high category, that is, a score of 1, 2, or 3, unless the application had major importance. And you could get low even if you had moderate to high importance, if there were major weaknesses. The other point on this chart is anchoring the scoring at 5, which is the thing in dark gray at the bottom. It turns out that a number of reviewers were thinking that a 5, based on the old 100 to 500 system, was an extremely bad score, and were reluctant to use it. So they were essentially compressing the range from 1 to 5, making this point help them spread scores more. And here is the outcome of the first round of decompression. The pink bar -- the pink line is the one which is the most recent. It looks relatively small change, but it crosses the 50 percent line of overall scores, at the score of 59, which is a significant change from the old, where it was crossing that line at 49. More importantly is what happens at the 10 to 30 range, and particularly -- well, there has been about a 30-percent shift in the score from the yellow line to the pink line. It's important to note that jump that occurs at 20. That jump is a accumulation of scores at around -- at the grade of 30 -- at the merit score of 20, I'm sorry. And that kind of jump is within the gray zone of many institutes, and therefore provides program staff with relatively little information. And we're working on that particular issue. It's relatively rare; in fact, I don't know of any other presentation of the preliminary impact scores. So these are the preliminary impact scores that we get before the discussion. And you can see that the peak of this curve -- well, one thing is it's a skewed normal distribution, so it's skewed to the positive end of the scale. But it does indicate a normal distribution, which suggests that our use of the percentile score, as a translation of these impact scores, is not very helpful or accurate. You can see that the peak here is between 30 and 40, or the score's absolute preliminary scores of 3 and 4. This means that most of the -- 50 percent of the application are squeezed within this range, which is not very helpful to you or to program staff. After discussion, it gets a little better in that the non-discussed applications on the right are 46 percent of the applications. And then there's a respreading, as the discussion spreads scores a bit. So here, the peak is still between 30 and 40, but at this point, the -- a quarter of the applications are below that, rather than 50 percent of the applications. This provides a little bit more scoring range to work with and interpret for program staff. However, here's another look at the way scores from reviews are coming out. You can see there are huge jumps at 10, 20, 30, 40, et cetera. And those scores -- actually, you can't -- oh, yes. These scores are on their sides, so they're a little harder to see. And then they come down. So there really is a massive agreement around constant scores, and having that kind of agreement on review is not very helpful. Figuring out how to spread off of, particularly, 10, 20, and 30 would be helpful. Most review committees reflect that they can easily differentiate more finely than they are actually differentiating with their scores. There remains a problem with the scoring system, and we are having a discussion of ranking, of going to a more frank ranking system, to try and [inaudible] this issue. Another suggestion that has been made by a number of reviewers is -- would be to get the opportunity to have a half point to create more differentiation. However, when we looked at the old 100 to 500 scale, when you go back to that, you get a similar kind of peaking. So the great compression that we had under the old system, and the inclination to agree on an individual score that has overlap with other scores, seems to be a chronic problem, and we need to have another way of dealing with it. Diversity and fairness in peer review. As you can imagine, fairness is the key hallmark -- the key goal of peer review, and needs to be its hallmark. In 2011, as I was coming on board, the Ginther, et al. article in "Science" came out, that showed that though applications with strong priority scores were equally likely to be funded regardless of race, African-American applicants were 10 percentage points less likely to receive NIH research funding compared to whites. Ten percentage points sounds fairly innocuous or small in some sense, though this was highly significant. But if you consider that for the number of African-American scientists that applied to NIH, if you look at a comparable number of equally -- of control white scientists, the African-American scientists were getting 55 percent of the awards expected for the white scientists. So that 10 percentage point difference translated into a huge likelihood of success difference. Some suggested explanations in that paper were the possibility of bias in peer review, and a cumulative disadvantage that could be experienced by African-American scientists based on differences in education, and other forms of bias over the course of their careers. NIH was extremely -- felt that this issue was extremely important. This was important because NIH believes that it should be -- represent fairness for all groups. And the level of this difference got the attention of Francis Collins and Larry Tabak, the director and deputy director of NIH, to the extent that they immediately formed a set of internal committees, raised this issue to the level of the advisory committee to the director of NIH, asked that immediate action be taken, but also that both CSR and other groups at NIH get to the bottom of what was the cause of this discrepancy or this disparity. So a peer review subcommittee was set up. By the way, we accept that this -- whatever is going on happens before award decisions are made. The impact score for applications entirely determines the award rate. So we accept that if there's any problem in the NIH system, it must be occurring within the peer review system. It remains possible that there is a difference in the applications coming in. The advisory committee asked that a peer review subcommittee be set up. That's been done. I'm a co-chair of that committee. It asked that we provide more information for applicants to have non-discussed as their outcome. African Americans have a much -- have many more applications not discussed than other scientists. That's been done. They asked that text analysis of application summary statement in discussions be looked at. They look -- asked for an evaluation of anonymized applications. And they ask for diversity awareness training of NIH staff. All of those things are being worked on, and we're hoping to do this as a true experimental science so that we can know the causes of these disparities. This is the initial group on the workgroup on diversity, subcommittee on peer review; it's mainly social scientists to try and look at this problem. We're also bringing on board other scientists with strong records on peer review, and perhaps -- and more biologically-oriented scientists. We have also increased the representation of minority groups on our study sections. Just since I've come on board, we've increased African Americans on CSR study sections by 42 percent, and Hispanic scientists by 22 percent. I show you those in highlight, but if you look at the rest of the chart, you can see that from 2006 to 2011, numbers had actually been declining across the board. And so I can't do more than say that we've brought back the numbers to a proportion that's about 10 percent right now. This is about double the representation of underrepresented minority scientists in the award pool of NIH. We've also created an early career reviewer program to help early career scientists who are just beginning life as an independent researcher to understand more about the Center for Scientific Review and review. This is to train qualified scientists without significant experience, to help emerging researchers advance their careers, and to enrich the existing pool of NIH reviewers by including scientists from less research intensive institutions. The requirements for being an early career reviewer is not having reviewed for NIH before, beyond one mail review, have a faculty appointment that we are told by the university that it's expected that the individual become an investigator, and that they've established an active independent research program, and have not had an R01 or equivalent. Now, these individuals -- there's no more than one per review panel. These individuals are given a very light review load as a tertiary reviewer, between two and four applications. So their primary job is to look and learn. They are under the wing of the SRO and the chair of the panel. So far, we've -- nearly 700 have served on study sections, and of those, 32 percent are underrepresented minority scientists. But these include scientists from all universities, who are considered independent. Feedback so far has been very positive. Ninety-eight percent found the ECR experience to be useful. Ninety percent reported themselves to be in a better position to write their own grants. Ninety-seven percent would recommend the ECR experience to a colleague. We have not been getting negative comments from reviewers, and the experience of the SROs and the chairs is very positive. For your own information, this is how to apply to the ECR program: to csrearlyreviewer -- all one word -- @mail.nih.gov. You'll be given copies of my presentation, if you don't want to note this. CSR is also looking at additional review platforms. As part of a general effort to try and ensure that we have appropriate review platforms for different circumstances, we're trying different kinds, such as telephone-assisted meetings, video-assisted discussions, intranet-assisted meetings, and telepresence meetings. These are various forms of either phone, video, or completely asynchronous electronic reviews. We're also looking at editorial board style meetings. We are looking at the strengths and weaknesses of these various forms; what kinds of reviews they are best for; people's impressions of these forms of review compared to face-to-face, our standard form of review. There's a lot of interest in editorial board style meetings, since these have been used for the director's special application program. And it is felt that that might be a style for the future, but it is more expensive than regular face-to-face review. Here's just an example of a video conference-based study section. Similar to telepresence, this is where reviewers on both sides of the room -- the ones on the back are just on video; the ones in the front are live. The ones in the back appear nearly full size, so it's possible to see their expressions as they conduct their reviews, and to see whether or not they're paying attention. [laughter] One of the nice things about this is that the microphones are set up so they're directional, and you can see the voice appearing to come from the person that's speaking. It's a very nice feature of this. I'm going to shift over to an issue which has been bothering many scientists, and has been a source of frequent complaints to CSR. And this is the issue of the A-2 [spelled phonetically] application. This is the graphic that convinced NIH directors and NIH that we should address the A2 issue. Here we see that the awards to A0s went from about 60 percent in 1998, and it dropped to the lowest form of award by 2008. The A2 awards had risen, had passed the A0s. And you can see from the intersection, that seemed to looming. It was likely that it would pass the A1s in popularity for proportion of awards. After the A2 was eliminated in 2008, some applications were grandfathered. But as the grandfathering ran out, those curves went down. The A0 rebounded strongly, as we had hoped, and passed the A1. But you could also see, this was the latest results are from 2011. We're looking for 2012. I'm very concerned that A1 may intersect with the A0. But what these lines suggested was that -- and what we heard in the actual conversations that went on on review panels was that a line was being set up, that people were expected to wait until their A2 application before receiving an award. In many cases, it was very easy for the committee to try and pick out a few things that they wanted to see get an award. They became more urgent at the A1, once they came to the A2 level. It became a strong temptation to line up PIs in order of their applications. When the A2 was eliminated, the time to funding dropped quite a bit, from a little over 90 weeks to a little over 50 weeks, on average. So these two things, and the lack of difference between new investigators and established investigators, it was through a lot of statistics the Office of Extramural Research determined that there were no groups that were substantially more affected by the loss of the A2. The time to award increased -- reduced, went down dramatically. And so it was felt that we should eliminate the A2. Also, the order of scoring seemed to hold across this period of time. So there were no dramatic changes that were occurring. It was just delaying when the award would occur. Now I want to talk a little bit about the future, and then I'll take some of your comments. One of the ways in which we think we can improve the outputs of CSR scoring is if we look at the distribution of applications across study sections. Current distribution is quite non-random. First of all, we base distribution on the areas of science. Two, we base distribution on PI preference. PIs get to ask for the study section they'd like to have, and we honor those 80 percent of the time. And 75 percent of scientists ask for a review committee. This means, and the observation of many people, that there may be a non-random distribution of the highest quality applications. If that's true, and we normalize the output of all study sections, that means that you are not looking when -- in council and in other places -- you do not get to look at, necessarily, the best applications across all the applications. You're just looking at the best applications from each study section. We are looking at how that works. We are trying to get -- establish statistics on whether or not that impression of most people is real. We are going to be scoring or re-ranking applications from a broad set of study sections to try and see what the relationship of that ranking is to the individual study sections. If there is consistent differences across study sections, that will cause us to ask the question, "How can we look more generally across study sections in a systematic way, and provide you with the information?" We're trying to develop better tools for applicants for referral and review. This is part of a general process to try and make CSR more user-friendly for applicants, and to make a system of award intake more helpful. We are also trying to increase diversity and reduce award disparities, obviously. In general, we'd like to provide better service to applicants and to the ICs. We are trying to have more discussions with the program staff to see what we can do to make our summary statements the most helpful. Finally, we're trying to develop a science of peer review. We've established an office with money to do experiments in peer review. I've implied some of those ideas to you earlier in this talk. But I think if we are to keep the U.S. lead, especially in this time of difficult funding, we've got to figure out ways to make it easier to find the best applications and make awards to those, and to provide you, through CSR, with the best information for making those decisions. Thank you very much. [applause] Eric Green: Thank you, Richard. Well, we have time for questions. Let me start with the first one for this -- I forgot the exact three letters -- the training one, where you take youngsters, and basically give -- Richard Nakamura: The early career reviewer program, ECR. Eric Green: Do you -- what's the curriculum that you give to them? Is it purely just by showing up, or do you have materials, or mock study sections, or other things that one could imagine you could do? Richard Nakamura: We don't go to mock study sections. What we do is the EGCR is given both a PowerPoint set plus a training from the SRO. And then they're given guidance. They can't attend peer review more than two full sessions. And then they're given guidance about how to work with it. So far, the reaction from the SROs and from the chairs is that has been sufficient. That is the -- in general, the ECRs take this very seriously. They work very *** the reviews that they do. They have not caused embarrassment by lack of knowledge around that issue. Female Speaker: Yes, so I just wanted to share some thoughts and get your impression about this set of slides that you showed about the distribution of review scores. Because, at some level, one is assuming that the quality of these proposals and the science that is being proposed is normally distributed. But it's not a random population of proposers, and so that assumption may be fallacious at some level. But also, I think, in my own personal experience on study section, is that there is this tension between doing a review sort of on absolute versus relative terms. On the particular study section of which I served for, say, four years, at the beginning of my tenure, the overall quality of the proposals was just not very strong, on absolute terms. And that's not to say there weren't some good proposals. But -- and then being sort of forced, you know, our SRO kept saying, "Oh, you know, your mean is much lower than all the other study sections." And this refers to what you were talking about just at the end. On the other hand, you know, does it make sense if, for some reason, a particular batch of proposals in general is not strong, to rate them on this sort of relative scale? On the other hand, if a particular batch of proposals are all very, very strong, you're naturally going to get that compression, either at the top or the bottom. So I'm just wondering, you know, sort of what recognition there is about this -- you know, when you look at statistics, statistics can tell you one thing. But if you don't look at the priors, you're sort of maybe being led to a slightly false direction or strategy. So I'm just curious to hear what your thoughts are about it. Richard Nakamura: Yes, we're not going to -- we're not planning on doing any given thing. We, right now, we observe that we have a problem, that we've long had this problem, that different efforts to address the problem have systematically failed. And that there's this broader issue of what's going on across study sections, which we've never addressed. So we're planning on having some meetings with individuals who are experts in decision theory. Obviously, if this were easy, both corporations and NIH would have solved this a long time ago. This is a deep, difficult problem. And so far, we remain at the -- with the idea that strong scientists, excellent scientists, are the best judges of science. Aside from that, the way to get the most -- maximal information from them, either as a committee or as individuals, I think there are many questions about, and that one of the things that we should be doing is exploring how we get to answers about that. I'm not proposing, and don't plan to propose we know the answer. Here is what we're going to do. I think what we'd like to do is to say, "Here is a theoretically better way of approaching this. Let's try it with some study sections," and maybe within an IRG. I think we're going to have continued discussions with the scientific community about any changes that we're thinking about. Male Speaker: I was interested in your actions related to the diversity issue on application success rate. So there has been research which has shown that there's gender and racial bias in the review of grants. And given that knowledge, why isn't there something where you're trying to address that with the review panelists, themselves. They bring it with them. They're in societal stereotypes. It's nice for diversity awareness training of the NIH staff, but they're not making the review, the scores, and the decisions for funding. Richard Nakamura: It is the intent of the advisory committee to the director to address that issue with the reviewers if we determine that there is bias in the peer review system. Right now, we're running some experiments which we hope will show not only whether or not there are differences in the quality of applications, but differences that could be attributable to bias and the proportions that that might account for. I think we need both pieces of information to develop the right intervention. And so -- but if it turns out that there is bias in the peer review system, it is the intent to try and train reviewers. The ACD asked us to take any validated system for countering bias and apply it, first to NIH staff and then to the reviewers. However, there is no such validated system. And so the first thing we're going to do is to see if there's any system that can make a difference with NIH staff. And there's about 500 staff that we could apply this to. So there are reasonably large numbers. Male Speaker: So I was very taken by the figure that you had with the spikes of the priority scores. Can you give us a sense of what causes that underlying phenomenon, and the degree to which the kind of group dynamics of the study section and congealing scores on those, and what that might be telling you about the underlying review process? Richard Nakamura: There's no question that there is a strong temptation once a general agreement seems to be existing on review committee that an application is in award domain. Remember, we score based -- initially, we group based on preliminary scores. So when -- the first few reviews they know are in the possible award range. It looks like there's a -- just a strong group temptation to agree on a score, and to have everyone vote that, in many cases. And it's mainly those that vote outside the range that cause differentiation among the scores. That's a tough position for individuals to take, on many committees. So we think it's a severe problem, a serious problem. The scientists themselves, they can differentiate better than the committee scores reflect. But we haven't figured out a way to get that expression. And this is one of the reasons for discussion of ranking systems. Male Speaker: Sorry, just as a follow-up, is that because, say, the high and the low are 2.3 and 2.5, and so everyone has to vote in that range, or -- Richard Nakamura: No, remember that the primary reviewers give scores of 1, 2, or 3. They only have those three digits to work with. So 1 is heavily discouraged by the system, and so you're talking about 2 and 3. And they know that if you want to stay out of the gray zone, it has to be below 2. So, given those, committees easily get stuck on a consensus. Male Speaker: Right. So it used to be you could give the decimal. Now it's like 2 and 2 -- Richard Nakamura: You cannot give the decimal. Male Speaker: -- so everyone votes 2. And how many scores go into a priority score? Richard Nakamura: The overall priority score is based, usually, on 20 scores that are generated around -- Male Speaker: So even though you have 20 scores, you're still getting these -- Richard Nakamura: If the -- if the three reviewers say -- Male Speaker: I mean, that's actually hard to gerrymander that way. [laughs] You have to work really hard to come up with a system that ends up with those optimal. Richard Nakamura: We have a problem. Female Speaker: I wanted to follow up a little bit on the idea that Jill was introducing earlier, of what's the assumption and what might be the implications of trying to standardize things across study sections. And, in particular, you made a comment that rare that scores above 20 can be considered for an award, and I'm thinking of the societal and ethical issues in research study section that was formed to bring the research ethics and the ELSI together. And, traditionally, I think it's well-recognized that those scores have always been high in the sense of bad, remarkably so, compared to other study sections. And that's just the way that study section scores. So I guess my question is, how would you handle something like that? And I understand what you said earlier that you don't have a solution yet, you're looking into it. But as a related question to that, then, I guess when you've created -- and I don't know whether there were other study sections that got merged; presumably, there were -- when you've looked at the experience of the disciplines that came in, or the groups that came in, and were merged into a new study section, have you traced, at all, how those applications have fared, and is that something that you have done at the study section level? And how would you handle that going forward? Richard Nakamura: We do do some tracing. Most of the tracing is based on concerns or complaints by subsections of committees. So whenever we hear a subgroup complain about the outcomes of mergers, et cetera, we do look at that kind of issue. But I'll remind you that today, because of the huge drop in awards, every group is saying, "We've been selected out for loss of funding," and we very rarely statistically see that difference when we look at it. Everyone is losing money these days, and we're not picking on any one group. But everyone seems to have the impression that they must be picked on because so many of their compatriots are losing funding. Male Speaker: So I'm glad you're taking a science-based approach and trying to use actual data. That's wonderful. One of the things with the Internet-assisted review is there's at least some level of the scoring that's locked in prior to the actual in-person visit. And I'm wondering if you've looked at that data, you know, the pre and post, to see how well does the scoring in isolation reflect what happens after the group dynamics. And really where I'm going is, can we get to the point where we don't have to have in-person at all, even though I really enjoy it? Richard Nakamura: [laughs] We are asking the question, "What does in-person contribute? What does -- is there a socialization process that's important in peer review?" I can't tell you what the answer is. There's a lot of myth in peer review, obviously, as well as the possibilities of science. And we want to systematically explore this. Some of the methods allow us to know -- to determine, in the long run, whether or not electronic review is the equivalent of face-to-face review. Male Speaker: Richard, do you actually have funding to carry out controlled experiments? Richard Nakamura: Yes, in two ways. I have been allowed to keep some money we've saved through electronic review, a small amount of money. And we've also been given extra money to carry out the experiments in bias -- in studies in bias, by NIH. So I think if we get results which help clarify what we need to do to improve quality of peer review, the institutes and Francis will give us money for that. I think everyone agrees that corporate America believes that you need to have resources like that and experiments like that in order to improve quality, and so we hope we will be able to do that ourselves. Female Speaker: Just want to make a comment that I do think the social aspects of it have big impacts. And I just was on a review pretty recently, and the scores before we all met together were significantly different than when we were together. And there were good reasons and bad reasons for that. The good ones were they were maybe someone could clarify some technical issue that someone else didn't understand. But then it also comes down to who was better at persuading the rest of the group to go one way or the other, or if we don't score well enough, it's not going to get funded. And there were so many statements in there that were like [laughs] could it sway it one way or the other so easily. And so I do think it's very important to have this decision-making expert to advise you. Richard Nakamura: Yeah, I agree that there are lots of different dynamics that are going on that we have to understand. We are doing some studies of the shift in scoring that occurs, and the scoring that -- differences that occur on different committees. Some committees basically stay with the original scores, and some committees shift quite a bit. So -- and the -- it's up to you, the staff of the institute and of council, to make decisions about what's important based on the admission of the institute. And so you can ignore the scores. It's difficult to ignore, let's say, a non-discussed, but within the scored range, I think the general expectation is that the relevance to the institute's mission is very important. Male Speaker: So have there been any experiments of the system of the type where you might send some subset of applications to multiple study sections to compare what the outcome would be of the exact same application going through the system at the same time? Richard Nakamura: Okay, we are discussing the possibility of reviewing some subset of applications twice to get the reliability and validity estimates that we need for many of our other studies. You might understand that that makes a lot of people very nervous. And I have not -- [laughter] -- gotten the go-ahead yet to do those studies. But we think that it is important to develop some power estimates of our system. Male Speaker: My guess of that would be that there's going to be, you know, high concordance on the things that are not discussed, right? There's a bunch of things that ultimately, you know, no study section would fund. And then there are going to be a couple of things that would score in the 10s, no matter what study section you would get to. And then there's going to be a whole 20 to 30 percent that's going to be all over the place, and depending on who you send it to, would or would not have gotten support. Richard Nakamura: Yeah, I completely agree. I think most of us believe that peer review system, as it's currently designed, works well when success rates are 30 to 40, and not so well -- Male Speaker: Exactly, yep. Richard Nakamura: -- when success rates are where they are right now. Male Speaker: And I think that's the point, you know, that it would important -- if that's true, then that's important to know so that we can communicate that to policymakers to say that, in fact, when you're drawing down the fund, you're drawing it into a percentile range where, you know, sure you're funding good science, but there's a ton of great science that gets left on the table. And the only way to really secure that we're funding all the, you know, the best science and to keep U.S. competitiveness is to figure out how to drive the funding rates back up into the 25 or 30 percent range. Richard Nakamura: I agree that current funding is catastrophic. Male Speaker: I wonder if -- and you probably have compared notes across the mechanisms used across different agencies for their peer review, and whether they're seeing the same kind of issues that you're seeing. And I've served on a number of panels across DOE or NSF, and they do things quite differently, and not better or worse. But I'm wondering if you look at their numbers, do you see the same trends? Richard Nakamura: All the agencies that I'm aware of in the United States are having a harder time with funding, and are seeing loss of -- in science. And, at the same time, we're watching our agencies -- related agencies in other countries. I was -- earlier -- I guess last year I was in South Korea, presenting on peer review. They were extremely interested. They were frankly feeling that we were giving them an opportunity to emerge as a first-class science country. Their -- a representative from their government came to report on the 2013 budget. He was deeply apologetic for the funding that he was going to announce. It was only an increase of 5 percent. And so they have concluded that the -- that American success is built on science and technology, and on federal funding of science and technology. And they're all delighted, in some sense, that we've given them the opportunity to compete. Female Speaker: So just one quick comment, and then another question. You know, DeeDee [spelled phonetically] was talking about some issues around review and grants that are not discussed. I just want to point out that the way we order grants for discussion now, if a grant -- if two out of three reviewers score a grant well, and there's one, like, whacko review, or, you know, sometimes it works the other way, but that grant now, according to the ordering, will not be reviewed. In the old days, it used to be that these kind of outlier scoring grants would also be discussed. So I just want to suggest that we might think about how to go back to that. But the real thing I wanted to mention was that I agree with you, it's really important we get the best scientists to review grants. One of the pressures on us, who have done that service, is that the review panels often come very close to when we have to submit our own grants. And while there is this quasi policy that you get some dispensation for that, that your grant submissions can be late, that does not apply to all the grants you would submit. So if you're responding to an RFA or a PAR, you don't get any dispensation for that. And many of us who are experienced reviewers, and I like to think that those of us sitting around this table are reasonable scientists, we, therefore, have this additional pressure that we may have 10 grants of other people to review, and we've got to meet a deadline for -- to fund our own labs. So that's part of this idea of trying to get good people to review that I think it would be great to revisit that policy and how it's actually implemented. Richard Nakamura: I can tell you simply that we are revisiting those policies. Female Speaker: Great, I'm glad to hear that. Richard Nakamura: And any individual reviewer can ask that something be discussed. And we've also asked our SROs when there are strong discrepancies in the preliminary scores, to try and get some resolution, even if there's not going to be a discussion, to try and get an understanding among the reviewers, which is the more correct score and to have a discussion around that. It's to try and reduce the problem this causes to PIs who receive highly discrepant scores. Female Speaker: Well, and this is related to the question in terms of transparency to the applicant as well as to council when it has to decide, is there a movement towards including the distribution of -- not the exact distribution, but aggregates in the critique back to the applicant, such as means, mode, standard deviation, range? Richard Nakamura: Of the score -- of the individual scores, they are provided with some of the criterion scores that came to that information. But I'm not quite sure what advantage that would have. Female Speaker: I think that would point to outlier reviews because you would have a large standard deviation or range, and you would have a means and a mode that might disagree very much. Richard Nakamura: Well, we are asking SROs to try and get some resolution of highly disparate reviews so that it's less confusing. Male Speaker: I wanted to ask you whether there's an ongoing program of quality assurance, quality improvement for the SROs. Is there -- is there -- is there presiding over these study sections; is it observed by outside -- people within CSR, but outside of the study section? Richard Nakamura: There are two forms of -- I guess, multiple forms of observation. Obviously, reviewers are there, and there's a hierarchy over each SRO. We expect the IRG chief to attend many of the meetings run within their groups. We expect division directors to attend them. I attend maybe six to eight study sections around. So we also -- each study section's observed by program staff. We hope there is better communication these days with the program staff, and a better sense on the part of program, that through their hierarchy, I'm very receptive to information about problems on review committees. Male Speaker: Does the program staff often express the concern that they're -- have actually been told, you know, not to participate, that they have to stay as quiet, silent observers? Richard Nakamura: During the reviews themselves, that's correct. We want the SRO to be the federal lead on that. Between sessions, I expect there to be good communication. I think we've encouraged communication between program and review staff, both at the higher and at the individual SRG levels. I think that our SROs are feeling more able to work with the program staff, and to hear about problems. We're also trying to create systems so electronically it'll be easier for program staff to follow what's going on at reviews of their applicants. Male Speaker: I just wanted to second Lucilla's [spelled phonetically] point because I think it's a really good one, which is that you're already collecting all this data on the scores, so even just getting a histogram back with some summary statistics of what the scores were for your grant, at least, for me, it would be -- I'd love to see that, right? I mean, it's sort of the same as if you were running a course, and you wanted to compare -- and this would be really useful data to also have in terms of trying to figure out whether -- to normalize scores across study sections or not, right? You're saying that sometimes, some study sections, you're not funding necessarily the best, you know, all the best grant overall, you're just funding the best grants in that study section, right? So both in terms of giving that data back to individual PIs, but also keeping it for broader staff analysis, it would be very useful data to have. And you already have it, so you're already collecting it. Richard Nakamura: Yeah, one of the problems that we have on any individual level collection of data is that we really go out of our way to make sure that things can't be traced back to individual reviewers. Even the statistics that we kept here, we deleted all the original PI information. We collected the numbers. And each time we do these collections, we do it for a specific purpose. So I can imagine that something like this could be arranged. We'd have to think about it a bit to make sure that the underlying confidentiality is kept. But I do -- anything that we can do relatively readily like that, I'm willing to consider. Male Speaker: I wouldn't say [spelled phonetically] -- if you're PI and you're getting a histogram of your scores, right? If you got a histogram of scores, and they are bimodal, right, that tells you something different than you get a histogram of scores, and everyone give you a 5, right? Then you go, "Okay, clearly," right? You know, I don't -- I really need to rethink what I'm doing, and it's out of the question or -- but if you got this bimodal score, you might say, "Well, clearly, some people didn't understand it. Maybe I could have addressed the point better," or something like that. I think it would also help in the discussions that the program officers have with individuals' PIs, right? Because I know I had called my program officer and said, "Well, is there a study section thinking, blah blah blah," and they point and go, "Look, there's no variants here," right? Then, you know, you could take that to mean what it means. Richard Nakamura: I hear you, okay. Male Speaker: So, Carlos, even if you strip identifying information, there are scenarios where an individual could be linked to a score. We can talk about it at lunch. Male Speaker: Okay. Male Speaker: But I think that would be the underlying -- Female Speaker: You mean the -- and individual reviewer on the study section? Male Speaker: I mean the review panel that has one expertise, and you read a written review because that person's assigned. And then you see a score that's outlying over here, and you say, "That reviewer gave me that score." Whether it's accurate or not, it's still open the door. Female Speaker: But you do -- the score is from the three reviewers. Male Speaker: Anywhere in back [spelled phonetically], yeah, exactly. Female Speaker: You get the -- you get three reviewers' scores, and they're, you know, [inaudible] -- Male Speaker: And they don't necessarily tell you what that person's final score was. But if you see the printed scores, and there's an outlier, you might link it. Correct or not, you're creating social havoc in the follow-up. We'll talk at lunch. Male Speaker: Okay. All right. Richard Nakamura: Thank you very much. Eric Green: Well, thank you, Richard. It was terrific that you would come talk to us. So, thank you, council, for a good discussion. So we will break for lunch now. Shall we try for 1:15? We will reconvene at 1:15, so 55 minutes, go get your lunch, and thank you for a good morning.