Tip:
Highlight text to annotate it
X
Eric Green: So, Rich.
Richard Nakamura: Thank you very much, Eric, and it's indeed
a pleasure to come and present to you an update on the Center for Scientific Review.
It isn't very often that the president speaks about peer review, so we had to feature that
in a talk to the National Academy of Sciences, and in response to some recent concern about
peer review expressed in Congress. I think President Obama very nicely defended the notion
of peer review as important for our ability to have the top science. That was very much
appreciated.
Let me just tell you a little bit about the CSR mission, which is to see that NIH grant
applications receive fair, independent, expert, and timely reviews, free from inappropriate
influences, so NIH can fund the most promising research.
In 2012, CSR reviewed almost 55 percent of NHGRI's grant applications. There was a total
of 286, which is the equivalent of one of our study sections, about the workload of
one of our study sections. These are distributed a little bit more than one study section,
and we have 174 standing study sections.
I'll go quickly over the path of applications through CSR because some points are a little
obscure, but most, you, as council, would be familiar with. So all NIH extramural grant
applications run through CSR. We receive all NIH applications. And they are referred from
CSR to NIH institutes and centers, and to scientific review groups, or SRGs. We review
the majority, about 65 percent, of grant applications for scientific merit for NIH. But the other
35 percent or so are reviewed within the institutes, and in the case of NHGRI, a slightly larger
fraction.
We have a somewhat unusual peer review process in that applicants, PIs, send in applications
based either on their own initiative or on funding announcements. There is peer review
either at CSR or the institutes. Those applications go through study sections, where they are
ranked. And then they are percentiled. So, as you know, percentiling normalizes the output
of all of our study sections. And that's an important step, and is increasingly being
discussed at NIH.
At the IC, and in front of council, with strategic goals -- are applied to the decision to make
awards and to decide about funding. Funding itself is a more difficult activity these
days because of the low number of awards, and the low proportion of dollars for the
number of awards.
Finally, we expect research to be the outcome, and outcome progress to be represented by
publications. And we hope these ultimately will affect the public health.
Within the center, 85,000 applications are received each year. Of those, the center reviews
58,000 of those. This involves 16,000 reviewers, over 230 scientific review officers, and almost
1,500 review meetings a year. So this is really a factory of review. We like to think we do
it well, and we do it efficiently.
This graphic is an overwhelming fact of life, both for CSR and all the institutes these
days. As we all know, especially since the doubling of the budget, the success rate has
been dropping. It is now at historic lows. The last couple of points on this graph -- the
last one is wrong. It should be flat over the 2011-2012 period, at 18 percent. However,
an 18 percent success rate -- overall success rate conceals an important issue. And that
is, for any given award, the odds have fallen to approximately 10 percent of an award coming
from any given application. That essentially means that of our primary constituency, the
scientific community, 90 percent are inherently unhappy with CSR at any given time, and caused
me second thoughts about accepting this position. However, I think we, the institute directors
and myself, all believe that this is a critical role, and very important for the future of
science in the United States, and that we all must step up to the plate when it comes
to being invited to serve. This applies not only to the institute directors, but also
to those who serve on review itself. They, too, have to suffer the consequences of this
curve, not only in their personal activities and in their labs, but as an acquaintance
of many PIs that are having -- or a friend of many PIs who are having a hard time.
Some review issues that we faced early in 2013 look like this. One critical thing was
grade inflation. We were seeing quite a bit of inflation, especially since enhancing peer
review started, and the new scoring system started. As you know, the scoring system went
from 10 to 50, 100 to 500, to 1 to 9.
And this is an examination of how scores worked out to percentiles. And, as you can see, in
2009 -- it's hard to see the pointer -- it's down at the bottom is 2009, October. And you
can see the distribution of scores representing a percentile score at the top of 20, 25, and
30. As you may know, it's rare that scores above 20 can be considered for an award.
So, essentially, a score of 7 represented a score of 20; a score of 13 represented a
score of 25 -- a percentile score of 25; and 19, a percentile score of 30. Now that essentially
means that two digits are available for indicating within the award range out of the 1 through
9 scale. This got increased -- compressed over time, where a -- towards the end. Nine
represented a score of 20, and a 15 represented a score of 25.
This year, we implemented some changes in order to decompress the scores, and this is
an indication. We've gone back to better than 2009 decompression, and we're not finished
yet, so that now we have all of two digits available to us, and in some cases, the third
digit. I'll show you a little bit on how that works.
In 2009, the scoring chart was developed. And we have changed the scoring chart for
the beginning of 2013 slightly. This is felt to be a tweak, but an important tweak, because
one thing that had become the practice in a number of review committees is that the
strengths and weaknesses were treated as inverses. As you can see here, a number 1 score is seen
as exceptionally strong, with essentially no weaknesses, whereas 9 is very few strengths
and numerous major weaknesses. We started to hear some reviewers say, "Well, this has
no weaknesses. Therefore, I'm giving it 1."
This was a severe distortion of the original concept of enhancing peer review, in which
the significance or the strengths of the application were supposed to be score-driving features,
not counting weaknesses. This wasn't universally true, but it was enough true that we felt
we had to make a change, and so we created this scoring chart. It actually looks more
different from the other scoring chart than necessary, largely to impress upon our review
committees that the scoring chart had changed. There's an emphasis on overall impact on the
chart, and an emphasis that you cannot get into the high category, that is, a score of
1, 2, or 3, unless the application had major importance. And you could get low even if
you had moderate to high importance, if there were major weaknesses.
The other point on this chart is anchoring the scoring at 5, which is the thing in dark
gray at the bottom. It turns out that a number of reviewers were thinking that a 5, based
on the old 100 to 500 system, was an extremely bad score, and were reluctant to use it. So
they were essentially compressing the range from 1 to 5, making this point help them spread
scores more.
And here is the outcome of the first round of decompression. The pink bar -- the pink
line is the one which is the most recent. It looks relatively small change, but it crosses
the 50 percent line of overall scores, at the score of 59, which is a significant change
from the old, where it was crossing that line at 49. More importantly is what happens at
the 10 to 30 range, and particularly -- well, there has been about a 30-percent shift in
the score from the yellow line to the pink line. It's important to note that jump that
occurs at 20. That jump is a accumulation of scores at around -- at the grade of 30
-- at the merit score of 20, I'm sorry. And that kind of jump is within the gray zone
of many institutes, and therefore provides program staff with relatively little information.
And we're working on that particular issue.
It's relatively rare; in fact, I don't know of any other presentation of the preliminary
impact scores. So these are the preliminary impact scores that we get before the discussion.
And you can see that the peak of this curve -- well, one thing is it's a skewed normal
distribution, so it's skewed to the positive end of the scale. But it does indicate a normal
distribution, which suggests that our use of the percentile score, as a translation
of these impact scores, is not very helpful or accurate.
You can see that the peak here is between 30 and 40, or the score's absolute preliminary
scores of 3 and 4. This means that most of the -- 50 percent of the application are squeezed
within this range, which is not very helpful to you or to program staff.
After discussion, it gets a little better in that the non-discussed applications on
the right are 46 percent of the applications. And then there's a respreading, as the discussion
spreads scores a bit. So here, the peak is still between 30 and 40, but at this point,
the -- a quarter of the applications are below that, rather than 50 percent of the applications.
This provides a little bit more scoring range to work with and interpret for program staff.
However, here's another look at the way scores from reviews are coming out. You can see there
are huge jumps at 10, 20, 30, 40, et cetera. And those scores -- actually, you can't -- oh,
yes. These scores are on their sides, so they're a little harder to see. And then they come
down. So there really is a massive agreement around constant scores, and having that kind
of agreement on review is not very helpful. Figuring out how to spread off of, particularly,
10, 20, and 30 would be helpful. Most review committees reflect that they can easily differentiate
more finely than they are actually differentiating with their scores.
There remains a problem with the scoring system, and we are having a discussion of ranking,
of going to a more frank ranking system, to try and [inaudible] this issue. Another suggestion
that has been made by a number of reviewers is -- would be to get the opportunity to have
a half point to create more differentiation.
However, when we looked at the old 100 to 500 scale, when you go back to that, you get
a similar kind of peaking. So the great compression that we had under the old system, and the
inclination to agree on an individual score that has overlap with other scores, seems
to be a chronic problem, and we need to have another way of dealing with it.
Diversity and fairness in peer review. As you can imagine, fairness is the key hallmark
-- the key goal of peer review, and needs to be its hallmark. In 2011, as I was coming
on board, the Ginther, et al. article in "Science" came out, that showed that though applications
with strong priority scores were equally likely to be funded regardless of race, African-American
applicants were 10 percentage points less likely to receive NIH research funding compared
to whites.
Ten percentage points sounds fairly innocuous or small in some sense, though this was highly
significant. But if you consider that for the number of African-American scientists
that applied to NIH, if you look at a comparable number of equally -- of control white scientists,
the African-American scientists were getting 55 percent of the awards expected for the
white scientists. So that 10 percentage point difference translated into a huge likelihood
of success difference. Some suggested explanations in that paper were the possibility of bias
in peer review, and a cumulative disadvantage that could be experienced by African-American
scientists based on differences in education, and other forms of bias over the course of
their careers.
NIH was extremely -- felt that this issue was extremely important. This was important
because NIH believes that it should be -- represent fairness for all groups. And the level of
this difference got the attention of Francis Collins and Larry Tabak, the director and
deputy director of NIH, to the extent that they immediately formed a set of internal
committees, raised this issue to the level of the advisory committee to the director
of NIH, asked that immediate action be taken, but also that both CSR and other groups at
NIH get to the bottom of what was the cause of this discrepancy or this disparity.
So a peer review subcommittee was set up. By the way, we accept that this -- whatever
is going on happens before award decisions are made. The impact score for applications
entirely determines the award rate. So we accept that if there's any problem in the
NIH system, it must be occurring within the peer review system. It remains possible that
there is a difference in the applications coming in.
The advisory committee asked that a peer review subcommittee be set up. That's been done.
I'm a co-chair of that committee. It asked that we provide more information for applicants
to have non-discussed as their outcome. African Americans have a much -- have many more applications
not discussed than other scientists. That's been done.
They asked that text analysis of application summary statement in discussions be looked
at. They look -- asked for an evaluation of anonymized applications. And they ask for
diversity awareness training of NIH staff. All of those things are being worked on, and
we're hoping to do this as a true experimental science so that we can know the causes of
these disparities.
This is the initial group on the workgroup on diversity, subcommittee on peer review;
it's mainly social scientists to try and look at this problem. We're also bringing on board
other scientists with strong records on peer review, and perhaps -- and more biologically-oriented
scientists.
We have also increased the representation of minority groups on our study sections.
Just since I've come on board, we've increased African Americans on CSR study sections by
42 percent, and Hispanic scientists by 22 percent. I show you those in highlight, but
if you look at the rest of the chart, you can see that from 2006 to 2011, numbers had
actually been declining across the board. And so I can't do more than say that we've
brought back the numbers to a proportion that's about 10 percent right now. This is about
double the representation of underrepresented minority scientists in the award pool of NIH.
We've also created an early career reviewer program to help early career scientists who
are just beginning life as an independent researcher to understand more about the Center
for Scientific Review and review. This is to train qualified scientists without significant
experience, to help emerging researchers advance their careers, and to enrich the existing
pool of NIH reviewers by including scientists from less research intensive institutions.
The requirements for being an early career reviewer is not having reviewed for NIH before,
beyond one mail review, have a faculty appointment that we are told by the university that it's
expected that the individual become an investigator, and that they've established an active independent
research program, and have not had an R01 or equivalent. Now, these individuals -- there's
no more than one per review panel. These individuals are given a very light review load as a tertiary
reviewer, between two and four applications. So their primary job is to look and learn.
They are under the wing of the SRO and the chair of the panel. So far, we've -- nearly
700 have served on study sections, and of those, 32 percent are underrepresented minority
scientists. But these include scientists from all universities, who are considered independent.
Feedback so far has been very positive. Ninety-eight percent found the ECR experience to be useful.
Ninety percent reported themselves to be in a better position to write their own grants.
Ninety-seven percent would recommend the ECR experience to a colleague. We have not been
getting negative comments from reviewers, and the experience of the SROs and the chairs
is very positive.
For your own information, this is how to apply to the ECR program: to csrearlyreviewer -- all
one word -- @mail.nih.gov. You'll be given copies of my presentation, if you don't want
to note this.
CSR is also looking at additional review platforms. As part of a general effort to try and ensure
that we have appropriate review platforms for different circumstances, we're trying
different kinds, such as telephone-assisted meetings, video-assisted discussions, intranet-assisted
meetings, and telepresence meetings. These are various forms of either phone, video,
or completely asynchronous electronic reviews. We're also looking at editorial board style
meetings. We are looking at the strengths and weaknesses of these various forms; what
kinds of reviews they are best for; people's impressions of these forms of review compared
to face-to-face, our standard form of review. There's a lot of interest in editorial board
style meetings, since these have been used for the director's special application program.
And it is felt that that might be a style for the future, but it is more expensive than
regular face-to-face review.
Here's just an example of a video conference-based study section. Similar to telepresence, this
is where reviewers on both sides of the room -- the ones on the back are just on video;
the ones in the front are live. The ones in the back appear nearly full size, so it's
possible to see their expressions as they conduct their reviews, and to see whether
or not they're paying attention.
[laughter]
One of the nice things about this is that the microphones are set up so they're directional,
and you can see the voice appearing to come from the person that's speaking. It's a very
nice feature of this.
I'm going to shift over to an issue which has been bothering many scientists, and has
been a source of frequent complaints to CSR. And this is the issue of the A-2 [spelled
phonetically] application.
This is the graphic that convinced NIH directors and NIH that we should address the A2 issue.
Here we see that the awards to A0s went from about 60 percent in 1998, and it dropped to
the lowest form of award by 2008. The A2 awards had risen, had passed the A0s. And you can
see from the intersection, that seemed to looming. It was likely that it would pass
the A1s in popularity for proportion of awards.
After the A2 was eliminated in 2008, some applications were grandfathered. But as the
grandfathering ran out, those curves went down. The A0 rebounded strongly, as we had
hoped, and passed the A1. But you could also see, this was the latest results are from
2011. We're looking for 2012. I'm very concerned that A1 may intersect with the A0.
But what these lines suggested was that -- and what we heard in the actual conversations
that went on on review panels was that a line was being set up, that people were expected
to wait until their A2 application before receiving an award. In many cases, it was
very easy for the committee to try and pick out a few things that they wanted to see get
an award. They became more urgent at the A1, once they came to the A2 level. It became
a strong temptation to line up PIs in order of their applications. When the A2 was eliminated,
the time to funding dropped quite a bit, from a little over 90 weeks to a little over 50
weeks, on average.
So these two things, and the lack of difference between new investigators and established
investigators, it was through a lot of statistics the Office of Extramural Research determined
that there were no groups that were substantially more affected by the loss of the A2. The time
to award increased -- reduced, went down dramatically. And so it was felt that we should eliminate
the A2. Also, the order of scoring seemed to hold across this period of time. So there
were no dramatic changes that were occurring. It was just delaying when the award would
occur.
Now I want to talk a little bit about the future, and then I'll take some of your comments.
One of the ways in which we think we can improve the outputs of CSR scoring is if we look at
the distribution of applications across study sections.
Current distribution is quite non-random. First of all, we base distribution on the
areas of science. Two, we base distribution on PI preference. PIs get to ask for the study
section they'd like to have, and we honor those 80 percent of the time. And 75 percent
of scientists ask for a review committee.
This means, and the observation of many people, that there may be a non-random distribution
of the highest quality applications. If that's true, and we normalize the output of all study
sections, that means that you are not looking when -- in council and in other places -- you
do not get to look at, necessarily, the best applications across all the applications.
You're just looking at the best applications from each study section.
We are looking at how that works. We are trying to get -- establish statistics on whether
or not that impression of most people is real. We are going to be scoring or re-ranking applications
from a broad set of study sections to try and see what the relationship of that ranking
is to the individual study sections. If there is consistent differences across study sections,
that will cause us to ask the question, "How can we look more generally across study sections
in a systematic way, and provide you with the information?"
We're trying to develop better tools for applicants for referral and review. This is part of a
general process to try and make CSR more user-friendly for applicants, and to make a system of award
intake more helpful. We are also trying to increase diversity and reduce award disparities,
obviously. In general, we'd like to provide better service to applicants and to the ICs.
We are trying to have more discussions with the program staff to see what we can do to
make our summary statements the most helpful.
Finally, we're trying to develop a science of peer review. We've established an office
with money to do experiments in peer review. I've implied some of those ideas to you earlier
in this talk. But I think if we are to keep the U.S. lead, especially in this time of
difficult funding, we've got to figure out ways to make it easier to find the best applications
and make awards to those, and to provide you, through CSR, with the best information for
making those decisions.
Thank you very much.
[applause]
Eric Green: Thank you, Richard. Well, we have time for
questions. Let me start with the first one for this -- I forgot the exact three letters
-- the training one, where you take youngsters, and basically give --
Richard Nakamura: The early career reviewer program, ECR.
Eric Green: Do you -- what's the curriculum that you give
to them? Is it purely just by showing up, or do you have materials, or mock study sections,
or other things that one could imagine you could do?
Richard Nakamura: We don't go to mock study sections. What we
do is the EGCR is given both a PowerPoint set plus a training from the SRO. And then
they're given guidance. They can't attend peer review more than two full sessions. And
then they're given guidance about how to work with it. So far, the reaction from the SROs
and from the chairs is that has been sufficient. That is the -- in general, the ECRs take this
very seriously. They work very *** the reviews that they do. They have not caused
embarrassment by lack of knowledge around that issue.
Female Speaker: Yes, so I just wanted to share some thoughts
and get your impression about this set of slides that you showed about the distribution
of review scores. Because, at some level, one is assuming that the quality of these
proposals and the science that is being proposed is normally distributed. But it's not a random
population of proposers, and so that assumption may be fallacious at some level.
But also, I think, in my own personal experience on study section, is that there is this tension
between doing a review sort of on absolute versus relative terms. On the particular study
section of which I served for, say, four years, at the beginning of my tenure, the overall
quality of the proposals was just not very strong, on absolute terms. And that's not
to say there weren't some good proposals. But -- and then being sort of forced, you
know, our SRO kept saying, "Oh, you know, your mean is much lower than all the other
study sections." And this refers to what you were talking about just at the end.
On the other hand, you know, does it make sense if, for some reason, a particular batch
of proposals in general is not strong, to rate them on this sort of relative scale?
On the other hand, if a particular batch of proposals are all very, very strong, you're
naturally going to get that compression, either at the top or the bottom.
So I'm just wondering, you know, sort of what recognition there is about this -- you know,
when you look at statistics, statistics can tell you one thing. But if you don't look
at the priors, you're sort of maybe being led to a slightly false direction or strategy.
So I'm just curious to hear what your thoughts are about it.
Richard Nakamura: Yes, we're not going to -- we're not planning
on doing any given thing. We, right now, we observe that we have a problem, that we've
long had this problem, that different efforts to address the problem have systematically
failed. And that there's this broader issue of what's going on across study sections,
which we've never addressed.
So we're planning on having some meetings with individuals who are experts in decision
theory. Obviously, if this were easy, both corporations and NIH would have solved this
a long time ago. This is a deep, difficult problem. And so far, we remain at the -- with
the idea that strong scientists, excellent scientists, are the best judges of science.
Aside from that, the way to get the most -- maximal information from them, either as a committee
or as individuals, I think there are many questions about, and that one of the things
that we should be doing is exploring how we get to answers about that. I'm not proposing,
and don't plan to propose we know the answer. Here is what we're going to do. I think what
we'd like to do is to say, "Here is a theoretically better way of approaching this. Let's try
it with some study sections," and maybe within an IRG. I think we're going to have continued
discussions with the scientific community about any changes that we're thinking about.
Male Speaker: I was interested in your actions related to
the diversity issue on application success rate. So there has been research which has
shown that there's gender and racial bias in the review of grants. And given that knowledge,
why isn't there something where you're trying to address that with the review panelists,
themselves. They bring it with them. They're in societal stereotypes. It's nice for diversity
awareness training of the NIH staff, but they're not making the review, the scores, and the
decisions for funding.
Richard Nakamura: It is the intent of the advisory committee
to the director to address that issue with the reviewers if we determine that there is
bias in the peer review system. Right now, we're running some experiments which we hope
will show not only whether or not there are differences in the quality of applications,
but differences that could be attributable to bias and the proportions that that might
account for. I think we need both pieces of information to develop the right intervention.
And so -- but if it turns out that there is bias in the peer review system, it is the
intent to try and train reviewers.
The ACD asked us to take any validated system for countering bias and apply it, first to
NIH staff and then to the reviewers. However, there is no such validated system. And so
the first thing we're going to do is to see if there's any system that can make a difference
with NIH staff. And there's about 500 staff that we could apply this to. So there are
reasonably large numbers.
Male Speaker: So I was very taken by the figure that you
had with the spikes of the priority scores. Can you give us a sense of what causes that
underlying phenomenon, and the degree to which the kind of group dynamics of the study section
and congealing scores on those, and what that might be telling you about the underlying
review process?
Richard Nakamura: There's no question that there is a strong
temptation once a general agreement seems to be existing on review committee that an
application is in award domain. Remember, we score based -- initially, we group based
on preliminary scores. So when -- the first few reviews they know are in the possible
award range.
It looks like there's a -- just a strong group temptation to agree on a score, and to have
everyone vote that, in many cases. And it's mainly those that vote outside the range that
cause differentiation among the scores. That's a tough position for individuals to take,
on many committees. So we think it's a severe problem, a serious problem. The scientists
themselves, they can differentiate better than the committee scores reflect. But we
haven't figured out a way to get that expression. And this is one of the reasons for discussion
of ranking systems.
Male Speaker: Sorry, just as a follow-up, is that because,
say, the high and the low are 2.3 and 2.5, and so everyone has to vote in that range,
or --
Richard Nakamura: No, remember that the primary reviewers give
scores of 1, 2, or 3. They only have those three digits to work with. So 1 is heavily
discouraged by the system, and so you're talking about 2 and 3. And they know that if you want
to stay out of the gray zone, it has to be below 2. So, given those, committees easily
get stuck on a consensus.
Male Speaker: Right. So it used to be you could give the
decimal. Now it's like 2 and 2 --
Richard Nakamura: You cannot give the decimal.
Male Speaker: -- so everyone votes 2. And how many scores
go into a priority score?
Richard Nakamura: The overall priority score is based, usually,
on 20 scores that are generated around --
Male Speaker: So even though you have 20 scores, you're
still getting these --
Richard Nakamura: If the -- if the three reviewers say --
Male Speaker: I mean, that's actually hard to gerrymander
that way. [laughs] You have to work really hard to come up with a system that ends up
with those optimal.
Richard Nakamura: We have a problem.
Female Speaker: I wanted to follow up a little bit on the
idea that Jill was introducing earlier, of what's the assumption and what might be the
implications of trying to standardize things across study sections. And, in particular,
you made a comment that rare that scores above 20 can be considered for an award, and I'm
thinking of the societal and ethical issues in research study section that was formed
to bring the research ethics and the ELSI together. And, traditionally, I think it's
well-recognized that those scores have always been high in the sense of bad, remarkably
so, compared to other study sections. And that's just the way that study section scores.
So I guess my question is, how would you handle something like that? And I understand what
you said earlier that you don't have a solution yet, you're looking into it. But as a related
question to that, then, I guess when you've created -- and I don't know whether there
were other study sections that got merged; presumably, there were -- when you've looked
at the experience of the disciplines that came in, or the groups that came in, and were
merged into a new study section, have you traced, at all, how those applications have
fared, and is that something that you have done at the study section level? And how would
you handle that going forward?
Richard Nakamura: We do do some tracing. Most of the tracing
is based on concerns or complaints by subsections of committees. So whenever we hear a subgroup
complain about the outcomes of mergers, et cetera, we do look at that kind of issue.
But I'll remind you that today, because of the huge drop in awards, every group is saying,
"We've been selected out for loss of funding," and we very rarely statistically see that
difference when we look at it. Everyone is losing money these days, and we're not picking
on any one group. But everyone seems to have the impression that they must be picked on
because so many of their compatriots are losing funding.
Male Speaker: So I'm glad you're taking a science-based
approach and trying to use actual data. That's wonderful. One of the things with the Internet-assisted
review is there's at least some level of the scoring that's locked in prior to the actual
in-person visit. And I'm wondering if you've looked at that data, you know, the pre and
post, to see how well does the scoring in isolation reflect what happens after the group
dynamics. And really where I'm going is, can we get to the point where we don't have to
have in-person at all, even though I really enjoy it?
Richard Nakamura: [laughs] We are asking the question, "What
does in-person contribute? What does -- is there a socialization process that's important
in peer review?" I can't tell you what the answer is. There's a lot of myth in peer review,
obviously, as well as the possibilities of science. And we want to systematically explore
this. Some of the methods allow us to know -- to determine, in the long run, whether
or not electronic review is the equivalent of face-to-face review.
Male Speaker: Richard, do you actually have funding to carry
out controlled experiments?
Richard Nakamura: Yes, in two ways. I have been allowed to keep
some money we've saved through electronic review, a small amount of money. And we've
also been given extra money to carry out the experiments in bias -- in studies in bias,
by NIH. So I think if we get results which help clarify what we need to do to improve
quality of peer review, the institutes and Francis will give us money for that. I think
everyone agrees that corporate America believes that you need to have resources like that
and experiments like that in order to improve quality, and so we hope we will be able to
do that ourselves.
Female Speaker: Just want to make a comment that I do think
the social aspects of it have big impacts. And I just was on a review pretty recently,
and the scores before we all met together were significantly different than when we
were together. And there were good reasons and bad reasons for that. The good ones were
they were maybe someone could clarify some technical issue that someone else didn't understand.
But then it also comes down to who was better at persuading the rest of the group to go
one way or the other, or if we don't score well enough, it's not going to get funded.
And there were so many statements in there that were like [laughs] could it sway it one
way or the other so easily. And so I do think it's very important to have this decision-making
expert to advise you.
Richard Nakamura: Yeah, I agree that there are lots of different
dynamics that are going on that we have to understand. We are doing some studies of the
shift in scoring that occurs, and the scoring that -- differences that occur on different
committees. Some committees basically stay with the original scores, and some committees
shift quite a bit. So -- and the -- it's up to you, the staff of the institute and of
council, to make decisions about what's important based on the admission of the institute. And
so you can ignore the scores. It's difficult to ignore, let's say, a non-discussed, but
within the scored range, I think the general expectation is that the relevance to the institute's
mission is very important.
Male Speaker: So have there been any experiments of the
system of the type where you might send some subset of applications to multiple study sections
to compare what the outcome would be of the exact same application going through the system
at the same time?
Richard Nakamura: Okay, we are discussing the possibility of
reviewing some subset of applications twice to get the reliability and validity estimates
that we need for many of our other studies. You might understand that that makes a lot
of people very nervous. And I have not --
[laughter]
-- gotten the go-ahead yet to do those studies. But we think that it is important to develop
some power estimates of our system.
Male Speaker: My guess of that would be that there's going
to be, you know, high concordance on the things that are not discussed, right? There's a bunch
of things that ultimately, you know, no study section would fund. And then there are going
to be a couple of things that would score in the 10s, no matter what study section you
would get to. And then there's going to be a whole 20 to 30 percent that's going to be
all over the place, and depending on who you send it to, would or would not have gotten
support.
Richard Nakamura: Yeah, I completely agree. I think most of
us believe that peer review system, as it's currently designed, works well when success
rates are 30 to 40, and not so well --
Male Speaker: Exactly, yep.
Richard Nakamura: -- when success rates are where they are right
now.
Male Speaker: And I think that's the point, you know, that
it would important -- if that's true, then that's important to know so that we can communicate
that to policymakers to say that, in fact, when you're drawing down the fund, you're
drawing it into a percentile range where, you know, sure you're funding good science,
but there's a ton of great science that gets left on the table. And the only way to really
secure that we're funding all the, you know, the best science and to keep U.S. competitiveness
is to figure out how to drive the funding rates back up into the 25 or 30 percent range.
Richard Nakamura: I agree that current funding is catastrophic.
Male Speaker: I wonder if -- and you probably have compared
notes across the mechanisms used across different agencies for their peer review, and whether
they're seeing the same kind of issues that you're seeing. And I've served on a number
of panels across DOE or NSF, and they do things quite differently, and not better or worse.
But I'm wondering if you look at their numbers, do you see the same trends?
Richard Nakamura: All the agencies that I'm aware of in the
United States are having a harder time with funding, and are seeing loss of -- in science.
And, at the same time, we're watching our agencies -- related agencies in other countries.
I was -- earlier -- I guess last year I was in South Korea, presenting on peer review.
They were extremely interested. They were frankly feeling that we were giving them an
opportunity to emerge as a first-class science country. Their -- a representative from their
government came to report on the 2013 budget. He was deeply apologetic for the funding that
he was going to announce. It was only an increase of 5 percent. And so they have concluded that
the -- that American success is built on science and technology, and on federal funding of
science and technology. And they're all delighted, in some sense, that we've given them the opportunity
to compete.
Female Speaker: So just one quick comment, and then another
question. You know, DeeDee [spelled phonetically] was talking about some issues around review
and grants that are not discussed. I just want to point out that the way we order grants
for discussion now, if a grant -- if two out of three reviewers score a grant well, and
there's one, like, whacko review, or, you know, sometimes it works the other way, but
that grant now, according to the ordering, will not be reviewed. In the old days, it
used to be that these kind of outlier scoring grants would also be discussed. So I just
want to suggest that we might think about how to go back to that.
But the real thing I wanted to mention was that I agree with you, it's really important
we get the best scientists to review grants. One of the pressures on us, who have done
that service, is that the review panels often come very close to when we have to submit
our own grants. And while there is this quasi policy that you get some dispensation for
that, that your grant submissions can be late, that does not apply to all the grants you
would submit. So if you're responding to an RFA or a PAR, you don't get any dispensation
for that. And many of us who are experienced reviewers, and I like to think that those
of us sitting around this table are reasonable scientists, we, therefore, have this additional
pressure that we may have 10 grants of other people to review, and we've got to meet a
deadline for -- to fund our own labs.
So that's part of this idea of trying to get good people to review that I think it would
be great to revisit that policy and how it's actually implemented.
Richard Nakamura: I can tell you simply that we are revisiting
those policies.
Female Speaker: Great, I'm glad to hear that.
Richard Nakamura: And any individual reviewer can ask that something
be discussed. And we've also asked our SROs when there are strong discrepancies in the
preliminary scores, to try and get some resolution, even if there's not going to be a discussion,
to try and get an understanding among the reviewers, which is the more correct score
and to have a discussion around that. It's to try and reduce the problem this causes
to PIs who receive highly discrepant scores.
Female Speaker: Well, and this is related to the question
in terms of transparency to the applicant as well as to council when it has to decide,
is there a movement towards including the distribution of -- not the exact distribution,
but aggregates in the critique back to the applicant, such as means, mode, standard deviation,
range?
Richard Nakamura: Of the score -- of the individual scores,
they are provided with some of the criterion scores that came to that information. But
I'm not quite sure what advantage that would have.
Female Speaker: I think that would point to outlier reviews
because you would have a large standard deviation or range, and you would have a means and a
mode that might disagree very much.
Richard Nakamura: Well, we are asking SROs to try and get some
resolution of highly disparate reviews so that it's less confusing.
Male Speaker: I wanted to ask you whether there's an ongoing
program of quality assurance, quality improvement for the SROs. Is there -- is there -- is there
presiding over these study sections; is it observed by outside -- people within CSR,
but outside of the study section?
Richard Nakamura: There are two forms of -- I guess, multiple
forms of observation. Obviously, reviewers are there, and there's a hierarchy over each
SRO. We expect the IRG chief to attend many of the meetings run within their groups. We
expect division directors to attend them. I attend maybe six to eight study sections
around. So we also -- each study section's observed by program staff. We hope there is
better communication these days with the program staff, and a better sense on the part of program,
that through their hierarchy, I'm very receptive to information about problems on review committees.
Male Speaker: Does the program staff often express the concern
that they're -- have actually been told, you know, not to participate, that they have to
stay as quiet, silent observers?
Richard Nakamura: During the reviews themselves, that's correct.
We want the SRO to be the federal lead on that. Between sessions, I expect there to
be good communication. I think we've encouraged communication between program and review staff,
both at the higher and at the individual SRG levels. I think that our SROs are feeling
more able to work with the program staff, and to hear about problems. We're also trying
to create systems so electronically it'll be easier for program staff to follow what's
going on at reviews of their applicants.
Male Speaker: I just wanted to second Lucilla's [spelled
phonetically] point because I think it's a really good one, which is that you're already
collecting all this data on the scores, so even just getting a histogram back with some
summary statistics of what the scores were for your grant, at least, for me, it would
be -- I'd love to see that, right? I mean, it's sort of the same as if you were running
a course, and you wanted to compare -- and this would be really useful data to also have
in terms of trying to figure out whether -- to normalize scores across study sections or
not, right? You're saying that sometimes, some study sections, you're not funding necessarily
the best, you know, all the best grant overall, you're just funding the best grants in that
study section, right?
So both in terms of giving that data back to individual PIs, but also keeping it for
broader staff analysis, it would be very useful data to have. And you already have it, so
you're already collecting it.
Richard Nakamura: Yeah, one of the problems that we have on
any individual level collection of data is that we really go out of our way to make sure
that things can't be traced back to individual reviewers. Even the statistics that we kept
here, we deleted all the original PI information. We collected the numbers. And each time we
do these collections, we do it for a specific purpose. So I can imagine that something like
this could be arranged. We'd have to think about it a bit to make sure that the underlying
confidentiality is kept. But I do -- anything that we can do relatively readily like that,
I'm willing to consider.
Male Speaker: I wouldn't say [spelled phonetically] -- if
you're PI and you're getting a histogram of your scores, right? If you got a histogram
of scores, and they are bimodal, right, that tells you something different than you get
a histogram of scores, and everyone give you a 5, right? Then you go, "Okay, clearly,"
right? You know, I don't -- I really need to rethink what I'm doing, and it's out of
the question or -- but if you got this bimodal score, you might say, "Well, clearly, some
people didn't understand it. Maybe I could have addressed the point better," or something
like that.
I think it would also help in the discussions that the program officers have with individuals'
PIs, right? Because I know I had called my program officer and said, "Well, is there
a study section thinking, blah blah blah," and they point and go, "Look, there's no variants
here," right? Then, you know, you could take that to mean what it means.
Richard Nakamura: I hear you, okay.
Male Speaker: So, Carlos, even if you strip identifying
information, there are scenarios where an individual could be linked to a score. We
can talk about it at lunch.
Male Speaker: Okay.
Male Speaker: But I think that would be the underlying --
Female Speaker: You mean the -- and individual reviewer on
the study section?
Male Speaker: I mean the review panel that has one expertise,
and you read a written review because that person's assigned. And then you see a score
that's outlying over here, and you say, "That reviewer gave me that score." Whether it's
accurate or not, it's still open the door.
Female Speaker: But you do -- the score is from the three
reviewers.
Male Speaker: Anywhere in back [spelled phonetically], yeah,
exactly.
Female Speaker: You get the -- you get three reviewers' scores,
and they're, you know, [inaudible] --
Male Speaker: And they don't necessarily tell you what that
person's final score was. But if you see the printed scores, and there's an outlier, you
might link it. Correct or not, you're creating social havoc in the follow-up.
We'll talk at lunch.
Male Speaker: Okay. All right.
Richard Nakamura: Thank you very much.
Eric Green: Well, thank you, Richard. It was terrific
that you would come talk to us. So, thank you, council, for a good discussion. So we
will break for lunch now. Shall we try for 1:15? We will reconvene at 1:15, so 55 minutes,
go get your lunch, and thank you for a good morning.