Tip:
Highlight text to annotate it
X
MIKE: So today I'm happy to introduce Professor Russ
Altman, who is a professor of genetics and medicine at
Stanford University, and computer science, by courtesy.
Russ is very involved in many biomedical research projects
at Stanford, including being the director of the biomedical
informatics program.
The program, you are--
I forget your particular position--
heavily involved in the [UNINTELLIGIBLE]
program, and the principle investigator for the
pharmacogenomics knowledge base.
This video is going to be made publicly available through
Google video.
So if you guys have any questions that are Google
proprietary, make sure you save them to the very end when
we're not live anymore.
I'd like to introduce now, Professor Russ Altman.
RUSS B. ALTMAN: Thank you.
Thanks very much.
I will also add that one of my of feathers in my cap is that
I was Mike's PhD advisor, which he forgot to mention.
But very proud of him, and I'm happy to be here.
And today I just want to tell you about what's colloquially
known as personalized medicine.
And one of the forms it's taking is in pharmacogenetics.
And I'll tell you a little bit about what we're doing, really
to get a conversation started and see if there's any
interest in what we're doing and potential further
interactions.
So I want to start out real basic, because I know that
this is a very eclectic group with different backgrounds.
The human genome is made out of DNA, which has four bases,
A, T, C and G. The famous four bases.
And three billion of those bases make a human, plus a few
other things.
But basically, the genetic plan for each human is in
three billion bases, 99.7% of which are the
same across all humans.
So obviously, that remaining 0.3% makes a huge difference,
and is responsible, in addition to the environmental
factors, with why we all don't look like clones of
each other right now.
Now, there are about a million positions--
so if you think of that literally has a long string of
As, Ts, Cs and Gs, out of those three billion, there are
about a million that change across all humans at a
significant frequency.
A mutation would be something that might be very rare.
You might be only one who has a certain DNA change from like
A to G at a certain position.
But there are other changes that are quite common in the
population, and actually come from the fact that we share a
common heritage, kind of the out of Africa hypothesis.
And a few tens of thousands of humans, about 100,000 thousand
years ago, who are basically our ancestors.
Those humans generated some diversity.
And then there was a huge population explosion, similar
to the growth curves you see for Google, actually.
And that population explosion meant that a few of the
variations that were in the population when we were only
100,000, have now fixed themselves in 10%, 5%, 50% of
the population.
So T turning into G is one example.
And there's about a million such examples, depending on
how you define it, in the human genome.
One of the consequences of all this, by the way, is that
there are many possible human genomes.
Even if we only differed at a million positions, with four
choices at each position, that would basically be four to the
million possible humans.
So we haven't even begun to sample human diversity.
That's good news.
Because if we have a lot of problems to solve, you can be
sure that there's a lot of humans still left to make,
before we have to recycle and start making the same ones
over again.
So we sequenced the human genome.
It was announced a couple years ago that the human
genome, the average human genome, had been sequenced.
That was the 99.7% that was shared.
Then it became a great interest to understand for
humans, where don't we shared the genome, and what's the
consequences of that?
So the last five years in particular, there's been a lot
of activity in what they call re-sequencing, or genotyping,
where, OK, we know the 99.7% that's all shared.
Let's look at that small percent that's not shared.
And let's characterize the ways in which it's not shared.
What are the choices, T and A, for example?
Which populations--
African, Asian, Caucasian--
which populations show what frequency of those variations?
And so that's the second bullet point, characterizing
the variation in the genotype.
But we're not just doing that because it's fun to do.
We're doing that because we believe that with that
genotypic information, we can, if not predict perfectly, we
can adjust probabilities of things like probability of
disease, or probability of responding well to a
medication, which is what I'm interested in.
And those are what I call phenotypes, which may be a
word that you're familiar with or not.
But basically, a phenotype is any measurable feature of an
organism that's not its DNA sequence.
So the DNA sequence is the genotype.
And the how its molecules, cells, organs, and organisms,
how that responds to various stimuli
would be the phenotype.
And what we'd really like to do now is understand the
relationship between this very easily measured genotype and
phenotypes that we care about.
Like risk for disease, or likelihood of responding well
to a medication, and a variety of other
health related outcomes.
You could also have non-health related outcomes having to do
with, well-- it's the Olympics--
athletic performance.
And though there's a lot of interest in-- the phenotype
would be, I skate fast. And the question
is, what's the genotype?
That's not of particular interest to me.
There are people in the world who care about that deeply.
So I'm an informatician.
I do informatics.
I guess many of you, in some form or another, do
something like that.
And the challenges in this field of post-genome
biomedical research, first of all it's to exchange clear
information.
I'm not going to say much more about that.
You all know that that's hard and boring.
But very hard.
And so we work on that.
We also want to understand the statistical relationship
between genotype and phenotype.
If I give you a genotype, can you predict the phenotype,
even if you don't fully understand why that genotype
and phenotype are correlated.
Kind of a machine learning, data mining approach.
And for some people in the world, that's fine.
For example, the FDA approves drugs because they work
statistically.
It doesn't approve drugs because there's a really good
story about how the drug works.
I'm a physician, I practice medicine on Friday afternoons.
And there are a lot of drugs that I use that we don't have
a good story for how they work.
But we know that they do work, and that's all, really, that
physicians and patients care about.
So you don't want to dis the statistical story, even though
many scientists want to have a much more profound mechanistic
understanding.
Tell me the story about how this DNA variation leads to
changes in the cell that leads to a different phenotype.
A perfectly legitimate question of interest to a
different set of people.
We're interested in both.
And then, I think you can imagine that, as we begin to
be able to predict phenotypes, based on genotypes, plus some
environmental variables like, do you smoke, do you drink
benzine at night, stuff like that, that we will be able to
do a better job at making prognosis for disease, which
is basically the anticipated time course of the disease,
diagnosing the disease in the first place, and
treating the disease.
So let me just make this a little bit more concrete.
Here I have two fragments of two genomes.
So that's about 25 bases.
Imagine three billion of those.
It's very finite, right?
So three billion is a big number, but it's not a big
number here at Google.
It's not even a big number to my kids, because it fits on
all their iPods, right?
Your genome fits on your iPod and you still have plenty of
room for more music.
It's three mega bases, and there's only four bases.
So there's only a couple of bits you need to represent
that information.
But here we have two individuals, and they're
exactly the same except for that one position, where the
one guy at the top has a T, and the guy at the bottom has
a G. That's called a SNP--
single nucleotide polymorphism.
Single because it's one position, nucleotide because
that's what they call the As, Gs and Cs and Ts, and
polymorphism just is a fancy word for a difference in the
population.
So that's a SNP because there's a T in some people and
a G in other people.
What we'd like to do, at least from a machine learning point
of view, is relate that by some complicated, probably
nonlinear, function to observable phenotypes that we
care about.
Like whether you respond well to a cholesterol
medication or not.
That's essentially what pharmacogenomics is.
But I'm going to tell you a little bit more
detail about this.
So pharmacogenomics and pharmacogenetics are the study
of how genetic variation leads to variation in
the response to drugs.
If I were to give you all a drug right now, which, in the
state of California, I'm allowed to do, since I'm
licensed, you would all probably have a slightly
different, or maybe a very-- depending on the drug-- a very
different response.
Some of you might get a headache, so of you might get
nauseous, some you might actually have the desired
effect of calming you down.
Others, it might make you anxious.
And we'd like to understand that, because as physicians,
we don't practice, particularly, personally
informed medicine.
Of course, we talk to the patient, we understand what's
important to them.
You know, doc it would be bad for me to be sedated tomorrow
because I'm giving a big talk at work, and I would like to
be awake for it.
So of course, doctors do personalized
medicine at that level.
We're talking about a personalized--
when you see personalized medicine in the press, it
often means genome informed.
And in fact, I'm starting to use that phrase more, because
I don't want to insult my fellow doctors, to imply that
they haven't been doing personalized medicine for the
last couple of hundred years.
But they have certainly not been doing
genome informed medicine.
Which is, if I have your genotypes, I
can make better decisions.
This is one of the promises of the genome project.
We got the Congressmen to pay for this by promising them
that we would address important health issues, like
heart disease and prostate disease, which is what
congressmen get.
And so, we want to deliver on that.
Let me give you a couple of examples.
So codeine is a pain medication.
Many of you probably have been prescribed codeine.
It's in Tylenol number three, if you've ever gotten Tylenol
number three, it's the thing that makes the number three
necessary, and why you have to get a prescription for it,
instead of just getting it over the counter.
And it's an opioid.
It's in the morphine family.
But when you take codeine, it is not active.
There is an enzyme in your liver that turns it into
morphine, actually.
It does a little chemical reaction that changes codeine
to morphine.
And then of course, you get all the benefits of morphine,
which are quite considerable, especially if you had a tooth
extracted or surgery recently.
The name of that protein is--
it's an ugly name.
I apologize, I didn't name it.
It's CYP2D6.
There's an elaborate gene naming system, which is a
whole 'nother topic.
But there's a gene called CYP2D6, and it makes a
protein, it encodes for a protein in the genome, that
performs this metabolism.
And as you can see in the third point, 7% of Caucasians
have a version that is a SNP--
and there's actually a number of different SNPs that can
cause this-- that renders CYP2D6 totally inactive in
their liver.
Which means that, for 7% percent of Caucasian patients,
codeine is like a placebo.
It does absolutely nothing.
Now, I don't know if there's anybody, probably
statistically looking at the crowd, there is probably
somebody here who has been given codeine and they said,
geez, it really didn't help my pain very much.
If that happened, this is one of the most likely reasons.
I assume you took the medication.
As a physician, whether the patient took the medication is
actually probably more important, in terms of
assessing whether the medication worked.
But once you know that they took it, the
genetics plays a role.
So codeine is a very common question.
AUDIENCE: You said it's the cause of this, or
was it just a marker?
RUSS B. ALTMAN: CYP2D6 is the enzyme that does the
transformation.
So there is this idea of markers, where they correlate
with things that we either know or don't know.
In this case, this SNP--
and there's a bunch of them that can do the same thing--
causes the inactivity of CYP26.
So in this case we know a mechanistic story.
That's a good point.
Because I was talking about, it could be a correlative
story, where I just genotyped a bunch of people, saw that
there was this SNP that, whenever I saw it, codeine
didn't work.
And that would have been a perfectly useful thing for
clinical practice.
But we actually know a little bit more in this case.
One more example that's been in the press recently is a
medication called by BiDil.
It's actually a combination pill.
Two differed medications that we use separately, but some
company probably was running out of patent time and needed
something to patent, and created a pill that had the
two medications grouped together.
And the bad news was, they did a trial on the general
population, and it showed no overall
benefit for heart disease.
This is a pair of drugs that are used for heart failure.
And they did a big trial, you know, 3,000 people get it,
3,000 people don't get it.
And there was no statistical difference in the outcome.
And they were bummed because, wow, this could have been a
good drug, but it's not.
Their actuarial statistics guys looked at the data, and
they noticed that African descent patients, in the study
with 3,000 and 3,000, showed a little bit of benefit.
Now, you can't get FDA to approve a drug based on a
subgroup analysis that's not the primary goal of the study.
But it was enough for them to go back and do another study
focused on African descent patients.
And they showed that it actually improved outcome
quite well.
And so the FDA gave them approval--
the Food and Drug Administration in the United
States gave them approval for using BiDil in
African descent patients.
Well, why is this genetic?
I didn't mention any genetic test. Well, I have mixed
feelings about this.
The genetic test that they're using is looking at the color
of the patient's skin.
That's not a great genetic test, for a number of reasons
which I can't go into.
Some day, and I hope soon, we will do the work to find the
genetic variables that are actually responsible for
predicting the success of this drug.
And then what we'll do is we'll test all patients,
African or otherwise, and if they have this SNP--
then we'll know that they should benefit from BiDil.
And if they don't, they won't.
Right now, the genetic test is staring at the color of the
patient's skin.
And it's been approved, and it's a very interesting story.
You can imagine that bioethicists are having a
field day writing about this, and what its
implications might be.
But the fact is, it's on the market and it's
helping some people.
And so that's good.
So I think you can see the clinical promise of
pharmacogenomics.
Focused treatment by finding people who
are likely to respond.
Not giving it to people who are going to
have bad adverse reactions.
That's something I'd like to come back to.
And then, for the drug companies, as in the case of
Bidil, it's a way to save drugs that they were hoping
would be everybody, one size fits all, and instead, it's
one size fits some people.
And if you're a drug company trying to make some money,
it's not bad to get a drug that at least you can sell to
half the population, or whatever.
And then, from a scientific point of view, we'd like to
understand how drugs work better, and understand the
genetic basis for drug response.
Is it in routine use?
Not really.
There are few cases.
There's a breast cancer medication that, before you
take it, you need to get a genetic test. I do not test
CYP2D6 before I give patients codeine.
What do I do?
I given them codeine.
I say, if it doesn't work, give me a call.
And if it doesn't, I give them like Vicodin, or
Percodan, or whatever.
But for whatever reason, it has not entered practice.
Even though, in principle, you could check CYP2D6 levels
once, early in life, and then tell the patient forever,
don't do codeine, it's not going to work for you.
There's a little downside for the drug companies, because
they don't really like splitting their markets if
they don't have to.
So they're a little bit schizophrenic about this.
Do we like it or not?
Can't quite decide.
There are lots of SNPs in the genome that make no
difference at all.
So we have still science to do to try to figure out which
ones really matter.
Genotyping, the cost of testing for DNA sequence is
not expensive, but it's still not cheap enough, and the QC,
quality control, hasn't been done for routine clinical
applications.
That is going to go away very fast. There are a lot of
companies working very hard to get accurate
genotypes very cheap.
I think you can look in the next five years for your
genotype being available for the same amount that you pay
for a head MRI, or a head CT.
On order of hundreds or thousands of dollars, not tens
of thousands of dollars.
There are companies--
Prologen, right down the street, will do it for $5,000
if you do it in bulk.
But that's for research purposes, and there's a
certain challenge of QC for clinical purposes.
There are ethical issues in testing individual patients.
What do you do with the data?
Where do you put it?
Who gets to see it?
Do insurance companies get to see it?
Do employers get to see it?
And what are they allowed to act on?
What happens if you have genetic background that
increases the risk of certain diseases that will make you an
insurance liability to yourself or others?
It's also, in America today--
and this might be a challenge for Google--
totally unclear how to deliver this information to a
practitioner.
When I'm practicing medicine, 12 to 15 minutes per patient,
that includes saying hello, talking, examining, about one
to two minutes for making drug prescription decisions.
I have 30 medications in my head that I use a lot, that I
know really well, and I can make prescribing decisions.
If you tell me that the world is a much bigger place, and
every patient is going to have a genetically informed
prescribing decisions, I need to have a terminal next to me,
telling me what the right answer is, and giving me the
information to both agree or disagree, and to explain to
the patient what the plan is.
We don't have a health information infrastructure in
the United States today, as you may be all aware.
And so how to deliver that, in the United States today, is
entirely unclear.
It's much clearer on how to do that in UK, Canada, Estonia,
Iceland, because they all have some sort of centralized
effort, where you can imagine putting in a new app for drug
prescriptions, or integrating it, really, with
the existing apps.
That's not happening.
So there's a big challenge.
And I know that Google is looking at stuff in that area.
We think about it a lot as well.
OK, so we're building the PharmGKB, b the
pharmacogenetics knowledge base.
This is not meant for you to really read.
We tried not to innovate too much on user interface
because, you guys and Amazon and eBay are doing all the
user training for us.
And so, we try to--
tabs like Amazon, search boxes like you, the real meat of our
stuff is in the upper left.
External links, or I want to buy this information--
you don't really buy it-- but the download buttons are
always in the upper right, just where you have the add to
cart type buttons, et cetera.
And this is a database.
And what's our mission?
I'll tell you our mission in a second.
I'm going to skip that.
PharmGKB was funded by the National Institute of Health,
because they wanted there to be a public
Pharmacogenetics effort.
Drug companies are doing a lot of this work, but they are not
charged with sharing the data.
Perfectly understandable.
That means that the public, if they want to get data in the
public, so that universities and nonprofits can do research
in this area, they're going to have to pay for it.
And so NIH kind of ponied up a bunch of money, funded a bunch
of centers to do pharmacogenomics research,
about 12 national centers.
And they funded one database, that was us, to
kind of hold the data.
So produce a public repository with broad applicability.
This is the NIGMS at NIH, which is who funds this.
That's not really important right now.
This gives you a sense for the things we're looking at in our
network of researchers.
A lot of heart disease and lung disease along the left.
Cancer in blue.
And a lot of drug metabolism on the right.
And then we're in red in the middle,
because we're the database.
And this is really a national network.
Our goals are to create a national data resource with
high quality data linking genotype data with phenotype
data, both laboratory phenotypes, as well as
clinical phenotypes.
We have to figure out how to represent the
data, the data model.
Tough problem.
Provide analytic functionality so that our users can do stuff
with the data.
And link with all the complimentary databases that
we don't want to do, but we need to integrate with.
You know, we're excited about this, because it's an
incredibly complex domain.
We have the core data that we have to worry about in green.
Genotype data, molecular phenotypes, clinical
phenotypes.
In order to make sense of that, our system has to know
about individuals, on the far right there, the environment
in which they operate, the drugs they take, and the
molecules in their body that deal with those drugs.
So from an informatics point of view, it's kind of a fun
representational challenge.
Our mission can be simply and graphically summated as these
three little balls that intersect.
We have to curate data and knowledge.
So the data that comes in is raw data, but we also curate
it and kind of represent it for the
community as digested knowledge.
We have PhD-level curators who do that.
We have our user interface and functionality, the website,
the database.
And then we have outreach and dissemination, which we're
doing right now.
And both to users and to people who submit data.
So this is the site again.
Again, I don't want to really go through the details.
It's a site.
It has lots of information, you type in
words, you get hits.
I just want to point out kind of some of the key features.
This is a gene page.
A gene page would have genotype data.
What are all the variations in the human genome, in that
gene, that we know about?
I'll show you a little bit more detail if you
press on that link.
And we have phenotype data.
What are measured phenotypes known to be related to that
gene that we have in the database?
So that's data.
We also have pathway, I would call it knowledge.
Which is, we've put together pictures-- and I'll show you
one in a minute-- of how genes work together to metabolize a
drug, or to respond to a drug.
So it's more of a systems level view, not a
gene by gene view.
That's in our pathways.
And then finally, we have at the bottom, if you scroll
down, we have annotated literature data, where our
curators are constantly looking at the literature, and
hand annotating for important articles that are giving
pharmacogenomic information.
So two types of data, genotype and phenotype.
Two types of knowledge, pathways
and literature curations.
When you look at a gene page, this is kind of scientific
detail, which I don't think most of you care about.
But that's the genome in that big thick bar.
And all those little tick marks that look like the
skyline, those are all the locations in the gene where
there's known variations in humans.
And there's a table at the bottom that tells you the
percentages.
So for example, that first SNP is a TA SNP, you can see at
the bottom left, which 52% of the humans we've looked at
have a T, 58% have an A.
And then we gather that data for the entire
genome, provide browsing.
And links, which I'm not going to talk about.
I should say that it's very important, though, to have the
population breakdowns, because as I said before, most
populations share all these variations, but there
frequencies can be very different.
So the right average drug in Asia might be different from
the right average drug in Africa.
But of course, we don't do average drugs.
But if you're making decisions about formularies and what
drugs our country should buy, then the frequencies
are going to matter.
The most exciting thing about PharmGKB is when we have a
genotype for which we also have one of these
little green phis.
And that's third to last column.
That green phi means that we've measured phenotypes on
people for whom we've also measured that
genotype for that row.
And that's what it's all about.
If this whole effort is about relating genotypes to
phenotypes, then you need to have a big data set of
genotypes that you measured, and genotypes that you've
measured on the same people.
When you see those green phis, it means if you press that,
you basically get a huge Excel or SAS spreadsheet, which is
going to have--
each row is going to be a human, which has a bunch of
genotypes and then a bunch of phenotypes, and then you can
go to town trying to figure out what's the functional
relationship between those.
I should add, environment is also important.
So if you want to put it very succinctly, phenotype is a
function of genotype plus environment.
And what that F is, is really what the name of the game is
for the next 10, 20, 50 years.
The functional form of that F, and then which genotypes
matter, which environmental variables matter.
So this is a phenotype file, which is not particularly--
it's a bunch of gobbledy *** about pharmacology.
But this is the exciting table I was telling you about, where
each row is a human, de-identified of course.
To get to this data, I should say, you have to register and
log in, which I hate.
Because we want to disseminate as much information to the
world as possible.
But we have to protect the privacy of these patients.
Even though they're de-identified, we want to make
sure we know who sees this data.
So we have a lot of information on the site that
you can look at anonymously.
And that the Google--
in fact, the Googlebots are the number one hits on many of
these pages.
But then there's a certain level of data where we have to
ask people to login.
And of course, we lose a lot of people at that point.
We lose most of the drug companies, for example.
Because they don't want anybody to know what they're
looking at.
So if a drug company does log in, they download the entire
database, so that they can look at what they're really
interested in, but nobody can tell what it was.
So this is just an example of a bunch of phenotypes.
Each column is a phenotype that's been measured, and each
row is a person.
And behind that hot link is the full genotype collection
for that person.
That's a person circled.
And that would be their genotypes, which I don't need
to show you.
Of course, the other side of pharma--
yes, question?
AUDIENCE: [INAUDIBLE]?
RUSS B. ALTMAN: We have some gene expression data, and we
have a road map to kind of integrate that more tightly.
In the pharmacogenomic field, there's been remarkably little
data created in gene expression.
And where a lot of times we're driven by the data that's
submitted to us.
But we are seeing an increase in that.
And so we're building up for that.
So right now we have some.
I would say modest, but--
and the woman four ahead of you and to the left, my left,
your right, is in charge of the gene expression effort.
Tina.
So we have drug pages as well, because it's genes and drugs.
So kind of a very parallel universe.
We have the phenotype data sets, pathways,
very similar situation.
Links to outside, like the PubChem, which you may have
heard of, which is a national effort to give drug
information and chemical information.
Very valuable, because it saves us a lot of time of
having to download a lot of drug structures.
And then we have the annotations.
This is the converse list of genes that are known to be
involved in that drug.
Before, I showed you drugs that were known to be involved
with a gene.
This is a pathway.
It's highly curated, made by scientists who have conference
calls and argue about it until our curators say enough is
enough, let's stop, and let's put it up on the web.
It's a great user interface for us, because instead of
having to type things in, you can look at pictures and click
on genes or drugs, and go right into the database.
So we have a fairly big effort building pathways, curating
them, making sure they're high quality, and providing links
to the supporting data.
We have templated queries, and kind of all the things you
would expect so that users can fill in a blank, and we can
give them formatted output meant especially for them.
Since we now know what question they're asking.
As opposed to the Google-type text box, where we basically
give them a list of everything that was hit that is scored,
but we don't really know what they're thinking.
Whereas here, we kind of know what they're thinking, because
they've chosen one of the templated queries.
We do all kinds of search features, which nobody ever
uses, except like me and the project director.
Boosting and spelling, and people just type in words.
And I think you guys know that better than me.
We have almost 2,000 registered users.
And I'll show you the data for total users, but this is a
small fraction, and that's why I hate it.
But these are people who actually want to download the
de-identified individual data.
And that's why we have to keep an audit trail.
We have about 1,300--
yes?
AUDIENCE: [INAUDIBLE]?
RUSS B. ALTMAN: I'll show you, but we opened in about--
well, we started in 2000, and had no hits for about four
years because we had no data.
And then I'll show you kind of the hit rate.
We're getting about 100 per month, and
that's kind of growing.
So we're happy about that.
We have about 1,300 hundred manually curated articles by
our curators.
We have primary data on about 613 out of 25,000 human genes.
So this is very curated, I would say, especially if
you're thinking from a Google-type perspective.
407 drugs that we have primary data on.
A total of 74 phenotype data sets, but that's growing
pretty significantly.
16 pathways, that's growing.
And we have about 21,000 people in the database for
whom we have genotype data.
And about 9,000 of those also have phenotype.
These are our unique IP visits per month.
You know, this is the most intimidating slide to give to
this crowd, I have to say.
So we're getting about 30,000 to 40,000 unique IP
addresses per month.
And you guys get what, 500 million a day, I don't know.
So, you know, give me a break.
But you can see that when we started getting data, in late
'03, we had a big rise in hits.
And then, there was a drop.
But then we released a new set of features, and we're hopeful
that we're now back on a growth curve, or at least up
around 30,000, 35,000.
If you look at other biological databases, the most
successful is GenBank, which holds the raw
DNA sequence data.
They are used by all biologists everywhere, every
day, and they get 400,000 unique visitors a day.
And so, I think that's an upper bound on what we could
ever do, because we're a subset of the biologists who
care about drugs and stuff.
So you can do your own calculation.
But an upper bound is at least about 400,000 a day.
And I would say, probably 40,000 a day, which means we
have at least an order of magnitude of growth that I
could expect to happen, and not be surprised.
I would be disappointed if we stayed constant here.
And I would be not surprised if we went up by an order of
magnitude in the next four or five
years, two years, whatever.
This is our user profiles.
As you expect, since it's a public effort, it's mostly
educational.
The other is all educational from non-US
sites, so it's really--
50 plus 20-- it's really 70% educational.
Maybe 65%.
And then 23% dot com.
As you know, in fact I learned on my last-- my last visit to
Google was in that other place down on 101.
It was a cozy little shack.
And this is much bigger.
So congratulations.
But one of the things that we can do just like you do is, we
can look at what people actually type into the search
boxes to figure out what they care about.
And so it's marketing research for us.
And for one three month period, or whatever--
I don't even remember what the period was--
these are exact text matches.
And of course, you could get much better statistics if you
do spelling errors and stuff.
But this gives us a sense of what the
community of users wants.
And that's great for us, because then we can say, how
are we doing on searches for these drugs, these diseases,
these genes?
We are entering a new phase now where people can now
measure all the SNPs in an entire genome.
Whereas, for the last five years, it
was much more focused.
So we're having to build kind of chromosome level browsing
capabilities.
Question?
AUDIENCE: What are the people you do have the genomes for
now, the 21,000, how many SNPs have you [INAUDIBLE]?
RUSS B. ALTMAN: So for those 21,000 people, we probably
have an average of only 20 to 50 SNPs each.
But in the next three months, we're going to be starting to
get 300,000, 400,000 SNPs per individual.
So there's going to literally be this qualitative and
quantitative jump, because of the technologies that
Affymetrix, Illumina, Prologen, which is right down
the street, there's this quantum leap.
And we hope to be the first, or if not one of the first,
I'm hoping the first database, that can accept that data and
vend it out to the public.
And our team is working very hard writing code to parse
those binaries, display them, in fact, that's exactly what
I'm showing here.
You'll be able to look at a chromosome.
We will label all the regions of the chromosome for
which we have data.
Today it's only three, because we have low throughput data.
But then we'll show segments of the genome with little tick
marks showing all the SNPs, where they are, what the
frequencies are.
And so we're about to enter a very exciting time.
I have a little blog.
This is in beta.
But I made my own little blog, which is news items. And the
amazing thing about Google is, you guys found it and made it
my number one hit on my name in like two days, before I
even announced it.
So way to go.
So in return, we used Google groups as our key way for
technical support with our various users and stuff.
And so there's a beta--
Google groups, I know, is in beta.
And we're also using it for our own beta technical
discussion group.
Just set that up last week.
Now, I just want to finish up talking about us some
opportunities in a vague way.
I don't have a plan, but I wanted this group to hear
about this.
So first of all, one of the things that drives
pharmacogenomics is two things, getting drugs that
will work better, and not using drugs that will cause
adverse events.
And there's been a lot of press about
adverse events recently.
There was a study out of a blue ribbon panel that said
60,000 preventable diseases in the United States a year
because of drug errors.
If you look at the most common drug--
ADRs is adverse drug reaction--
so it's like an adverse event due to taking a drug.
The three most common are heart arrhythmias, and QT
prolongation is a technical--
it's basically arrhythmias from drugs that mess with your
conduction system of your heart.
Liver responses that are fulminant and bad.
And severe dermatological rashes.
Those are the three most common.
I'll tell you how I know that in a second.
And there are many other minor ones.
Some of you may have had some of these when you took your
favorite drug, rash, stomach ache, dry
mouth, drowsiness, headache.
The United States Food and Drug Administration is charged
with tracking these, because they are charged with then
going back to the drug companies and saying, hey,
we've got a problem.
We're finding a lot of reports of your drugs causing trouble.
And they have the AERS, the adverse event reporting
system, where they get 400--
in 2004, they had 420,000 adverse events reported, which
they estimated a, are totally biased by the people who
actually like making these reports.
And there's lots of people who don't do it.
And this is at no more than 4% or 5% of total adverse
reactions, because it's kind of a pain.
A health care worker, a pharmacist, a doctor, or
nurse, may say, oh, you have a bad effect.
I think it's from the drug.
Let me fill out this form and fax it to the FDA.
They don't even have an online system yet, OK?
Based on this, FDA scientists look at these 420,000 adverse
event reports.
I have no idea how they do this.
They have a way to do it and I don't know what it is.
But they generate about 600 studies a year saying, we
looked at this drug and we think we better think about
modifying the label of this drug, because we're seeing
adverse events here that are not listed, and that the world
needs to know about.
And then these 600 reports are done by 40 scientists working
at the FDA doing these studies.
And you can even find it by using Google.
Now, the great thing is, this is all de-identified, and so
they make it absolutely public.
So you can go to their website and you can download all
420,000 reports from 2004, and for a number
of other years back.
And what you get is patient demography, basic like gender,
age, I don't know if they have ethnic background, some of the
drugs they were taking, what happened that was bad, why
they were taking the drug in the first place, whatever
happened to the patient, the outcome, some dates, and a few
other things.
So pretty minimal report, but very, very valuable.
I'm mentioning this because I'm just going to leave it out
there that I think there are ways for FDA, maybe PharmGKB,
and Google to partner to do a much better job at tracking
drug events, adverse drug events, in the population.
And potentially really accelerate the time to when we
see the event in a statistical fashion, come up with a
research plan for addressing it, and then hopefully
changing the world in terms of how that drug is used and the
events that it causes.
So I'm just throwing that out there because I was thinking,
boy, they're not doing a great job at capturing these.
And think about how many-- you guys probably know-- how many
people type in drug names because they've just vomited,
they just took their first dose of a drug, and they say,
huh, let me type it into Google and see if this is a
known effect.
And there's a moment there when, in addition to giving
them information, we could potentially do stuff that
would help the population.
So there's the idea.
We have done some--
this is just to finish up--
we are interested in data mining ourselves.
We're an informatics lab.
And one of the things we recently published was a scan
of the medical literature for pharmacogenetics articles.
So which articles should I tell my curators, or ask--
I don't tell my curators anything--
should I ask my curators to annotate?
Right now, they have a very large pallet of 15 million
articles to kind of consider.
So we wrote a machine learning algorithm that was actually
quite accurate at picking out articles based on word uses.
So it was a statistical, natural language processing.
Didn't do any language models or anything like that.
And we picked out about 5,000 pharmacogenomics articles,
that when we did a subset of them manually, kind of
checked, we had about a 92% agreement with our curators.
So pretty good, definitely as a screening
method for finding articles.
And we're just updating that now.
So we'll have an updated list. And that allowed us to kind of
do things like create a web resource, where you type in a
drug, and we tell you all the genes that have ever been
mentioned in the context of that drug, in an article that
we think is about pharmacogenomics.
And I'll let you read that article if
you find that useful.
So I want to some questions and stuff.
So I want to stop, thank the team.
Teri Klein is here today, and she's the director.
We have some of the curators here.
Some of the technical staff.
It's nicer to look at their pictures, so I'll stop and we
can take some questions.
Hope I gave you a flavor for what we're trying to do.
Thanks.
[APPLAUSE]
RUSS B. ALTMAN: Yes?
AUDIENCE: How clean are your data [INAUDIBLE]?
RUSS B. ALTMAN: Question is how clean is our data.
The genotype data, which is the variation position of the
DNA, is extremely clean, because we have written a lot
of quality control, both automatic code for like--
we take in an XML, and we do a lot of syntactic, as well
semantic validation.
And our curators look at it.
AUDIENCE: [INAUDIBLE].
RUSS B. ALTMAN: Yeah.
AUDIENCE: The person who generated that data
[INAUDIBLE]?
RUSS B. ALTMAN: The people who generate the data are working
pretty hard to have high quality data.
Because, since we're relatively low throughput, or
have been, there's been a premium on becoming the site
known for good data.
And so, they have social mechanisms and social
pressures on them to do a good job, because they use this
data then to publish papers that are peer reviewed.
And the peers take a look at it very closely.
And so, the best source of data for quality control is
the fact that sometimes people measure the same thing in two
different labs, and then the concordance is quite good.
And that gives us confidence that these people are
collecting quality data.
AUDIENCE: Is it one SNP, is it 10,000 SNPs, a million SNPs,
how much is it?
Do people measure the same [INAUDIBLE]?
RUSS B. ALTMAN: I would say that 5% of our data has been
collected by two different groups.
So it's not the majority of it, but it's enough to get a
statistically reasonable sample for
doing concordance analysis.
Now, phenotypes are much harder, because everybody
measures them differently.
And the only quality control there, again, is the peer
review process.
Getting the paper published, and having people say, I don't
like the way you measured blood pressure,
I reject this paper.
So that's the primary pressure there, again.
There's nothing we can do technically.
We're relying on these other sources for quality control.
AUDIENCE: So, this is kind of tongue in cheek, but how
many-- given all the data that already exists, do you have
any intuition for how many, like, PhD thesis are out there
that would require no more work in the wet labs?
It's just a matter of integrating data from all
these different sources to make a new discovery that we
didn't know before?
RUSS B. ALTMAN: Well, there are probably on the order of
1,000 PhD candidates in biomedical informatics around
the country right now, most of whom don't generate primary
data themselves, but have collaborations or
use publicly available.
So I think there's a sense that there's a lot of data and
a lot of discoveries to be made.
On the other hand, clever biologists, every day,
generate new experiments that are smaller and faster.
So it's even better than the scenario that you're painting,
because in addition to a lot of undiscovered stuff for
mining, I think that we're really on the beginning
exponential of clever biologists generating large,
large data sets.
There's now an ethic in biology that, if you can do
it, why not do it fast and small, and in high throughput.
And that puts people who do informatics
in a very good mood.
So I'm bullish on informatics.
AUDIENCE: You have people's genome, even though it's
depersonalized in some sense.
Is any concern that that's like a fingerprint that you
could find the person, because it says whether they have
seizures and everything else?
RUSS B. ALTMAN: Yes.
So thank you very much.
This is one of those beautiful moments when your next slide--
So, you only need 60 to 100 SNPs.
I told you, we have about a million SNPs each.
You only need about 60 or 100 of them to be a unique
fingerprint.
So this DNA, even though I've removed your name and address
and social security number, if I get a piece of your DNA, I
can very cheaply figure out if you're in the PharmGKB or not.
So there's 20,000 people who, when they signed a disclosure,
basically took that risk.
And we're very grateful for that.
But there is a worry about genetic databases in general.
And it's one of the reasons we have that password step for
getting to that identifiable-- it's not identified, and it
might not even be identifiable-- but it's risky,
personalized data.
There's definitely some risk there.
And I'll just tell you a story.
And I've written about this.
Basically, there's a tradeoff between what you need to do
research and what you need to guarantee
privacy and their attention.
But you may have heard about a month ago, two months ago, a
teenager in the UK who--
so here's the story.
A father donates *** anonymously 15 years ago.
Teen, 15 years later, is interest in his heritage.
Teen in UK uses a commercial kit to check variations in his
Y chromosome.
OK, Y chromosome, you get that from your father.
Mom doesn't have any say in the Y chromosome.
There's a company that has a website that if you say you'll
make your Y chromosome information available, they'll
let you see other people who've also made their Y
chromosome available.
So the kid is smart, and he says, you know what, there's
two things I inherited from my father.
His Y chromosome, and I would have inherited his last name
if I knew what it was.
So he goes into this database and he finds two men who have
a very similar Y chromosome.
And they both have the same last name.
He doesn't know these guys are, but he says, I bet my
father had that last name, or a version of that last name.
He knew his father's age, and he knew his father's--
the part of the country where he came from.
So he combined this with some voter records, or some DMV
equivalent records, calls his father up.
He says, hi, Dad, I'm your *** donor child.
15 years old, this happens in the news on November 3.
So I used to talk about this and have to make up these
hypotheticals.
And this kid saved me a lot of time.
So there is no doubt that you can link multiple databases,
if you're clever.
And if you have any access to genetic information.
And then you can do some re-identification.
Re-identification can be going down to a single person, but
it can also be narrowing it down to a relatively small
group of people, which could be just as damaging.
So this is a concern.
And we've wound up writing a lot about this, because if we
ignore this problem, it could bring down the
entire genomic research.
All you need is one bad thing to happen, worse than this,
and then Congress has says, no more genetic databases, and
then I'm out of business.
And I'm applying to Google for a job.
So thank you for the question.
AUDIENCE: I'm sorry, so this is a risk, but this is an
acceptable risk?
Is that sort of your answer?
RUSS B. ALTMAN: Yes.
So we do our best to de-identify.
When people register, we make sure that they have bona fide
research questions.
They can be industrial, industry
is no problem, academic.
But like hotmail accounts.
Uh-uh.
Like, why do you want to see this data?
Then we, of course, have the weblog.
So we know who has seen, at least at the IP level and, oh,
since they login, we actually know who they
are, who saw data.
And so, if we get into trouble, we could roll back
some of that data and figure out, you know, if the judge
asks us, we could figure out.
And then we could find ourselves in the exact same
position that you guys have found yourself in recently.
So we have written about it.
We've publicly declared it.
We have a trillion usage policies that, when you
register, you have to agree to.
And there's a sense that we've done due diligence.
But I would never claim that we're bulletproof.
Yes?
AUDIENCE: With the frequency of SNPs, is it higher in genes
than junk DNA, or is it similar across both?
RUSS B. ALTMAN: So you actually see SNPs throughout
both coding DNA and the so-called junk DNA.
In fact, the junk DNA has a little bit more, because it's
less sensitive in some sense.
If it's not coding for a protein, it can tolerate these
variations.
You know, a random cosmic ray hit's your ovum or your ***,
a DNA base flips, and nobody cares, because it's not a
critical part of the genome.
Whereas, if you hit a critical part, you
might be very sensitive.
So the rate of SNPs tends to go down in regions that are
very important, and that have been conserved over many
billions of years, or millions of years.
AUDIENCE: Do you ever find surprising correlations with
SNPs that you though were in junk DNA?
RUSS B. ALTMAN: Yes.
There are SNPs that correlate very beautiful with
phenotypes, and those SNPs are hanging out in
the middle of nowhere.
We have no idea why that SNP is correlating with the
phenotype we care about.
It just shows you how much biology we still need to
figure out.
It probably isn't junk, is the answer.
Question in the back?
AUDIENCE: [INAUDIBLE]?
RUSS B. ALTMAN: One more, yeah.
AUDIENCE: So you maybe alluded to this with your adverse
event reporting, but can you think of ways, I mean, to
expand the client base from even 400,000 per week-- like,
what's the ultimate number of users who would--
RUSS B. ALTMAN: So the ultimate number of users is
all the people on the face of the earth are genotyped.
And that information is used responsibly by their various
health care systems to do optimal medical decision.
Right?
So six billion people on earth, six billion data files
with all of their genotypes, and enough phenotype
information to make predictions for all of them
and say, use these drugs, use those drugs.
And then, of course, you need a delivery mechanism.
I kind of alluded to that.
It's a challenge in America.
It would certainly be a challenge in Zimbabwe, or in
South Africa.
And so that's the big challenge.
AUDIENCE: [UNINTELLIGIBLE]
or just the health care providers?
RUSS B. ALTMAN: There's an issue of who carries around
the genotype, is it centralized?
Certainly the health care providers need
access to the data.
But does it get stuck on a chip on your tooth?
Do you carry it around on a card?
Is it in a central government database?
These are, as you can imagine, are very excit-- is it on
Google with a password?
As long as you don't give your Google password to anybody,
your genotype is safe.
So these are very interesting kind of sociological
discussions.
Society hasn't made a decision.
The societies that have centralized medical facilities
have an obvious default answer, which is, let's put it
in with the rest of the stuff.
We don't have that in the United States.
And so, I've written a little bit about a distributed,
patient controlled system, where the genotype is
measured, but then you are in control of it.
It's, you know, there's public private key
infrastructures in place.
And for any individual health care provider you can say, go
ahead and use my genotype.
And for a researcher like me, you can say, you know what, I
think PharmGKB is a good thing.
You may have my genotype and phenotype data for research.
Please use it responsibly.
And I trust you will, because I've read about
you and I trust you.
That kind of infrastructure, it really is technically not
that far-- you know, that's within reach.
And it's really a lot of sociological agreement that we
need to get that to happen.
Or somebody to do it and just do it well and say, look, this
is a good thing, any questions.
Yes.
AUDIENCE: When you get cancer, the cancer cell genotype is
different from yours.
Do you track that?
RUSS B. ALTMAN: So the question--
I'm sorry, I haven't been repeating the question.
The question was, the genotypes of cancer cells can
be different from the genotype of the host. That's
absolutely the case.
In fact, one of the causes of cancer is a rearrangement or a
messing up of the DNA, which creates multiple copies of
genes, missing copies of genes, or mutations or
polymorphisms, SNPs, in those genes.
And so, it is of great interest to sequence those.
And in fact, yes, the National Cancer Institute has a project
called the cancer genome project, where they're
sequencing, not humans, but cancers from humans.
We also are interested in that data.
You can use all the same measurement technology.
You just have to remember that it's not from a
normal blood cell.
It's from an abnormal cancer cell.
A big issue there is the number of copies of a gene.
Turns out that's one of the big things that happens in
cancer, is you get extra copies.
And so the fine balance that's established in a normal cell
is messed up, because you have 10 copies of one gene.
And so it's over-eager, and it just causes the cell problems.
So let me stop here since it's noon, and thank you very much.