Tip:
Highlight text to annotate it
X
Move to our next speaker, Shawn Eddy of the Howard Hughes Medical Institute in Janelia
Farms is here to talk about reading genomes bit by bit. Shawn?
Shawn Eddy: It's an honor to be here, and it's been a
great ride. Throughout this project I will enter graduate school when Renato Dulbecco
wrote that article about the human genome project and why we should do it. I've been
able to see this from the beginning. I want to convey to you some of the excitement that
I feel as we get to look at all this stuff and immerse ourselves in the A, C, Gs, and
Ts. And also I want to talk a little bit about some of the biology besides medicine and besides
the human genome and in the biology parts of this stuff I'm going to talk about. I'm
going to about some geekly things about computers and software. But in the biology parts, I'm
going to try to explain two things. I'm try to explain why not only did we sequence the
human genome, but a bunch of other things including lots of flies, and lots of worms,
and lots of other things, not just melanogaster, but all around melanogaster and also things
like why did we sequence this particular single cell pronged protozoa and other things like
it. And I'll have a little bit about that.
Where I'm going to start though is at the end of the day, when we look a DNA sequence,
at least if your computational geek, you look at the sequence H, C, Gs, and Ts and you say
this is a symbolic sequence. We know a lot about cracking codes by both experiment and
statistical analysis. Figuring out the meaning of apparently impenetrable languages is something
humans have done for a long time, and it's a great intellectual thing to do. Puzzles
are based on this. It gets very addictive.
One of the great examples of this is actually described in a book called The Decipherment
of Linear B by John Chadwick, writing about his buddy Michael Ventris, a mathematician
at Cambridge. This is an apropos tale. It is 1953 in Cambridge. One code was cracked
in the pub and another was also cracked by Ventris. This is linear B language, a language
that had been found Mycenaean caves and was only known in symbols. It's one of the few
examples where we didn't know anything about the language. It had to be cracked by direct
statistical attack.
Ventris' attack was comparative sequence analysis, and the idea was to look for statistical regularities
between different tablets. And as far as I'm concerned, this is one of the early successes
in comparative sequence analysis. In this case linear B turns out to be alternative
script for Greek and so the problem becomes somewhat easier than the problem that we're
face with. But we have a lot of examples. And it's fairly mind blowing to me, still,
to go to our disks and what you're seeing on the left unreadably is a listing of the
top level of the disk Janelia Farm where we keep or genomes. And this is like some pre
Victorian phylogenetic tree, because you can sort of see algae, amoeba, amphibia, archaea,
bacteria in an order that's sort of peculating up and down depending on what my laboratory
and I and my wife are working on at any given time. And it's remarkable to me that sort
of sitting in these directories is the actual are the source code through all these different
wonderful creatures that we see.
There is a lot of talk about how much data we're talking about. It is a fair amount of
data, but it's not, at least at the moment, completely mind blowing. The for my lab, that
disk that I just showed you is about 450 gigs. We have most of genome sequences that are
available. If we were in the business of manipulating human data, if we were taking raw images off
an illumina, that's a lot of data per human genome, and that's difficult to store. It's
difficult to ship down the internet. But once it gets to assembly, it's not so bad to store
it per year. It's not so bad to transmit it down the internet. And as Eric just mentioned,
there's proof of principle that we can store human genomes' differences. So we know, sort
of, intellectually sort of in principle we know that we can handle the genome for next
couple years, but it's nontrivial. So the thousand genomes project has generated five
terabytes in its pilot project. That's not a big deal. We have a petabyte of spinning
disk at Janelia. We could store that if we needed to. The NCBI read archive is starting
to fall over. It's approaching a petabyte soon. But these are volumes of data right
now that are not too different than what's you know, you're Itunes collection or I've
got this this sort of like the Vikings used to drink for beer from skulls of their enemies.
[laughter]
My coffee coaster is the solara genome sequence.
[Laughter]
When we look at this sequence, this is one of the few genome sequences that fits on a
slide. This is Fi X [spelled phonetically] 174 sequenced by Sanderin Coolsen [spelled
phonetically] back in the 70s. This sequence this little bacterial virus is actually quite
interpretable, because, of course, there are statistical regularities. There's an ATG start.
There's a got one of the three stop codes on. There's an open reading frame. And we
can walk through a genome like this, because the bacterial genome tends to be packed with
open reading frames, and we can do a pretty good job of interpreting this genome. There's
interesting stuff going on in the gaps, and sometimes an interesting little RNA I'm quite
interested in little RNAs that have functions as Eric was also talking about.
Once we get into the human genome and the big vertebra genomes, we have bit of problem.
Only one, two, three percent of it is coding, depending on who you're talking to, and one
of the most important signals that we can try to use is sequence conservation. We line
up a bunch of genomes, and we say this set of bases is essentially the same across some
clade of evolution. Really key in our ability to do this kind of analysis with the human
genome was the development of genome browsers by Jim Kent and David Haussler by the ensemble
browser by Ewan Burney by having the ability to align large quantities of genome, which
has been done by Web Miller's group among others, and then ability to take those genome
alignments and calculate some simple statistic of how conserved do I think the various bits
are generating these plots you can sort of see in blue that spike out of course on the
exons, in this case a single gene, the B 53 gene of the human, and then also spiking out
on some other area of non coding conservation. Adam Cepal's [spelled phonetically] program
[unintelligible] underlines lot of calculations that people use for getting a quick look at
this conservation. Those little spikes outside exons represent unexpected wealth.
We expected to see regulatory sequence. We know there's transcription of regulatory stuff.
We know there's enhancers, there's promoters, there's what have you not. But now we can
use the conservation to really tell us where that stuff is likely to be. And so that's
really the answer to the first sort of biological thing that I said, "Why did we sequence so
many flies? Why did we sequence some many worms?" You could sit down, and you can do
a power calculation, and you can say, "OK. What I'm really trying to do, is I'm trying
to calculate the number of here's reason of DNA. I've got the human. I've got the mouse.
I'm expecting about 40 percent difference between those two sequences by just neutral
drift, by evolutionary clock. So if it's a hundred nucleotides, I expect 40 differences.
I only see 10. Should I be surprised? Ten is a small number. Forty's is a small number."
You can do a power calculation of forgiven amount of conservation, how much do you have
to drive to two distributions apart before you can reliably tell that you've got a conserve
piece of DNA given the size of the human genome which is quite big, so you're going to have
false positives. Greg Cooper [spelled phonetically], Eric Sidow [spelled phoneticall] and others
and then followed by me have done this kind of simulations in power studies. And if you
want to drive the resolution down to single nucleotides or five or ten nucleotides, you
can pretty quickly convince yourself that for typical distances between vertebra genomes
of, like .2 to .4 substitutions per sight, you're going to need tens to even hundreds
of genomes lined up. So it's not so much that we interested in the platypus per se, but
it's one of the hundred genomes or the thousand genomes that we are going to line up against
the human to try to figure out what's functional in the human.
And not just so that's a very crude calculation. It's conserved, must be doing something interesting.
What you could also do is you can look at the pattern of conservation, and you now you
can do much more interesting things. For instance, if you're dealing with a coding region, obviously
there's a pattern of conservation that tends to respect the triplet periodicity of the
genetic code. So you can do something like take a region that seems to be conserved,
in this case this is a poster child for a long inter-genic non coding RNA, a gene called
SRA1, a gene that when it was first cloned had a truncated CDNA that went to this point,
and you'll notice the ATG start code on there, and then you color in where all the mutations
are. You color third position change as red and then first position's green, second position's
blue, and you can see most of the changes here are in red. They're respecting the frame.
There's two insertions in the picture I pulled out, one of six, one of three, again, respecting
frame, and when you do this over the entire aligned region over the SRA1 non coding RNA
it's pretty clear that this is a coding gene of 232 amino acids in the mouse and in the
236 amino acids in the human.
Now when the first paper came out, this is 1999, they didn't have as much data as we've
got now. And so this if you if we went deeper in the SRA1 story, it gets more and more murky,
because it does look like the RNA have some have some function independent of the coding
region but that's a different story. That this ability to recognize coding regions and
discriminate them from other conservation is something that's now at the heart of a
lot of computational gene finding methods that are trying to harness the vertebrate
sequences we have available or the [unintelligible] sequences we have available or the sedorenditis
[spelled phonetically] sequences, work that was really pioneered in bacteria by John Badger
a graduate student in Gary Olsen's lab but is now sort of fundamental to this field.
Now not just coding regions impose their own evolutionary constraints on sequence, but
now once you have that idea, you can say, "Oh, I expect transcription factors, because
of the way they contact DNA to show particular patterns of conserve bases and in the middle
would not be so conserve because of way sort of reach out across and leave little gap in
their conserved binding sites." Or in this case of my laboratory interested in RNAs,
you can say, "There should also be a constraint composed in RNAs secondary structure at least
for structured RNAs." So I could imagine making a statistical test for you give me aligned
pair of sequences, I can do a statistical test for what is the pattern of changes here
showing four changes is randomly distributed, sort of every column is independent, or whether
the four positions are respecting frame. They tend to be third position, or they tend to
be respecting what's in crick [spelled phonetically] base pairs, preserving what's in crick base
pairs in a correlated fashion which is a feature we see in lots of structural RNA. You can
formalize this, and I'll talk in a few minutes about how we formalize this. There's now a
sort of Lego blocks of tools that we use computationally to build this kind of statistic test for sequence
analysis. And you can turn this kind of approach into an RNA gene finder, something that will
look now for structural RNAs in conserve regions of whatever genome you're look at. It's been
very successful in bacteria. Signal noise makes it very problematic in the bigger genomes.
But there's been a lot of great work from Iville Hawfactor [spelled phonetically] and
Yacam Peterson and other people in taking a basic approach which was developed by my
wife Elana and actually get it to scale in big genomes.
And you can one of the one of the things I love about this field is that you can find
little subtle effects, sort of way the way a hacker will say, "Well if I heard you typing
on keyboard, the pattern of spaces between your pauses is informative about what you
typing in, so I can figure out your password if I've got a microphone close to you." We
can do the same kind of thing for genome sequences. We can look for very subtle things that evolution
is putting on the sequence. We can detect those patterns. This is just one of many examples
I could have pulled. This trick only really works in bacterial genomes. I wish it worked
elsewhere. The graph is sweeping across four million basis of E. coli, and I'm going to
do very simple thing. I'm going to count the number of Gs I see, and I'm going to count
the number of Cs I see. And as I just go across to top strand, the Watson strand, I'm going
to plot the XSG. And what you get is you get that plot. That's very nonrandom. It's not
much. There's an excess of about 20,000 Gs as you go across, and then that excess goes
away and starts coming back. Turns out, that's the terminator, and that's the origin of replication.
And remember that E. Coli replicates like this. And what happens is, there it's not
actually understood why this is, but one of the models is that the lagging strand of replication
is more solvent exposed and more prone to de animation of the seed because that's a
water driven de animation, and so you get a depletion of C on the lagging strand.
And that pattern show up, and if you believed that, you'd expect that to also show up for
transcription. It happens that E. Coli also its transcription direction respects it's
replication direction which is probably leading to why the install is so clean in E. Coli.
Phil Green, *** Smith, and others have to turn this idea into an RNA gene finder. Doesn't
work very well with genomes or single pairs of genomes. But it might be that there's enough
signal in the mutational biases of transcribe region that we can use this to find things
have not yet shown up in RNA seek [spelled phonetically] or epi genome experiments. Things
that evolution knows that the thing is being transcribed at some point in some single cell
that would be difficult to measure experimentally.
Now I also showed you a little picture of a pond cilia. So now let me get to that part.
Once you're looking for little patterns, you don't have to look for just little patterns
in all organisms that you think E. Coli shares with humans, share with worms. You can now
say, "You know, I can exploit the fact here's this weirdo creature in some weirdo evolutionary
niche that has weirdo evolutionary pressure. I'm interested in RNAs. I can take advantage
of this little weird creature to find RNAs and then use homology research to get out
of the little weird creature and find homologs in other organisms." So here's an example
the kind of thing you can do once you have tons of genome sequences lying on your disk.
You can notice that a great sort of observational computational biologist Gene Lobry [spelled
phonetically] has published a paper where he says, "Well, I looked at a bunch of bacteria
genomes, and I interested I noticed something interesting. If you plot the optimal growth
temperature of organisms verses the GC content of the genome, I sort of expected that higher
the temperature the more GC rich the genome would get. But it doesn't happen that way.
Bacteria grow in high temperatures, have other evolutionary adaptations to high temperature
other than just strengthening their hydrogen bond. They hold their DNA together by making
reverse gyrase [spelled phonetically] that overwinds the DNA, burns ATP, and puts positive
super [unintelligible] into the DNA, and other tricks to stabilize their DNA. But nobody
looked at the CG content of structural RNAs. Structure RNAs go off, they make a single
little strand, it folds up now to stabilize the structure RNA at high temperature. Evolution
does drive the GC content.
And so GC content of the genome is sort of randomly varying, but the structural RNAs
tend to be tightly correlated so if you're an RNA guy you go, "OK. What I'm going to
do is I'm going to reach into the database, find me the most AT biased genome that still
grows at a superhigh temperature. The most extreme one you can find is pyrococcus furiosus.
It's isolated from the Morraco Island in Italy. I actually know some of the people that have
the the the benefit of being able to go in their scuba gear and collect in the Italian
island. This thing grows at 98 degrees C, 16 percent genome. It is not something you
would think of as an experimental organism. It dies in trace oxygen. It's a strict anaerobe,
grows on elemental sulfur, generates hydrogen sulfide. Nobody likes you on the floor that
you're working on. It does that the advantage that because the normal growth temperature
98 degrees C, you don't need minus 70 to destroy your samples. You just put it on the bench
and it thinks that's minus 70.
[laughter]
But what you can do if it's genome so the easy part is looking because that was done
by not us. We can just reach into the database and say, "This is a pearl script sweeping
across its genome just counting GCs in windows there's two TRNAs." So finding structural
RNA in organisms is completely trivial. When you do this, you find a bunch of little RNAs,
tens of structural RNAs that have not been discovered before, all of which were known
classes. So this was unable to discover novel classes of structural RNA for bacteria or
archaea. Then you say, "OK. One of problems in interpreting genomes is we don't know where
a gene stops and a gene starts. So when I'm looking at this regulatory stuff, I don't
know when the enhancer is for this gene or I don't know if it's for that gene or indeed
whether it's transvecting [spelled phonetically] over to some other chromosome. And if I'm
looking structural RNAs, I have a problem of just finding them in the first place. The
statistical signals are pretty subtle." So this is some ongoing work from a student in
my lab, Sum Kil Jung [spelled phonetically], working in collaboration with Laura Landweber’s
lab at Princeton, working with an organism that was pioneered by David Prescott at the
University of Colorado Boulder. I was graduate student there and I've always held this organism
in my head as this organism is going to be useful for something someday.
[laughter]
It's adaptation is a bizarre one. It has in its macronucleus, actually it has two different
kinds of nuclei and I won't go all the biology has a macronucleus. It's a somatic transcribed
macronucleus. And in the macronucleus it has about 20,000 different chromosomes there's
actually several million chromosomes. The average chromosome size is two kilobases and
these are individual chromosomes: telomere, gene, telomere; telomere, gene, telomere;
telomere gene telomere; extra gene. For the most part, not completely unfortunately, but
for the most part, this organism has identified all of its genes for us, but telomere is at
the ends of them. So now we can just sequence the genome and say, "Well that's a gene. That's
a gene. That's a gene."
And since protein genes are relatively obvious, not completely, but relatively obvious, we
can do a subtracted screen and say, "Throw away all the genes in pairs. Everything else
is either an interesting gene we didn't know about or structural RNA or something like
that." And that screen has gone and identified a small number again of new RNAs. So let me
close with a couple words about geekly [spelled phonetically] things, forgive me. But underlying
all of this is computer science, statistics, mathematic being use to interpret genome sequences.
And at the end of the day, what we're trying to do is we start with a sequence and we draw
these cartoons where we say, "OK. The sequence is a line. And we're trying to attach labels.
I could show you a protein sequence and these could be domains, in this case of the dicer
protein which was actually done by sequence analysis. Brenda Bass wrote a review where
she said, "OK. Based on what we know RNA interference, it's got to have an RNA helocates [spelled
phonetically] activity. It's got to have this. It's got to have that." And it turns out there's
only one protein in cealegans [spelled phonetically] that has the right combination of domains,
its known function, and that protein turned out to be dyserin [spelled phonetically].
So that was a important computational clue in early days of RNA interference.
I could be drawing a DNA sequence and enhancers. You know, this this is how we draw things;
exons, introns. We're trying to attaches label to sequences. That is, you show me a piece
of sequence, what I'm really interested in, what label do I attach to that piece of sequence?
The label is hidden and I'm trying to infer, should it be this label or that label? And
it turns out that there's a big field of mathematics in digital signal processing and speech recognition
that does this for whether it's a speech signal or whether it's an encrypted signal whether
it's the telemetry off of your car engine, which actually I could tell you story how
those guys are now using software developed in bioinformatics to do prototyping of engine
telemetry and astronomic telemetry which is amazing the field has sort of gone full circle.
Because we've put so much work now into a adopting methods called hidden mark off models
and [unintelligible] context free grammar and other methods probability models of attaching
labels to sequences that are appropriate either for linear sequence analysis where you just
say I'm going to do sequence alignment or, in the case of RNA, I'm align pairs in a sort
of nested sets. We have models to do that. They were introduced in the nineties by Gary
Churchill, Gary Stormol [spelled phonetically], Andrews Crow [spelled phonetically], David
Haussler. FCFG models for RNAs introduced by [unintelligible], Sataki Bar [spelled phonetically]
and myself about the same time Yasi was in David's lab at the time.
And this has given us a real tool kit to build models which give me an opportunity to say
that the most important tool in computation biology without doubt is the tool called blast.
And it's probably familiar to all of you. Blast does a sequence alignment between a
query that you give it and every sequence in the database and it looks for things that
are significantly related. When we look at the blast out algorithm as problemistic modelers,
we say, "OK. This is an approximation for it's a two level approximation. Well it's
a multilevel approximation." It's doing sequence alignment where what it's trying to do is
attach a label saying those two residues are aligned or those two residues are not aligned.
This is an insertion. This is a deletion. And so it has three states. It says, "I'm
either going to try align residue X and Y, or I'm going to throw X out as an insertion,
or I'm going to throw Y out as a deletion as an insertion on the other strand." Those
are three states that I can move between. That's a mark off model. Where I go next is
dependent on where I just was. And so I have hidden states that I'm trying to infer, connected
by arrows that's a hidden mark off model. But the arrows in blast are associated with
mostly zeros implicitly and then there's a gap, open penalty every time I open an insertion
either way. And there's a gap extend penalty every time I extend an insertion by another
residue on either strand. Then there's a score for aligning the two residues that's due to
the blossom matrix from the hennocofs [spelled phonetically].
The bottom line is that we now understand that those zeros really should be probabilities,
or at least if you're thinking in a probabilistic inference context, we can represent blast's
internal model as probability model, and now we can do things like, "Well, instead of just
giving me the optimal alignment, some overall possible alignment and then tell me what's
the probability that you're really confident this residue aligns with this residue. Integrated
over all possible alignments and other things." The field has been trying to do that for a
while, and now I want to make a somewhat sociological point.
There are there's lots of research and a big literature on developing better methods for
sequence analysis. Those methods go into journals like BMC, bioinformatics and what have you,
and none of you read them. Then there's another field which takes very important algorithms
and reduces them down to their bare bones and speeds them up on particular hardware.
And there's lot you can do on modern hardware. For instance this paper from Michael Frar
[spelled phonetically] who was, until his untimely death a couple months ago, unfortunately,
our chief software engineer. I recruited him from Boston. Notice unaffiliated, this is
all sociological comment. He was not in the university. He did this particular paper in
his spare time and then was recruited by both Bill Pearson and me to actually become a biologist.
So you can use sin-D which was single instruction multiple data being driven by the graphics
industry all modern chips are capable of vector paralyzed processing. They're being driven
to this by the all the games everybody plays. You can take advantage of those instruction
sets to make bioinformatics software. But the question is: Whose going to do it?
The difference between writing a piece of software that worked for your BNF bioinformatics
paper and a piece of software that runs fast and it can be used by the rest of community
is an enormous difference. And it's to blast's credit that one of the things that's underlying
that's not only terrific theory Steven Aultshul [spelled phonetically] , and Sam Carlin [spelled
phonetically] , and others, and great algorithmics from Jean Myers, and Warren Gish [spelled
phonetically] , and others, but terrific software engineering from the NCBI team. It's very
rare to get that kind of investment in piece of software. And that's really one of the
things that makes Blast fast.
Frankly, in our lab we are frustrated that Blast was written 20 years ago, and it's now
difficult to adapt it to stuff that we think we know with probability modeling now. There's
an effort in my lab to now take the hidden mark off model methods and speed them up to
Blast. And we've been investing a lot of time in this to get the engineering to up snuff.
One of goals is for you to be able to do an HNM based search, like a Blast search, on
a web server where you get the response no matter what size of the RNA database in a
hundred milliseconds or less. That is faster than a Google search so that you can do interactive
searching rather than waiting for a batch job and actually start exploring what sequence
space looks like in all these wonderful organisms. And we're within an order of magnitude of
being able to do that.
Again, reinforcing the point, I don't want to belabor this, but we have been working
to engineer tools that can do this kind of probability analysis. The point I want to
make on this slide is the difference between a tool that runs good enough for a publication.
I could probably write hammer a thousand lines of C code. The actual code which is not a
good as we want is 44 thousand lines in Somoa, so we maintain a big code base trying to make
this stuff useful and that requires engineering.
But it also means I have two faced view of what we're look at, and it's sort of not great
for one's mental health at times. But we are very much emersed in two levels of code: one
level of looking all those wonderful genome sequences, but also the level of our C code
of trying to interpret all that. And both of those are evolutionary are artifacts that
are difficult to understand.
With that I'll stop. And in contrast and now I'll sort of counterpoint to Eric's world.
My little lab, which is a husband and wife team, we've been working together for longer
than we've been together as a couple, very small laboratory at Janelia Farm, where we're
pretty much dedicated to building these kind of tools for the community to use. And I'll
stop. And I'd be happy to take questions.
[applause]
Male Speaker: We have time for a quick question if someone's
going to race to a microphone. That someone racing to a microphone or racing out the door?
Racing out the door. So in that case, Shawn will be available at the break. We're going
to take a break now then, and I'm sure he'll be available if you have any questions.