Reading Genomes Bit by Bit - Sean eddy

Move to our next speaker, Shawn Eddy of the Howard Hughes Medical Institute in Janelia Farms is here to talk about reading genomes bit by bit. Shawn? Shawn Eddy: It's an honor to be here, and it's been a great ride. Throughout this project I will enter graduate school when Renato Dulbecco wrote that article about the human genome project and why we should do it. I've been able to see this from the beginning. I want to convey to you some of the excitement that I feel as we get to look at all this stuff and immerse ourselves in the A, C, Gs, and Ts. And also I want to talk a little bit about some of the biology besides medicine and besides the human genome and in the biology parts of this stuff I'm going to talk about. I'm going to about some geekly things about computers and software. But in the biology parts, I'm going to try to explain two things. I'm try to explain why not only did we sequence the human genome, but a bunch of other things including lots of flies, and lots of worms, and lots of other things, not just melanogaster, but all around melanogaster and also things like why did we sequence this particular single cell pronged protozoa and other things like it. And I'll have a little bit about that. Where I'm going to start though is at the end of the day, when we look a DNA sequence, at least if your computational geek, you look at the sequence H, C, Gs, and Ts and you say this is a symbolic sequence. We know a lot about cracking codes by both experiment and statistical analysis. Figuring out the meaning of apparently impenetrable languages is something humans have done for a long time, and it's a great intellectual thing to do. Puzzles are based on this. It gets very addictive. One of the great examples of this is actually described in a book called The Decipherment of Linear B by John Chadwick, writing about his buddy Michael Ventris, a mathematician at Cambridge. This is an apropos tale. It is 1953 in Cambridge. One code was cracked in the pub and another was also cracked by Ventris. This is linear B language, a language that had been found Mycenaean caves and was only known in symbols. It's one of the few examples where we didn't know anything about the language. It had to be cracked by direct statistical attack. Ventris' attack was comparative sequence analysis, and the idea was to look for statistical regularities between different tablets. And as far as I'm concerned, this is one of the early successes in comparative sequence analysis. In this case linear B turns out to be alternative script for Greek and so the problem becomes somewhat easier than the problem that we're face with. But we have a lot of examples. And it's fairly mind blowing to me, still, to go to our disks and what you're seeing on the left unreadably is a listing of the top level of the disk Janelia Farm where we keep or genomes. And this is like some pre Victorian phylogenetic tree, because you can sort of see algae, amoeba, amphibia, archaea, bacteria in an order that's sort of peculating up and down depending on what my laboratory and I and my wife are working on at any given time. And it's remarkable to me that sort of sitting in these directories is the actual are the source code through all these different wonderful creatures that we see. There is a lot of talk about how much data we're talking about. It is a fair amount of data, but it's not, at least at the moment, completely mind blowing. The for my lab, that disk that I just showed you is about 450 gigs. We have most of genome sequences that are available. If we were in the business of manipulating human data, if we were taking raw images off an illumina, that's a lot of data per human genome, and that's difficult to store. It's difficult to ship down the internet. But once it gets to assembly, it's not so bad to store it per year. It's not so bad to transmit it down the internet. And as Eric just mentioned, there's proof of principle that we can store human genomes' differences. So we know, sort of, intellectually sort of in principle we know that we can handle the genome for next couple years, but it's nontrivial. So the thousand genomes project has generated five terabytes in its pilot project. That's not a big deal. We have a petabyte of spinning disk at Janelia. We could store that if we needed to. The NCBI read archive is starting to fall over. It's approaching a petabyte soon. But these are volumes of data right now that are not too different than what's you know, you're Itunes collection or I've got this this sort of like the Vikings used to drink for beer from skulls of their enemies. [laughter] My coffee coaster is the solara genome sequence. [Laughter] When we look at this sequence, this is one of the few genome sequences that fits on a slide. This is Fi X [spelled phonetically] 174 sequenced by Sanderin Coolsen [spelled phonetically] back in the 70s. This sequence this little bacterial virus is actually quite interpretable, because, of course, there are statistical regularities. There's an ATG start. There's a got one of the three stop codes on. There's an open reading frame. And we can walk through a genome like this, because the bacterial genome tends to be packed with open reading frames, and we can do a pretty good job of interpreting this genome. There's interesting stuff going on in the gaps, and sometimes an interesting little RNA I'm quite interested in little RNAs that have functions as Eric was also talking about. Once we get into the human genome and the big vertebra genomes, we have bit of problem. Only one, two, three percent of it is coding, depending on who you're talking to, and one of the most important signals that we can try to use is sequence conservation. We line up a bunch of genomes, and we say this set of bases is essentially the same across some clade of evolution. Really key in our ability to do this kind of analysis with the human genome was the development of genome browsers by Jim Kent and David Haussler by the ensemble browser by Ewan Burney by having the ability to align large quantities of genome, which has been done by Web Miller's group among others, and then ability to take those genome alignments and calculate some simple statistic of how conserved do I think the various bits are generating these plots you can sort of see in blue that spike out of course on the exons, in this case a single gene, the B 53 gene of the human, and then also spiking out on some other area of non coding conservation. Adam Cepal's [spelled phonetically] program [unintelligible] underlines lot of calculations that people use for getting a quick look at this conservation. Those little spikes outside exons represent unexpected wealth. We expected to see regulatory sequence. We know there's transcription of regulatory stuff. We know there's enhancers, there's promoters, there's what have you not. But now we can use the conservation to really tell us where that stuff is likely to be. And so that's really the answer to the first sort of biological thing that I said, "Why did we sequence so many flies? Why did we sequence some many worms?" You could sit down, and you can do a power calculation, and you can say, "OK. What I'm really trying to do, is I'm trying to calculate the number of here's reason of DNA. I've got the human. I've got the mouse. I'm expecting about 40 percent difference between those two sequences by just neutral drift, by evolutionary clock. So if it's a hundred nucleotides, I expect 40 differences. I only see 10. Should I be surprised? Ten is a small number. Forty's is a small number." You can do a power calculation of forgiven amount of conservation, how much do you have to drive to two distributions apart before you can reliably tell that you've got a conserve piece of DNA given the size of the human genome which is quite big, so you're going to have false positives. Greg Cooper [spelled phonetically], Eric Sidow [spelled phoneticall] and others and then followed by me have done this kind of simulations in power studies. And if you want to drive the resolution down to single nucleotides or five or ten nucleotides, you can pretty quickly convince yourself that for typical distances between vertebra genomes of, like .2 to .4 substitutions per sight, you're going to need tens to even hundreds of genomes lined up. So it's not so much that we interested in the platypus per se, but it's one of the hundred genomes or the thousand genomes that we are going to line up against the human to try to figure out what's functional in the human. And not just so that's a very crude calculation. It's conserved, must be doing something interesting. What you could also do is you can look at the pattern of conservation, and you now you can do much more interesting things. For instance, if you're dealing with a coding region, obviously there's a pattern of conservation that tends to respect the triplet periodicity of the genetic code. So you can do something like take a region that seems to be conserved, in this case this is a poster child for a long inter-genic non coding RNA, a gene called SRA1, a gene that when it was first cloned had a truncated CDNA that went to this point, and you'll notice the ATG start code on there, and then you color in where all the mutations are. You color third position change as red and then first position's green, second position's blue, and you can see most of the changes here are in red. They're respecting the frame. There's two insertions in the picture I pulled out, one of six, one of three, again, respecting frame, and when you do this over the entire aligned region over the SRA1 non coding RNA it's pretty clear that this is a coding gene of 232 amino acids in the mouse and in the 236 amino acids in the human. Now when the first paper came out, this is 1999, they didn't have as much data as we've got now. And so this if you if we went deeper in the SRA1 story, it gets more and more murky, because it does look like the RNA have some have some function independent of the coding region but that's a different story. That this ability to recognize coding regions and discriminate them from other conservation is something that's now at the heart of a lot of computational gene finding methods that are trying to harness the vertebrate sequences we have available or the [unintelligible] sequences we have available or the sedorenditis [spelled phonetically] sequences, work that was really pioneered in bacteria by John Badger a graduate student in Gary Olsen's lab but is now sort of fundamental to this field. Now not just coding regions impose their own evolutionary constraints on sequence, but now once you have that idea, you can say, "Oh, I expect transcription factors, because of the way they contact DNA to show particular patterns of conserve bases and in the middle would not be so conserve because of way sort of reach out across and leave little gap in their conserved binding sites." Or in this case of my laboratory interested in RNAs, you can say, "There should also be a constraint composed in RNAs secondary structure at least for structured RNAs." So I could imagine making a statistical test for you give me aligned pair of sequences, I can do a statistical test for what is the pattern of changes here showing four changes is randomly distributed, sort of every column is independent, or whether the four positions are respecting frame. They tend to be third position, or they tend to be respecting what's in crick [spelled phonetically] base pairs, preserving what's in crick base pairs in a correlated fashion which is a feature we see in lots of structural RNA. You can formalize this, and I'll talk in a few minutes about how we formalize this. There's now a sort of Lego blocks of tools that we use computationally to build this kind of statistic test for sequence analysis. And you can turn this kind of approach into an RNA gene finder, something that will look now for structural RNAs in conserve regions of whatever genome you're look at. It's been very successful in bacteria. Signal noise makes it very problematic in the bigger genomes. But there's been a lot of great work from Iville Hawfactor [spelled phonetically] and Yacam Peterson and other people in taking a basic approach which was developed by my wife Elana and actually get it to scale in big genomes. And you can one of the one of the things I love about this field is that you can find little subtle effects, sort of way the way a hacker will say, "Well if I heard you typing on keyboard, the pattern of spaces between your pauses is informative about what you typing in, so I can figure out your password if I've got a microphone close to you." We can do the same kind of thing for genome sequences. We can look for very subtle things that evolution is putting on the sequence. We can detect those patterns. This is just one of many examples I could have pulled. This trick only really works in bacterial genomes. I wish it worked elsewhere. The graph is sweeping across four million basis of E. coli, and I'm going to do very simple thing. I'm going to count the number of Gs I see, and I'm going to count the number of Cs I see. And as I just go across to top strand, the Watson strand, I'm going to plot the XSG. And what you get is you get that plot. That's very nonrandom. It's not much. There's an excess of about 20,000 Gs as you go across, and then that excess goes away and starts coming back. Turns out, that's the terminator, and that's the origin of replication. And remember that E. Coli replicates like this. And what happens is, there it's not actually understood why this is, but one of the models is that the lagging strand of replication is more solvent exposed and more prone to de animation of the seed because that's a water driven de animation, and so you get a depletion of C on the lagging strand. And that pattern show up, and if you believed that, you'd expect that to also show up for transcription. It happens that E. Coli also its transcription direction respects it's replication direction which is probably leading to why the install is so clean in E. Coli. Phil Green, *** Smith, and others have to turn this idea into an RNA gene finder. Doesn't work very well with genomes or single pairs of genomes. But it might be that there's enough signal in the mutational biases of transcribe region that we can use this to find things have not yet shown up in RNA seek [spelled phonetically] or epi genome experiments. Things that evolution knows that the thing is being transcribed at some point in some single cell that would be difficult to measure experimentally. Now I also showed you a little picture of a pond cilia. So now let me get to that part. Once you're looking for little patterns, you don't have to look for just little patterns in all organisms that you think E. Coli shares with humans, share with worms. You can now say, "You know, I can exploit the fact here's this weirdo creature in some weirdo evolutionary niche that has weirdo evolutionary pressure. I'm interested in RNAs. I can take advantage of this little weird creature to find RNAs and then use homology research to get out of the little weird creature and find homologs in other organisms." So here's an example the kind of thing you can do once you have tons of genome sequences lying on your disk. You can notice that a great sort of observational computational biologist Gene Lobry [spelled phonetically] has published a paper where he says, "Well, I looked at a bunch of bacteria genomes, and I interested I noticed something interesting. If you plot the optimal growth temperature of organisms verses the GC content of the genome, I sort of expected that higher the temperature the more GC rich the genome would get. But it doesn't happen that way. Bacteria grow in high temperatures, have other evolutionary adaptations to high temperature other than just strengthening their hydrogen bond. They hold their DNA together by making reverse gyrase [spelled phonetically] that overwinds the DNA, burns ATP, and puts positive super [unintelligible] into the DNA, and other tricks to stabilize their DNA. But nobody looked at the CG content of structural RNAs. Structure RNAs go off, they make a single little strand, it folds up now to stabilize the structure RNA at high temperature. Evolution does drive the GC content. And so GC content of the genome is sort of randomly varying, but the structural RNAs tend to be tightly correlated so if you're an RNA guy you go, "OK. What I'm going to do is I'm going to reach into the database, find me the most AT biased genome that still grows at a superhigh temperature. The most extreme one you can find is pyrococcus furiosus. It's isolated from the Morraco Island in Italy. I actually know some of the people that have the the the benefit of being able to go in their scuba gear and collect in the Italian island. This thing grows at 98 degrees C, 16 percent genome. It is not something you would think of as an experimental organism. It dies in trace oxygen. It's a strict anaerobe, grows on elemental sulfur, generates hydrogen sulfide. Nobody likes you on the floor that you're working on. It does that the advantage that because the normal growth temperature 98 degrees C, you don't need minus 70 to destroy your samples. You just put it on the bench and it thinks that's minus 70. [laughter] But what you can do if it's genome so the easy part is looking because that was done by not us. We can just reach into the database and say, "This is a pearl script sweeping across its genome just counting GCs in windows there's two TRNAs." So finding structural RNA in organisms is completely trivial. When you do this, you find a bunch of little RNAs, tens of structural RNAs that have not been discovered before, all of which were known classes. So this was unable to discover novel classes of structural RNA for bacteria or archaea. Then you say, "OK. One of problems in interpreting genomes is we don't know where a gene stops and a gene starts. So when I'm looking at this regulatory stuff, I don't know when the enhancer is for this gene or I don't know if it's for that gene or indeed whether it's transvecting [spelled phonetically] over to some other chromosome. And if I'm looking structural RNAs, I have a problem of just finding them in the first place. The statistical signals are pretty subtle." So this is some ongoing work from a student in my lab, Sum Kil Jung [spelled phonetically], working in collaboration with Laura Landweber’s lab at Princeton, working with an organism that was pioneered by David Prescott at the University of Colorado Boulder. I was graduate student there and I've always held this organism in my head as this organism is going to be useful for something someday. [laughter] It's adaptation is a bizarre one. It has in its macronucleus, actually it has two different kinds of nuclei and I won't go all the biology has a macronucleus. It's a somatic transcribed macronucleus. And in the macronucleus it has about 20,000 different chromosomes there's actually several million chromosomes. The average chromosome size is two kilobases and these are individual chromosomes: telomere, gene, telomere; telomere, gene, telomere; telomere gene telomere; extra gene. For the most part, not completely unfortunately, but for the most part, this organism has identified all of its genes for us, but telomere is at the ends of them. So now we can just sequence the genome and say, "Well that's a gene. That's a gene. That's a gene." And since protein genes are relatively obvious, not completely, but relatively obvious, we can do a subtracted screen and say, "Throw away all the genes in pairs. Everything else is either an interesting gene we didn't know about or structural RNA or something like that." And that screen has gone and identified a small number again of new RNAs. So let me close with a couple words about geekly [spelled phonetically] things, forgive me. But underlying all of this is computer science, statistics, mathematic being use to interpret genome sequences. And at the end of the day, what we're trying to do is we start with a sequence and we draw these cartoons where we say, "OK. The sequence is a line. And we're trying to attach labels. I could show you a protein sequence and these could be domains, in this case of the dicer protein which was actually done by sequence analysis. Brenda Bass wrote a review where she said, "OK. Based on what we know RNA interference, it's got to have an RNA helocates [spelled phonetically] activity. It's got to have this. It's got to have that." And it turns out there's only one protein in cealegans [spelled phonetically] that has the right combination of domains, its known function, and that protein turned out to be dyserin [spelled phonetically]. So that was a important computational clue in early days of RNA interference. I could be drawing a DNA sequence and enhancers. You know, this this is how we draw things; exons, introns. We're trying to attaches label to sequences. That is, you show me a piece of sequence, what I'm really interested in, what label do I attach to that piece of sequence? The label is hidden and I'm trying to infer, should it be this label or that label? And it turns out that there's a big field of mathematics in digital signal processing and speech recognition that does this for whether it's a speech signal or whether it's an encrypted signal whether it's the telemetry off of your car engine, which actually I could tell you story how those guys are now using software developed in bioinformatics to do prototyping of engine telemetry and astronomic telemetry which is amazing the field has sort of gone full circle. Because we've put so much work now into a adopting methods called hidden mark off models and [unintelligible] context free grammar and other methods probability models of attaching labels to sequences that are appropriate either for linear sequence analysis where you just say I'm going to do sequence alignment or, in the case of RNA, I'm align pairs in a sort of nested sets. We have models to do that. They were introduced in the nineties by Gary Churchill, Gary Stormol [spelled phonetically], Andrews Crow [spelled phonetically], David Haussler. FCFG models for RNAs introduced by [unintelligible], Sataki Bar [spelled phonetically] and myself about the same time Yasi was in David's lab at the time. And this has given us a real tool kit to build models which give me an opportunity to say that the most important tool in computation biology without doubt is the tool called blast. And it's probably familiar to all of you. Blast does a sequence alignment between a query that you give it and every sequence in the database and it looks for things that are significantly related. When we look at the blast out algorithm as problemistic modelers, we say, "OK. This is an approximation for it's a two level approximation. Well it's a multilevel approximation." It's doing sequence alignment where what it's trying to do is attach a label saying those two residues are aligned or those two residues are not aligned. This is an insertion. This is a deletion. And so it has three states. It says, "I'm either going to try align residue X and Y, or I'm going to throw X out as an insertion, or I'm going to throw Y out as a deletion as an insertion on the other strand." Those are three states that I can move between. That's a mark off model. Where I go next is dependent on where I just was. And so I have hidden states that I'm trying to infer, connected by arrows that's a hidden mark off model. But the arrows in blast are associated with mostly zeros implicitly and then there's a gap, open penalty every time I open an insertion either way. And there's a gap extend penalty every time I extend an insertion by another residue on either strand. Then there's a score for aligning the two residues that's due to the blossom matrix from the hennocofs [spelled phonetically]. The bottom line is that we now understand that those zeros really should be probabilities, or at least if you're thinking in a probabilistic inference context, we can represent blast's internal model as probability model, and now we can do things like, "Well, instead of just giving me the optimal alignment, some overall possible alignment and then tell me what's the probability that you're really confident this residue aligns with this residue. Integrated over all possible alignments and other things." The field has been trying to do that for a while, and now I want to make a somewhat sociological point. There are there's lots of research and a big literature on developing better methods for sequence analysis. Those methods go into journals like BMC, bioinformatics and what have you, and none of you read them. Then there's another field which takes very important algorithms and reduces them down to their bare bones and speeds them up on particular hardware. And there's lot you can do on modern hardware. For instance this paper from Michael Frar [spelled phonetically] who was, until his untimely death a couple months ago, unfortunately, our chief software engineer. I recruited him from Boston. Notice unaffiliated, this is all sociological comment. He was not in the university. He did this particular paper in his spare time and then was recruited by both Bill Pearson and me to actually become a biologist. So you can use sin-D which was single instruction multiple data being driven by the graphics industry all modern chips are capable of vector paralyzed processing. They're being driven to this by the all the games everybody plays. You can take advantage of those instruction sets to make bioinformatics software. But the question is: Whose going to do it? The difference between writing a piece of software that worked for your BNF bioinformatics paper and a piece of software that runs fast and it can be used by the rest of community is an enormous difference. And it's to blast's credit that one of the things that's underlying that's not only terrific theory Steven Aultshul [spelled phonetically] , and Sam Carlin [spelled phonetically] , and others, and great algorithmics from Jean Myers, and Warren Gish [spelled phonetically] , and others, but terrific software engineering from the NCBI team. It's very rare to get that kind of investment in piece of software. And that's really one of the things that makes Blast fast. Frankly, in our lab we are frustrated that Blast was written 20 years ago, and it's now difficult to adapt it to stuff that we think we know with probability modeling now. There's an effort in my lab to now take the hidden mark off model methods and speed them up to Blast. And we've been investing a lot of time in this to get the engineering to up snuff. One of goals is for you to be able to do an HNM based search, like a Blast search, on a web server where you get the response no matter what size of the RNA database in a hundred milliseconds or less. That is faster than a Google search so that you can do interactive searching rather than waiting for a batch job and actually start exploring what sequence space looks like in all these wonderful organisms. And we're within an order of magnitude of being able to do that. Again, reinforcing the point, I don't want to belabor this, but we have been working to engineer tools that can do this kind of probability analysis. The point I want to make on this slide is the difference between a tool that runs good enough for a publication. I could probably write hammer a thousand lines of C code. The actual code which is not a good as we want is 44 thousand lines in Somoa, so we maintain a big code base trying to make this stuff useful and that requires engineering. But it also means I have two faced view of what we're look at, and it's sort of not great for one's mental health at times. But we are very much emersed in two levels of code: one level of looking all those wonderful genome sequences, but also the level of our C code of trying to interpret all that. And both of those are evolutionary are artifacts that are difficult to understand. With that I'll stop. And in contrast and now I'll sort of counterpoint to Eric's world. My little lab, which is a husband and wife team, we've been working together for longer than we've been together as a couple, very small laboratory at Janelia Farm, where we're pretty much dedicated to building these kind of tools for the community to use. And I'll stop. And I'd be happy to take questions. [applause] Male Speaker: We have time for a quick question if someone's going to race to a microphone. That someone racing to a microphone or racing out the door? Racing out the door. So in that case, Shawn will be available at the break. We're going to take a break now then, and I'm sure he'll be available if you have any questions.