Melissa Moore - U. mass/hhmi - Part 1 - Split genes and rna splicing

Hi. My name is Melissa Moore and I'm from the Howard Hughes Medical Institute and the University of Massachusetts Medical School and I'm here today to tell you about split genes and RNA splicing Most of you are familiar with the central dogma of biology which was elaborated by Sir Francis Crick in 1956 and what the central dogma says is that DNA is copied into or transcribed into RNA and then that RNA is translated into a different language and that is the language of proteins. So one of the central tenets of the central dogma of biology is that one gene encodes one protein. and this is very much true in bacteria where if one breaks open a bacterial cell and spreads out the DNA and RNA you can see very clearly that here is the DNA and as the DNA is being transcribed into RNA, it's being joined by ribosomes. So these big black blobs here are polyribosomes and those are making protein. And so the RNA is copied or translated directly into proteins. So here I'm showing a eukaryotic cell next to a prokaryotic cell. So these are bacteria next to a white blood cell. And you can see that the white blood cell is much much larger than the bacteria. And so not only are eukaryotic cells larger than bacteria but they also are more complicated internally. If this is the bacterial structure down here and this would be the eukaryotic internal structure and you can see that in the eukaryote, the DNA is in the nucleus whereas the ribosomes are out here in the cytoplasm as you can particularly see these on the rough ER here. And so the big difference for a eukaryotic cell is that the DNA is not directly accessed by the ribosomes. Also in the eukaryotic cell, this is a gene that is being transcribed from this direction to this direction. And so you can see that the RNA coming off of the DNA is getting longer and longer as it's being transcribed in this direction. But at some point there are these loops that form. And in fact, these loops are introns that get spliced out. So unlike prokaryotic genes eukaryotic genes are split in nature. They have segments of them that need to be spliced out before they can be used to make proteins. So this slide shows the current-day view of eukaryotic gene expression. And that is that eukaryotic genes are split, meaning that they contain sequences that are not contained within the final mRNA, and which are not translated into protein sequence. Those sequences are called introns and they are represented by the white line here, and the exons are the little colored boxes. Those are the expressed regions. When a gene is turned on and transcribed, the entire gene is transcribed first into a pre-messenger RNA or pre-mRNA and that pre-mRNA undergoes several steps of processing. First it is capped with a seven methylguanosine cap. At the 3' end it is cleaved and then a poly-A tail is added. And then in the middle, these intron sequences are literally spliced out. So there is a molecular scissors and tape that does the job of pre-mRNA splicing. After all of processing is done, the mRNA migrates to the nuclear envelope where it is exported and used to be translated into proteins. This shows the structure of a typical human gene. So the typical human gene has 23,000 base pairs and 7 introns. Now I'm using a median here because if I use the average, it would be very much thrown off by some of the very large genes that we're going to talk about in a little bit. The other thing that you'll notice is that the typical human gene has a median intron length of over 10 times that of the exon lengths. So that means that whenever a human gene is transcribed, 90 to 95% of the RNA is immediately spliced out and thrown away. And that seems rather wasteful. So we'll be talking about why it is that we have these intronic regions. What good are they if we're wasting that much RNA. But before we talk about that, I want to tell you about a particular gargantuan gene and that is the dystrophin gene. So dystrophin encodes a protein that is necessary for your muscles and mutations in this gene are one of the causes of muscular dystrophy. The DMD gene is the second largest gene in the human genome. It's 2.2 million base pairs long. It has 79 exons, 78 introns. and one of those introns is 400,000 nucleotides long. Now it's really hard for you with just me saying this to really get a sense of the scale of this thing. So next I'm going to show you a little movie so that you can really see just how big this gene is. Here we are with our scaled model of the dystrophin gene that is represented by this rope. And I want you to look at the end of this rope. So the little colored tape marks here are the exons. And the white rope is the intron. And you can see that here is the next exon. So the introns are much, much longer than the exons. Now, just to give you a sense of how big this gene is I'm going to pretend I am RNA polymerase. And I start transcribing this gene when you get up in the morning. So this is your time point of breakfast here. So here I go. Here's RNA polymerase. I'm transcribing this gene. It's not evening. It's still morning. Here's our mid-morning snack. So we're still going. Polymerase is incredibly processive on this gene. We haven't even gone a million base pairs yet. Now we're about halfway through the gene so this is lunchtime. The polymerase is still transcribing this gene. We're still going, still going. Here we are at mid-afternoon. We're getting into dinnertime now. And we've been going for about thirteen or fourteen hours now. Finally when you're about ready to go to bed after 16 hours we get to the very end of the gene, the last exon. It took 16 hours for polymerase to transcribe this entire gene of dystrophin. Alright, so you've just seen how long dystrophin RNA is. And I can tell you that was a 99.4-foot rope. Now once all of those introns are removed, to scale this is the size of the messenger RNA. This messenger RNA is very long. It is a 17,000 base messenger RNA. But it is less than 1% of the original RNA that was transcribed. You know, really pretty amazing. So now we want to talk about the question of why do we have all of these introns. It's really quite amazing that we have so much apparent junk DNA that is copied into RNA and then immediately thrown away. What good are these introns? One thing that introns do for us, is to allow eukaryotic cells to evolve new genes readily. And they can evolve new genes by exon duplication. So let's take for example, the fibronectin protein. You can see here, the fibronectin protein is made up of many repeats of different domains. These domains are also found in other proteins. So for example, the yellow domain is found in cell surface receptors and other extracellular matrix proteins. The blue domain is found in blood coagulation proteins and the red domain is also found in TPA, or tissue plasminogen activator. So the way the fibronectin was created was to take all these different domains and hook them up and bring them in from other places. And the way that that can be done is that you see if you look now at this diagram of the fibronectin gene, you can see that each of these domains consists of either one or two exons. So the advantage of having introns in this case is that by non-homologous recombination big chunks of the genome can be taken from one place to another and they don't necessarily have to be hooked up perfectly because the splicing machinery hooks them up. whereas in bacteria, where there are no introns, if there is a non-homologous recombination event to make a new protein, then it has to be perfectly in frame and it is much harder to do because there's not the genetic space in which to do the recombination. So that is a one real advantage of having introns. Now the second major advantage we will discuss by looking at the gene number versus complexity problem. So let's consider how many genes different kinds of organisms have. So here for example, is E. coli, a bacterium, and S. cerevisiae--budding yeast, or the yeast that is used to make bread or brew beer. E. coli is a bacterium in your gut. Now we've already shown, that bacteria are much simpler cells than eukaryotes, so you might expect that E. coli not to have as many genes as S. cerevisiae and that's true. So E. coli has about 3,200 genes and S. cerevisiae has about 6,000 genes. And by genes here, I'm talking about protein-coding genes. Now if we go up the evolutionary ladder, let's consider two other model organisms. C. elegans or the roundworm and Drosophila melanogaster, the fruit fly. Now on the surface, it might seem that Drosophila is the more complicated organism, so it would have more genes than the roundworm, but in fact, it's the opposite way. So C. elegans has about 19,000 genes and Drosophila melanogaster only has about 13,000 genes. So now you have to ask yourself, "How much more complicated am I than a fruit fly or a roundworm?" So how many genes do we expect humans to have? Or humans, mice, and even a mustard plant, so something that might be a little more complicated than a roundworm. So, if a roundworm has 19,000 genes before the human genome was sequenced, people were thinking that humans would have around 100,000 genes. The big surprise has come that in fact all of these organisms have about the same number of genes and they're all around 25,000. So it's really not a very big number. And just by this count, we're no more complicated than twice a fruit fly or about 1.3 times a roundworm. That's sort of troubling. So how can we get away with that? Well the way we get away with this apparent lack of complexity is alternative splicing. So eukaryotes in addition to having split genes, they defy the central dogma of biology, the original central dogma of biology. And that is that we may have one gene, but one gene can make many, many proteins. And so for example, here is one gene, where if all of the exons are spliced in, then it makes one protein, but if the red exon is left out, it makes a different protein. And then here, the blue exon, exon 4, is left out and that makes yet a third protein. So on this particular gene, there are three proteins coming from one gene. And we now know from recent deep-sequencing efforts that about 95% of all human genes exhibit alternative splicing. So that means that our proteome, the number of proteins that we have, is much, much larger than our genome. There are many different kinds of alternative splicing. So, there can be alternative promoters, which is the beginning of the gene. There can be alternative poly-A sites. So that's the 3 prime end of the gene. Alternative 5 prime splice sites. We'll talk about what a 5 prime splice site is in a minute, but it's the beginning of an intron. Alternative 3 prime splice sites. And then there are exons called cassette exons that are either spliced in or spliced out. That's the example I was showing you a minute ago. And there are mutually exclusive exons, where either one exon is put in or the other but not both. And then of course the simplest form of alternative splicing is to not splice at all. So you can splice or not splice and that would be a retained intron. So there are many different types of alternative splicing. The most common type of alternative splicing is the cassette exon, where an exon is either spliced in or spliced out. Now how many different proteins can be made from one gene? So this is the alpha tropomyosin gene from rat and these are some of the different splice forms of the alpha tropomyosin gene. And so you can see that there are many different splice forms in fibroblasts. Those are essentially undifferentiated cells. Then here's other isoforms in the brain and the smooth muscle. And so one of the important things about splicing is that it can be developmentally and tissue-specifically controlled. And so one gene in one tissue might make one protein but in another tissue, it makes a very different protein. And so, again that's how we add complexity. So just how complex can it get? Well, here's the current record-holder. So this is the Drosophila DSCAM gene, which is involved in axonal guidance in the brain. And Drosophila DSCAM has three regions of mutually exclusive exons. So there's one here, one here, one here, and one there. This region has 48, 33. There's two over there, and there's 12 back here. So if you do the math, there are over 38,000 different possible spliced isoforms of the DSCAM gene. And to the best of our knowledge, all of these isoforms can be made. So that means that this one gene in Drosophila can make three times as many different proteins as there are genes in the Drosophila genome. So it is very likely that in higher eukaryotes, such as you and me, our proteome is well over hundreds of thousands to millions of different proteins. So just with that thought in mind Now let's look back at our complexity problem. But this time, instead of looking at genes, let's look at how many introns each organism has and then how that scales with complexity. So in E. coli, it has no introns. Prokaryotes do not have these types of introns. S. cerevisiae, or budding yeast, has a few introns. It only has about 250 introns and as we go up the evolutionary ladder, the roundworm has about 99,000 introns. Again, more than the fruit fly simply because it has more genes. But as we go up now to humans and to mice, you can see that the number of introns is going up dramatically. And because the number of proteins you can make scales let's say exponentially with the number of introns you can imagine that our proteomes can be much more complicated than those of the other organisms. Now we want to talk about the problem of how does the cell know where the introns are. In other words, how does it know where to splice? And in this sense, a cell faces very much the same problem as a film director, who has shot thousands of hours of film and yet is now overwhelmed with film and needs to then make a final movie. And so traditionally with film before the digital age, the film director would go through the film frame by frame, find exactly the place where he or she wanted to make a cut and then there would be a splicing machine that would literally cut and splice the film back together to make the final version of the movie. And in cells, that's exactly what happens. Because of course introns are made up of individual nucleotides and the nucleotides are very much like the individual frames in a film. So introns have the following structure. The 5 prime end, or the beginning of the intron, is many times called a spliced donor but splicers tend to call it the 5 prime splice site. And it, in 99% of genes, is signified by a GT as the first two nucleotides of the intron. And then much further downstream, actually close to the other end of the intron is an adenosine called the branch site, and we'll be talking about what that branch site does in part II of my lecture. And then at the very end of the intron is the AG, which is the last two nucleotides of the intron at the 3 prime splice site or the "splice acceptor." Now I've told you that some introns are up to 400,000 nucleotides long and certainly this amount of information content is not enough to even signify a small intron. So there must be additional sequences. And so in addition to these universally conserved nucleotides, there are consensus sequences in all of the introns. So here are the consensus sequences for budding yeast and these are called logos, motif logos, and the height of the letter tells you how conserved that site is. So for example, here you can see the GT at the 5 prime splice site and so in addition to the GT, there are several other nucleotides that are highly conserved and a few that are less conserved. At the branch site in yeast, there is a very highly conserved TACTAAC sequence and then upstream of the 3 prime splice site there's not too much conservation until you get right to the to the 3 prime splice site. So there's additional information in these so-called consensus sequences. Surprisingly in humans--so if you remember yeast only have about 250 introns and they generally are short--they're all less than a couple hundred of nucleotides whereas human introns are much, much longer. And the surprising thing is that the human consensus sequences are less conserved. So you can already see that because you can just see that there are lots more letters down here than there are up here particularly for the branch site much less conservation. And then for the 3 prime splice site, there's a little more information because there's this so-called pyrimidine track here that is a bunch of C's and T's upstream of the AG. Now we can quantify the amount of information that's in these sequences by using the terminology that's used in computer science and that is bits of information. So how much information content do these things have. And so Chris Burge and coworkers did a study on a number of very short introns, the shortest introns in humans and also the introns in yeast. And they derived these consensus sequences, and these consensus sequences have this much information in them. So you can see already just by looking at the numbers of bits that the human introns have less information in their consensus sequences than the yeast introns, even though the human introns-- there are many more of them and they're much longer. Now if we add up all of this information, the yeast intron sequences clearly have more information in their consensus sequences than the humans but I can tell you what Chris Burge and his colleagues estimated in this paper was that for humans, in order to uniquely specify an intron, even of this short size that they were looking at so below a thousand nucleotides, you need about 37 bits of information and yet only 23 bits of information are contained in the splice site consensus sequences. So where is that other information coming from? So from the same paper, here is a graph showing the relative contributions of intron features to intron recognition in different organisms and these are the same organisms that we looked at before in terms of gene count or intron count and complexity. And what you can see is that in S. cerevisiae all of the information is coming from the 5 prime splice site, the 3 prime splice site and the branch site consensus sequences but once we get up to humans, only about half of the information needed to specify an intron in humans is coming from those three consensus sequences. Some of the other information is the length. The composition of introns. They're somewhat different in their base composition than exons. But then there's this big grey quadrant and one has to ask, "What is this other information that human introns require?" And a clue to that came very early soon after introns were discovered. So this was a paper in 1983. Introns were discovered in 1977. And the Maniatis and Orkin labs were mapping out mutations in the beta-globin gene. So mutations in beta-globin cause a disease called beta-thallasemia. It's a very debilitating blood disease because you do not make enough hemoglobin. And what they found was that many of the mutations that cause beta-thallasemia were mutations in the 5 prime splice site consensus sequence of the globin gene. But instead of completely killing the splicing of the beta-globin gene, what these mutations did was they activated cryptic 5 prime splice sites in the nearby area. So that means something was telling the splicing machinery that there's supposed to be a 5 prime splice site in this region and even though this consensus sequence was mutated, there were other sequences nearby that could serve as a viable 5 prime splice site. So what is that other information? So to think about what that information might be let's take a look back at that model of the dystrophin gene. that we looked at. So if you'll remember, here's my long 99-foot rope and when your eyes are looking at this rope, probably what they're drawn to are the exons, the little pieces of color, and not the rope itself. And so you can actually think about a human gene as islands of exons in the middle of oceans of introns. And so very much like a map of the Pacific Ocean if you're going to identify the islands, you're going to look at the islands, not at the ends of the oceans. You're going to look at where the islands begin and end not where the ocean begins and ends. So let's think about how that might be applied to a human gene. What we now understand is that the way that introns and exons are recognized in human genes is through what we call exon definition. And that is that even though the introns are of highly variable length-- going from about a few hundred nucleotides going up to tens of thousands, hundreds of thousands nucleotides-- the exons tend to be very much uniform in their length. The average exon as we said was 123 nucleotides but they vary in length from about 25 (there are some few shorter. There are exceptions to every rule.) but in general 25-300 nucleotides is the size of an internal exon in a human gene. And so what the human splicing machinery looks for is a good match to the consensus branch site and 3 prime splice site followed with 25-300 nucleotides by a match to the 5 prime splice site. And so this then defines an exon. But even with that distance information between a 3 prime splice site and a downstream 5 prime splice site that's still not enough information to uniquely identify an exon and it's also not enough information to allow for alternative splicing of that exon. So in addition to the consensus sequences, there are also other sequences called exonic splicing enhancers or splicing silencers. So these enhancers or silencers can be either in the exon or they can be in introns. And if they're in introns they're called intronic splicing enhancers or intronic splicing silencers. So that would be the ESE, ISE, ESS, and ISS sequences. Now these sequences tend to be clustered around exons. and they are recognized by two different kinds of proteins. in general. And these proteins are called the SR and the hnRNP proteins. Now SR proteins are RNA-binding proteins that have a C-terminal domain that is rich in arginine/serine dipeptides. And hnRNP proteins are another set of RNA-binding proteins They do not have this RS region. But in general, the SR proteins tend to recognize splicing enhancers. They tend to increase splicing and hnRNP proteins tend to recognize splicing silencers and so they tend to inhibit splicing. And it's the balance of the SR and hnRNP proteins that are in a cell that determines the splice pattern for any particular RNA. So you can think very much about splicing decisions being made as a committee decision so in addition to the SR proteins and the hnRNP proteins there's the core splicing machinery and we're going to talk about that in part II of my talk. But the core splicing machinery is what's responsible for recognizing the consensus sequences at the splice sites. And then in addition, at the same time there would be these SR proteins and hnRNP proteins and then it is the conjunction of all these different factors that then makes the decision as to whether or not an exon is going to be utilized or not and made into mRNA. So one thing I'd like to point out about these splicing enhancer and silencer sequences is that mutations in those sequences can lead to human genetic disease. So about 20% of human genetic disease is caused by mistakes in splicing. And these mistakes can occur if a mutation occurs at one of the consensus sequences--we've already seen that for the beta-thallasemia genes-- or when a mutation occurs in one of the silencers or enhancer sequences. So a common mistake that is made when scientists are hunting for the mutations that cause a particular disease, if a mutation is found in a coding exon and that mutation changes the amino acid that's coded by that exon, it's often assumed that that mutation is having its deleterious effects by changing an amino acid in the protein. But in fact many times the mutation can actually change the splicing pattern. So for example, leave out a whole exon or change a splice site and that itself would be much more deleterious. So it's important to think about that if you're trying to hunt for mutations that cause a particular disorder in either humans or some other model organism. And it's also important to note that we can't ignore mutations in introns. Not only are mutations at the consensus sequences going to be deleterious for splicing but mutations in these intronic splicing silencers and enhancers can also change pattern. So in the future, we're going to have to be sequencing introns as well as exons when we're looking for mutations in human disease. So here are some examples of those kinds of mutations. The top here, we have the SMN gene. Now all of us have two copies of this gene there's SMN1 and SMN2. If you have a mutation in SMN1, such that it is disrupted, this is a problem because the SMN2 gene is identical to the SMN1 gene except that it has a T mutation in an exonic splicing enhancer such that in the SMN1 gene all of the exon 7 is included but in SMN2 only 20% of exon 7 is included. So if you're lacking the SMN1 gene, then you do not make enough of the SMN protein and children who do not make enough of this protein have problems with development of their motor neurons and have spinal muscular atrophy. Here's another example. Dystrophin is a mutation, which causes a stop codon. So one might assume that the problem is that it makes a truncated protein. But in fact this stop mutation, this nonsense mutation, is in an exonic splicing silencer, or it becomes an exonic splicing silencer such that that exon is left out. And it still makes the protein but it's lacking that exon. And then this is a third example. Patients that have frontotemporal dementia often have a mutation in the tau gene. And that mutation causes an exonic splicing enhancer to be stronger such that they're making more of the version of the protein that has exon 10 included and less that has exon 10 excluded. And this 50/50 ratio is very important to have normal amounts of the tau protein. So in this case, the patients are overexpressing this truncated form of the protein and that leads to frontotemporal dementia. So finally I'd like to leave you with my take home messages. The first is that eukaryotic genes contain introns. Not every eukaryotic gene contains an intron but certainly in humans the vast majority contain introns. These introns facilitate the evolution of new genes and by allowing for alternative splicing, they allow us to greatly increase our proteome complexity. And higher eukaryotes utilize exon definition to define splice sites because the exons are like islands in oceans of intron. And then finally, many human hereditary disorders are cause by mutations that affect exon inclusion or splice site choice. So this ends this part of the lecture. If you're interested in learning something about the machinery that removes the introns, load up my next lecture which is spliceosome structure and dynamics. Thank you.