Tip:
Highlight text to annotate it
X
Hi. My name is Melissa Moore
and I'm from the Howard Hughes Medical Institute
and the University of Massachusetts Medical School
and I'm here today to tell you about split genes
and RNA splicing
Most of you are familiar with the central dogma of biology
which was elaborated by Sir Francis Crick in 1956
and what the central dogma says is that DNA is copied into
or transcribed into RNA and then that RNA is translated
into a different language and that is the language of proteins.
So one of the central tenets of the central dogma of biology
is that one gene encodes one protein.
and this is very much true in bacteria where if one
breaks open a bacterial cell and spreads out the DNA and RNA
you can see very clearly that here is the DNA
and as the DNA is being transcribed into RNA, it's
being joined by ribosomes. So these big black blobs here
are polyribosomes and those are making protein.
And so the RNA is copied or translated directly into proteins.
So here I'm showing a eukaryotic cell next to a prokaryotic cell.
So these are bacteria next to a white blood cell.
And you can see that the white blood cell is much
much larger than the bacteria. And so not only
are eukaryotic cells larger than bacteria but they also
are more complicated internally.
If this is the bacterial structure down here and this would be
the eukaryotic internal structure and you can see
that in the eukaryote, the DNA is in the nucleus
whereas the ribosomes are out here in the cytoplasm
as you can particularly see these on the rough ER here.
And so the big difference for a eukaryotic cell is that
the DNA is not directly accessed by the ribosomes.
Also in the eukaryotic cell, this is a gene that is being transcribed
from this direction to this direction. And so you can see
that the RNA coming off of the DNA is getting longer and longer
as it's being transcribed in this direction. But at some point
there are these loops that form. And in fact, these loops
are introns that get spliced out. So unlike prokaryotic genes
eukaryotic genes are split in nature. They have segments
of them that need to be spliced out before they can
be used to make proteins. So this slide shows the
current-day view of eukaryotic gene expression.
And that is that eukaryotic genes are split, meaning that
they contain sequences that are not contained within
the final mRNA, and which are not translated into
protein sequence. Those sequences are called introns
and they are represented by the white line here, and
the exons are the little colored boxes. Those are the
expressed regions. When a gene is turned on and
transcribed, the entire gene is transcribed first into
a pre-messenger RNA or pre-mRNA and that pre-mRNA
undergoes several steps of processing.
First it is capped with a seven methylguanosine cap.
At the 3' end it is cleaved and then a poly-A tail is added.
And then in the middle, these intron sequences are literally
spliced out. So there is a molecular scissors and tape
that does the job of pre-mRNA splicing. After all of
processing is done, the mRNA migrates to the nuclear
envelope where it is exported and used to be translated
into proteins. This shows the structure of a typical
human gene. So the typical human gene has 23,000
base pairs and 7 introns. Now I'm using a median here
because if I use the average, it would be very much
thrown off by some of the very large genes that
we're going to talk about in a little bit. The other thing
that you'll notice is that the typical human gene
has a median intron length of over 10 times that of
the exon lengths. So that means that whenever a
human gene is transcribed, 90 to 95% of the RNA
is immediately spliced out and thrown away. And that
seems rather wasteful. So we'll be talking about
why it is that we have these intronic regions.
What good are they if we're wasting that much RNA.
But before we talk about that, I want to tell you about
a particular gargantuan gene and that is the dystrophin gene.
So dystrophin encodes a protein that is necessary
for your muscles and mutations in this gene are one
of the causes of muscular dystrophy. The DMD gene
is the second largest gene in the human genome. It's
2.2 million base pairs long. It has 79 exons, 78 introns.
and one of those introns is 400,000 nucleotides long.
Now it's really hard for you with just me saying this
to really get a sense of the scale of this thing.
So next I'm going to show you a little movie so that
you can really see just how big this gene is.
Here we are with our scaled model of the dystrophin gene
that is represented by this rope. And I want you to
look at the end of this rope. So the little colored tape
marks here are the exons. And the white rope is the
intron. And you can see that here is the next exon.
So the introns are much, much longer than the exons.
Now, just to give you a sense of how big this gene is
I'm going to pretend I am RNA polymerase. And I
start transcribing this gene when you get up in the morning.
So this is your time point of breakfast here. So here I go.
Here's RNA polymerase. I'm transcribing this gene.
It's not evening. It's still morning. Here's our mid-morning snack.
So we're still going. Polymerase is incredibly processive on this gene.
We haven't even gone a million base pairs yet.
Now we're about halfway through the gene so this is
lunchtime. The polymerase is still transcribing this gene.
We're still going, still going. Here we are at mid-afternoon.
We're getting into dinnertime now. And we've been
going for about thirteen or fourteen hours now. Finally
when you're about ready to go to bed after 16 hours
we get to the very end of the gene, the last exon.
It took 16 hours for polymerase to transcribe this
entire gene of dystrophin.
Alright, so you've just seen how long dystrophin
RNA is. And I can tell you that was a 99.4-foot rope.
Now once all of those introns are removed, to scale
this is the size of the messenger RNA. This messenger
RNA is very long. It is a 17,000 base messenger RNA.
But it is less than 1% of the original RNA that was transcribed.
You know, really pretty amazing.
So now we want to talk about the question of why
do we have all of these introns. It's really quite
amazing that we have so much apparent junk DNA
that is copied into RNA and then immediately thrown away.
What good are these introns? One thing that introns
do for us, is to allow eukaryotic cells to evolve new
genes readily. And they can evolve new genes by
exon duplication. So let's take for example,
the fibronectin protein. You can see here, the fibronectin
protein is made up of many repeats of different domains.
These domains are also found in other proteins.
So for example, the yellow domain is found in cell
surface receptors and other extracellular matrix proteins.
The blue domain is found in blood coagulation proteins
and the red domain is also found in TPA, or tissue plasminogen activator.
So the way the fibronectin was created was to take
all these different domains and hook them up and
bring them in from other places. And the way that
that can be done is that you see if you look now at
this diagram of the fibronectin gene, you can see that
each of these domains consists of either one or two exons.
So the advantage of having introns in this case is that
by non-homologous recombination big chunks of the
genome can be taken from one place to another
and they don't necessarily have to be hooked up perfectly
because the splicing machinery hooks them up.
whereas in bacteria, where there are no introns,
if there is a non-homologous recombination event to
make a new protein, then it has to be perfectly in frame
and it is much harder to do because there's not the
genetic space in which to do the recombination.
So that is a one real advantage of having introns.
Now the second major advantage we will discuss by
looking at the gene number versus complexity problem.
So let's consider how many genes different kinds of
organisms have. So here for example, is E. coli, a bacterium,
and S. cerevisiae--budding yeast, or the yeast that is
used to make bread or brew beer. E. coli is a bacterium
in your gut. Now we've already shown, that bacteria
are much simpler cells than eukaryotes, so you might
expect that E. coli not to have as many genes as S. cerevisiae
and that's true. So E. coli has about 3,200 genes and
S. cerevisiae has about 6,000 genes. And by genes
here, I'm talking about protein-coding genes.
Now if we go up the evolutionary ladder, let's consider
two other model organisms. C. elegans or the roundworm
and Drosophila melanogaster, the fruit fly. Now on the
surface, it might seem that Drosophila is the more
complicated organism, so it would have more genes
than the roundworm, but in fact, it's the opposite way.
So C. elegans has about 19,000 genes and
Drosophila melanogaster only has about 13,000 genes.
So now you have to ask yourself, "How much more
complicated am I than a fruit fly or a roundworm?"
So how many genes do we expect humans to have?
Or humans, mice, and even a mustard plant, so
something that might be a little more complicated
than a roundworm. So, if a roundworm has 19,000 genes
before the human genome was sequenced, people
were thinking that humans would have around
100,000 genes. The big surprise has come that in fact
all of these organisms have about the same number
of genes and they're all around 25,000. So it's really
not a very big number. And just by this count,
we're no more complicated than twice a fruit fly or
about 1.3 times a roundworm. That's sort of troubling.
So how can we get away with that? Well the way we
get away with this apparent lack of complexity is
alternative splicing. So eukaryotes in addition to
having split genes, they defy the central dogma of
biology, the original central dogma of biology. And
that is that we may have one gene, but one gene can
make many, many proteins. And so for example, here
is one gene, where if all of the exons are spliced in,
then it makes one protein, but if the red exon is left
out, it makes a different protein. And then here, the blue
exon, exon 4, is left out and that makes yet a third protein.
So on this particular gene, there are three proteins
coming from one gene. And we now know from recent
deep-sequencing efforts that about 95% of all human
genes exhibit alternative splicing. So that means that
our proteome, the number of proteins that we have,
is much, much larger than our genome.
There are many different kinds of alternative splicing.
So, there can be alternative promoters, which is the
beginning of the gene. There can be alternative poly-A sites.
So that's the 3 prime end of the gene. Alternative 5 prime
splice sites. We'll talk about what a 5 prime splice site
is in a minute, but it's the beginning of an intron.
Alternative 3 prime splice sites. And then there are exons
called cassette exons that are either spliced in or
spliced out. That's the example I was showing you a
minute ago. And there are mutually exclusive exons,
where either one exon is put in or the other but not both.
And then of course the simplest form of alternative
splicing is to not splice at all. So you can splice or not
splice and that would be a retained intron. So there
are many different types of alternative splicing. The
most common type of alternative splicing is the
cassette exon, where an exon is either spliced in or spliced out.
Now how many different proteins can be made from
one gene? So this is the alpha tropomyosin gene from rat
and these are some of the different splice forms of
the alpha tropomyosin gene. And so you can see
that there are many different splice forms in fibroblasts.
Those are essentially undifferentiated cells.
Then here's other isoforms in the brain and the smooth muscle.
And so one of the important things about splicing
is that it can be developmentally and tissue-specifically controlled.
And so one gene in one tissue might make one protein
but in another tissue, it makes a very different protein.
And so, again that's how we add complexity.
So just how complex can it get?
Well, here's the current record-holder. So this is the
Drosophila DSCAM gene, which is involved in axonal
guidance in the brain. And Drosophila DSCAM has
three regions of mutually exclusive exons. So there's one here,
one here, one here, and one there. This region has 48,
33. There's two over there, and there's 12 back here.
So if you do the math, there are over 38,000
different possible spliced isoforms of the DSCAM gene.
And to the best of our knowledge, all of these isoforms
can be made. So that means that this one gene in Drosophila
can make three times as many different proteins
as there are genes in the Drosophila genome.
So it is very likely that in higher eukaryotes, such as you and me,
our proteome is well over hundreds of thousands to
millions of different proteins. So just with that thought in mind
Now let's look back at our complexity problem.
But this time, instead of looking at genes, let's look
at how many introns each organism has and then how that
scales with complexity. So in E. coli, it has no introns.
Prokaryotes do not have these types of introns.
S. cerevisiae, or budding yeast, has a few introns.
It only has about 250 introns and as we go up the
evolutionary ladder, the roundworm has about 99,000 introns.
Again, more than the fruit fly simply because it has more genes.
But as we go up now to humans and to mice, you can see
that the number of introns is going up dramatically.
And because the number of proteins you can make
scales let's say exponentially with the number of introns
you can imagine that our proteomes can be
much more complicated than those of the other organisms.
Now we want to talk about the problem of how does
the cell know where the introns are. In other words,
how does it know where to splice? And in this sense,
a cell faces very much the same problem as a film director,
who has shot thousands of hours of film and yet is
now overwhelmed with film and needs to then make
a final movie. And so traditionally with film before the
digital age, the film director would go through the film
frame by frame, find exactly the place where he or she
wanted to make a cut and then there would be a
splicing machine that would literally cut and splice the
film back together to make the final version of the movie.
And in cells, that's exactly what happens. Because of
course introns are made up of individual nucleotides
and the nucleotides are very much like the individual frames
in a film. So introns have the following structure.
The 5 prime end, or the beginning of the intron, is
many times called a spliced donor but splicers tend to
call it the 5 prime splice site. And it, in 99% of genes,
is signified by a GT as the first two nucleotides of the intron.
And then much further downstream, actually close to
the other end of the intron is an adenosine called the
branch site, and we'll be talking about what that
branch site does in part II of my lecture. And then
at the very end of the intron is the AG, which is the
last two nucleotides of the intron at the 3 prime splice site
or the "splice acceptor." Now I've told you that some
introns are up to 400,000 nucleotides long and certainly
this amount of information content is not enough
to even signify a small intron. So there must be
additional sequences. And so in addition to these
universally conserved nucleotides, there are consensus
sequences in all of the introns. So here are the
consensus sequences for budding yeast and these
are called logos, motif logos, and the height of the letter
tells you how conserved that site is. So for example,
here you can see the GT at the 5 prime splice site
and so in addition to the GT, there are several other
nucleotides that are highly conserved and a few that
are less conserved. At the branch site in yeast, there
is a very highly conserved TACTAAC sequence
and then upstream of the 3 prime splice site there's
not too much conservation until you get right to the
to the 3 prime splice site. So there's additional information
in these so-called consensus sequences. Surprisingly
in humans--so if you remember yeast only have about
250 introns and they generally are short--they're all
less than a couple hundred of nucleotides whereas
human introns are much, much longer. And the
surprising thing is that the human consensus sequences
are less conserved. So you can already see that
because you can just see that there are lots more letters
down here than there are up here particularly for the
branch site much less conservation. And then for the
3 prime splice site, there's a little more information
because there's this so-called pyrimidine track here
that is a bunch of C's and T's upstream of the AG.
Now we can quantify the amount of information that's
in these sequences by using the terminology that's
used in computer science and that is bits of information.
So how much information content do these things have.
And so Chris Burge and coworkers did a study on
a number of very short introns, the shortest introns
in humans and also the introns in yeast. And they
derived these consensus sequences, and these consensus sequences
have this much information in them. So you can see
already just by looking at the numbers of bits that
the human introns have less information in their consensus sequences
than the yeast introns, even though the human introns--
there are many more of them and they're much longer.
Now if we add up all of this information, the yeast intron
sequences clearly have more information in their consensus sequences
than the humans but I can tell you what Chris Burge
and his colleagues estimated in this paper was that
for humans, in order to uniquely specify an intron,
even of this short size that they were looking at
so below a thousand nucleotides, you need about
37 bits of information and yet only 23 bits of information
are contained in the splice site consensus sequences.
So where is that other information coming from?
So from the same paper, here is a graph showing
the relative contributions of intron features to
intron recognition in different organisms and these are
the same organisms that we looked at before in terms of
gene count or intron count and complexity.
And what you can see is that in S. cerevisiae all of the
information is coming from the 5 prime splice site,
the 3 prime splice site and the branch site consensus sequences
but once we get up to humans, only about half of
the information needed to specify an intron in humans
is coming from those three consensus sequences.
Some of the other information is the length. The composition
of introns. They're somewhat different in their base composition
than exons. But then there's this big grey quadrant
and one has to ask, "What is this other information that
human introns require?" And a clue to that came very early
soon after introns were discovered. So this was a paper
in 1983. Introns were discovered in 1977.
And the Maniatis and Orkin labs were mapping out
mutations in the beta-globin gene. So mutations
in beta-globin cause a disease called beta-thallasemia.
It's a very debilitating blood disease because you
do not make enough hemoglobin. And what they found was
that many of the mutations that cause beta-thallasemia
were mutations in the 5 prime splice site consensus sequence
of the globin gene. But instead of completely killing
the splicing of the beta-globin gene, what these mutations did
was they activated cryptic 5 prime splice sites in the
nearby area. So that means something was telling the
splicing machinery that there's supposed to be a
5 prime splice site in this region and even though this consensus
sequence was mutated, there were other sequences
nearby that could serve as a viable 5 prime splice site.
So what is that other information?
So to think about what that information might be
let's take a look back at that model of the dystrophin gene.
that we looked at. So if you'll remember, here's my long
99-foot rope and when your eyes are looking at this rope,
probably what they're drawn to are the exons, the
little pieces of color, and not the rope itself.
And so you can actually think about a human gene
as islands of exons in the middle of oceans of introns.
And so very much like a map of the Pacific Ocean
if you're going to identify the islands, you're going to
look at the islands, not at the ends of the oceans.
You're going to look at where the islands begin and end
not where the ocean begins and ends. So let's think
about how that might be applied to a human gene.
What we now understand is that the way that
introns and exons are recognized in human genes
is through what we call exon definition. And that is
that even though the introns are of highly variable length--
going from about a few hundred nucleotides going up to
tens of thousands, hundreds of thousands nucleotides--
the exons tend to be very much uniform in their length.
The average exon as we said was 123 nucleotides
but they vary in length from about 25 (there are some
few shorter. There are exceptions to every rule.) but
in general 25-300 nucleotides is the size of an internal
exon in a human gene. And so what the human splicing machinery
looks for is a good match to the consensus branch site
and 3 prime splice site followed with 25-300 nucleotides
by a match to the 5 prime splice site. And so this
then defines an exon. But even with that distance information
between a 3 prime splice site and a downstream 5 prime splice site
that's still not enough information to uniquely identify
an exon and it's also not enough information to allow
for alternative splicing of that exon. So in addition
to the consensus sequences, there are also other sequences
called exonic splicing enhancers or splicing silencers.
So these enhancers or silencers can be either in the
exon or they can be in introns. And if they're in introns
they're called intronic splicing enhancers or intronic splicing silencers.
So that would be the ESE, ISE, ESS, and ISS sequences.
Now these sequences tend to be clustered around exons.
and they are recognized by two different kinds of proteins.
in general. And these proteins are called the SR and the hnRNP proteins.
Now SR proteins are RNA-binding proteins that have
a C-terminal domain that is rich in arginine/serine dipeptides.
And hnRNP proteins are another set of RNA-binding proteins
They do not have this RS region. But in general, the SR
proteins tend to recognize splicing enhancers. They tend
to increase splicing and hnRNP proteins tend to
recognize splicing silencers and so they tend to inhibit splicing.
And it's the balance of the SR and hnRNP proteins
that are in a cell that determines the splice pattern
for any particular RNA. So you can think very much about
splicing decisions being made as a committee decision
so in addition to the SR proteins and the hnRNP proteins
there's the core splicing machinery and we're going to talk
about that in part II of my talk. But the core splicing
machinery is what's responsible for recognizing the
consensus sequences at the splice sites. And then
in addition, at the same time there would be these
SR proteins and hnRNP proteins and then it is the
conjunction of all these different factors that then
makes the decision as to whether or not an exon is going to be
utilized or not and made into mRNA. So one thing
I'd like to point out about these splicing enhancer
and silencer sequences is that mutations in those sequences
can lead to human genetic disease. So about 20%
of human genetic disease is caused by mistakes in splicing.
And these mistakes can occur if a mutation occurs
at one of the consensus sequences--we've already
seen that for the beta-thallasemia genes--
or when a mutation occurs in one of the silencers or
enhancer sequences. So a common mistake that is made
when scientists are hunting for the mutations that
cause a particular disease, if a mutation is found
in a coding exon and that mutation changes the amino acid that's
coded by that exon, it's often assumed that that mutation
is having its deleterious effects by changing an amino acid
in the protein. But in fact many times the mutation can actually
change the splicing pattern. So for example, leave out
a whole exon or change a splice site and that itself would be
much more deleterious. So it's important to think about that if
you're trying to hunt for mutations that cause a particular disorder
in either humans or some other model organism.
And it's also important to note that we can't ignore
mutations in introns. Not only are mutations at
the consensus sequences going to be deleterious
for splicing but mutations in these intronic splicing
silencers and enhancers can also change pattern. So
in the future, we're going to have to be sequencing
introns as well as exons when we're looking for mutations
in human disease. So here are some examples of
those kinds of mutations. The top here, we have
the SMN gene. Now all of us have two copies of this gene
there's SMN1 and SMN2. If you have a mutation
in SMN1, such that it is disrupted, this is a problem
because the SMN2 gene is identical to the SMN1 gene
except that it has a T mutation in an exonic splicing enhancer
such that in the SMN1 gene all of the exon 7 is included
but in SMN2 only 20% of exon 7 is included. So if you're
lacking the SMN1 gene, then you do not make enough of the
SMN protein and children who do not make enough of this protein
have problems with development of their motor neurons
and have spinal muscular atrophy. Here's another
example. Dystrophin is a mutation, which causes a
stop codon. So one might assume that the problem is
that it makes a truncated protein. But in fact
this stop mutation, this nonsense mutation, is in an exonic splicing silencer,
or it becomes an exonic splicing silencer such that
that exon is left out. And it still makes the protein
but it's lacking that exon. And then this is a third example.
Patients that have frontotemporal dementia often have
a mutation in the tau gene. And that mutation
causes an exonic splicing enhancer to be stronger such that
they're making more of the version of the protein
that has exon 10 included and less that has exon 10 excluded.
And this 50/50 ratio is very important to have
normal amounts of the tau protein. So in this case,
the patients are overexpressing this truncated form of the protein
and that leads to frontotemporal dementia.
So finally I'd like to leave you with my take home messages.
The first is that eukaryotic genes contain introns.
Not every eukaryotic gene contains an intron
but certainly in humans the vast majority contain
introns. These introns facilitate the evolution of new genes
and by allowing for alternative splicing, they allow us to
greatly increase our proteome complexity. And
higher eukaryotes utilize exon definition to define
splice sites because the exons are like islands in
oceans of intron. And then finally, many human
hereditary disorders are cause by mutations
that affect exon inclusion or splice site choice.
So this ends this part of the lecture. If you're interested
in learning something about the machinery that
removes the introns, load up my next lecture which is
spliceosome structure and dynamics. Thank you.