Tip:
Highlight text to annotate it
X
Andy Baxevanis: Okay. Good morning, everyone. Thanks again
for joining us for this third week of Current Topics in Genome Analysis. We're going to
pick up right where we left off last time around in looking at different ways to think
about sequence data. First, we have to get the requisite disclosure out of the way. Once
again, I don't have any relative financial relationships to disclose this morning.
Okay, so where we left off last week was we were talking quite a bit about homology searches
and in thinking about that, we spent a lot of time talking about blast. And you can think
about blast as essentially a one against one way of looking at sequence comparisons. You
start with a sequence of interest and you now compare that one by one to a collection
of sequences in some public database or perhaps in a list of sequences that you've compiled
yourself. So, really, again, one against one.
Another approach for doing this is to instead look at collections of protein families, and
look at those in what you could call a one against many, or many against one sort of
way. The reason for doing this is we can gain a lot of information by looking at the collective
characteristics that we can find by considering an entire protein family all as a single entity.
We can identify new members of a protein family that may not have been found using the standard
blast-type searches. We can also find some commonalities in domain architecture by doing
so. So, we're going to discuss four methods this morning. Two that are one against many:
Pfam and CDD; and two that are many against one, the other way around: side-blast and
delta-blast.
So, like we did last week, let's just go ahead and start with some definitions. So, the first
term we have to define is something called a "profile." And all a profile is, is a numerical
representation of a multiple sequence alignment. So, we're going to take a set of aligned sequences
and represent that as a numerical matrix. The reason we can do this and capitalize this
on this in our sequence comparisons is because it allows us very quickly to identify patterns
or motifs that are conserved across the entire protein family. The common characteristics
that define that protein family. What that, in turn, allows us to do is find similar between
sequences that have little or no sequence similarity, and I'll give you the same example
I gave you last week with respect to the homeodomain proteins. If you look at that protein family,
the family is characterized by a 60 across -- 60 amino acid across -- multiple sequence
alignment. But there are very few positions in that block of 60 that are absolutely conserved
across the entire 60 amino acids. And, so, by looking by traditional blast searches,
more than likely you wouldn't identify all of the members of that particular protein
family. So, when you're trying to look at distantly related proteins, these are now
the tools that we're going to use to try to tease those out. So, these are really good
techniques to have in your arsenal.
So how does this actually work? How do these profiles get constructed? So, let's consider
we have a multiple sequence alignment here. So, we've got all of these sequences. Ten
amino acids across. You'll notice some of the positions are more conserved than others.
So, we've got a G that we always see in the 10th position. In the next-to-last position,
we normally see a proline, but sometimes there's a threonine thrown in for good measure.
So, we look at this multiple sequence alignment and we're going to ask ourselves four questions:
which residues to we see at each position, so which ones can or cannot be at a particular
position in this protein family; what's the frequency of those observed residues; how
often do you see a given residue; is it an absolutely conserved position or is a little
bit of wiggle allowed, which positions are conserved, and where can the gaps be introduced.
Once you answer all four of those questions, there's an automated method that will take
these multiple sequence alignments and generate something called the position-specific scoring
table or a position-specific scoring matrix. And this is now what we're going to use as
the basis for comparison to an individual sequence of interest.
Let's just quickly talk about what all of these numbers mean, the same way we talked
about the blossom matrices last week. So, again, if we look at position ten here, we
see that there's always a G in that last position. I want you to imagine that this alignment
is now flipped onto its side, so we've got position one to ten going across and the alignment
position one to ten going down in the matrix. So, if we have a sequence of interest in the
10th position of that sequence if we see a G, we just find the G here. We find the corresponding
G there. Find the value at the intersection and, in this case, we're going to assess 150
points for that exact match.
In the next-to-last position, we've got, again, a proline in most places but the occasional
threonine. So, if we do this same thing and look for the intersection of the proline in
the consensus sequence versus the proline in the sequence of interest, here we've got,
in this case, 89 points. So, the same type of thought pattern that goes behind the blossom
matrices that you give the most points for an exact match. You still assess a positive
value for conservative substitutions, just not as many points as you would have for the
exact match.
Finally, if you have a residue in your sequence of interest that doesn't match the consensus
at all, you'll start to see negative values in the table in the positions where there's
really a lot of variability, you'll see the positive values go down. So, again, a way
to represent this multiple sequence alignment as a series of numbers. Again, you don't have
to calculate these by yourselves, but you can now use these to take a single sequence
and compare them to hundreds of these position-specific scoring tables.
The other definition for this morning is something called a "pattern," so let's go right to the
example. So, we've got this very interesting nomenclature up here on the screen, and we'll
just take it position by position. So, in the first position here, you see two letters
with square brackets around them. And that just, when you see the square brackets, it
means one or the other. So, one of the things in the square brackets. So, in this case,
we have a pattern that begins with a phenylalanine or a tyrosine in the first position. Where
you see the X that means any amino acid can occupy the second position. We just have a
C here all by itself, so that means you must have a cysteine in the third position. X followed
by a number, so that just means two of those. So, any two amino acids.
The curly brackets, as opposed to the square brackets that mean one of the things in the
brackets, the curly brackets mean anything but what's in the brackets. So, not valine
or alanine. Again, we have an open position that could be anything. And, finally, three
histidines at the end. So, the difference between the profiles and the patterns is that
the profiles give you information both on position and frequency where the patterns
only give you information about position. So, this just becomes a simple pattern-matching
exercise to find family members or assign a new protein of unknown function to a particular
protein family. Both of these are incredibly useful ways of approaching the problem of
assigning characteristics to an unknown protein.
So, let's go ahead and put these into practice by talking about the first method, the first
database, and that's called Pfam: protein families. So, what Pfam contains is a large
collection of multiple sequence alignments based on protein domains and conserved protein
regions, and because those regions are conserved, they more than likely have some sort of structural
or function importance? When you look at one of these Pfam entries, you'll see some very
good information. You'll obviously see the multiple sequence alignment that was used
to derive that matrix. Protein domain architectures. You'll get an idea of a species distribution
of where that particular protein family can be identified. So, is it in certain species
and not in other species?
Information on known protein structures and links to other relevant databases. As we go
through all of the things I'm going to show you this morning, you'll see references along
the bottom, and I would encourage you to pull those references to get a more in-depth view
of each of the things we're going to discuss this morning. Okay. So there are actually
two databases within Pfam: Pfam A and Pfam B. So, Pfam A is the better of the two because
they're based on a curated multiple sequence alignment, so someone has collected information
based on their biological knowledge, created a seedal alignment, and then has used that
alignment to gather in the rest of the members of the protein family. The methodology used
to do that relies on a program called HMMER that Sean Eddy at the Janelia Labs have developed
many years ago. And because of the way this construction is done, because of the curated
nature of the method, the hits are likely to be true positives.
There's also something called Pfam B. The difference between the two is that unlike
A that is based on curated alignments, B is based on automatically generated alignments
from database searches. So, those are deemed to be of lower quality, but you can certainly
take a look at those if you don't find anything matching your sequence of interest in the
Pfam A set. So, the old documentation just always compared Pfam A to Pfam B in this way.
They would say that Pfam A, you would have a handcrafted multiple sequence alignment,
so you should think of that as like a nicely handcrafted beer. You know that it has much
more care given to it and by virtue of that, it's of much higher quality.
Okay. As with last week, all of the examples I'm going to take you through this morning,
you can do them yourself when you get back to your labs. So, this is the URL for the
webpage that has all of the sequences. So, again, I would very much encourage to do this
when you get back because it's one thing to watch me do it. It's another thing to actually
have your own hands on the keyboards actually going through the exercises.
Okay. So, we'll go to the Pfam website. Here is the URL over at Sanger. And very simple
homepage. You can do a sequence search -- we're going to do that in a moment -- but you can
also look at members of a family. View a clan, which is just a group of related families.
Structures and so on. So, we're going to go ahead and just do a basic sequence search.
So, by clicking on those words pops up a box. Nothing else. Okay? As with the examples we
worked last week, there are always options we can set and default values you can change.
The way you do that here is there is a very easy-to-miss hyperlink right here that says
"here." Okay, so we're going to go there. And take a look at what we get. Not a lot
of options, but a couple more things you can do. So, we're going to go ahead and put our
query sequence in the box from the example webpage. You can either run the search using
an e-value cutoff. So, here the e-value is 1. The rules that I gave you last week for
BLAST govern here as well. Okay? So, we're really looking for things that are at 10 to
the minus-third or better at the protein level. But we're going to, just for the sake of the
example, leave that at one for now.
The gathering threshold really applies to a cut-off that was used by the curator to
develop this particular profile so you can safely ignore that. If you want, you can search
for the Pfam Bs by just click on the box, but we're going to just leave that the way
it is right now. Okay, so we'll go ahead. Submit the search. And this is what we get
back. So, just a quick overview. This represents your sequence of interest, so there's a very
light grey bar in the background. And you'll see that it identified, in this case, a P4-50
domain. But not all of one. You'll see that this side is rounded, this side is jagged.
So that means you've got part of it, but not all of it. If we look at the significant matches,
cytochrome [spelled phonetically] P4-50 was found from position 41 to 500 in our sequence
of interest. It matched the profile from 1 to 457, so the reason this is in red is that's
to communicate to you that it didn't match the entire length of the motif, just part
of it. Below, you'll see what's marked insignificant Pfam A matches. So, two hits come up. You
do want to take a look at those, even though they don't meet our statistical criteria,
the e-values are here. But you might have, again, biological information from research
you're doing in the lab, from things you've read in the literature that might deserve
giving these some additional consideration.
So now let's just concentrate on that P4-50 hit. So, if we now click on the view link
from the previous page, we get an expanded view here. And a lot of text. Okay. So, let's
just focus on this part of the slide. So, we start off and the first row is your sequence
of interest. It's just marked "seq." The second row is the consensus sequence that's based
on that position-specific scoring matrix. So, any place you see an exact match, you'll
see the letter repeated. Any place where there's no match to the consensus, you'll actually
see a plus-sign in this example. I'm sorry, that's indicating where you have a conservative
substitution. The PP line -- posterior probability -- you will see numbers here. The numbers
indicate to you how good the match is at each position and the key is up here above. So,
it tells you that if you see it's a zero, you've got 0 to 5 percent confidence. One
means 5 to 15 percent and so on. So, you really want to see as high a number as possible in
those positions. The star means that it's the best you can actually get and in most
of our positions here, we have asterisks going all the way across.
Okay. So, all right. So now we can go directly to some more information about this P4-50
domain just by clicking on that particular link. If we go to the Pfam tab at the top,
we get a nice executive summary. If you knew nothing about what side of Chrome P4-50s were,
right there you get a quick summary of what they are, some of the most important literature
references that you should be reading before you even start doing any work on this particular
protein family. You also get a representative structure on the side here. A different one
will come up every time. And you can just go through them either by pulling down the
pull-down menu or just click on "View a different structure." It will just take you through
some of them.
Now, if we look at the top, it tells us how many hits are actually found in this particular
protein family, how many members there actually are. So we find that there's 282 different
architectures comprising just shy of 40,000 sequences. So, let's go ahead and take a look
at some of those things. So, we're going to go down the left sidebar. We were at summary,
we're going to domain organization. And so we see here all of these schematics, all these
graphics, showing different combinations of protein domains that are found in subsets
of those 39,592 sequences. So, in the first case it says that 33,600-some odd of them
have this kind of orientation, this kind of domain structure where you've got a P4-50
domain and another domain that's in front of it that's represented by this little orange
box over here. If you wanted to see all of those sequences, you would just click on the
show link and it would give you all of those right away.
So, sometimes when you don't see absolute sequence similarity, it's good to look at
things at the domain level because you can start to infer function just by looking at
"Do I have a similar domain architecture as the sequence that I'm starting with?" From
a clinical point of view, it's important to think in these terms because you can look
at this and start to ask yourselves "Well, do I have a mutation, a clinically-relevant
mutation that may be knocks out one of these domains that this domain no longer forms?"
So, you either have a structural consequence or a functional consequence because you have
that disease mutation.
Okay, 39,592 sequences. Let's take a look at a subset of those. So, we now go to the
next link down. Alignments. And, on this page, you can take a look at the actual alignment
in a number of different ways. Every place where there's an alignment available, you'll
see a check mark. And if you were to take your cursor and mouse over that particular
check mark, it would actually switch over to reading as a view. So, it will just change
into a hyperlink. And so the one that I've moused over here is under the seed sequences.
So, again, the seed alignment is what that expert put together. The known members of
the family as the seed for bringing in the rest of the members in this particular family.
And if you go ahead and click on that, it will bring up a viewer called JalView. This
is within the context of the web browser. We're going to see it later on in a more interactive
fashion. The colors do actually mean something here. So, in the first set of colors, the
key is on the right-hand side. It gives you some sort of coloration that gives you an
indication of the physical, chemical characteristics of the residues in those particular positions.
Things that are more commonly going to be able to substitute for one another, just by
looking again at the entire constellation of proteins. What we know about substitution
patterns. So, things that are either small or hydrophobic. Things where you see charged
amino acids: histidines, tyrosines, and so on.
Occasionally you will see rows that are marked with an SS in parentheses. And the SS stands
for "secondary structure." So, these are places where there is known secondary structural
information. So, the position of the alpha helices or the position of the beta strands
are known. By virtue of that, that makes for a much better alignment because you can use
that information to better predict what is supposed to be where. So, it really goes a
long way in helping these alignments be much more robust. And we'll talk about that again
towards the tail-end of the lecture. So the key for that is also here. So, in the rows
where we have that SS designation, any place you see the red blocks, those are the alpha
helices. Any place you see the yellow blocks, those are the beta strands.
Okay. Let's go back to the Pfam page. And take a species-centric view of this information.
So, this very colorful representation here is supposed to represent all of the organisms
on the planet. You see the color assignments on this side of the slide here. If you look
very closely, you can see that the metazoa are here followed by the chordates followed
by the mammals going there. And you can just take your cursor and just click on those particular
regions to just focus in on those particular taxa. What I've highlighted here is an organism
called nematostella. This comes from the phylum Cnidaria. These are organisms that are unified
by the fact that they have stinging cells, cnidocytes that they use to capture their
prey. So, this includes hydra and jellyfish and corals and sea anemones. By highlighting
that box, it also shows me the lineage up here that gets you from the root to that particular
organism and you can very quickly just download all of the sequences from this protein family,
from that particular organism, with a single click of a button. So, it's a nice tool to
have if you're looking again at a particular organism and you just want to find all of
the -- or a particular class of organisms and phyla and whatever. To quickly pull all
of that information together, rather than having to generate those sequence files one
by one.
Okay. We'll pretend we've gone back to the summary page here and now we're going to just
scroll down that page to where we have a bunch of external database links. And the one I'm
going to take you to is the one right in the middle called ProSite and this is where we're
going to see the first example of a pattern. The designation that we talked about earlier
where we know what the positional information is for each member of a protein family. And
if we look at the big box at the bottom, the very first item in that box gives us the consensus
pattern for this particular sitochrome P4-50 family. You have links to other databases
as well, but the most important thing, I think, on this page is the part that says "Expert
to contact by email." Okay, so these are folks who keep these entries up-to-date and have
agreed to serve as a reference point if you have any questions about this particular protein
family. So, you would be somewhat foolish not to take advantage of this kind of resource
because this person is making themselves available to you. If you're studying sitochrome P4-50s,
you have a question that has befuddled you, you can't find the answer in the literature,
click on the link. Send them the email, and they'll get back to you. So, it's a good way
to actually get to a human resource rather than just looking at all of the database entries.
Okay.
So, the next thing we're going to talk about is related. The conserve domain database or
CDD. So, what we're going to use CDD for -- same sort of goal here is to look for conserve
domains in a protein sequence. The main difference between Pfam and CDD is that in the calculation
of the CDD entries, what they do is incorporate three dimensional structural information and
by doing that, again,, what it helps you do is to define domain boundaries and it allows
-- gives you a really powerful tool for refining the alignments. The data comes from a number
of sources, including Pfam, but not Pfam B, and a bunch of other databases that are focused
on collecting information on groups of orthologous proteins.
So, the way the CDD searches are conducted is through a variant of BLAST called RPS-BLAST
-- reversed position specific BLAST. So, basically, it's just a glorified way of saying we're
taking a single sequence, our query sequence, and we're going to search that against a bunch
of pre-calculated position-specific scoring matrices. The same way we did with Pfam A.
The important thing to keep in mind is the method that is used by CDD is not the same
method that's what is used by Pfam. Okay? So, similar to the advice I gave you last
week about using multiple scoring matrices when you do your BLAST searches. When you're
doing this kind of search, looking for protein domains, you should use both of these methods
when you're delving into these kinds of questions, just to hedge your bets. Okay.
So, here we are at the CDD homepage. URL, again, up in the corner. We're going to take
a sequence from our sequence webpage. In this case, this is the sequence of the deleted
colorectal carcinoma gene, DCC, the protein that's encoded by that gene from human. We're
going to leave all of the options exactly the way they are, but if you wanted to only
search one of the databases from the list that I showed you earlier, you could select
just a single one. But we're going to just throw everything at it. Go ahead and click
on the submit button. And off we go.
So, here we get a display. The top part is very reminiscent of what we saw last week
with our BLAST searches. So, your query sequence is represented by the grey bar going across
the top. Position one to position 1447. Every place that a domain or a motif has been found
is marked in relation to the coordinates on this line. So, there's one-to-one correspondence
between all of these shapes and each one of these lines in the table. We're going to focus
in on the very first one, this neo-geneine domain right here at the C-terminal end of
our DCC protein. In order to learn a little bit more about that, there's just a plus-sign
over there. We're going to go ahead and click on that. It'll expand that part of the page
and we get a very quick, again, executive summary of what this actually is. You find
out that it's an immuno-globuland-type domain. And how it is actually composed. The alignment
shows you the alignment of your sequence, the top line with the CDE domain matrix. The
second line, every place you see a blue letter, that means -- excuse me, every place you see
a red letter, that's an exact match. Every place you see a blue letter that is a mismatch.
Okay.
We can get some more information about this particular domain by looking at this link.
It's very hard to see here in the blue following the name of the domain. And so if we were
to click on that, it would take us to now the domain page. Again, the same information
we saw before. The most important part of this page, to me, is the stuff that's all
the way at the bottom. So, we're going to pretend that we've scrolled down and here
we have the sequence alignment. So, you can see which blocks of this particular protein
domain are conserved all of the members across this particular family. You can actually download
the sequences from here and this now puts you in a position of not even having to construct
a multiple sequence alignment. We're going to talk a lot about that today. But you now
have the tool in your hand to be able to start doing multiple sequence alignments and we're
going to talk about, again, a lot of that later in today's lecture. So, again, a good
way to just be able to pull homologous together without actually having to go through the
routines that we talked about last week. So, again, something else for your own material.
Okay.
So, we come back here. We've talked about the one-against-many techniques. Now, let's
flip it the other way and talk about many-against-one or many-against-many, depending how you want
to look at it. The first technique we're going to talk about is another variant of BLAST
called PSI-BLAST and the PSI stands for "Position-Specific Iterated" and what this allows you to do is
to start to identify distantly related members of a protein family. That, again, you might
not have found in the context of a regular BLAST search. The BLAST searches start to
fail around the 25 percent mark. So, this is the reason I gave you that cut-off criteria
last week for protein-protein searches that one of the criteria you're going to apply
is to look for at least 25 percent sequence identity. Because as you get below that, you
really can't be too sure that those things are actually proper, real matches. That they
might actually be false positives. So, this is another thing you can throw at it to maybe
push the boundaries a little bit. This is a really nice version of a profile-based search
and the way it works is like this. In the first step, you do a regular BLAST search.
So, we're going to start with a protein sequence and we're going to search that protein sequence
against a database of our choice. What will then happen is we'll get back our hit list,
the same way we did last week, but based on the results of that hit list, what's in there,
we're going to take all of those hits, take all of those sequences. Align them. Make a
position-specific scoring matrix. And then that matrix becomes the input to the next
round of searches. So, this is now an interactive technique that each time you go around, you're
going to gather in more and more sequences. You're going to keep adding them to your alignment.
You're going to keep recalculating the matrix. You're not, it is for you. And by doing that,
we're going to just keep increasing, increasing the resolving power of that position-specific
scoring matrix in finding new members of the protein family.
All right. So, in practice. Here we are at the BLAST homepage. This hopefully looks familiar
to you now, since this is a BLAST-based search, we're just going to go to the protein BLAST
section of the basic BLAST collection that we have. This is, again, the BLAST search
page. We put in a sequence here. In this case, this is the high-mobility group protein from
humans, so this is a DNA binding protein. As before, we can select which database we
wanted to search. So, we have a pull down here and this time around, I picked something
called Swiss-Prot. So, let's talk about that for a minute. So, what Swiss-Prot tries to
do is something very similar to Ref-Seq that we talked about last week. So, you'll recall
that Ref-Seq is an effort by NCBI to represent each and every sequence in the central dogma
once and only once. So, it would have a single entry for each DNA, RNA, and protein sequence.
Non-redundant. What Swiss-Prot does, it's a similar effort. A long-standing effort that
started many, many years ago by the folks at the Swiss Institute for Bioinfomatics.
Only on the protein side of the house. But, pretty similar in what its end goals are actually
are. So, you'll have by definition, the database is non-redundant like Ref-Seq. There's on-going
curation by the staff at EBI, the European Bioinfomatics Institute. But also by external
experts. People similar to the ones that you saw in the Pro-Site entry that take it upon
themselves to make sure that these entries are kept up to date, the folks who are actually
working in the field, doing experiments, having their hands in it every single day.
What you'll see in the entries of those Swiss-Prot entries is that you'll have some extra fields
there, including comment lines, and I always point this out to people because, again, that's
the nice executive summary that comes to you by the expert in the field. So, again, you'd
be foolish not to take use of that information, because the expert has put in there what they
think it's most important for you to know about that particular sequence. The way you
know that a sequence is a Swiss-Prot sequence goes back to the accession number. And, again,
the accession number is the unique identifier, the Social Security number for a particular
sequence. So, here is how the representation is. And you already know what the square brackets
mean, so it's one of these letters. So, it either starts with an O or a P or a Q, and
then has a five-digit number following. Okay. All right.
So we come back to our page here. What I've also done in this case is I've limited the
organism selection here to just the vertebrates, so remember you always have the ability to
search subsets of the database based on what organismal spread you're interested in. The
program selection is in the red box, so what we were doing all of last week was the BLAST-Ps,
this time we're going to pick the second one down PSI-BLAST. As usual, we're going to take
a look at the algorithm parameters to maybe tweak things to our best advantage here. The
e-value threshold, by default, is 10. We're going to change it to 0.001 per our guidelines
from last week. We can, again, filter out our low-complexity region. So, low-complexity
regions are those homopolameric run places of bias composition in our sequences that
could confound the BLAST algorithm. We're going to set a second e-value threshold, a
PSI-BLAST threshold, also to 0.001. You'll see the default here. Again, what I like to
do is always check this box here at the bottom so that my results are coming up in a new
window so if I ever have to come back to this search page, it's still there. I don't have
to keep hitting the back button to get here. And then we're just going to go ahead and
BLAST it.
All right. So, hopefully this looks more familiar to you now, given last week's lecture. Exact
same format. So, we have our query sequence. One to 215. Two HMG box domains have been
identified here, so two punitive DNA binding regions. And our overview of the BLAST results
by score. We're going to just scroll right down to the hit list. Again, we have our list
of alignments here. So the description, the definition line -- the very short one-line
identifier of what each one of these sequences actually are: our score values. Remember,
the score values are important but we're not going to look at those. Instead, we're going
to look at the e-values which are calculated from the score values. The great neutralizer
that allows us to search to compare any one BLAST search to any other BLAST search. And
these are now ranked from lowest to highest. You'll see this time around two extra columns,
and the one that's highlighted says "select for PSI-BLAST" and all of the check marks
are checked off. So, you'll remember how this works. We start with a sequence. We do a -- the
first round is just a regular BLAST-P search. We now have to get around the business of
specifying what is going to go into the next round. What is going to be used for calculating
that matrix which will now be the query for the next round of searching? So, we're going
to just go ahead and include everything. That's what all the check boxes are there for. They're
automatically checked by default.
Now, in this case when we look at the e-value column, we're still well above the rules that
I gave you last week, the 1x10 to the minus-30. But let's say we weren't. Right? You would
see everything above the line would be checked off, everything below the line would not be
checked off. But, again, this is where your biological knowledge really comes into play.
You may have information that says something below the line should be included and, perhaps,
that something above the line should not. So, this is where you have to look at this.
Check off some additional boxes, uncheck some of the other boxes. If you see something that
you know is not correct, that you know is actually a false positive. All right. To now
do the next round, all you have to do is click on that Go button. It will calculate the matrix
based on all of these sequences. And so here we now have the results page again. But now
we've got a new column here "Use to build PSSM." This just tells us all right, what
was used in the last round to get us here. If we scroll down, we're going to see a bunch
of yellow lines. Okay, there's nothing checked there because these are the ones that were
just found. Okay, so these are all the new sequences that we found by virtue of having
all of this information from the protein family collectively. So, we're able to tease out
additional sequences here. Most of these are for something called the sex-determining protein.
So, this is TDY or SRY. These are members of this family, so we're going to go ahead
and include these in the next round of searches. Okay.
And we're going to just go around and around as many times as it takes. Two, three, four.
And in this case, five. Okay. Where we just keep just hitting that Go button at the bottom.
You'll know you're done when it says "no new sequences were found above the 0.001 threshold."
So, we started in round one with our regular BLAST-P search. We found 77 sequences. By
going through this process, we ended up with an extra 100 sequences, so that's a pretty
substantial number. Okay? Things that we would not have found using just the traditional
BLAST search or even just by using the default values. So, this is why it's important to
be able to know what those defaults do and changing them when necessary. So, I think
this demonstrates very well how you can use the collective characteristics of a protein
family to really find these distantly related members. Again, things you wouldn't find otherwise.
Always, always, always please check the statistics, especially in the percent identity column,
because even though our criteria that the method applies, we specified e-values, but
we have not specified percent identity cut-offs, so you need to do that by eye. So, if you
see things that are too low, just take them out. Okay.
There's a newer variant of this. Got another member of the BLAST family called Delta-BLAST.
And so this method is slightly different from what's used by PSI-BLAST but not grossly different.
So, it's pretty easy to understand. Instead of doing a BLAST-P search in the first round.
Sequence of interest against in our example Swiss-Prot. Instead it's going to take the
sequence of interest and search it against all of the entries in CDD, the database that
we talked about earlier. Okay? So, it's going to take that profile information that it finds,
compute a new position-specific scoring matrix, and now use that as the input for round one.
It's an iterative process, just like PSI-BLAST. In round one, it's searching against CDD.
Starting in round two, it's just successive PSI-BLAST searches. So, from that point on,
it's exactly the same. It's just, how is the first round actually calculated?
What the authors set out to do with this particular method is really come up with a more robust
way to find homologs, and you end up getting some really nice, high-quality alignments,
even when you're starting to push the boundaries of sequence-similarity, where things are getting
down right around that 25 percent Twilight Zone. The only caution I have to give you
here is while the methodology is very robust, it really depends on having entries in CDD.
All of the protein families may not be necessarily represented there. That information on homologous
proteins just might not be there. I think currently there's about 75 percent of all
known protein families are cataloged within CDD, at least based on their documentation.
So you just need to keep that in mind. That you may not be looking at everything that's
available to you. So, this is yet another case of you've got multiple methods available
to you. Absolutely do both. I would encourage you to pull this paper because it shows you
some performance data information, the usual Venn diagram representations where you see
how many were found by one method, how many were found by the other, but then how many
are in the overlap and there's quite a few that fall on those other sides. It really
would behoove you to use both of the two methods, and the authors recommend that as well. Okay.
All right, so that's it as far as individual sequences go. We spent most of the last two
weeks talking about this, so now let's take this the next step and think about multiple
sequence alignments, which are just in essence a whole bunch of pare-wise sequence alignments.
Okay, so what can you learn by doing these multiple sequence alignments? As we've talked
about many times, you can start to pick out conserved regions, patterns, and domains.
Why is that of use to you? It can start to help inform you about how to design your experiments.
So, where are the regions of a particular protein that you should be looking at? Where
are the things that might be a little bit less important with respect to structure and
function? Again, as we've harped on quite a bit, it helps you pull in new members of
protein families as well. If you don't have these, it's very hard to do everything that's
in the bottom part of the slide, so when we start talking about predicting secondary structure,
yes you can do this on single sequences, but it's much more powerful when you look at a
family of sequences. You cannot perform any sort of phylogenetic analysis unless you have
one of these things, so even the idea of looking at evolutionary relationships becomes impossible
without having these multiple sequence alignments in hand. And, again, as we've talked about,
it gives us the input to many of these search methods. Okay.
So, what do we need to think about when we are constructing one of these multiple sequence
alignments? And these are themes that we've hit on several times already in the last two
weeks. We want to keep in mind positions in our multiple sequence alignments, so just
the columns of the alignment where we can have absolute sequence similarity. We're going
to try to align as many of the common characters as possible from sequence to sequence to sequence.
By doing so, hopefully we can start to think about conservation, so we'll start to see
patterns where maybe there's only one or two or three amino acids in a particular column.
That gives us an indication of where can you have conservative substitutions that don't
adversely affect the function or the structure of the protein.
Finally, it gives us some idea about structural similarities. So we can, again, as I've alluded
to several times now, if we have secondary structural information, or information from
either NMR structures or X-ray structures, we can use that information to fine tune the
alignment. So, I know many of you, probably all of you at some point, have tried to put
together a multiple sequence alignment, so we're going to go through some general guidelines
to help you bring up your game a couple of notches. All right. So, the first thing is
when we do these, we tend to concentrate on the protein level rather than on a nucleotide
level and hopefully that's become a little more obvious to you as to why, as to why as
we've gone through a lot of the considerations we've talked about. When we look at these
protein alignments, they tend to be more informative quite simply because you have 20 amino acids
that you're trying to align on the protein side versus the four nucleotides that you
are trying to align on the nucleotide side. And so it makes for more accurate alignments.
It goes back to the side-chain chemistries of each of the amino acids and all of the
physical chemical properties that come along with that.
Now, of course, there's going to be cases where you can't do that and so let's say you're
going to be looking at regulatory elements. Things like that by necessity have to work
with nucleotide-sequences and Laurel Nitzski, in two weeks, is going to spend much more
time talking about exactly that. All right.
So, this slide is just intended to give you the 30,000 foot overview of what you have
to think about as you do one of these. And so we're going to take each one of these off
in order to give you a better appreciation for how to approach this question in the future.
So, you want to put together a multiple sequence alignment. You need sequences. Okay, so how
do you actually pick them? First is how many? And so you want to keep that number relatively
reasonable and the reason for that is when the algorithms that compute these multiple
sequence alignments are global alignment methods. We've talked about the difference between
global and local last week. The global alignments, you're trying to align the entire sequence
along its entire length with every other sequence in the set. Because of that, this is computationally
intensive. It takes a long time. When you start to put too many sequences in your input
set, the methods just start to choke and you might start getting inaccurate alignments.
The methods are getting much better with time. They're able to handle more and more sequences
in the input sets, but it's something to be mindful of.
The other thing is sort of a more practical issue that if this is the first step to doing
a phylogenetic analysis, you have to keep in mind the next step in the process. That
-- okay, this is now going to produce the input to the program I'm going to use to generate
my trees. Same considerations apply there. The more and more that you have in that input
set, the harder and harder it is to actually generate reasonable trees that can actually
be interpreted. So, it becomes almost impossible when the data sets get too big. So, good starting
point, 10 to 15 sequences. I'm saying ballpark upper limit here 50 to 100. I know that's
a very wide range, and it really just depends on how either closely related or distantly
related each one of those proteins in the set actually are.
All right, the sequences should be about the same length and the way you get them there
sometimes is to just trim them down. So, we talked about domain architectures earlier
and we've seen results from our BLAST searches and we've seen things in CDD that show us
where the conserved blocks are. You can use that information to now trim back your sequences
-- BLAST, PSI-BLAST, whichever method you want to use -- to have a more sort of tidy
input set going into the multiple-sequence alignment program. So, normally what will
happen is the action tends to be in the middle. You can safely get rid of the end terminal
end, safely get rid of the C-terminal ends, and come up with the alignment that you want.
If you don't already know that there's some business going on in either the end terminal
or C-terminal ends. Okay.
All right. The degree of sequence similarity. So, depends on what question you want to ask.
If you're looking for required or highly conserved amino acids, you want to push towards having
much more closely related sequences. If you want to go the other way and study evolutionary
questions, you want to use more divergent sequences. So, a good place to start, I think,
is sequences that are about 30 to 70 percent similar because that's going to hedge your
bets on both of these things. But I think that the last bullet really says it best.
That the most informative of alignments you can actually get is when you have a sequence
set where you have sequences that are not too similar but not too dissimilar. Because
if they're too similar, well, you're not going to learn anything you didn't already know.
And if they're too dissimilar, you're not going to be able to align them. Okay. So that's
how you put the set together. We're going to put it into a program. We get back some
results. You have to inspect those. And this becomes an iterative process. What I normally
recommend people do is they start off their alignments using a small set of sequences,
go on the lower ends of the ranges I gave you earlier. And go ahead and just do a preliminary
alignment. And look at, all right, just by eye how many residues are conserved across
the alignment, do the alignments actually make sense, do you see a neat block-type structure,
and have you been able to get away with very few gaps being included. Remember that the
gaps represent biological events. Those are either insertions or deletions, so you can't
just keep inserting gaps without regard to biology. It's going to end up messing up the
alignment and messing up the interpretation.
If you've done this on a small number and the alignment's good, fine. Add a few more
sequences, do it again, and you can just keep going around like that. If you find out the
alignment's not so great, take out sequences where you see tons of gaps are being included,
then realign. So, this is not just a one-shot deal. To do it well, you have to go through
a few rounds of fiddling with the input set to get a good alignment. Once you have it,
you can use some visualization tools that we have to look for key residues and problem
regions. What I always think is a good idea is to cross-check against these expertly created
multiple-sequence alignments and we've seen examples of that through our travels over
the last two weeks. So, you know where to go to look for those. Again, if there's structural
information that you have, absolutely use it because it's going to tell you really definitively
where can you put in a gap and where can you not put tolerate a gap? So, if you have an
alpha helical region, no gaps. If you see a beta strand region, no gaps in the loops.
Fine, you can have them there. Okay.
All right. Interpretation. So, now you've got it. It's to your satisfaction. What does
the alignment actually mean? What can you glean from it? Any time you see an absolutely
conserved position, you know that those are required for either proper structure or function.
So they've been conserved for a reason. If you see a relatively well-conserved position,
you can tolerate limited amounts of change. You can have a little bit of wiggle room there
and not adversely affect the structure or the function of the protein. Now, I think
this is where most people stop considering things. They're looking for what is in common
but I think there's a lot to be learned for looking at what is not in common. The non-conserved
positions where you can actually mutate freely, because this is really where the evolutionary
innovation goes on. Where you can possibly give rise to proteins with new functions.
And finally we've already talked about where the gaps can actually be tolerated. What that
might indicate.
So, let's talk about a particular program, the one that's probably most commonly used
to do multiple-sequence alignments and that's called "clustal omega." And so as you would
expect from a multiple-sequence alignment program, you'll get out multiple-sequence
alignment whether you start with nucleotide or amino acid sequences. It's really fast
in the calculations and as you'll see it's really easy to use. You can provide it information
if you have structural information about where the gaps can go and where the gaps shouldn't
go in advance. So it knows not to put them in certain places. And we'll see this in the
context of a Java applet where we can actually manipulate the alignment a bit. Okay.
Before we actually get to the program, a little bit of theory. So, how this works is using
a method called a Progressive Alignment Method. So, let's say we have a sequence set. It's
got a whole bunch of sequences in it. Rather than trying to align all of them all with
each other all at the same time, we have to do it in a little bit more rational fashion.
And the way the methodology does it is by just taking two sequences at a time, starting
with the two most related sequences and then just building up the alignment from there.
So, it will start with two. It will keep adding and adding and adding. To bring in the ones
that are a little bit less related to the ones that are in the rest of the set. So we're
going to start in the places where the alignment is actually easy and move to the parts where
it's a little bit harder because the sequences are more divergent. I think an example helps
here.
So, we're going to start with four sequences: A, B, C, and D. First thing we're going to
do is calculate a similarity score of each one of those to each other sequence in the
set. So, that's the equation of how many alignments you get so here we have four sequences. It
means we have to calculate six alignments. This is just to illustrate why if your sequence
set is too large that the numbers of calculations start to go up really, really fast, so it
becomes very computationally difficult for the algorithm. So, here's our four sequences
again. And we're just going to put a little scoring matrix together to show us how identical
each one of these is to every other one in the set. So, for sequences, six alignments
from the previous slide. One, two, three, four, five, six. So, by looking off the diagonal
which are the exact matches, we see the two most related pairs are A with B, 80 percent,
and C with D at 92 percent.
So in the first step, we're going to just take A, and align it with B. We're going to
take C and align it with D. Okay. Once we do that, we're going to use this alignment,
fix it. Take this alignment, fix it. And now represent AB and CD as single sequences. So,
we've basically gone from four to two. Okay? We're going to then take those "new" - in
quotation marks -- single sequences, align those, and keep doing that as many times as
we have to, to keep adding the remaining members of the set. So, here we only had four. We're
already done. But you get the idea that you're starting at the tips, merging each new pair
of sequences as you go towards the root of this tree. Okay. So why is this good? Especially
for large sets, you do the easy ones first because, as you already know, those are the
simplest to align. But, more importantly, you're going to use the information from those
easy alignments to help you align the more difficult ones. The ones that are more distantly
related because as you start to pile up these alignments, you're going to get more information
the same way that we did by looking at our position-specific scoring matrices that will
help us later on in the process, get those tough ones into position.
Disadvantages. We're going, there's a directionality to this process and we are fixing these new
alignments as we go. If we make an error early on, that epic error is propagated all the
way through. So, if that alignment is incorrect in a particular position or several positions,
that's just going to stay fixed through the entire rest of the process. You're going to
end up with a possibly bad alignment. Now the good thing about clustal and a lot of
the other methods is that they've now introduced ways for iterations to take place. So, rather
than just doing the alignment process ones, it will do it multiple times. To make sure
that you don't get caught in essentially what's a local minimum. It's going to cost you some
increased compute time, but it's well worth doing. Okay.
Now, there's obviously there's other theoretical underpinnings for different types of multiple
sequence alignment methods and so this is really all you need to know about this one.
If you're going to use something different, please just take the time to understand how
the method works because you just, again, don't want to end up with a bad alignment
because you haven't selected the sequences properly or you haven't taken into account
some of the fine considerations of how it's done. The last thing you want to do is be
in a position where you're going to start to draw improper biological conclusions.
All right. So, we're back to the program. What did we actually get? So, as one would
expect, you get a multiple-sequence alignment and those pair-wise alignment scores that
you saw previously as we talked about the progressive alignment. You also get something
called a cladogram and something called a phylogram. So, the cladogram is a tree representing
your sequences that is an estimate of the phylogeny. So, it gives you a branch structure.
All of the branches are of equal length so you don't have a sense of evolutionary time
here. It gives you a first glance at common ancestry but, again, we don't have a sense
of how much time has elapsed that set for each one of the taxa on the tree. The phylogram's
a little bit better. Again, an estimate of the phylogeny, the branches are not of equal
length, so the branch lengths are each proportional to the amount of inferred evolutionary change.
When we look at the alignments, we already saw a first glimpse of this when we saw JalView
in the context of Pfam earlier. What it tries to align. So, it tries to keep aromatics together
basic and acidic side-chains, catalytic sites, and so on. You'll see that the interpretation
of these is empirical. There are no rules I can give you here like I gave you for BLAST
previously. We're going to just take a look at it and see how good it is. But we do at
least have one indicator that's going to appear in the alignment and that is these symbols.
If you have an entirely conserved column, you'll see an asterisk noting that particular
position. You want to, as a rule of thumb, see that in about at least rather 10 percent
of the positions. If we have a conserved position, places where the physical chemical properties
are strongly similar, you'll see a colon on that position. And if you see a dot, that
is what they call semi-conserved, so weekly similar properties probably encompassing one
or more of the classes from the previous slide. Okay.
So how do we do it? Here's the URL again, this takes us off to EBI. Here, I have put
in five sequences, in this case, all in FOS-A format. From our webpage that has the sequences.
For a set of FOS-B proteins. The input is very easy because it's really just the sequences.
We're going to take a look at more options just to see. You can see what's there. Here
are where the settings are for the iterations that I talked about. The things that will
make sure that your progressive alignment is less prone to error. We're going to leave
those as set. The only thing I've changed here is the output format to clustal with
numbers and that's just going to number our alignment that we're going to look at in a
minute. We're going to go ahead and submit that. And this is what we get back. So, here's
our alignment. Our five sequences are aligned. You'll see the numbers along the side, so
that's the clustal with numbers bit. So you know where you are in each sequence. The color
key matches what's here in the inset, so the reds are the small residues, the blues are
the acidic, and so on. Across the bottom, you will see where the conserved and semi-conserved
and absolutely conserved positions are. So, it's the asterisk, the colon, and the dot,
in that order, and places where that's not seen, you just don't have a mark on those
columns.
We're going to take a little bit of a different look at this by going up to the top where
it says phylogenetic tree and if we click on that tab, here we actually see the cladogram
that I was telling you about before, so sort of the first hint at the relatedness of these
proteins. You'll notice that the branch lengths are all pretty much the same so that we're
not getting any sort of sense of evolutionary time here. If we click on "real" here, that
will now convert this to the phylogram and you see how the map has actually changed.
Now, both of these look very different than what we're going to see at the very end of
the lecture when we actually construct a phylogenetic tree, but it gives you, again, sort of a first
glimpse at what we're going to be looking at.
To get there, we're going to go to this tab that says "result summary" and so there's
a bunch of links to the actual files if you want to download those to take them into another
phylogenetic analysis program. They're in the right format that you'll need to go right
into those programs, but what we're going to focus on instead here is this box here
where it says "JalView." And so you just click on the button that says "Start JalView." This
will launch a Java Applet that we're going to now use to actually take a look at our
alignment in a little bit more depth so we can use it to manually edit the alignments.
If we don't like how the alignment has been put together, we can actually shove lines
back and forth. So we can introduce gaps or we can take gaps out; we can color the residues
in a variety of ways, look at pair-wise alignments, do consensus sequence calculations. If we
see that we have an overrepresentation of a particular sequence in the alignment, we
can just go ahead and take it out. And the most important thing, at the end of the day,
calculating the tree.
All right, so we've clicked the JalView button. The viewer has been launched. You'll see a
window that looks like this. This is the default view. Our five sequences going across the
top and we have some indication of the quality of the alignment position by position as we
go across. So, conservation just tells you how well the alignment is at that particular
position. It's an indication of percent identity. So, the higher the bar, the better the alignment
at each position. The quality is another way of looking at that same metric based on blossom
scores. And finally you have a consensus sequence going across the bottom and this is, again,
based on the percent identity. Here is where the plus-sign actually represents that there
has been no consensus residue identified for that particular position.
We're going to go ahead and change the menu options. If you squint real hard, you'll see
that there's a bunch of very small type up here with a bunch of pull-down menus and that's
where we're going to change things. So, you'll see the upper part of the slide change. So,
we're going to go to the color menu. There is a sub-menu that's called "percent identity,"
so you pick "percent identity." And that changes the representation. So, everything is in now
different shades of blue. The darkest of the blues are the most conserved regions going
to the regions that are just still rendered in white. So, this is a very quick way to
identify new motifs, candidate regions that might have structural and functional significance.
So, here you want to look for blocks that have higher absolute sequence identity because
there's, obviously, going to be evolutionary pressure to keep these regions conserved.
So, going beyond position by position, are there entire blocks that have to stay together?
If we want, we could do just a pair-wise alignment for any of the sequences in the set. So, I've
just selected the first two for mouse and human, and so you go to the calculate menu
and pick pair-wise alignments. And you would get this alignment. It tells you the percent
identity -- 95 some odd percent -- and the alignment in the reminiscent BLAST format.
The final step, we're going to finally generate an evolutionary tree. So, if we go to this
calculate menu again, calculate tree from that menu. We go to neighbor joining using
blossom 62 and now we have what probably is the fastest way to generate an evolutionary
tree without having to go to another piece of software. And you can see this representation
looks very different than what we saw earlier for this same example on the webpages. You
can actually put evolutionary distances on these to give you an idea of the branch lengths.
But I would use this as a first approximation because obviously there are methods out there
that are much more robust at generating these kinds of trees. But, as your first glimpse,
this is a great place to start. Okay.
So, obviously what I want to reinforce here, this is the most commonly used method. We
use it a lot. It's not the only method. There's a lot of other methods that are out there.
They all have their strengths and weaknesses and in practice we use different methods to
make sure the alignments are as good as we can get.
A tool that's going to help you with that is another program that I want to tell you
about called Tcoffee. The coffee grinder comes from their website. And so what this actually
does is takes the multiple-sequence alignments a little bit further because you can combine
sequence information, profile information, and structural information all at once. Whether
you're looking at protein structures, RNA secondary structures, so this really is great
to RNA sequences. It has some specialized algorithms that are fine-tuned for specific
classes of proteins. So, if you're looking at transmembrane proteins or non-coding RNAs
and so on where the rules are a little bit different, this is a good place to look to
align those. What I particularly like about this method is it allows you to take output
from other methods, like clustal, and other things that are out there and combine them
into a single master alignment. And what that master alignment is going to quickly tell
you is where there are inconsistencies between the methods. So, if there are places from
method to method to method where you now have an inconsistency, give special attention to
those regions. You might have to hand curate those alignments a little bit to improve upon
them, so it's a quick way to zero in. Here is the URL for Tcoffee. They've just, the
team behind this in Cedric Notredame's lab has just published a tutorial and the reference
for that is on the slide. So, that will just take you through, step-by-step, how to actually
use this method.
Okay. So, with that, we end our tour of sequence analysis over these two weeks and I always
feel like it's important to end on this slide. The black box that I alluded to very early
on last week. Unfortunately, as we all know, you go to a webpage, you see a box, you stick
your sequence in, you hit the button. Okay? And it becomes a very mechanical approach
to sequence analysis where you really don't know what's going on inside the black box.
Code in your sequence, get out your results, and hopefully what you've started to get a
better appreciation for over the last two weeks is that the inspection is really the
most important part of the process. And the only way to be able to do that is to really
understand not at a gross technical method, but to have some understanding of what the
methods are actually doing in the background so that you can take the results from these
methods, couple them with your own biological knowledge, never leave your biological knowledge
at the door. It's an important part of this because remember these are predictions. You
have real data from the laboratory. Put them all together and that approach will always,
always, always serve you well. So, with that word of advice, I want to just give you one
additional resource which I think is very cool.
So, David Sorrels who was at Penn for many years and then went on to Smith-Clyne [spelled
phonetically] recently published this perspectives piece in Plass Computational Biology. The
reference is on the slide. And why I love this is because a lot of people don't have
access to courses like these or tutorials, hands-on workshops back in their individual
institutes. So, this is a great way to teach yourselves how to do a lot of the techniques
we're going to talk about in this course that we don't necessarily have time to cover. What's
very, very nice about this is what David's done is he's created five curriculum tracks
and the five are shown on the slide there. So, if you're interested in doing analyses,
there's a subset of courses he recommends that you should watch online. So, very much
in the Course Sarah [spelled phonetically] or the Kohn Academy [spelled phonetically]
model where you can just sit at your laptop, at your computer, download some handouts,
and learn about all of these techniques. If you're interested in building the tools, there's
a [unintelligible] Informatics Tool Track and so on. So, this is a must read. Consider
it assigned reading. Download this particular paper and I think you'll find some very, very
important and very, very useful things in this manuscript.
Okay, so with that, we're going to shift our focus next week back to the level of whole
genomes. As I look over to Dr. Wolfsburg [spelled phonetically]. Terra's going to take over
next week and we're going to start talking about how we're going to visualize and analyze
genome scale data. Identifying important sequence-based features and using that information to its
best advantage. So, with that, I thank you for your attention. I'll take questions at
the podium, and thanks for joining us this morning.