Biological Sequence Analysis Ii - Andy baxevanis - 2014

Andy Baxevanis: Okay. Good morning, everyone. Thanks again for joining us for this third week of Current Topics in Genome Analysis. We're going to pick up right where we left off last time around in looking at different ways to think about sequence data. First, we have to get the requisite disclosure out of the way. Once again, I don't have any relative financial relationships to disclose this morning. Okay, so where we left off last week was we were talking quite a bit about homology searches and in thinking about that, we spent a lot of time talking about blast. And you can think about blast as essentially a one against one way of looking at sequence comparisons. You start with a sequence of interest and you now compare that one by one to a collection of sequences in some public database or perhaps in a list of sequences that you've compiled yourself. So, really, again, one against one. Another approach for doing this is to instead look at collections of protein families, and look at those in what you could call a one against many, or many against one sort of way. The reason for doing this is we can gain a lot of information by looking at the collective characteristics that we can find by considering an entire protein family all as a single entity. We can identify new members of a protein family that may not have been found using the standard blast-type searches. We can also find some commonalities in domain architecture by doing so. So, we're going to discuss four methods this morning. Two that are one against many: Pfam and CDD; and two that are many against one, the other way around: side-blast and delta-blast. So, like we did last week, let's just go ahead and start with some definitions. So, the first term we have to define is something called a "profile." And all a profile is, is a numerical representation of a multiple sequence alignment. So, we're going to take a set of aligned sequences and represent that as a numerical matrix. The reason we can do this and capitalize this on this in our sequence comparisons is because it allows us very quickly to identify patterns or motifs that are conserved across the entire protein family. The common characteristics that define that protein family. What that, in turn, allows us to do is find similar between sequences that have little or no sequence similarity, and I'll give you the same example I gave you last week with respect to the homeodomain proteins. If you look at that protein family, the family is characterized by a 60 across -- 60 amino acid across -- multiple sequence alignment. But there are very few positions in that block of 60 that are absolutely conserved across the entire 60 amino acids. And, so, by looking by traditional blast searches, more than likely you wouldn't identify all of the members of that particular protein family. So, when you're trying to look at distantly related proteins, these are now the tools that we're going to use to try to tease those out. So, these are really good techniques to have in your arsenal. So how does this actually work? How do these profiles get constructed? So, let's consider we have a multiple sequence alignment here. So, we've got all of these sequences. Ten amino acids across. You'll notice some of the positions are more conserved than others. So, we've got a G that we always see in the 10th position. In the next-to-last position, we normally see a proline, but sometimes there's a threonine thrown in for good measure. So, we look at this multiple sequence alignment and we're going to ask ourselves four questions: which residues to we see at each position, so which ones can or cannot be at a particular position in this protein family; what's the frequency of those observed residues; how often do you see a given residue; is it an absolutely conserved position or is a little bit of wiggle allowed, which positions are conserved, and where can the gaps be introduced. Once you answer all four of those questions, there's an automated method that will take these multiple sequence alignments and generate something called the position-specific scoring table or a position-specific scoring matrix. And this is now what we're going to use as the basis for comparison to an individual sequence of interest. Let's just quickly talk about what all of these numbers mean, the same way we talked about the blossom matrices last week. So, again, if we look at position ten here, we see that there's always a G in that last position. I want you to imagine that this alignment is now flipped onto its side, so we've got position one to ten going across and the alignment position one to ten going down in the matrix. So, if we have a sequence of interest in the 10th position of that sequence if we see a G, we just find the G here. We find the corresponding G there. Find the value at the intersection and, in this case, we're going to assess 150 points for that exact match. In the next-to-last position, we've got, again, a proline in most places but the occasional threonine. So, if we do this same thing and look for the intersection of the proline in the consensus sequence versus the proline in the sequence of interest, here we've got, in this case, 89 points. So, the same type of thought pattern that goes behind the blossom matrices that you give the most points for an exact match. You still assess a positive value for conservative substitutions, just not as many points as you would have for the exact match. Finally, if you have a residue in your sequence of interest that doesn't match the consensus at all, you'll start to see negative values in the table in the positions where there's really a lot of variability, you'll see the positive values go down. So, again, a way to represent this multiple sequence alignment as a series of numbers. Again, you don't have to calculate these by yourselves, but you can now use these to take a single sequence and compare them to hundreds of these position-specific scoring tables. The other definition for this morning is something called a "pattern," so let's go right to the example. So, we've got this very interesting nomenclature up here on the screen, and we'll just take it position by position. So, in the first position here, you see two letters with square brackets around them. And that just, when you see the square brackets, it means one or the other. So, one of the things in the square brackets. So, in this case, we have a pattern that begins with a phenylalanine or a tyrosine in the first position. Where you see the X that means any amino acid can occupy the second position. We just have a C here all by itself, so that means you must have a cysteine in the third position. X followed by a number, so that just means two of those. So, any two amino acids. The curly brackets, as opposed to the square brackets that mean one of the things in the brackets, the curly brackets mean anything but what's in the brackets. So, not valine or alanine. Again, we have an open position that could be anything. And, finally, three histidines at the end. So, the difference between the profiles and the patterns is that the profiles give you information both on position and frequency where the patterns only give you information about position. So, this just becomes a simple pattern-matching exercise to find family members or assign a new protein of unknown function to a particular protein family. Both of these are incredibly useful ways of approaching the problem of assigning characteristics to an unknown protein. So, let's go ahead and put these into practice by talking about the first method, the first database, and that's called Pfam: protein families. So, what Pfam contains is a large collection of multiple sequence alignments based on protein domains and conserved protein regions, and because those regions are conserved, they more than likely have some sort of structural or function importance? When you look at one of these Pfam entries, you'll see some very good information. You'll obviously see the multiple sequence alignment that was used to derive that matrix. Protein domain architectures. You'll get an idea of a species distribution of where that particular protein family can be identified. So, is it in certain species and not in other species? Information on known protein structures and links to other relevant databases. As we go through all of the things I'm going to show you this morning, you'll see references along the bottom, and I would encourage you to pull those references to get a more in-depth view of each of the things we're going to discuss this morning. Okay. So there are actually two databases within Pfam: Pfam A and Pfam B. So, Pfam A is the better of the two because they're based on a curated multiple sequence alignment, so someone has collected information based on their biological knowledge, created a seedal alignment, and then has used that alignment to gather in the rest of the members of the protein family. The methodology used to do that relies on a program called HMMER that Sean Eddy at the Janelia Labs have developed many years ago. And because of the way this construction is done, because of the curated nature of the method, the hits are likely to be true positives. There's also something called Pfam B. The difference between the two is that unlike A that is based on curated alignments, B is based on automatically generated alignments from database searches. So, those are deemed to be of lower quality, but you can certainly take a look at those if you don't find anything matching your sequence of interest in the Pfam A set. So, the old documentation just always compared Pfam A to Pfam B in this way. They would say that Pfam A, you would have a handcrafted multiple sequence alignment, so you should think of that as like a nicely handcrafted beer. You know that it has much more care given to it and by virtue of that, it's of much higher quality. Okay. As with last week, all of the examples I'm going to take you through this morning, you can do them yourself when you get back to your labs. So, this is the URL for the webpage that has all of the sequences. So, again, I would very much encourage to do this when you get back because it's one thing to watch me do it. It's another thing to actually have your own hands on the keyboards actually going through the exercises. Okay. So, we'll go to the Pfam website. Here is the URL over at Sanger. And very simple homepage. You can do a sequence search -- we're going to do that in a moment -- but you can also look at members of a family. View a clan, which is just a group of related families. Structures and so on. So, we're going to go ahead and just do a basic sequence search. So, by clicking on those words pops up a box. Nothing else. Okay? As with the examples we worked last week, there are always options we can set and default values you can change. The way you do that here is there is a very easy-to-miss hyperlink right here that says "here." Okay, so we're going to go there. And take a look at what we get. Not a lot of options, but a couple more things you can do. So, we're going to go ahead and put our query sequence in the box from the example webpage. You can either run the search using an e-value cutoff. So, here the e-value is 1. The rules that I gave you last week for BLAST govern here as well. Okay? So, we're really looking for things that are at 10 to the minus-third or better at the protein level. But we're going to, just for the sake of the example, leave that at one for now. The gathering threshold really applies to a cut-off that was used by the curator to develop this particular profile so you can safely ignore that. If you want, you can search for the Pfam Bs by just click on the box, but we're going to just leave that the way it is right now. Okay, so we'll go ahead. Submit the search. And this is what we get back. So, just a quick overview. This represents your sequence of interest, so there's a very light grey bar in the background. And you'll see that it identified, in this case, a P4-50 domain. But not all of one. You'll see that this side is rounded, this side is jagged. So that means you've got part of it, but not all of it. If we look at the significant matches, cytochrome [spelled phonetically] P4-50 was found from position 41 to 500 in our sequence of interest. It matched the profile from 1 to 457, so the reason this is in red is that's to communicate to you that it didn't match the entire length of the motif, just part of it. Below, you'll see what's marked insignificant Pfam A matches. So, two hits come up. You do want to take a look at those, even though they don't meet our statistical criteria, the e-values are here. But you might have, again, biological information from research you're doing in the lab, from things you've read in the literature that might deserve giving these some additional consideration. So now let's just concentrate on that P4-50 hit. So, if we now click on the view link from the previous page, we get an expanded view here. And a lot of text. Okay. So, let's just focus on this part of the slide. So, we start off and the first row is your sequence of interest. It's just marked "seq." The second row is the consensus sequence that's based on that position-specific scoring matrix. So, any place you see an exact match, you'll see the letter repeated. Any place where there's no match to the consensus, you'll actually see a plus-sign in this example. I'm sorry, that's indicating where you have a conservative substitution. The PP line -- posterior probability -- you will see numbers here. The numbers indicate to you how good the match is at each position and the key is up here above. So, it tells you that if you see it's a zero, you've got 0 to 5 percent confidence. One means 5 to 15 percent and so on. So, you really want to see as high a number as possible in those positions. The star means that it's the best you can actually get and in most of our positions here, we have asterisks going all the way across. Okay. So, all right. So now we can go directly to some more information about this P4-50 domain just by clicking on that particular link. If we go to the Pfam tab at the top, we get a nice executive summary. If you knew nothing about what side of Chrome P4-50s were, right there you get a quick summary of what they are, some of the most important literature references that you should be reading before you even start doing any work on this particular protein family. You also get a representative structure on the side here. A different one will come up every time. And you can just go through them either by pulling down the pull-down menu or just click on "View a different structure." It will just take you through some of them. Now, if we look at the top, it tells us how many hits are actually found in this particular protein family, how many members there actually are. So we find that there's 282 different architectures comprising just shy of 40,000 sequences. So, let's go ahead and take a look at some of those things. So, we're going to go down the left sidebar. We were at summary, we're going to domain organization. And so we see here all of these schematics, all these graphics, showing different combinations of protein domains that are found in subsets of those 39,592 sequences. So, in the first case it says that 33,600-some odd of them have this kind of orientation, this kind of domain structure where you've got a P4-50 domain and another domain that's in front of it that's represented by this little orange box over here. If you wanted to see all of those sequences, you would just click on the show link and it would give you all of those right away. So, sometimes when you don't see absolute sequence similarity, it's good to look at things at the domain level because you can start to infer function just by looking at "Do I have a similar domain architecture as the sequence that I'm starting with?" From a clinical point of view, it's important to think in these terms because you can look at this and start to ask yourselves "Well, do I have a mutation, a clinically-relevant mutation that may be knocks out one of these domains that this domain no longer forms?" So, you either have a structural consequence or a functional consequence because you have that disease mutation. Okay, 39,592 sequences. Let's take a look at a subset of those. So, we now go to the next link down. Alignments. And, on this page, you can take a look at the actual alignment in a number of different ways. Every place where there's an alignment available, you'll see a check mark. And if you were to take your cursor and mouse over that particular check mark, it would actually switch over to reading as a view. So, it will just change into a hyperlink. And so the one that I've moused over here is under the seed sequences. So, again, the seed alignment is what that expert put together. The known members of the family as the seed for bringing in the rest of the members in this particular family. And if you go ahead and click on that, it will bring up a viewer called JalView. This is within the context of the web browser. We're going to see it later on in a more interactive fashion. The colors do actually mean something here. So, in the first set of colors, the key is on the right-hand side. It gives you some sort of coloration that gives you an indication of the physical, chemical characteristics of the residues in those particular positions. Things that are more commonly going to be able to substitute for one another, just by looking again at the entire constellation of proteins. What we know about substitution patterns. So, things that are either small or hydrophobic. Things where you see charged amino acids: histidines, tyrosines, and so on. Occasionally you will see rows that are marked with an SS in parentheses. And the SS stands for "secondary structure." So, these are places where there is known secondary structural information. So, the position of the alpha helices or the position of the beta strands are known. By virtue of that, that makes for a much better alignment because you can use that information to better predict what is supposed to be where. So, it really goes a long way in helping these alignments be much more robust. And we'll talk about that again towards the tail-end of the lecture. So the key for that is also here. So, in the rows where we have that SS designation, any place you see the red blocks, those are the alpha helices. Any place you see the yellow blocks, those are the beta strands. Okay. Let's go back to the Pfam page. And take a species-centric view of this information. So, this very colorful representation here is supposed to represent all of the organisms on the planet. You see the color assignments on this side of the slide here. If you look very closely, you can see that the metazoa are here followed by the chordates followed by the mammals going there. And you can just take your cursor and just click on those particular regions to just focus in on those particular taxa. What I've highlighted here is an organism called nematostella. This comes from the phylum Cnidaria. These are organisms that are unified by the fact that they have stinging cells, cnidocytes that they use to capture their prey. So, this includes hydra and jellyfish and corals and sea anemones. By highlighting that box, it also shows me the lineage up here that gets you from the root to that particular organism and you can very quickly just download all of the sequences from this protein family, from that particular organism, with a single click of a button. So, it's a nice tool to have if you're looking again at a particular organism and you just want to find all of the -- or a particular class of organisms and phyla and whatever. To quickly pull all of that information together, rather than having to generate those sequence files one by one. Okay. We'll pretend we've gone back to the summary page here and now we're going to just scroll down that page to where we have a bunch of external database links. And the one I'm going to take you to is the one right in the middle called ProSite and this is where we're going to see the first example of a pattern. The designation that we talked about earlier where we know what the positional information is for each member of a protein family. And if we look at the big box at the bottom, the very first item in that box gives us the consensus pattern for this particular sitochrome P4-50 family. You have links to other databases as well, but the most important thing, I think, on this page is the part that says "Expert to contact by email." Okay, so these are folks who keep these entries up-to-date and have agreed to serve as a reference point if you have any questions about this particular protein family. So, you would be somewhat foolish not to take advantage of this kind of resource because this person is making themselves available to you. If you're studying sitochrome P4-50s, you have a question that has befuddled you, you can't find the answer in the literature, click on the link. Send them the email, and they'll get back to you. So, it's a good way to actually get to a human resource rather than just looking at all of the database entries. Okay. So, the next thing we're going to talk about is related. The conserve domain database or CDD. So, what we're going to use CDD for -- same sort of goal here is to look for conserve domains in a protein sequence. The main difference between Pfam and CDD is that in the calculation of the CDD entries, what they do is incorporate three dimensional structural information and by doing that, again,, what it helps you do is to define domain boundaries and it allows -- gives you a really powerful tool for refining the alignments. The data comes from a number of sources, including Pfam, but not Pfam B, and a bunch of other databases that are focused on collecting information on groups of orthologous proteins. So, the way the CDD searches are conducted is through a variant of BLAST called RPS-BLAST -- reversed position specific BLAST. So, basically, it's just a glorified way of saying we're taking a single sequence, our query sequence, and we're going to search that against a bunch of pre-calculated position-specific scoring matrices. The same way we did with Pfam A. The important thing to keep in mind is the method that is used by CDD is not the same method that's what is used by Pfam. Okay? So, similar to the advice I gave you last week about using multiple scoring matrices when you do your BLAST searches. When you're doing this kind of search, looking for protein domains, you should use both of these methods when you're delving into these kinds of questions, just to hedge your bets. Okay. So, here we are at the CDD homepage. URL, again, up in the corner. We're going to take a sequence from our sequence webpage. In this case, this is the sequence of the deleted colorectal carcinoma gene, DCC, the protein that's encoded by that gene from human. We're going to leave all of the options exactly the way they are, but if you wanted to only search one of the databases from the list that I showed you earlier, you could select just a single one. But we're going to just throw everything at it. Go ahead and click on the submit button. And off we go. So, here we get a display. The top part is very reminiscent of what we saw last week with our BLAST searches. So, your query sequence is represented by the grey bar going across the top. Position one to position 1447. Every place that a domain or a motif has been found is marked in relation to the coordinates on this line. So, there's one-to-one correspondence between all of these shapes and each one of these lines in the table. We're going to focus in on the very first one, this neo-geneine domain right here at the C-terminal end of our DCC protein. In order to learn a little bit more about that, there's just a plus-sign over there. We're going to go ahead and click on that. It'll expand that part of the page and we get a very quick, again, executive summary of what this actually is. You find out that it's an immuno-globuland-type domain. And how it is actually composed. The alignment shows you the alignment of your sequence, the top line with the CDE domain matrix. The second line, every place you see a blue letter, that means -- excuse me, every place you see a red letter, that's an exact match. Every place you see a blue letter that is a mismatch. Okay. We can get some more information about this particular domain by looking at this link. It's very hard to see here in the blue following the name of the domain. And so if we were to click on that, it would take us to now the domain page. Again, the same information we saw before. The most important part of this page, to me, is the stuff that's all the way at the bottom. So, we're going to pretend that we've scrolled down and here we have the sequence alignment. So, you can see which blocks of this particular protein domain are conserved all of the members across this particular family. You can actually download the sequences from here and this now puts you in a position of not even having to construct a multiple sequence alignment. We're going to talk a lot about that today. But you now have the tool in your hand to be able to start doing multiple sequence alignments and we're going to talk about, again, a lot of that later in today's lecture. So, again, a good way to just be able to pull homologous together without actually having to go through the routines that we talked about last week. So, again, something else for your own material. Okay. So, we come back here. We've talked about the one-against-many techniques. Now, let's flip it the other way and talk about many-against-one or many-against-many, depending how you want to look at it. The first technique we're going to talk about is another variant of BLAST called PSI-BLAST and the PSI stands for "Position-Specific Iterated" and what this allows you to do is to start to identify distantly related members of a protein family. That, again, you might not have found in the context of a regular BLAST search. The BLAST searches start to fail around the 25 percent mark. So, this is the reason I gave you that cut-off criteria last week for protein-protein searches that one of the criteria you're going to apply is to look for at least 25 percent sequence identity. Because as you get below that, you really can't be too sure that those things are actually proper, real matches. That they might actually be false positives. So, this is another thing you can throw at it to maybe push the boundaries a little bit. This is a really nice version of a profile-based search and the way it works is like this. In the first step, you do a regular BLAST search. So, we're going to start with a protein sequence and we're going to search that protein sequence against a database of our choice. What will then happen is we'll get back our hit list, the same way we did last week, but based on the results of that hit list, what's in there, we're going to take all of those hits, take all of those sequences. Align them. Make a position-specific scoring matrix. And then that matrix becomes the input to the next round of searches. So, this is now an interactive technique that each time you go around, you're going to gather in more and more sequences. You're going to keep adding them to your alignment. You're going to keep recalculating the matrix. You're not, it is for you. And by doing that, we're going to just keep increasing, increasing the resolving power of that position-specific scoring matrix in finding new members of the protein family. All right. So, in practice. Here we are at the BLAST homepage. This hopefully looks familiar to you now, since this is a BLAST-based search, we're just going to go to the protein BLAST section of the basic BLAST collection that we have. This is, again, the BLAST search page. We put in a sequence here. In this case, this is the high-mobility group protein from humans, so this is a DNA binding protein. As before, we can select which database we wanted to search. So, we have a pull down here and this time around, I picked something called Swiss-Prot. So, let's talk about that for a minute. So, what Swiss-Prot tries to do is something very similar to Ref-Seq that we talked about last week. So, you'll recall that Ref-Seq is an effort by NCBI to represent each and every sequence in the central dogma once and only once. So, it would have a single entry for each DNA, RNA, and protein sequence. Non-redundant. What Swiss-Prot does, it's a similar effort. A long-standing effort that started many, many years ago by the folks at the Swiss Institute for Bioinfomatics. Only on the protein side of the house. But, pretty similar in what its end goals are actually are. So, you'll have by definition, the database is non-redundant like Ref-Seq. There's on-going curation by the staff at EBI, the European Bioinfomatics Institute. But also by external experts. People similar to the ones that you saw in the Pro-Site entry that take it upon themselves to make sure that these entries are kept up to date, the folks who are actually working in the field, doing experiments, having their hands in it every single day. What you'll see in the entries of those Swiss-Prot entries is that you'll have some extra fields there, including comment lines, and I always point this out to people because, again, that's the nice executive summary that comes to you by the expert in the field. So, again, you'd be foolish not to take use of that information, because the expert has put in there what they think it's most important for you to know about that particular sequence. The way you know that a sequence is a Swiss-Prot sequence goes back to the accession number. And, again, the accession number is the unique identifier, the Social Security number for a particular sequence. So, here is how the representation is. And you already know what the square brackets mean, so it's one of these letters. So, it either starts with an O or a P or a Q, and then has a five-digit number following. Okay. All right. So we come back to our page here. What I've also done in this case is I've limited the organism selection here to just the vertebrates, so remember you always have the ability to search subsets of the database based on what organismal spread you're interested in. The program selection is in the red box, so what we were doing all of last week was the BLAST-Ps, this time we're going to pick the second one down PSI-BLAST. As usual, we're going to take a look at the algorithm parameters to maybe tweak things to our best advantage here. The e-value threshold, by default, is 10. We're going to change it to 0.001 per our guidelines from last week. We can, again, filter out our low-complexity region. So, low-complexity regions are those homopolameric run places of bias composition in our sequences that could confound the BLAST algorithm. We're going to set a second e-value threshold, a PSI-BLAST threshold, also to 0.001. You'll see the default here. Again, what I like to do is always check this box here at the bottom so that my results are coming up in a new window so if I ever have to come back to this search page, it's still there. I don't have to keep hitting the back button to get here. And then we're just going to go ahead and BLAST it. All right. So, hopefully this looks more familiar to you now, given last week's lecture. Exact same format. So, we have our query sequence. One to 215. Two HMG box domains have been identified here, so two punitive DNA binding regions. And our overview of the BLAST results by score. We're going to just scroll right down to the hit list. Again, we have our list of alignments here. So the description, the definition line -- the very short one-line identifier of what each one of these sequences actually are: our score values. Remember, the score values are important but we're not going to look at those. Instead, we're going to look at the e-values which are calculated from the score values. The great neutralizer that allows us to search to compare any one BLAST search to any other BLAST search. And these are now ranked from lowest to highest. You'll see this time around two extra columns, and the one that's highlighted says "select for PSI-BLAST" and all of the check marks are checked off. So, you'll remember how this works. We start with a sequence. We do a -- the first round is just a regular BLAST-P search. We now have to get around the business of specifying what is going to go into the next round. What is going to be used for calculating that matrix which will now be the query for the next round of searching? So, we're going to just go ahead and include everything. That's what all the check boxes are there for. They're automatically checked by default. Now, in this case when we look at the e-value column, we're still well above the rules that I gave you last week, the 1x10 to the minus-30. But let's say we weren't. Right? You would see everything above the line would be checked off, everything below the line would not be checked off. But, again, this is where your biological knowledge really comes into play. You may have information that says something below the line should be included and, perhaps, that something above the line should not. So, this is where you have to look at this. Check off some additional boxes, uncheck some of the other boxes. If you see something that you know is not correct, that you know is actually a false positive. All right. To now do the next round, all you have to do is click on that Go button. It will calculate the matrix based on all of these sequences. And so here we now have the results page again. But now we've got a new column here "Use to build PSSM." This just tells us all right, what was used in the last round to get us here. If we scroll down, we're going to see a bunch of yellow lines. Okay, there's nothing checked there because these are the ones that were just found. Okay, so these are all the new sequences that we found by virtue of having all of this information from the protein family collectively. So, we're able to tease out additional sequences here. Most of these are for something called the sex-determining protein. So, this is TDY or SRY. These are members of this family, so we're going to go ahead and include these in the next round of searches. Okay. And we're going to just go around and around as many times as it takes. Two, three, four. And in this case, five. Okay. Where we just keep just hitting that Go button at the bottom. You'll know you're done when it says "no new sequences were found above the 0.001 threshold." So, we started in round one with our regular BLAST-P search. We found 77 sequences. By going through this process, we ended up with an extra 100 sequences, so that's a pretty substantial number. Okay? Things that we would not have found using just the traditional BLAST search or even just by using the default values. So, this is why it's important to be able to know what those defaults do and changing them when necessary. So, I think this demonstrates very well how you can use the collective characteristics of a protein family to really find these distantly related members. Again, things you wouldn't find otherwise. Always, always, always please check the statistics, especially in the percent identity column, because even though our criteria that the method applies, we specified e-values, but we have not specified percent identity cut-offs, so you need to do that by eye. So, if you see things that are too low, just take them out. Okay. There's a newer variant of this. Got another member of the BLAST family called Delta-BLAST. And so this method is slightly different from what's used by PSI-BLAST but not grossly different. So, it's pretty easy to understand. Instead of doing a BLAST-P search in the first round. Sequence of interest against in our example Swiss-Prot. Instead it's going to take the sequence of interest and search it against all of the entries in CDD, the database that we talked about earlier. Okay? So, it's going to take that profile information that it finds, compute a new position-specific scoring matrix, and now use that as the input for round one. It's an iterative process, just like PSI-BLAST. In round one, it's searching against CDD. Starting in round two, it's just successive PSI-BLAST searches. So, from that point on, it's exactly the same. It's just, how is the first round actually calculated? What the authors set out to do with this particular method is really come up with a more robust way to find homologs, and you end up getting some really nice, high-quality alignments, even when you're starting to push the boundaries of sequence-similarity, where things are getting down right around that 25 percent Twilight Zone. The only caution I have to give you here is while the methodology is very robust, it really depends on having entries in CDD. All of the protein families may not be necessarily represented there. That information on homologous proteins just might not be there. I think currently there's about 75 percent of all known protein families are cataloged within CDD, at least based on their documentation. So you just need to keep that in mind. That you may not be looking at everything that's available to you. So, this is yet another case of you've got multiple methods available to you. Absolutely do both. I would encourage you to pull this paper because it shows you some performance data information, the usual Venn diagram representations where you see how many were found by one method, how many were found by the other, but then how many are in the overlap and there's quite a few that fall on those other sides. It really would behoove you to use both of the two methods, and the authors recommend that as well. Okay. All right, so that's it as far as individual sequences go. We spent most of the last two weeks talking about this, so now let's take this the next step and think about multiple sequence alignments, which are just in essence a whole bunch of pare-wise sequence alignments. Okay, so what can you learn by doing these multiple sequence alignments? As we've talked about many times, you can start to pick out conserved regions, patterns, and domains. Why is that of use to you? It can start to help inform you about how to design your experiments. So, where are the regions of a particular protein that you should be looking at? Where are the things that might be a little bit less important with respect to structure and function? Again, as we've harped on quite a bit, it helps you pull in new members of protein families as well. If you don't have these, it's very hard to do everything that's in the bottom part of the slide, so when we start talking about predicting secondary structure, yes you can do this on single sequences, but it's much more powerful when you look at a family of sequences. You cannot perform any sort of phylogenetic analysis unless you have one of these things, so even the idea of looking at evolutionary relationships becomes impossible without having these multiple sequence alignments in hand. And, again, as we've talked about, it gives us the input to many of these search methods. Okay. So, what do we need to think about when we are constructing one of these multiple sequence alignments? And these are themes that we've hit on several times already in the last two weeks. We want to keep in mind positions in our multiple sequence alignments, so just the columns of the alignment where we can have absolute sequence similarity. We're going to try to align as many of the common characters as possible from sequence to sequence to sequence. By doing so, hopefully we can start to think about conservation, so we'll start to see patterns where maybe there's only one or two or three amino acids in a particular column. That gives us an indication of where can you have conservative substitutions that don't adversely affect the function or the structure of the protein. Finally, it gives us some idea about structural similarities. So we can, again, as I've alluded to several times now, if we have secondary structural information, or information from either NMR structures or X-ray structures, we can use that information to fine tune the alignment. So, I know many of you, probably all of you at some point, have tried to put together a multiple sequence alignment, so we're going to go through some general guidelines to help you bring up your game a couple of notches. All right. So, the first thing is when we do these, we tend to concentrate on the protein level rather than on a nucleotide level and hopefully that's become a little more obvious to you as to why, as to why as we've gone through a lot of the considerations we've talked about. When we look at these protein alignments, they tend to be more informative quite simply because you have 20 amino acids that you're trying to align on the protein side versus the four nucleotides that you are trying to align on the nucleotide side. And so it makes for more accurate alignments. It goes back to the side-chain chemistries of each of the amino acids and all of the physical chemical properties that come along with that. Now, of course, there's going to be cases where you can't do that and so let's say you're going to be looking at regulatory elements. Things like that by necessity have to work with nucleotide-sequences and Laurel Nitzski, in two weeks, is going to spend much more time talking about exactly that. All right. So, this slide is just intended to give you the 30,000 foot overview of what you have to think about as you do one of these. And so we're going to take each one of these off in order to give you a better appreciation for how to approach this question in the future. So, you want to put together a multiple sequence alignment. You need sequences. Okay, so how do you actually pick them? First is how many? And so you want to keep that number relatively reasonable and the reason for that is when the algorithms that compute these multiple sequence alignments are global alignment methods. We've talked about the difference between global and local last week. The global alignments, you're trying to align the entire sequence along its entire length with every other sequence in the set. Because of that, this is computationally intensive. It takes a long time. When you start to put too many sequences in your input set, the methods just start to choke and you might start getting inaccurate alignments. The methods are getting much better with time. They're able to handle more and more sequences in the input sets, but it's something to be mindful of. The other thing is sort of a more practical issue that if this is the first step to doing a phylogenetic analysis, you have to keep in mind the next step in the process. That -- okay, this is now going to produce the input to the program I'm going to use to generate my trees. Same considerations apply there. The more and more that you have in that input set, the harder and harder it is to actually generate reasonable trees that can actually be interpreted. So, it becomes almost impossible when the data sets get too big. So, good starting point, 10 to 15 sequences. I'm saying ballpark upper limit here 50 to 100. I know that's a very wide range, and it really just depends on how either closely related or distantly related each one of those proteins in the set actually are. All right, the sequences should be about the same length and the way you get them there sometimes is to just trim them down. So, we talked about domain architectures earlier and we've seen results from our BLAST searches and we've seen things in CDD that show us where the conserved blocks are. You can use that information to now trim back your sequences -- BLAST, PSI-BLAST, whichever method you want to use -- to have a more sort of tidy input set going into the multiple-sequence alignment program. So, normally what will happen is the action tends to be in the middle. You can safely get rid of the end terminal end, safely get rid of the C-terminal ends, and come up with the alignment that you want. If you don't already know that there's some business going on in either the end terminal or C-terminal ends. Okay. All right. The degree of sequence similarity. So, depends on what question you want to ask. If you're looking for required or highly conserved amino acids, you want to push towards having much more closely related sequences. If you want to go the other way and study evolutionary questions, you want to use more divergent sequences. So, a good place to start, I think, is sequences that are about 30 to 70 percent similar because that's going to hedge your bets on both of these things. But I think that the last bullet really says it best. That the most informative of alignments you can actually get is when you have a sequence set where you have sequences that are not too similar but not too dissimilar. Because if they're too similar, well, you're not going to learn anything you didn't already know. And if they're too dissimilar, you're not going to be able to align them. Okay. So that's how you put the set together. We're going to put it into a program. We get back some results. You have to inspect those. And this becomes an iterative process. What I normally recommend people do is they start off their alignments using a small set of sequences, go on the lower ends of the ranges I gave you earlier. And go ahead and just do a preliminary alignment. And look at, all right, just by eye how many residues are conserved across the alignment, do the alignments actually make sense, do you see a neat block-type structure, and have you been able to get away with very few gaps being included. Remember that the gaps represent biological events. Those are either insertions or deletions, so you can't just keep inserting gaps without regard to biology. It's going to end up messing up the alignment and messing up the interpretation. If you've done this on a small number and the alignment's good, fine. Add a few more sequences, do it again, and you can just keep going around like that. If you find out the alignment's not so great, take out sequences where you see tons of gaps are being included, then realign. So, this is not just a one-shot deal. To do it well, you have to go through a few rounds of fiddling with the input set to get a good alignment. Once you have it, you can use some visualization tools that we have to look for key residues and problem regions. What I always think is a good idea is to cross-check against these expertly created multiple-sequence alignments and we've seen examples of that through our travels over the last two weeks. So, you know where to go to look for those. Again, if there's structural information that you have, absolutely use it because it's going to tell you really definitively where can you put in a gap and where can you not put tolerate a gap? So, if you have an alpha helical region, no gaps. If you see a beta strand region, no gaps in the loops. Fine, you can have them there. Okay. All right. Interpretation. So, now you've got it. It's to your satisfaction. What does the alignment actually mean? What can you glean from it? Any time you see an absolutely conserved position, you know that those are required for either proper structure or function. So they've been conserved for a reason. If you see a relatively well-conserved position, you can tolerate limited amounts of change. You can have a little bit of wiggle room there and not adversely affect the structure or the function of the protein. Now, I think this is where most people stop considering things. They're looking for what is in common but I think there's a lot to be learned for looking at what is not in common. The non-conserved positions where you can actually mutate freely, because this is really where the evolutionary innovation goes on. Where you can possibly give rise to proteins with new functions. And finally we've already talked about where the gaps can actually be tolerated. What that might indicate. So, let's talk about a particular program, the one that's probably most commonly used to do multiple-sequence alignments and that's called "clustal omega." And so as you would expect from a multiple-sequence alignment program, you'll get out multiple-sequence alignment whether you start with nucleotide or amino acid sequences. It's really fast in the calculations and as you'll see it's really easy to use. You can provide it information if you have structural information about where the gaps can go and where the gaps shouldn't go in advance. So it knows not to put them in certain places. And we'll see this in the context of a Java applet where we can actually manipulate the alignment a bit. Okay. Before we actually get to the program, a little bit of theory. So, how this works is using a method called a Progressive Alignment Method. So, let's say we have a sequence set. It's got a whole bunch of sequences in it. Rather than trying to align all of them all with each other all at the same time, we have to do it in a little bit more rational fashion. And the way the methodology does it is by just taking two sequences at a time, starting with the two most related sequences and then just building up the alignment from there. So, it will start with two. It will keep adding and adding and adding. To bring in the ones that are a little bit less related to the ones that are in the rest of the set. So we're going to start in the places where the alignment is actually easy and move to the parts where it's a little bit harder because the sequences are more divergent. I think an example helps here. So, we're going to start with four sequences: A, B, C, and D. First thing we're going to do is calculate a similarity score of each one of those to each other sequence in the set. So, that's the equation of how many alignments you get so here we have four sequences. It means we have to calculate six alignments. This is just to illustrate why if your sequence set is too large that the numbers of calculations start to go up really, really fast, so it becomes very computationally difficult for the algorithm. So, here's our four sequences again. And we're just going to put a little scoring matrix together to show us how identical each one of these is to every other one in the set. So, for sequences, six alignments from the previous slide. One, two, three, four, five, six. So, by looking off the diagonal which are the exact matches, we see the two most related pairs are A with B, 80 percent, and C with D at 92 percent. So in the first step, we're going to just take A, and align it with B. We're going to take C and align it with D. Okay. Once we do that, we're going to use this alignment, fix it. Take this alignment, fix it. And now represent AB and CD as single sequences. So, we've basically gone from four to two. Okay? We're going to then take those "new" - in quotation marks -- single sequences, align those, and keep doing that as many times as we have to, to keep adding the remaining members of the set. So, here we only had four. We're already done. But you get the idea that you're starting at the tips, merging each new pair of sequences as you go towards the root of this tree. Okay. So why is this good? Especially for large sets, you do the easy ones first because, as you already know, those are the simplest to align. But, more importantly, you're going to use the information from those easy alignments to help you align the more difficult ones. The ones that are more distantly related because as you start to pile up these alignments, you're going to get more information the same way that we did by looking at our position-specific scoring matrices that will help us later on in the process, get those tough ones into position. Disadvantages. We're going, there's a directionality to this process and we are fixing these new alignments as we go. If we make an error early on, that epic error is propagated all the way through. So, if that alignment is incorrect in a particular position or several positions, that's just going to stay fixed through the entire rest of the process. You're going to end up with a possibly bad alignment. Now the good thing about clustal and a lot of the other methods is that they've now introduced ways for iterations to take place. So, rather than just doing the alignment process ones, it will do it multiple times. To make sure that you don't get caught in essentially what's a local minimum. It's going to cost you some increased compute time, but it's well worth doing. Okay. Now, there's obviously there's other theoretical underpinnings for different types of multiple sequence alignment methods and so this is really all you need to know about this one. If you're going to use something different, please just take the time to understand how the method works because you just, again, don't want to end up with a bad alignment because you haven't selected the sequences properly or you haven't taken into account some of the fine considerations of how it's done. The last thing you want to do is be in a position where you're going to start to draw improper biological conclusions. All right. So, we're back to the program. What did we actually get? So, as one would expect, you get a multiple-sequence alignment and those pair-wise alignment scores that you saw previously as we talked about the progressive alignment. You also get something called a cladogram and something called a phylogram. So, the cladogram is a tree representing your sequences that is an estimate of the phylogeny. So, it gives you a branch structure. All of the branches are of equal length so you don't have a sense of evolutionary time here. It gives you a first glance at common ancestry but, again, we don't have a sense of how much time has elapsed that set for each one of the taxa on the tree. The phylogram's a little bit better. Again, an estimate of the phylogeny, the branches are not of equal length, so the branch lengths are each proportional to the amount of inferred evolutionary change. When we look at the alignments, we already saw a first glimpse of this when we saw JalView in the context of Pfam earlier. What it tries to align. So, it tries to keep aromatics together basic and acidic side-chains, catalytic sites, and so on. You'll see that the interpretation of these is empirical. There are no rules I can give you here like I gave you for BLAST previously. We're going to just take a look at it and see how good it is. But we do at least have one indicator that's going to appear in the alignment and that is these symbols. If you have an entirely conserved column, you'll see an asterisk noting that particular position. You want to, as a rule of thumb, see that in about at least rather 10 percent of the positions. If we have a conserved position, places where the physical chemical properties are strongly similar, you'll see a colon on that position. And if you see a dot, that is what they call semi-conserved, so weekly similar properties probably encompassing one or more of the classes from the previous slide. Okay. So how do we do it? Here's the URL again, this takes us off to EBI. Here, I have put in five sequences, in this case, all in FOS-A format. From our webpage that has the sequences. For a set of FOS-B proteins. The input is very easy because it's really just the sequences. We're going to take a look at more options just to see. You can see what's there. Here are where the settings are for the iterations that I talked about. The things that will make sure that your progressive alignment is less prone to error. We're going to leave those as set. The only thing I've changed here is the output format to clustal with numbers and that's just going to number our alignment that we're going to look at in a minute. We're going to go ahead and submit that. And this is what we get back. So, here's our alignment. Our five sequences are aligned. You'll see the numbers along the side, so that's the clustal with numbers bit. So you know where you are in each sequence. The color key matches what's here in the inset, so the reds are the small residues, the blues are the acidic, and so on. Across the bottom, you will see where the conserved and semi-conserved and absolutely conserved positions are. So, it's the asterisk, the colon, and the dot, in that order, and places where that's not seen, you just don't have a mark on those columns. We're going to take a little bit of a different look at this by going up to the top where it says phylogenetic tree and if we click on that tab, here we actually see the cladogram that I was telling you about before, so sort of the first hint at the relatedness of these proteins. You'll notice that the branch lengths are all pretty much the same so that we're not getting any sort of sense of evolutionary time here. If we click on "real" here, that will now convert this to the phylogram and you see how the map has actually changed. Now, both of these look very different than what we're going to see at the very end of the lecture when we actually construct a phylogenetic tree, but it gives you, again, sort of a first glimpse at what we're going to be looking at. To get there, we're going to go to this tab that says "result summary" and so there's a bunch of links to the actual files if you want to download those to take them into another phylogenetic analysis program. They're in the right format that you'll need to go right into those programs, but what we're going to focus on instead here is this box here where it says "JalView." And so you just click on the button that says "Start JalView." This will launch a Java Applet that we're going to now use to actually take a look at our alignment in a little bit more depth so we can use it to manually edit the alignments. If we don't like how the alignment has been put together, we can actually shove lines back and forth. So we can introduce gaps or we can take gaps out; we can color the residues in a variety of ways, look at pair-wise alignments, do consensus sequence calculations. If we see that we have an overrepresentation of a particular sequence in the alignment, we can just go ahead and take it out. And the most important thing, at the end of the day, calculating the tree. All right, so we've clicked the JalView button. The viewer has been launched. You'll see a window that looks like this. This is the default view. Our five sequences going across the top and we have some indication of the quality of the alignment position by position as we go across. So, conservation just tells you how well the alignment is at that particular position. It's an indication of percent identity. So, the higher the bar, the better the alignment at each position. The quality is another way of looking at that same metric based on blossom scores. And finally you have a consensus sequence going across the bottom and this is, again, based on the percent identity. Here is where the plus-sign actually represents that there has been no consensus residue identified for that particular position. We're going to go ahead and change the menu options. If you squint real hard, you'll see that there's a bunch of very small type up here with a bunch of pull-down menus and that's where we're going to change things. So, you'll see the upper part of the slide change. So, we're going to go to the color menu. There is a sub-menu that's called "percent identity," so you pick "percent identity." And that changes the representation. So, everything is in now different shades of blue. The darkest of the blues are the most conserved regions going to the regions that are just still rendered in white. So, this is a very quick way to identify new motifs, candidate regions that might have structural and functional significance. So, here you want to look for blocks that have higher absolute sequence identity because there's, obviously, going to be evolutionary pressure to keep these regions conserved. So, going beyond position by position, are there entire blocks that have to stay together? If we want, we could do just a pair-wise alignment for any of the sequences in the set. So, I've just selected the first two for mouse and human, and so you go to the calculate menu and pick pair-wise alignments. And you would get this alignment. It tells you the percent identity -- 95 some odd percent -- and the alignment in the reminiscent BLAST format. The final step, we're going to finally generate an evolutionary tree. So, if we go to this calculate menu again, calculate tree from that menu. We go to neighbor joining using blossom 62 and now we have what probably is the fastest way to generate an evolutionary tree without having to go to another piece of software. And you can see this representation looks very different than what we saw earlier for this same example on the webpages. You can actually put evolutionary distances on these to give you an idea of the branch lengths. But I would use this as a first approximation because obviously there are methods out there that are much more robust at generating these kinds of trees. But, as your first glimpse, this is a great place to start. Okay. So, obviously what I want to reinforce here, this is the most commonly used method. We use it a lot. It's not the only method. There's a lot of other methods that are out there. They all have their strengths and weaknesses and in practice we use different methods to make sure the alignments are as good as we can get. A tool that's going to help you with that is another program that I want to tell you about called Tcoffee. The coffee grinder comes from their website. And so what this actually does is takes the multiple-sequence alignments a little bit further because you can combine sequence information, profile information, and structural information all at once. Whether you're looking at protein structures, RNA secondary structures, so this really is great to RNA sequences. It has some specialized algorithms that are fine-tuned for specific classes of proteins. So, if you're looking at transmembrane proteins or non-coding RNAs and so on where the rules are a little bit different, this is a good place to look to align those. What I particularly like about this method is it allows you to take output from other methods, like clustal, and other things that are out there and combine them into a single master alignment. And what that master alignment is going to quickly tell you is where there are inconsistencies between the methods. So, if there are places from method to method to method where you now have an inconsistency, give special attention to those regions. You might have to hand curate those alignments a little bit to improve upon them, so it's a quick way to zero in. Here is the URL for Tcoffee. They've just, the team behind this in Cedric Notredame's lab has just published a tutorial and the reference for that is on the slide. So, that will just take you through, step-by-step, how to actually use this method. Okay. So, with that, we end our tour of sequence analysis over these two weeks and I always feel like it's important to end on this slide. The black box that I alluded to very early on last week. Unfortunately, as we all know, you go to a webpage, you see a box, you stick your sequence in, you hit the button. Okay? And it becomes a very mechanical approach to sequence analysis where you really don't know what's going on inside the black box. Code in your sequence, get out your results, and hopefully what you've started to get a better appreciation for over the last two weeks is that the inspection is really the most important part of the process. And the only way to be able to do that is to really understand not at a gross technical method, but to have some understanding of what the methods are actually doing in the background so that you can take the results from these methods, couple them with your own biological knowledge, never leave your biological knowledge at the door. It's an important part of this because remember these are predictions. You have real data from the laboratory. Put them all together and that approach will always, always, always serve you well. So, with that word of advice, I want to just give you one additional resource which I think is very cool. So, David Sorrels who was at Penn for many years and then went on to Smith-Clyne [spelled phonetically] recently published this perspectives piece in Plass Computational Biology. The reference is on the slide. And why I love this is because a lot of people don't have access to courses like these or tutorials, hands-on workshops back in their individual institutes. So, this is a great way to teach yourselves how to do a lot of the techniques we're going to talk about in this course that we don't necessarily have time to cover. What's very, very nice about this is what David's done is he's created five curriculum tracks and the five are shown on the slide there. So, if you're interested in doing analyses, there's a subset of courses he recommends that you should watch online. So, very much in the Course Sarah [spelled phonetically] or the Kohn Academy [spelled phonetically] model where you can just sit at your laptop, at your computer, download some handouts, and learn about all of these techniques. If you're interested in building the tools, there's a [unintelligible] Informatics Tool Track and so on. So, this is a must read. Consider it assigned reading. Download this particular paper and I think you'll find some very, very important and very, very useful things in this manuscript. Okay, so with that, we're going to shift our focus next week back to the level of whole genomes. As I look over to Dr. Wolfsburg [spelled phonetically]. Terra's going to take over next week and we're going to start talking about how we're going to visualize and analyze genome scale data. Identifying important sequence-based features and using that information to its best advantage. So, with that, I thank you for your attention. I'll take questions at the podium, and thanks for joining us this morning.