Webinar - A practical guide to ncbi blast

Good afternoon everyone! This is Peter Cooper from the NCBI. We've got two webinars today. One is A Practical Guide to the NCBI Blast. You're seeing sort of a cover slide for that, there. The other one begins about an hour and 15 minutes later and that's a webinar on human variation resources and medical genetics resources. These were originally requested by Des Moines University, so I just wanted to give a shout out to them. They should be listening. The classroom out there. We'll record both webinars. They will be available on our YouTube channel on the webinars playlist. Take a couple of weeks for them to get up there. Materials for these webinars are on the FTP site and there's a compressed link that will take you there. That's gonna have the slides in there as well as the demos that I'm gonna do after I give the slideshow. This is a shortened version of a larger workshop that we give that lasts a couple of hours. So, if you have any questions about this, there's my email address and contact information is on the front. What we're gonna do today is talk a little bit about the basics of using BLAST, the reasons people people use it, some of the statistical information that BLAST gives you, the scoring systems, what the search programs are, and talk about some other alignment services that are not BLAST. We offer them on our website. And then, I'm gonna go over to the web browser and do some live searches. The BLAST is probably the most widely used sequence similarity search tool in the world. What is does is it finds high scoring local alignments between two sequences. They can be protein sequences or DNA sequences. BLAST includes a model of score distributions for random local alignments and because of that, it can provide some statistical information about how different your alignments are from chance. So, BLAST tells you, then, about non-chance similarities between biological sequences. That's interesting because if the similarities are not due to chance, then they must be due to something else. A couple things that could be. The most interesting one from the point of view of the original purpose of BLAST is homology that these two sequences are descended from a common ancestor. In most cases these days, people are using BLAST for simple identification, so, that includes things like annotating things like genomes or seeing what kinds of problems there are, now, with the alignments of things through a genome. All BLAST sequences, the All BLAST searches begin with a sequence, either a protein or a nucleotide sequence. That can be either one that you determined or it can be one from the database. Let's talk for a minute about BLAST statistics. The most important statistics you get back from BLAST is something called the expect value or the expectation value, it ranges from essentially zero up to the size of the database. It's the number of alignments that you would expect by chance with a particular score or a greater score. So, for example, if we have a five base or a sequence like this one, ELVIS, that has an expect value of 48,000. We would expect to see 48,000 hits that good or better in a database search by chance. We don't know anything about that particular hit. The one below it, though, has an e-value of, that's read as seven times 10 to the minus 18, that means I wouldn't expect to see any hits that are that good by chance. That tells you about your alignments are different than chance and if they're different than chance, they're due to something else. A real important point from this slide is that the e-value depends directly on the size of the search base, the size of the database. You wanna search the smallest database that's likely to contain the sequence of interest. We'll come back to that point when we talk about limiting your database. BLAST uses two schools of thought in terms of scoring things. The one that's in a classic kind of BLAST scoring system is called the Position Independent Scoring System. That means that the same substitution in your alignment gets the same score in any position in that alignment. The model that that really assumes is that all positions and sequence are equally likely to change. That's not a realistic model for the way proteins or DNA sequences evolve. Nevertheless, this is used by ordinary BLAST, BLAST P uses BLOSUM62 and it includes a concept of conservative substitutions. Nucleotide searches are less sophisticated. They use an identity matrix. The other kind of BLAST, which we're not really gonna have time to talk about today, but we will see the results of, are Position Dependent Scoring Matrices. And those kinds of scoring matrices, the substitution score depends on the position in the protein or in the alignment. This means, of course, that some positions are more important, less likely to change than others. And that's a realistic model for the way proteins and other biological sequences evolve. Programs that do this are PSI-BLAST, DELTA-BLAST. Those search a database with a position specific score matrix or a PSSM. Reverse PSI-BLAST searches a database's PSSMs and identifies conserved domains and that's the search that we're gonna see today that uses these position specific scoring systems. All BLAST programs also include some kind of a penalty that allows them to incorporate gaps in the alignment. Okay, so, let's talk for a minute about that these BLAST search programs are, what they're called. The nucleotide search programs that you're gonna... See on the web are blastn and megablast. Blastn is a traditional BLAST algorithm. It's the most sensitive kind of nucleotide search. Megablast, by the way, is the default algorithm and this is the best program for simple identification, things, species, annotation, you just have to remember that that's the default algorithm on the BLAST pages. When you do searches you may sometimes need to change it to the more sensitive blastn. There's sort of an intermediate algorithm that's not used very much called discontiguous megablast, it's also more sensitive than megablast. And then, here are the protein search programs and today we're really gonna only focus on the position independent scoring one. There's blastp which is a protein, protein search and alignment program. And then, there are these translating searches. These are useful for unannotated protein coding regions. And we'll do one of these today. There's blastx, which translates your query sequence and searches a protein database, tblastn, which translates the database, searches it with a protein, and tblastx, which translates both the query and the database. But all of these things are searches at the protein level. When you search BLAST at NCBI how do you get here? Certainly, you can get there from our homepage and there's a link right there at the NCBI home, links to BLAST. There's also the most common way to do it, for most people, is simply to type NCBI BLAST in Google and that will take you right to the BLAST homepage. This is the current look of the BLAST homepage and by the way, this is going to be changing fairly soon. The furniture is going to be rearranged, there's no real difference in function. Basically, this is divided into several sections that are useful in terms of what they do. One is a way to get access to assembled genomes and I'll show you that in more detail in a few minutes that you can pick any genome that you want from here and find the most complete set of data. Then, the center part of the page is basically the core of the blast searches and we're gonna be focused on that today, what we're calling basic BLAST. Those links take you to pages that are preloaded to do those kinds of searches. Then, finally, at the bottom, there's sort of a miscellaneous set of search programs or alignment tools that are related to BLAST, but are not necessarily BLAST. And we'll visit some of those today. When you're using BLAST, you need to have something to search with and that's your query sequence. I just want to point out a few aspects of this that still confuse some people. The BLAST search programs on the web will take FASTA formatted sequences, like those shown in the upper left of this slide. They will also take accession numbers, NCBI accession numbers, we'll pull those from a database and do a search with them. You can also run BLAST directly from the entrez pages, nucleotide and protein. We'll do a couple of examples like that today. Another point to take away from this slide is you can use multiple queries in a single search. Most people know that, but occasionally, we'll run into somebody that thinks you can only do one sequence at a time. Each sequence will be searched separately as a BLAST search if you do that. I want to talk for a minute about something that's a little bit odd if you think about it. One thing you can do at NCBI which is useful is to compare your own sequences without doing a database search at all. I just want to point out the two options for doing that at this point in the talk. One is BLAST 2 sequences. And so, any BLAST page at NCBI, when you go to the BLAST form there's a little checkbox that says align two or more sequences and if you check that box, another box will open up, another form, another text field will open up in the form, and you can enter sequence in there. You can enter many sequences in there. So, you can do your own little database search against your own customized database, if you want. You can also access this under that specialized BLAST section at the bottom of the page. Something we know from talking to people at the help desk, here, is that many times when people are doing BLAST 2 sequences, what they really want to do is a global sequence alignment, we do have that available in the specialized BLAST section. It's not BLAST at all, this is an algorithm called Needleman-Wunsch and it allows you to compare the entire lengths of two sequences. This is a global alignment tool. It doesn't provide any meaningful statistics about whether this is a chance alignment or not. It will just align anything to anything and it will include all the residues of your particular sequence. If you want to do global alignment, if you're interested in knowing what the percent identity is, between two proteins, this is the tool that you would have to use to do that. Okay, so, now, those were searches independent of our databases. Let's talk a little bit about the BLAST databases. And this is sort of a complicated part of our system that you need to sort of understand... What's goin' on there. Some of it's a bit chaotic. The protein databases, which you're searching using either blastt or blastx, are fairly straightforward. You do have sort of a comprehensive database called nr. This is a non-redundant database. It contains the majority of the protein sequences that people are interested in in NCBI. It also has available useful subsets on the database pull-down list. RefSeq, Swiss-Prot, PDB. Just keep in mind there are some sequences that are not part of the protein nr. US, and European, and Asian patents sequences that we get are not in there, they're in a separate database. Proteins that are coming from metagenomic samples. Sort of ecological genome thing, those are not in there. And, also, the proteins from Next-Gen assemblies. So, these are transcriptome shotgun assembly sequences. This is a growing set of data, in particular for the nucleotide side, but there are some PSA proteins, as well, those are not part of nr. This is what the nucleotide search page database pull-down list looks like. And it's quite a bit more complex. And I've got a, you can certainly, any time you go to a BLAST page, you can click on one of those question marks, get information about what sequences are included. This is a slide that has more details about that. A couple things to keep in mind about the nucleotide database is it makes them different than the protein. The main one is just the default database that we call nr. I like to refer to it as nt, which is what we call it on FTP site. This is not a comprehensive database. It contains a traditional GenBank sequences, things that are not bulk sequences, nr, NCBI RefSeq RNA sequences. That's actually a very small set of data compared to everything else we have at NCBI. It's a useful set, but it's a smaller set. Some subsets of that which are cleaner are the RefSeq, RNA database, there's also a 16S RNA database, as well, that you can search. So, what's not an nr is the majority of the nucleotide data. That includes all the bulk sequences, the RefSeq Genomic Sequences, which include our chromosome records and our various sizes of assemblies there, patents are not in there, and some other large sets of data, including Whole Genome Shotgun sequences, Transcriptome Shotgun Assemblies, and SRA data, which is really the largest set of data at NCBI. It's so large that there's no way to actually search it as a single entity. We'll talk a little bit about that when we do a demo. Another set of databases that are really separate BLAST pages, if you will, are available through that device. At the top of the BLAST homepage it lets you search Genome/Assembly Databases. Basically, this is a way of getting you a BLAST page that has the most completely assembled genome for that particular species, so you can type an organism name in there and then you can link directly to it and that will take you to a BLAST page set up to run that search. Now, I mentioned this earlier in the talk, that the most important thing you can do when you're using BLASTs is to search the smallest database that's likely to contain the sequence of interest and that's because the database gets larger and larger. As this gets larger and larger, it gets harder and harder to discern the signal from the noise that's in there. And that has to do, probably, with the way the expect value scales with the size of the database. There's some useful things that you can do, here. You can use one of the organism limits, you can type the name of an organism or group of organisms. You can even exclude groups of organisms that you don't want to see. So, here's an example: getting all the bacterial sequences without the order enterobacteriales in there. You can get rid of things like model sequences or uncultured sequences if you're working with bacteria. You can even specify things like a molecular weight range. So, any entrez query that works in the protein and nucleotide database will also work on this page. Okay, a couple of things to finish up, here, and then we'll pause and see if there are any questions. One of the things that, as a person who manages BLAST help or sits on the BLAST help desk, here, it's very important to me and to all the other people who sort of support BLAST is that you understand that there is a identifier for your search and that's called the request identifier. If you look at your BLAST results it's at the top. There's RID, which might not be clear what that stands for. But that identifier is the unique identifier for your results, so if you have a problem with BLAST, you can write to us and give us that identifier. If you click on that link, it will give you a URL like the one in the middle of this slide and you can just paste that in a web browser and get your results back or you can send it to somebody you know that you want to show them your results or you can send it to us. We will keep your results on the servers at NCBI for about 36 hours. You can see them to the recent results link that's on the BLAST page. They will also show up in your My NCBI. It doesn't make them last any longer, they still last for 36 hours. So, keep this in mind and we'll show you that managing that a little bit later on today when we do a search. So, be sure to send us an RID if you have a question about a particular search. We can look up your results and see exactly what the settings were and we can figure out if there's a bug or if there's something we can help you with to make the search work more efficiently. Another thing I want to mention is that BLAST offers a number of download options. This is actually an older screenshot. We've added a couple of more structured formats, here. Just be aware that they're here. These are the kinds of things that you'll want if you need to save BLAST results or to save huge sets of results to try to parse out information from them because there's structured formats that you can parse with a script. Or you can use some of the utilities that come with BLAST to re-display them. And the hit table is particularly popular, even with people who don't script, because that can be loaded into Excel. The .CSV version of it in particular. Okay, so let me talk about some of the specialized BLAST services, then we'll stop for questions. Bonnie's lookin' at me 'cause I said we'd stop (laughter) for questions next. And these are ones we're gonna demonstrate in a few minutes. PrimerBlast is our primer designer and specificity checker. It takes advantage of free software, Primer3, to design the primers and it uses adaptations of our sequences if you want to design primers that do things like ban exon boundaries and things like that, that uses a BLAST to make sure your primers are specific. MOLE-BLAST is a tool, it's very specialized and I won't demonstrate that today. We have done webinars on this particular topic. This is a way of clustering sequences and funding attachment on the placement of things like 16S sequences. We use it internally in our taxonomy group to help identify things. Two special protein services that I will demonstrate today are COBALT, which is our multiple alignment tool. COBALT stands for Constraint Based Alignment Tool. It does a Protein Global Multiple Alignment. And just like Needleman-Wunsch, a global alignment tool like this requires that you input sequences that you know are related to each other, 'cause otherwise, you'll just get a mess. The beauty part about COBALT is it lets you take the output from a BLAST search and feed it into COBALT, so you already know those sequences are related and you can write it as an extension to your BLAST search. And then, I'll give you a quick demo of a new tool that's kind of a, something we're sort of trying out. That's a rapid protein identification tool. It's called SmartBLAST. It uses a very rapid approach to searching that uses k-mer content of sequences to find matches. It's very quick and it might replace some of our internal mechanisms for neighboring things like proteins to give you a live search, say, if you wanted to find a related protein on the web. And it produces, also uses COBALT to produce a multiple alignment and a protein tree. Okay. So, now. I'll just mention that there is a help link on the BLAST pages and this has lots of good information. Including links to their handbooks chapters, the help documents, and the YouTube channel, which has a lot of BLAST tutorials. In fact, this will go on that, in addition to the webinars playlist, it'll go on the BLAST tutorials playlist on our YouTube channel. So, Bonnie says we have three questions. - [Voiceover] The first question... Is whether, how many, what's the limits for multiple sequences in the query data set? - [Voiceover] Well, that's a question that we get often at BLAST help in particular. You're allowed, the way this works is you're allowed one hour of CPU time, so, that's processing time. It's not real time. Now, that could be just a few minutes of real time, depending upon how many processors your search runs on. So, there is no fixed limit based on number of sequences, number of residues, but you will run up against it fairly quickly if you use large numbers of sequences that have a lot of hits in the database. I'm afraid I can't give you a concrete answer. If people write in with proteins, I would say no more than 100 at a time for nucleotide sequences, depends on the length because that can vary a lot. But if you're trying to do something like search with chromosome one against the nt database, Bonnie's laughing, but this happens all the time. Don't do that, it's not going to happen for you. - [Voiceover] Because chromosome one is how long, Peter? - [Voiceover] I don't remember, it's big. - [Voiceover] Okay. (laughter) The largest human chromosome, so. - [Voiceover] There are lots of things that you can do to sort of ameliorate that problem, but if you have a need to BLAST hundred thousands of sequences, you might need to think about some other options and I can point some of those out to you and they're available on the help desk and in the developer options. Okay, do we have another one? - [Voiceover] The question was, is the reference... Is the protein reference database not included on nr? And I wanted to make sure that this person meant the BLAST pProtein nr database and I didn't get a clarification. - [Voiceover] Well, the answer's really the same on-- - [Voiceover] Okay. - [Voiceover] Both. - [Voiceover] Okay. - [Voiceover] The RefSeq, well, as long as we're talking, let's address them separately. On the protein side, RefSeq proteins are included in nr. That's easy. The nucleotide side, the messenger RNA, the transcript sequences are included in the nt nr database, the nucleotide default database. The larger reference sequences, the assemblies, like chromosomes and contigs and things like that, those genomic sequences are not included in nt nr. And there was a third question? - [Voiceover] The person says that the BLAST results are missing right away after they did the BLAST and they can't find them. - [Voiceover] Not sure, not sure I understand the question. - [Voiceover] I know, I didn't either, completely. But I hoped that you on the BLAST help might have seen that before. - [Voiceover] Yeah, no, I don't know what, what that person means by they can't find them. - [Voiceover] Okay. - [Voiceover] Maybe you can rephrase that and we'll come back to it later. I'd like to stop here and go on to do a few live demos. What I wanted to do is to do a few searches. And we're going to... We've got, actually, a document on the FTP site that goes over what we're planning to do in the live searches. I'm gonna do a couple of things with a mammalian protein called creatine kinase, B, the brain-type kinase. We're gonna do some BLAST searches with that. Then, we're gonna do a translating search against a fish, PSA database, to find the corresponding nucleotide sequence for a protein. We'll use a different protein for that. We'll use glycine dehydrogenase. Then, actually, I will give a quick demo, Smart BLAST, using an open reading frame that I got from that fish sequence. And then, we're gonna do two other searches. One is to show you some things about the nucleotide system by searching the human genome with a transcript from macaca fascicularis and we're gonna design some primers using primer BLAST. There's another example in here, I'm using SRA, but we won't have time to do that one today, I'm pretty sure, 'cause I wanna spend some time on those first several. Okay. So, what I wanna do is we're gonna start, actually, not in BLAST, but we're gonna start in the protein database. 'Cause I'm gonna set some searches up for you. Notice that I also have, I've mentioned those BLAST RIDs and I have those in this document and they're stable, for awhile, anyway. We have the ability to preserve these at NCBI. At some point they will go stale because the underlying things change and they don't work anymore. But I can use those to retrieve my results, save us some time, in particular for the first search result setup, and I'll show you by just retrieving our results why it's important to limit your database. I'm gonna go over here to the NCBI homepage and I'm gonna change my database to protein. I'm just gonna retrieve the sequence that I happen to know the accession number for. This is a creatine kinase from a human. I know the accession number, this is a reference sequence accession number. I'm gonna retrieve a human reference sequence. My goal, ultimately, with this search is to try to find the collective set of sequences to do multiple sequence alignment with them and I'm gonna try to collect mammalian creatine kinase. So, notice that for many protein sequence like this I can click the Run BLAST link, here. And, really, all it did for me was it just loaded the BLAST page for me with the accession number in the query box, there. And I'll talk about some of the settings on the page. I normally have slides for that, but I thought I'd just do it live because I think we can do things a little bit more expediently that way. One of the first things you need to do when you come here is to figure out what database you're searching. And here's the pull-down list. Now, I'm gonna leave it set to nr for a moment. I'm actually not gonna run that search for you, I'm gonna retrieve the results because that's gonna take awhile. And this is a good example just to illustrate for you the problem that you're gonna run into, now, with the size of the protein database being so large and so heavily weighted in sort of two areas. One of those areas is the vertebrate protein and the other area that it's heavily weighted in is bacterial protein. And that has to do with the efforts, the sequence, and annotate links. So, we'll come back to limiting this in a few minutes. There also is a set of. There are some settings below this fold down here that is called Algorithm parameters, so, I wanna just show you that briefly because there are some things down there that I think you may need to change sometimes. One of the important things, here, is there two parameters that govern your output. One of them is the maximum target sequences. But no matter what else I do, BLAST will not show me more than 100 sequences. It's a little bit worse than that, though. What is means is that BLAST will not collect more than 100 sequences. On BLAST, there's a two-stage algorithm, so, if you'd have this set to low, you can wind up missing some important things. So, you can increase this. And, in fact, the search that I'm gonna show you, I've set it to 5,000. And, in fact, that wasn't enough, as you'll see in a minute. You, also, will probably want to adjust your expect threshold, ordinarily. Notice that it's set to 10. That means that the worst score that I'm gonna show you, I would expect to see 10 hits that are that good or better by chance. That's not something that's terribly interesting to you. So, you can set this to some other value. There are lots of good arbitrary values to set, here. One that's quite common for protein searches is this one. Sometimes, 10 to the minus six, so, I think that's a useful one to use, there. You could set it to one times 10 to the minus three. But that means that at least there's some kind of, you know, possibility that that's not due to chance. The other thing to notice that I want to point out, here, now, that's a recent change in BLAST for protein, one of the shortcuts that BLAST uses is it doesn't find or try to extend every match, it finds the matches of certain size and then starts to extend them. The default once was three, for protein searches. It's now six. That makes the searches faster, but just be aware if you come here and you find that you don't get exactly the same results that you got, say, last year at this time, it might be because of the word size. It's gonna affect sort of marginal hits in many cases. Those are the things I wanted to show you, here, for the protein pages. And I could run this. It may take awhile to run. So, what I'm gonna do is go back over here and retrieve my RID. And so, I've got one, here, that I ran that has this RID, here. There's a complete URL there. I can also just you that I can get this for my recent results, I'll just copy that. I'll go over here to this link that says recent results and that's available on any of the BLAST pages. And I'll paste this in here. Now, I actually did do some filter on this. So, I eliminated some of the model organism searches and I'll come back to that in a minute when we go back and show you with this applied. Resubmit it, we can see all those settings. The main point in showing you this is that, first of all, we did run conserve domain search on this. It has the phosphagen kinase conserve domain, the creatine kinase, so, that's what I would expect. If didn't know what this protein was it would give me some ideas of what the function of this protein is. I've maxed out my display, here, for the graphical overview. It holds 100. I can change that if I want to. And then, down here... I've got my, what we call the BLAST descriptions. These are the hits, essentially. And they're sorted for me by E value. Notice I have all these E values of zero. They're not really zero, they're just a very small number. And I'm trying to reach my cutoff. The number was set to, I think when I ran the search I probably left it at 10. I might need to reformat this, because I don't have everything, so, let's make sure that I get all my results back. What I can do, here, is go back through to the formatting options, we'll use this more than once today. It's still set to 100 descriptions, but notice that I can get up to 5,000, which that's what I originally requested. And so, you can see what happens very quickly is that I get a tremendous number of hits that are from all kinds of lifeforms. I should be able to take this all the way back to the bacteria because it has conserve domain. So, here, I'm back into somethings that are insects and have arginine kinases in them. But this is an overwhelming amount of output and the main point that I wanted to make to you with this search is that there's no reason for you to do this because you probably don't need all this. You probably want to do something much simpler and restrict to a particular set of organisms. So, this just goes on and on and on. 5,000 hits. And by the time I get to the bottom of my descriptions, I still haven't reached my E value cutoff. And that means that I'm missing significant hits. Maybe that significant hit is one that I'm interested in. The main take-home from this is to make sure you're limiting your database to something that you're interested in. And, I actually, this one I already did limit a little bit. Let me show you what I did. If I have a result like this and I want to resubmit it, I can click this link that says edit and resubmit. This will take me back to the page that shows me exactly what I did. So, I did a search with creatine kinase. I did this yesterday. I searched nr. And I did get rid of some kind of sequences. I got rid of models and I got rid of uncultured environmental sample sequences. I did ask for 5,000 hits. I didn't restrict my E value and I never reached it. I had an E value set here for 10, but I never reached that at all. I never even got to an E value that wasn't significant because even with 5,000 hits, I didn't find all those significant matches. This, down here, was on by default. This is a filter for the kinds of sequences that violate BLAST statistics, which is low complexity regions. So, if I wanna fix this to make it a little bit more manageable as a search... Probably the best thing that I can do is to run it against a particular group of organisms. So, if I'm interested in making a little phylogenetic tree or a protein tree, then I would want to collect sequences for my group of organisms. I'm gonna collect mammals, which is a much smaller set of data. And let's (sighs) look for a better control set of data and let's do the reference protein database. I've now made the database smaller and I've made this RefSeq protein database even smaller by doing two things: restricting by organism and getting rid of the model sequences that our pipeline is producing. And the other thing I might want to do is to go down here and to change my expect threshold like we did earlier. So, I could make it to something fairly significant. And these are just arbitrary cutoffs. You can go back and change them if you find that you're not getting sequences that you want or you're getting things that you don't want. I will try to run this one live. Let's see how long it takes. Immediately, I have my conserve domain results. That's partly because the sequences in the database we already know what conserve domain's are on it. Remember, that's a position specific kind of search. And if I think this is taking too long, I can go back over here because I have run this. And get my request ID out of my old document. And if you want to come back and retrieve these, they will work for a while. Probably several months. They won't work next year. Let's see where we are here. I didn't need to do that because this is done. So, now, instead of maxing out everything, I have 32 BLAST hits and if you'll look at my graphical overview, here, you can see there's sort of two kinds of hits. There are these longer ones. This is the one that we started with. These are the cytosolic isoforms or genes for creatine kinase. And notice that there are some that are going to the mitochondria and you can probably guess that the reason that they don't align is because there is a leader peptide at the beginning, here, a single peptide that tells the cell to send that to the mitochondria. Looking at my output, here. You can see the organisms in the list. My E value was set pretty stringently, but these are all very similar proteins, so, I got them here. That looks pretty good. I wanted to look at an alignment just to see how BLAST does in alignment. When we look at this one from (mumbles). If I go here, for the pig, a U type, it jumps me down to the alignment, and you can see the way BLAST shows the protein alignment. Notice this is a local alignment. So the first 11 residues of my query didn't align the first 44 of my subject sequence didn't align with the query. And then you can see how the center line is sort of reflecting the scoring system. These plus signs represent positive scores for substitutions in the underlying BLOSUM62 matrix. The blank spaces means that that's a negative or a zero score in that matrix. The identities, of course, are given a letter there. I don't see any gaps in this particular alignment, but if there had been some of these rendered at oh yes, there's one right here. There's a gap right there. BLAST inserted a gap there, that was cheaper than aligning residues incorrectly in that position. So I have an ad here for SmartBLAST, and we can do that next with a different protein. After we do another kind of protein search. Does anybody, let's pause here for a minute, Bonnie, and see if there are any particular questions about this one. - [Voiceover] Not about the clarification from the missing BLAST cells, but I wanted you to address was sometimes the network crashes during the, I just wanted to see if you knew, or could address. When you make the BLAST request, it gets to the NCBI servers. The NCBI servers can run it, and the results should be in the results, recent results window. But when, if there is a network crash, at what part of the process would that present the BLAST? - [Voiceover] I don't know. - [Voiceover] Okay. - [Voiceover] The person has the particular issue, they should write to us. We can help them solve the problem, but I can't completely understand what the problem is in the question. Write to BLAST help or write to me, with enough details about exactly what happened. What your search was. If you got to the point of getting an RID, you know, send that to us, too. - [Voiceover] There is another question about is 5,000 the optimal value for protein, or does it depend on the size of the query sequence? - [Voiceover] No, you mean 5,000, it depends on what you're trying to do. It depends on what you're trying to do, the main thing is to make sure that that isn't limiting your results. That the expect value cut off is limiting your results. So I want to do a different kind of example. We're gonna use another protein search just to show you what a translating search is useful for. And I'm gonna go back up here to the BLAST homepage. Notice that I can choose from, here I'm gonna take a protein sequence. I'm gonna take a highly conserved protein, and I'm gonna try to find it in one of these transcriptome shotgun assemblies. These are assemblies from next gen, RNA-seq data, or organisms that we basically have no other kinds of data for. Many of them have no protein data just because, at NCBI. But those assemblies of their transcripts, you can identify the corresponding regions. So let me just show you that real quickly. I'm gonna go to tblastn. And we're gonna use as a query sequence, one of the ones that's in my output over here. This is a glycine dehydrogenase. So again, the query sequence, that's not that hard to do. But notice that my databases here are different. So I'm using a protein query but my databases are nucleotide databases. This is a growing set of data called transcriptome shotgun assembly. Many cases those assemblies are also represented in SRA, so you could potentially search SRA, as long as you knew what the experiments were that you needed to search. Transcriptome shotgun assembly database is set up the same way that WGS's on BLAST. When I choose that database, notice that I have to do something else. So choosing an organism for example, is not optional, I have to choose one. So the organism that I'm gonna choose here, striped bass, okay? So that's an organism that's near and dear to a lot of people's hearts around this part of the world. Chesapeake Bay. So there's a transcriptome shotgun assembly for that organism. If I want to, I can change my expect threshold, just as a matter of course to be something a little bit significant. I'll leave those settings the way they are, and let's see if this is fast. These searches are a little more burdensome than doing an ordinary BLASTP search, because what BLAST is doing, is it's translating the database in all six reading frames on the fly, to give you a protein. So I have a very nice hit here. Some other sort of minor hits. If I click down here, here's my match and this is a translation of my subject sequence. It's a pretty decent match, to give me the pretty good idea that this is a (mumbles) protein. When I go here, this doesn't take me to the nucleotide database, these are sorted in a separate system. This is going to our WGS browser, which also contains the transcriptome shotgun assemblies. I'll go ahead and open that on a new tab. So what I can do there is to get the FASTA sequence, but this exists basically only here, I can't get this out of the normal nucleotide database. There's a master record for this here. You could download the entire set if I wanted to. Now what I did, actually, was I translated that, and I got a little protein sequence, and this would be a common use case for smartBLAST. I have a protein that I generated from some kind of project like this and I want to identify it. Let's see what happens when I do smartBLAST. I'm gonna go back over here, this is my open reading frame that I got from that sequence. I'll show you a couple of smartBLAST that are kind of useful. One is the rapid identification. The other is that it does give me a little bit increased look back time, because the database is smaller, and I don't run into that problem of being limited by the number of proteins that I can collect. I'm gonna go back here to the BLAST homepage. I'm gonna retrieve the smartBLAST link here. I'm gonna paste that protein sequence in. So I search with a bony fish sequence. Which is labeled as unknown, and notice that it places it in this nice little protein tree for me. If I wanted to know what this protein was, then I have no question that this is a glycine dehydrogenase, decarboxylating, midochondrial form, so I have other fishes there. This yellow croaker, the damselfish, guppies, the zebra fish notice the hits are in two different colors. There is a reference database or what we call landmark database of proteomes from well-studied organisms. The house mouse and zebra fish are two of those. You go to the help tab, it defines what all the other organisms are that are in that database. In addition, it gives me hits from the best hits from the NR database, so that's where this large yellow croaker, the damselfish and the guppy sequence come from. So that's very good and it was very fast at identifying that. The other thing that's kinda useful about this, so here are my best hits. Top five, but these additional hits are interesting because it let me look back. Because the number of organisms in that reference database are limited, I can see much further back than I could in a search against NR. So here is some bacteria, Thermotoga Maritima, in fact if I wanted to find Escherichia coli, could just do a find in page, make it easy. So here's my match to e coli, which I defy you to find that easily on a search against NR, because what happens is you're gonna have to get many, many thousands of hits to see that one. It's still a significant match, it's only 33 percent identical to the protein that I started with. So I'm gonna change gears and do one thing with BLAST, and just show you some of the formatting options that are useful. Alright, so let me go over here to the BLAST page again. Actually, what I think I'll do is I'll start with a nucleotide database, which is one way of doing this search that I'm gonna do now. I'm gonna go back here and I'm gonna do a search using a sequence that's actually got a problem with it. We can see that problem very easily by using one of the formatting options in BLAST. So I'm gonna go ahead and copy that. That's a nucleotide sequence. So this one has sort of a funny definition line. It was some kind of a high throughput cDNA sequencing project. Because it's similar to CDC20, but there actually is a problem with this sequence. It is from a monkey. I'm gonna click the run BLAST button here, and I'm gonna throw this into the standard nucleotide BLAST page, 'cause there's a shortcut here that's pretty handy. So let me search the human genome. I'm gonna do that. Now this is my first nucleotide search today. Notice that this is set to megaBLAST. That means that its not very sensitive, and I won't have the kinds of problems that I have with protein where I'm looking back very far and seeing lots and lots of hits. This is an example really of an identification search and a search that looks at annotation problems. We'll go ahead and run that. That was very fast, because we have an index search of the human genome. We actually have two human, two sets of data with two genomes. We have the transcripts, I hit the corresponding CDC20 transcript, and then we have hits to two different genome assemblies. We'll focus on the primary assembly. So if we look at the alignment to CDC20, I can go here and take a look at that. Here's some mismatches at the beginning. Mismatches, but it's a very close alignment. One of the things I want you to notice is there's a little gap right here. That's near the pre-prime end of the alignment. That might not be a big problem if it's in the UTR, but what if its in the coding region? It could cause a frame shift. So we could see, if we can add the coding regions onto this then we can see what's happened there. So that's the main thing I came here to show you. If I go to the formatting options, I can add the CDS feature, which will pull the coding region features of some RNA translations from the nucleotide sequence database. I could also render this in a way that's going to give me some indication of where there are differences. It's kinda hard to see where the mismatches are and where the gaps are, and things like that. Let me reformat that. So here's my in terminal refining, there are some changes in the coding region here, they're kinds of subtle things. (mumbles) for Alanine. Some of them are silent substitutions here. But here I have the problem. So it looks like the gap that was inserted, here it looks like there must have been a sequencing error probably, then the sequence. So it threw a frame shift in here, beginning here, there's a different reading frame translation of the sequence. So these kind of formatting options are very useful for seeing those kinds of things. The other thing I want you to see here is the way that you can look at hits to the genome. There is a hit to a pseudo gene, I won't go into that right now. You can look at that later if you want to. I'm just gonna look at the main hit, which is on chromosome one. We're down in the alignments here, and so we're sorting this by E value. We can sort this not by E value, but by query start position. That will give us basically the exon in order. Now they didn't line up exactly right, because BLAST doesn't know about splice junctions. The other thing that I can do here that makes this very useful to be able to do is to just display this in the graphical sequence viewer, so we can see what's going on. I'm gonna display this here. My BLAST hits are gonna be loaded in the graphical sequence viewer. In this case, sort of inverted my BLAST hits, so it's showing me the subject sequence with the query sequence aligned to it this way. So, you can quite clearly see the exon intron structure here with the mismatches highlighted. We're gonna use the graphical sequencer just to zoom in cause it's by front end. You'll notice that actually I missed the first exon. One of the things you can do as an exercise, or to convince yourself that it's true. If I use a blastn, which is a more sensitive kind of search, that I will make that first exon. I will be able to align to it. MegaBLAST, which is what we used, which is a very large word flash shortcut, it uses a word type of 28. So if there were any mismatches in that 28 nucleotide hit, it won't find a hit at all. For this first untranslated exon, doesn't find the match with megaBLAST, but it will with blastn. Okay, so I think we need to wrap up pretty soon. Why don't we pause here, Bonnie, and see if there's any questions. If not, I can do the primer-BLAST example fairly quickly. Do you have anything? - [Voiceover] Well, you can set up the primer-BLAST example while I ask this question, because it feeds right into it. That is, is there no RID for primer-BLAST? - [Voiceover] There is an RID for primer-BLAST. I'll show you that when we do it. - [Voiceover] Okay. - [Voiceover] So what I'm gonna do is... We're gonna design primers for a particular exon of a gene, so I'm gonna work with BRCA1. I'm gonna cheat and use a shortcut that's built in a lot of places. It's a gene sensor. I'm actually gonna search Pub Med, and notice that there's this ad for gene. Um, I wanna do this in nucleotide. It also has one, but the advantage of the one in nucleotide is that it gives me access to the genomic sequence. This is a RefSeq gene record, I can highlight sequence features here. When I click that link, I'm able to sort of browse the features. Let me go ahead and get the exons here. So there are 23 exons of BRCA1. I'm gonna pick exon 15, as the thing that I want to amplify. This is a common kind of task that people have, they need to get primers that will amplify an exon of a gene. But notice that I can now display this exon in FASTA format. I want to design primers that will amplify this. What I can do is send this directly to primer-BLAST, with a template sequence. So this will design primers that will amplify within that exon, but if I want to, I can make it so that it starts before the beginning of the sequence and ends before, and binds outside of the exon. I can go ahead and copy that, move it over. Likewise, I can change the endpoint over here. Then I need to pick a background database. It's already set to the tax ID for humans. I want to amplify this out of genomic DNA, so I'm gonna pick RefSeq reference assembly from selected organisms is a good one to pick. Refseq representative genomes also works well for humans, because this is the representative genome for humans. I'll click the Get Primers button. It recognizes that I'm, there's a sequence that it matches chromosome 17, that's the gene BRCA1, so that's okay. I want to say that that's right, that's what I want to find. Primer-BLAST can sometimes get a little slower than everything else, cause it's run on fewer machines. It also had a very wide open BLAST search at the end. So I got three primer pairs that are outside of exon 15 in BRCA1. So these are a decent set of primers that will amplify that exon, and not really bind within it too much. You could actually add something to this to see that this is a region that has a lot of disease causing mutations, so this is a common kind of task that people who are screening things for these mutations would do. Now, somebody asked me about the RID for primer-BLAST. You can use this Job ID here, to go back and get the primer-BLAST results, previous one. Yeah, so it's over here. So if you have your primer-BLAST job id, which is that long string that we had a minute ago, you can enter that in here, and retrieve the results. They don't persist, you know, any longer than ordinary BLAST jobs. But you can do that. Okay, so that is a wrap on this one. So those are the things that I wanted to show you. We ran over by a minute or two. We can stay open for a couple more minutes, if anybody has any questions. Okay, thanks everybody for coming, and that concludes this webinar. We will have another one beginning in about 15 minutes.