Tip:
Highlight text to annotate it
X
Good afternoon everyone!
This is Peter Cooper from the NCBI.
We've got two webinars today.
One is A Practical Guide to the NCBI Blast.
You're seeing sort of a cover slide for that, there.
The other one begins about an hour and 15 minutes later
and that's a webinar on human variation resources
and medical genetics resources.
These were originally requested by Des Moines University,
so I just wanted to give a shout out to them.
They should be listening.
The classroom out there.
We'll record both webinars.
They will be available on our YouTube channel
on the webinars playlist.
Take a couple of weeks for them to get up there.
Materials for these webinars are on the FTP site and there's
a compressed link that will take you there.
That's gonna have the slides in there as well as the demos
that I'm gonna do after I give the slideshow.
This is a shortened version of a larger workshop that
we give that lasts a couple of hours.
So, if you have any questions about this, there's my email
address and contact information is on the front.
What we're gonna do today is talk a little bit about
the basics of using BLAST, the reasons people people use it,
some of the statistical information that BLAST gives you,
the scoring systems, what the search programs are, and talk
about some other alignment services that are not BLAST.
We offer them on our website.
And then, I'm gonna go over to the web browser and do
some live searches.
The BLAST is probably the most widely used sequence
similarity search tool in the world.
What is does is it finds high scoring local alignments
between two sequences.
They can be protein sequences or DNA sequences.
BLAST includes a model of score distributions for random
local alignments and because of that, it can provide some
statistical information about how different your alignments
are from chance.
So, BLAST tells you, then, about non-chance similarities
between biological sequences.
That's interesting because if the similarities are not due
to chance, then they must be due to something else.
A couple things that could be.
The most interesting one from the point of view of
the original purpose of BLAST is homology that these two
sequences are descended from a common ancestor.
In most cases these days, people are using BLAST for simple
identification, so, that includes things like annotating
things like genomes or seeing what kinds of problems there
are, now, with the alignments of things through a genome.
All BLAST sequences, the All BLAST searches begin with
a sequence, either a protein or a nucleotide sequence.
That can be either one that you determined or it can be one
from the database.
Let's talk for a minute about BLAST statistics.
The most important statistics you get back from BLAST
is something called the expect value or the expectation
value, it ranges from essentially zero up to the size of
the database.
It's the number of alignments that you would expect by
chance with a particular score or a greater score.
So, for example, if we have a five base or a sequence like
this one, ELVIS, that has an expect value of 48,000.
We would expect to see 48,000 hits that good or better
in a database search by chance.
We don't know anything about that particular hit.
The one below it, though, has an e-value of, that's read
as seven times 10 to the minus 18, that means I wouldn't
expect to see any hits that are that good by chance.
That tells you about your alignments are different than
chance and if they're different than chance, they're due
to something else.
A real important point from this slide is that the e-value
depends directly on the size of the search base, the size
of the database.
You wanna search the smallest database that's likely
to contain the sequence of interest.
We'll come back to that point when we talk about limiting
your database.
BLAST uses two schools of thought in terms of
scoring things.
The one that's in a classic kind of BLAST scoring system
is called the Position Independent Scoring System.
That means that the same substitution in your alignment
gets the same score in any position in that alignment.
The model that that really assumes is that all positions
and sequence are equally likely to change.
That's not a realistic model for the way proteins or DNA
sequences evolve.
Nevertheless, this is used by ordinary BLAST, BLAST P uses
BLOSUM62 and it includes a concept
of conservative substitutions.
Nucleotide searches are less sophisticated.
They use an identity matrix.
The other kind of BLAST, which we're not really gonna
have time to talk about today, but we will see the results
of, are Position Dependent Scoring Matrices.
And those kinds of scoring matrices, the substitution score
depends on the position in the protein or in the alignment.
This means, of course, that some positions are more
important, less likely to change than others.
And that's a realistic model for the way proteins and other
biological sequences evolve.
Programs that do this are PSI-BLAST, DELTA-BLAST.
Those search a database with a position specific score
matrix or a PSSM.
Reverse PSI-BLAST searches a database's PSSMs and identifies
conserved domains and that's the search that we're gonna
see today that uses these position specific scoring systems.
All BLAST programs also include some kind of a penalty
that allows them to incorporate gaps in the alignment.
Okay, so, let's talk for a minute about that these BLAST
search programs are, what they're called.
The nucleotide search programs that you're gonna...
See on the web are blastn and megablast.
Blastn is a traditional BLAST algorithm.
It's the most sensitive kind of nucleotide search.
Megablast, by the way, is the default algorithm and this is
the best program for simple identification, things, species,
annotation, you just have to remember that that's
the default algorithm on the BLAST pages.
When you do searches you may sometimes need to change it
to the more sensitive blastn.
There's sort of an intermediate algorithm that's not used
very much called discontiguous megablast, it's also more
sensitive than megablast.
And then, here are the protein search programs and today
we're really gonna only focus on the position independent
scoring one.
There's blastp which is a protein, protein search
and alignment program.
And then, there are these translating searches.
These are useful for unannotated protein coding regions.
And we'll do one of these today.
There's blastx, which translates your query sequence
and searches a protein database, tblastn, which translates
the database, searches it with a protein, and tblastx, which
translates both the query and the database.
But all of these things are searches at the protein level.
When you search BLAST at NCBI how do you get here?
Certainly, you can get there from our homepage and there's
a link right there at the NCBI home, links to BLAST.
There's also the most common way to do it, for most people,
is simply to type NCBI BLAST in Google and that will take
you right to the BLAST homepage.
This is the current look of the BLAST homepage and by
the way, this is going to be changing fairly soon.
The furniture is going to be rearranged, there's no real
difference in function.
Basically, this is divided into several sections that are
useful in terms of what they do.
One is a way to get access to assembled genomes and I'll
show you that in more detail in a few minutes that you can
pick any genome that you want from here and find the most
complete set of data.
Then, the center part of the page is basically the core
of the blast searches and we're gonna be focused on that
today, what we're calling basic BLAST.
Those links take you to pages that are preloaded to do
those kinds of searches.
Then, finally, at the bottom, there's sort of
a miscellaneous set of search programs or alignment tools
that are related to BLAST, but are not necessarily BLAST.
And we'll visit some of those today.
When you're using BLAST, you need to have something
to search with and that's your query sequence.
I just want to point out a few aspects of this that still
confuse some people.
The BLAST search programs on the web will take FASTA
formatted sequences, like those shown in the upper left
of this slide.
They will also take accession numbers, NCBI accession
numbers, we'll pull those from a database and do a search
with them.
You can also run BLAST directly from the entrez pages,
nucleotide and protein.
We'll do a couple of examples like that today.
Another point to take away from this slide is you can use
multiple queries in a single search.
Most people know that, but occasionally, we'll run into
somebody that thinks you can only do one sequence at a time.
Each sequence will be searched separately as a BLAST search
if you do that.
I want to talk for a minute about something that's a little
bit odd if you think about it.
One thing you can do at NCBI which is useful is to compare
your own sequences without doing a database search at all.
I just want to point out the two options for doing that
at this point in the talk.
One is BLAST 2 sequences.
And so, any BLAST page at NCBI, when you go to the BLAST
form there's a little checkbox that says align two or more
sequences and if you check that box, another box will
open up, another form, another text field will open up
in the form, and you can enter sequence in there.
You can enter many sequences in there.
So, you can do your own little database search against
your own customized database, if you want.
You can also access this under that specialized BLAST
section at the bottom of the page.
Something we know from talking to people at the help desk,
here, is that many times when people are doing BLAST 2
sequences, what they really want to do is a global sequence
alignment, we do have that available in the specialized
BLAST section.
It's not BLAST at all, this is an algorithm called
Needleman-Wunsch and it allows you to compare the entire
lengths of two sequences.
This is a global alignment tool.
It doesn't provide any meaningful statistics about whether
this is a chance alignment or not.
It will just align anything to anything and it will include
all the residues of your particular sequence.
If you want to do global alignment, if you're interested
in knowing what the percent identity is, between two
proteins, this is the tool that you would have to use
to do that.
Okay, so, now, those were searches independent of
our databases.
Let's talk a little bit about the BLAST databases.
And this is sort of a complicated part of our system
that you need to sort of understand...
What's goin' on there.
Some of it's a bit chaotic.
The protein databases, which you're searching using
either blastt or blastx, are fairly straightforward.
You do have sort of a comprehensive database called nr.
This is a non-redundant database.
It contains the majority of the protein sequences that
people are interested in in NCBI.
It also has available useful subsets on the database
pull-down list.
RefSeq, Swiss-Prot, PDB.
Just keep in mind there are some sequences that are not
part of the protein nr.
US, and European, and Asian patents sequences that we get
are not in there, they're in a separate database.
Proteins that are coming from metagenomic samples.
Sort of ecological genome thing, those are not in there.
And, also, the proteins from Next-Gen assemblies.
So, these are transcriptome shotgun assembly sequences.
This is a growing set of data, in particular for
the nucleotide side, but there are some PSA proteins,
as well, those are not part of nr.
This is what the nucleotide search page database pull-down
list looks like.
And it's quite a bit more complex.
And I've got a, you can certainly, any time you go to
a BLAST page, you can click on one of those question marks,
get information about what sequences are included.
This is a slide that has more details about that.
A couple things to keep in mind about the nucleotide
database is it makes them different than the protein.
The main one is just the default database that we call nr.
I like to refer to it as nt, which is what we call it
on FTP site.
This is not a comprehensive database.
It contains a traditional GenBank sequences, things that
are not bulk sequences, nr, NCBI RefSeq RNA sequences.
That's actually a very small set of data compared to
everything else we have at NCBI.
It's a useful set, but it's a smaller set.
Some subsets of that which are cleaner are the RefSeq,
RNA database, there's also a 16S RNA database, as well,
that you can search.
So, what's not an nr is the majority of the nucleotide data.
That includes all the bulk sequences,
the RefSeq Genomic Sequences, which include our chromosome
records and our various sizes of assemblies there, patents
are not in there, and some other large sets of data,
including Whole Genome Shotgun sequences, Transcriptome
Shotgun Assemblies, and SRA data, which is really
the largest set of data at NCBI.
It's so large that there's no way to actually search it
as a single entity.
We'll talk a little bit about that when we do a demo.
Another set of databases that are really separate BLAST
pages, if you will, are available through that device.
At the top of the BLAST homepage it lets you search
Genome/Assembly Databases.
Basically, this is a way of getting you a BLAST page that
has the most completely assembled genome for that particular
species, so you can type an organism name in there and then
you can link directly to it and that will take you to
a BLAST page set up to run that search.
Now, I mentioned this earlier in the talk, that the most
important thing you can do when you're using BLASTs is to
search the smallest database that's likely to contain
the sequence of interest and that's because the database
gets larger and larger.
As this gets larger and larger, it gets harder and harder
to discern the signal from the noise that's in there.
And that has to do, probably, with the way the expect value
scales with the size of the database.
There's some useful things that you can do, here.
You can use one of the organism limits, you can type
the name of an organism or group of organisms.
You can even exclude groups of organisms that you don't
want to see.
So, here's an example: getting all the bacterial sequences
without the order enterobacteriales in there.
You can get rid of things like model sequences or uncultured
sequences if you're working with bacteria.
You can even specify things like a molecular weight range.
So, any entrez query that works in the protein
and nucleotide database will also work on this page.
Okay, a couple of things to finish up, here, and then we'll
pause and see if there are any questions.
One of the things that, as a person who manages BLAST help
or sits on the BLAST help desk, here, it's very important
to me and to all the other people who sort of support
BLAST is that you understand that there is a identifier
for your search and that's called the request identifier.
If you look at your BLAST results it's at the top.
There's RID, which might not be clear what that stands for.
But that identifier is the unique identifier for your
results, so if you have a problem with BLAST, you can write
to us and give us that identifier.
If you click on that link, it will give you a URL like
the one in the middle of this slide and you can just
paste that in a web browser and get your results back
or you can send it to somebody you know that you want to
show them your results or you can send it to us.
We will keep your results on the servers at NCBI for about
36 hours.
You can see them to the recent results link that's on
the BLAST page.
They will also show up in your My NCBI.
It doesn't make them last any longer, they still last
for 36 hours.
So, keep this in mind and we'll show you that managing that
a little bit later on today when we do a search.
So, be sure to send us an RID if you have a question about
a particular search.
We can look up your results and see exactly what
the settings were and we can figure out if there's a bug
or if there's something we can help you with to make
the search work more efficiently.
Another thing I want to mention is that BLAST offers
a number of download options.
This is actually an older screenshot.
We've added a couple of more structured formats, here.
Just be aware that they're here.
These are the kinds of things that you'll want if you need
to save BLAST results or to save huge sets of results
to try to parse out information from them because there's
structured formats that you can parse with a script.
Or you can use some of the utilities that come with BLAST
to re-display them.
And the hit table is particularly popular, even with people
who don't script, because that can be loaded into Excel.
The .CSV version of it in particular.
Okay, so let me talk about some of the specialized
BLAST services, then we'll stop for questions.
Bonnie's lookin' at me 'cause I said we'd stop
(laughter) for questions next.
And these are ones we're gonna demonstrate in a few minutes.
PrimerBlast is our primer designer and specificity checker.
It takes advantage of free software, Primer3, to design
the primers and it uses adaptations of our sequences
if you want to design primers that do things like ban exon
boundaries and things like that, that uses a BLAST to make
sure your primers are specific.
MOLE-BLAST is a tool, it's very specialized and I won't
demonstrate that today.
We have done webinars on this particular topic.
This is a way of clustering sequences and funding attachment
on the placement of things like 16S sequences.
We use it internally in our taxonomy group to help
identify things.
Two special protein services that I will demonstrate today
are COBALT, which is our multiple alignment tool.
COBALT stands for Constraint Based Alignment Tool.
It does a Protein Global Multiple Alignment.
And just like Needleman-Wunsch, a global alignment tool
like this requires that you input sequences that you know
are related to each other, 'cause otherwise, you'll just
get a mess.
The beauty part about COBALT is it lets you take the output
from a BLAST search and feed it into COBALT, so you already
know those sequences are related and you can write it as
an extension to your BLAST search.
And then, I'll give you a quick demo of a new tool that's
kind of a, something we're sort of trying out.
That's a rapid protein identification tool.
It's called SmartBLAST.
It uses a very rapid approach to searching that uses k-mer
content of sequences to find matches.
It's very quick and it might replace some of our internal
mechanisms for neighboring things like proteins to give you
a live search, say, if you wanted to find a related protein
on the web.
And it produces, also uses COBALT to produce a multiple
alignment and a protein tree.
Okay.
So, now.
I'll just mention that there is a help link on the BLAST
pages and this has lots of good information.
Including links to their handbooks chapters, the help
documents, and the YouTube channel, which has a lot of
BLAST tutorials.
In fact, this will go on that, in addition to the webinars
playlist, it'll go on the BLAST tutorials playlist on our
YouTube channel.
So, Bonnie says we have three questions.
- [Voiceover] The first question...
Is whether, how many, what's the limits for multiple
sequences in the query data set?
- [Voiceover] Well, that's a question that we get often
at BLAST help in particular.
You're allowed, the way this works is you're allowed one
hour of CPU time, so, that's processing time.
It's not real time.
Now, that could be just a few minutes of real time,
depending upon how many processors your search runs on.
So, there is no fixed limit based on number of sequences,
number of residues, but you will run up against it fairly
quickly if you use large numbers of sequences that have
a lot of hits in the database.
I'm afraid I can't give you a concrete answer.
If people write in with proteins, I would say no more than
100 at a time for nucleotide sequences, depends on
the length because that can vary a lot.
But if you're trying to do something like search with
chromosome one against the nt database, Bonnie's laughing,
but this happens all the time.
Don't do that, it's not going to happen for you.
- [Voiceover] Because chromosome one is how long, Peter?
- [Voiceover] I don't remember, it's big.
- [Voiceover] Okay.
(laughter)
The largest human chromosome, so.
- [Voiceover] There are lots of things that you can do
to sort of ameliorate that problem, but if you have a need
to BLAST hundred thousands of sequences, you might need
to think about some other options and I can point some
of those out to you and they're available on the help desk
and in the developer options.
Okay, do we have another one?
- [Voiceover] The question was, is the reference...
Is the protein reference database not included on nr?
And I wanted to make sure that this person meant the BLAST
pProtein nr database and I didn't get a clarification.
- [Voiceover] Well, the answer's really the same on--
- [Voiceover] Okay. - [Voiceover] Both.
- [Voiceover] Okay.
- [Voiceover] The RefSeq, well, as long as we're talking,
let's address them separately.
On the protein side, RefSeq proteins are included in nr.
That's easy.
The nucleotide side, the messenger RNA, the transcript
sequences are included in the nt nr database, the nucleotide
default database.
The larger reference sequences, the assemblies,
like chromosomes and contigs
and things like that, those genomic sequences are not
included in nt nr.
And there was a third question?
- [Voiceover] The person says that the BLAST results
are missing right away after they did the BLAST
and they can't find them.
- [Voiceover] Not sure, not sure I understand the question.
- [Voiceover] I know, I didn't either, completely.
But I hoped that you on the BLAST help might have seen
that before.
- [Voiceover] Yeah, no, I don't know what, what that person
means by they can't find them.
- [Voiceover] Okay.
- [Voiceover] Maybe you can rephrase that and we'll come
back to it later.
I'd like to stop here and go on to do a few live demos.
What I wanted to do is to do a few searches.
And we're going to...
We've got, actually, a document on the FTP site that goes
over what we're planning to do in the live searches.
I'm gonna do a couple of things with a mammalian protein
called creatine kinase, B, the brain-type kinase.
We're gonna do some BLAST searches with that.
Then, we're gonna do a translating search against a fish,
PSA database, to find the corresponding nucleotide sequence
for a protein.
We'll use a different protein for that.
We'll use glycine dehydrogenase.
Then, actually, I will give a quick demo, Smart BLAST, using
an open reading frame that I got from that fish sequence.
And then, we're gonna do two other searches.
One is to show you some things about the nucleotide system
by searching the human genome with a transcript from
macaca fascicularis and we're gonna design some primers
using primer BLAST.
There's another example in here, I'm using SRA, but we won't
have time to do that one today, I'm pretty sure, 'cause
I wanna spend some time on those first several.
Okay.
So, what I wanna do is we're gonna start, actually,
not in BLAST, but we're gonna start in the protein database.
'Cause I'm gonna set some searches up for you.
Notice that I also have, I've mentioned those BLAST RIDs
and I have those in this document and they're stable,
for awhile, anyway.
We have the ability to preserve these at NCBI.
At some point they will go stale because the underlying
things change and they don't work anymore.
But I can use those to retrieve my results, save us some
time, in particular for the first search result setup,
and I'll show you by just retrieving our results why
it's important to limit your database.
I'm gonna go over here to the NCBI homepage and I'm gonna
change my database to protein.
I'm just gonna retrieve the sequence that I happen to know
the accession number for.
This is a creatine kinase from a human.
I know the accession number, this is a reference sequence
accession number.
I'm gonna retrieve a human reference sequence.
My goal, ultimately, with this search is to try to find
the collective set of sequences to do multiple sequence
alignment with them and I'm gonna try to collect mammalian
creatine kinase.
So, notice that for many protein sequence like this I can
click the Run BLAST link, here.
And, really, all it did for me was it just loaded the BLAST
page for me with the accession number in the query
box, there.
And I'll talk about some of the settings on the page.
I normally have slides for that, but I thought I'd just do
it live because I think we can do things a little bit more
expediently that way.
One of the first things you need to do when you come here
is to figure out what database you're searching.
And here's the pull-down list.
Now, I'm gonna leave it set to nr for a moment.
I'm actually not gonna run that search for you, I'm gonna
retrieve the results because that's gonna take awhile.
And this is a good example just to illustrate for you
the problem that you're gonna run into, now, with the size
of the protein database being so large and so heavily
weighted in sort of two areas.
One of those areas is the vertebrate protein and the other
area that it's heavily weighted in is bacterial protein.
And that has to do with the efforts, the sequence,
and annotate links.
So, we'll come back to limiting this in a few minutes.
There also is a set of.
There are some settings below this fold down here that
is called Algorithm parameters, so, I wanna just show you
that briefly because there are some things down there that
I think you may need to change sometimes.
One of the important things, here, is there two parameters
that govern your output.
One of them is the maximum target sequences.
But no matter what else I do, BLAST will not show me more
than 100 sequences.
It's a little bit worse than that, though.
What is means is that BLAST will not collect more than
100 sequences.
On BLAST, there's a two-stage algorithm, so, if you'd have
this set to low, you can wind up missing some
important things.
So, you can increase this.
And, in fact, the search that I'm gonna show you, I've set
it to 5,000.
And, in fact, that wasn't enough, as you'll see in a minute.
You, also, will probably want to adjust your expect
threshold, ordinarily.
Notice that it's set to 10.
That means that the worst score that I'm gonna show you,
I would expect to see 10 hits that are that good or better
by chance.
That's not something that's terribly interesting to you.
So, you can set this to some other value.
There are lots of good arbitrary values to set, here.
One that's quite common for protein searches is this one.
Sometimes, 10 to the minus six, so, I think that's a useful
one to use, there.
You could set it to one times 10 to the minus three.
But that means that at least there's some kind of, you know,
possibility that that's not due to chance.
The other thing to notice that I want to point out, here,
now, that's a recent change in BLAST for protein, one of
the shortcuts that BLAST uses is it doesn't find or try
to extend every match, it finds the matches of certain
size and then starts to extend them.
The default once was three, for protein searches.
It's now six.
That makes the searches faster, but just be aware if you
come here and you find that you don't get exactly the same
results that you got, say, last year at this time, it might
be because of the word size.
It's gonna affect sort of marginal hits in many cases.
Those are the things I wanted to show you, here,
for the protein pages.
And I could run this.
It may take awhile to run.
So, what I'm gonna do is go back over here and retrieve
my RID.
And so, I've got one, here, that I ran that has this
RID, here.
There's a complete URL there.
I can also just you that I can get this for my recent
results, I'll just copy that.
I'll go over here to this link that says recent results
and that's available on any of the BLAST pages.
And I'll paste this in here.
Now, I actually did do some filter on this.
So, I eliminated some of the model organism searches
and I'll come back to that in a minute when we go back
and show you with this applied.
Resubmit it, we can see all those settings.
The main point in showing you this is that, first of all,
we did run conserve domain search on this.
It has the phosphagen kinase conserve domain, the creatine
kinase, so, that's what I would expect.
If didn't know what this protein was it would give me some
ideas of what the function of this protein is.
I've maxed out my display, here, for the graphical overview.
It holds 100.
I can change that if I want to.
And then, down here...
I've got my, what we call the BLAST descriptions.
These are the hits, essentially.
And they're sorted for me by E value.
Notice I have all these E values of zero.
They're not really zero, they're just a very small number.
And I'm trying to reach my cutoff.
The number was set to, I think when I ran the search
I probably left it at 10.
I might need to reformat this, because I don't have
everything, so, let's make sure that I get all
my results back.
What I can do, here, is go back through to the formatting
options, we'll use this more than once today.
It's still set to 100 descriptions, but notice that I can
get up to 5,000, which that's what I originally requested.
And so, you can see what happens very quickly is that I
get a tremendous number of hits that are from all kinds
of lifeforms.
I should be able to take this all the way back to
the bacteria because it has conserve domain.
So, here, I'm back into somethings that are insects
and have arginine kinases in them.
But this is an overwhelming amount of output and the main
point that I wanted to make to you with this search is that
there's no reason for you to do this because you probably
don't need all this.
You probably want to do something much simpler and restrict
to a particular set of organisms.
So, this just goes on and on and on.
5,000 hits.
And by the time I get to the bottom of my descriptions,
I still haven't reached my E value cutoff.
And that means that I'm missing significant hits.
Maybe that significant hit is one that I'm interested in.
The main take-home from this is to make sure you're limiting
your database to something that you're interested in.
And, I actually, this one I already did limit a little bit.
Let me show you what I did.
If I have a result like this and I want to resubmit it,
I can click this link that says edit and resubmit.
This will take me back to the page that shows me exactly
what I did.
So, I did a search with creatine kinase.
I did this yesterday.
I searched nr.
And I did get rid of some kind of sequences.
I got rid of models and I got rid of uncultured
environmental sample sequences.
I did ask for 5,000 hits.
I didn't restrict my E value and I never reached it.
I had an E value set here for 10, but I never reached
that at all.
I never even got to an E value that wasn't significant
because even with 5,000 hits, I didn't find all those
significant matches.
This, down here, was on by default.
This is a filter for the kinds of sequences that violate
BLAST statistics, which is low complexity regions.
So, if I wanna fix this to make it a little bit more
manageable as a search...
Probably the best thing that I can do is to run it against
a particular group of organisms.
So, if I'm interested in making a little phylogenetic tree
or a protein tree, then I would want to collect
sequences for my group of organisms.
I'm gonna collect mammals, which is a much smaller set
of data.
And let's (sighs)
look for a better control set of data and let's do
the reference protein database.
I've now made the database smaller and I've made this
RefSeq protein database even smaller
by doing two things: restricting by organism
and getting rid of the model sequences that
our pipeline is producing.
And the other thing I might want to do is to go down here
and to change my expect threshold like we did earlier.
So, I could make it to something fairly significant.
And these are just arbitrary cutoffs.
You can go back and change them if you find that you're not
getting sequences that you want or you're getting things
that you don't want.
I will try to run this one live.
Let's see how long it takes.
Immediately, I have my conserve domain results.
That's partly because the sequences in the database
we already know what conserve domain's are on it.
Remember, that's a position specific kind of search.
And if I think this is taking too long, I can go back
over here because I have run this.
And get my request ID out of my old document.
And if you want to come back and retrieve these, they will
work for a while.
Probably several months.
They won't work next year.
Let's see where we are here.
I didn't need to do that because this is done.
So, now, instead of maxing out everything, I have 32 BLAST
hits and if you'll look at my graphical overview, here,
you can see there's sort of two kinds of hits.
There are these longer ones.
This is the one that we started with.
These are the cytosolic isoforms or genes
for creatine kinase.
And notice that there are some that are going
to the mitochondria and you can probably guess that
the reason that they don't align is because there is
a leader peptide at the beginning, here, a single peptide
that tells the cell to send that to the mitochondria.
Looking at my output, here.
You can see the organisms in the list.
My E value was set pretty stringently, but these are all
very similar proteins, so, I got them here.
That looks pretty good.
I wanted to look at an alignment just to see how BLAST does
in alignment.
When we look at this one from (mumbles).
If I go here, for the pig, a U type,
it jumps me down to the alignment,
and you can see the way BLAST shows the protein alignment.
Notice this is a local alignment.
So the first 11 residues of my query didn't align
the first 44 of my subject sequence
didn't align with the query.
And then you can see how the center line is
sort of reflecting the scoring system.
These plus signs represent positive scores
for substitutions in the underlying BLOSUM62 matrix.
The blank spaces means that that's a negative
or a zero score in that matrix.
The identities, of course, are given a letter there.
I don't see any gaps in this particular alignment,
but if there had been some of these rendered at
oh yes, there's one right here.
There's a gap right there.
BLAST inserted a gap there,
that was cheaper than aligning residues incorrectly
in that position.
So I have an ad here for SmartBLAST,
and we can do that next with a different protein.
After we do another kind of protein search.
Does anybody, let's pause here for a minute, Bonnie,
and see if there are any particular questions
about this one.
- [Voiceover] Not about the clarification
from the missing BLAST cells,
but I wanted you to address was
sometimes the network crashes during the,
I just wanted to see if you knew,
or could address.
When you make the BLAST request,
it gets to the NCBI servers.
The NCBI servers can run it,
and the results should be in the results,
recent results window.
But when, if there is a network crash,
at what part of the process would that
present the BLAST?
- [Voiceover] I don't know.
- [Voiceover] Okay.
- [Voiceover] The person has the particular issue,
they should write to us.
We can help them solve the problem,
but I can't completely understand what the problem is
in the question.
Write to BLAST help or write to me,
with enough details about exactly what happened.
What your search was.
If you got to the point of getting an RID,
you know, send that to us, too.
- [Voiceover] There is another question about is
5,000 the optimal value for protein,
or does it depend on the size of the query sequence?
- [Voiceover] No, you mean 5,000,
it depends on what you're trying to do.
It depends on what you're trying to do,
the main thing is to make sure that that isn't
limiting your results.
That the expect value cut off is limiting your results.
So I want to do a different kind of example.
We're gonna use another protein search just to show you
what a translating search is useful for.
And I'm gonna go back up here to the BLAST homepage.
Notice that I can choose from,
here I'm gonna take a protein sequence.
I'm gonna take a highly conserved protein,
and I'm gonna try to find it in one of these
transcriptome shotgun assemblies.
These are assemblies from next gen,
RNA-seq data, or organisms
that we basically have no other kinds of data for.
Many of them have no protein data
just because, at NCBI.
But those assemblies of their transcripts,
you can identify the corresponding regions.
So let me just show you that real quickly.
I'm gonna go to tblastn.
And we're gonna use as a query sequence,
one of the ones that's in my output over here.
This is a glycine dehydrogenase.
So again, the query sequence, that's not that hard to do.
But notice that my databases here are different.
So I'm using a protein query but my databases
are nucleotide databases.
This is a growing set of data called
transcriptome shotgun assembly.
Many cases those assemblies are also represented in SRA,
so you could potentially search SRA,
as long as you knew what the experiments were
that you needed to search.
Transcriptome shotgun assembly database
is set up the same way that WGS's on BLAST.
When I choose that database,
notice that I have to do something else.
So choosing an organism for example,
is not optional, I have to choose one.
So the organism that I'm gonna choose here,
striped bass, okay?
So that's an organism that's near and dear
to a lot of people's hearts around this part of the world.
Chesapeake Bay.
So there's a transcriptome shotgun assembly
for that organism.
If I want to, I can change my expect threshold,
just as a matter of course
to be something a little bit significant.
I'll leave those settings the way they are,
and let's see if this is fast.
These searches are a little more burdensome
than doing an ordinary BLASTP search,
because what BLAST is doing, is it's translating
the database in all six reading frames on the fly,
to give you a protein.
So I have a very nice hit here.
Some other sort of minor hits.
If I click down here,
here's my match and this is a translation
of my subject sequence.
It's a pretty decent match,
to give me the pretty good idea that
this is a (mumbles) protein.
When I go here, this doesn't take me to the
nucleotide database, these are sorted
in a separate system.
This is going to our WGS browser,
which also contains the transcriptome shotgun assemblies.
I'll go ahead and open that on a new tab.
So what I can do there is to get the FASTA sequence,
but this exists basically only here,
I can't get this out of the normal nucleotide database.
There's a master record for this here.
You could download the entire set if I wanted to.
Now what I did, actually, was I translated that,
and I got a little protein sequence,
and this would be a common use case for smartBLAST.
I have a protein that I generated
from some kind of project like this
and I want to identify it.
Let's see what happens when I do smartBLAST.
I'm gonna go back over here,
this is my open reading frame that I got from that sequence.
I'll show you a couple of smartBLAST
that are kind of useful.
One is the rapid identification.
The other is that it does give me a little bit
increased look back time, because the database is smaller,
and I don't run into that problem of being limited
by the number of proteins that I can collect.
I'm gonna go back here to the BLAST homepage.
I'm gonna retrieve the smartBLAST link here.
I'm gonna paste that protein sequence in.
So I search with a bony fish sequence.
Which is labeled as unknown, and notice that
it places it in this nice little protein tree for me.
If I wanted to know what this protein was,
then I have no question that this is
a glycine dehydrogenase, decarboxylating,
midochondrial form, so I have other fishes there.
This yellow croaker, the damselfish, guppies, the zebra fish
notice the hits are in two different colors.
There is a reference database
or what we call landmark database of proteomes
from well-studied organisms.
The house mouse and zebra fish are two of those.
You go to the help tab,
it defines what all the other organisms are
that are in that database.
In addition, it gives me hits from the best hits
from the NR database,
so that's where this large yellow croaker,
the damselfish and the guppy sequence come from.
So that's very good and it was very fast
at identifying that.
The other thing that's kinda useful about this,
so here are my best hits.
Top five, but these additional hits are interesting
because it let me look back.
Because the number of organisms in that reference database
are limited, I can see much further back
than I could in a search against NR.
So here is some bacteria, Thermotoga Maritima,
in fact if I wanted to find Escherichia coli,
could just do a find in page, make it easy.
So here's my match to e coli,
which I defy you to find that easily
on a search against NR,
because what happens is you're gonna have to
get many, many thousands of hits to see that one.
It's still a significant match,
it's only 33 percent identical
to the protein that I started with.
So I'm gonna change gears and do one thing with BLAST,
and just show you some of the formatting options
that are useful.
Alright, so let me go over here to the
BLAST page again.
Actually, what I think I'll do is I'll
start with a nucleotide database,
which is one way of doing this search
that I'm gonna do now.
I'm gonna go back here
and I'm gonna do a search
using a sequence that's actually got a problem with it.
We can see that problem very easily by using
one of the formatting options in BLAST.
So I'm gonna go ahead and copy that.
That's a nucleotide sequence.
So this one has sort of a funny definition line.
It was some kind of a high throughput
cDNA sequencing project.
Because it's similar to CDC20,
but there actually is a problem with this sequence.
It is from a monkey.
I'm gonna click the run BLAST button here,
and I'm gonna throw this into
the standard nucleotide BLAST page,
'cause there's a shortcut here that's pretty handy.
So let me search the human genome.
I'm gonna do that.
Now this is my first nucleotide search today.
Notice that this is set to megaBLAST.
That means that its not very sensitive,
and I won't have the kinds of problems
that I have with protein where I'm looking back very far
and seeing lots and lots of hits.
This is an example really of an identification search
and a search that looks at annotation problems.
We'll go ahead and run that.
That was very fast, because we have an index search
of the human genome.
We actually have two human,
two sets of data with two genomes.
We have the transcripts,
I hit the corresponding CDC20 transcript,
and then we have hits to two different genome assemblies.
We'll focus on the primary assembly.
So if we look at the alignment to CDC20,
I can go here and take a look at that.
Here's some mismatches at the beginning.
Mismatches, but it's a very close alignment.
One of the things I want you to notice
is there's a little gap right here.
That's near the pre-prime end of the alignment.
That might not be a big problem if it's in the UTR,
but what if its in the coding region?
It could cause a frame shift.
So we could see, if we can add the coding regions onto this
then we can see what's happened there.
So that's the main thing I came here to show you.
If I go to the formatting options,
I can add the CDS feature,
which will pull the coding region features
of some RNA translations
from the nucleotide sequence database.
I could also render this in a way
that's going to give me some indication
of where there are differences.
It's kinda hard to see where the mismatches are
and where the gaps are, and things like that.
Let me reformat that.
So here's my in terminal refining,
there are some changes in the coding region here,
they're kinds of subtle things.
(mumbles) for Alanine.
Some of them are silent substitutions here.
But here I have the problem.
So it looks like the gap that was inserted,
here it looks like there must have been a sequencing error
probably, then the sequence.
So it threw a frame shift in here,
beginning here, there's a different
reading frame translation of the sequence.
So these kind of formatting options
are very useful for seeing those kinds of things.
The other thing I want you to see here is
the way that you can look at hits to the genome.
There is a hit to a pseudo gene,
I won't go into that right now.
You can look at that later if you want to.
I'm just gonna look at the main hit,
which is on chromosome one.
We're down in the alignments here,
and so we're sorting this by E value.
We can sort this not by E value,
but by query start position.
That will give us basically the exon in order.
Now they didn't line up exactly right,
because BLAST doesn't know about splice junctions.
The other thing that I can do here
that makes this very useful to be able to do
is to just display this in the graphical sequence viewer,
so we can see what's going on.
I'm gonna display this here.
My BLAST hits are gonna be loaded
in the graphical sequence viewer.
In this case, sort of inverted my BLAST hits,
so it's showing me the subject sequence
with the query sequence aligned to it this way.
So, you can quite clearly see
the exon intron structure here
with the mismatches highlighted.
We're gonna use the graphical sequencer
just to zoom in cause it's by front end.
You'll notice that actually I missed the first exon.
One of the things you can do as an exercise,
or to convince yourself that it's true.
If I use a blastn, which is a more sensitive kind of search,
that I will make that first exon.
I will be able to align to it.
MegaBLAST, which is what we used,
which is a very large word flash shortcut,
it uses a word type of 28.
So if there were any mismatches in that 28 nucleotide hit,
it won't find a hit at all.
For this first untranslated exon,
doesn't find the match with megaBLAST,
but it will with blastn.
Okay, so I think we need to wrap up pretty soon.
Why don't we pause here, Bonnie,
and see if there's any questions.
If not, I can do the primer-BLAST example fairly quickly.
Do you have anything?
- [Voiceover] Well, you can set up the primer-BLAST
example while I ask this question,
because it feeds right into it.
That is, is there no RID for primer-BLAST?
- [Voiceover] There is an RID for primer-BLAST.
I'll show you that when we do it.
- [Voiceover] Okay.
- [Voiceover] So what I'm gonna do is...
We're gonna design primers for a particular exon of a gene,
so I'm gonna work with BRCA1.
I'm gonna cheat and use a shortcut that's built in
a lot of places.
It's a gene sensor.
I'm actually gonna search Pub Med,
and notice that there's this ad for gene.
Um, I wanna do this in nucleotide.
It also has one, but the advantage of the one in nucleotide
is that it gives me access to the genomic sequence.
This is a RefSeq gene record,
I can highlight sequence features here.
When I click that link, I'm able to sort of
browse the features.
Let me go ahead and get the exons here.
So there are 23 exons of BRCA1.
I'm gonna pick exon 15,
as the thing that I want to amplify.
This is a common kind of task that people have,
they need to get primers that will amplify
an exon of a gene.
But notice that I can now display this exon
in FASTA format.
I want to design primers that will amplify this.
What I can do is send this directly to primer-BLAST,
with a template sequence.
So this will design primers that will amplify
within that exon, but if I want to,
I can make it so that it starts before the beginning
of the sequence and ends before,
and binds outside of the exon.
I can go ahead and copy that, move it over.
Likewise, I can change the endpoint over here.
Then I need to pick a background database.
It's already set to the tax ID for humans.
I want to amplify this out of genomic DNA,
so I'm gonna pick RefSeq reference assembly
from selected organisms is a good one to pick.
Refseq representative genomes also works well for humans,
because this is the representative genome for humans.
I'll click the Get Primers button.
It recognizes that I'm,
there's a sequence that it matches chromosome 17,
that's the gene BRCA1, so that's okay.
I want to say that that's right,
that's what I want to find.
Primer-BLAST can sometimes get a little slower
than everything else, cause it's run on fewer machines.
It also had a very wide open BLAST search at the end.
So I got three primer pairs
that are outside of exon 15 in BRCA1.
So these are a decent set of primers
that will amplify that exon,
and not really bind within it too much.
You could actually add something to this
to see that this is a region that has a
lot of disease causing mutations,
so this is a common kind of task
that people who are screening things
for these mutations would do.
Now, somebody asked me about the RID for primer-BLAST.
You can use this Job ID here,
to go back and
get the primer-BLAST results, previous one.
Yeah, so it's over here.
So if you have your primer-BLAST job id,
which is that long string that we had a minute ago,
you can enter that in here,
and retrieve the results.
They don't persist, you know, any longer than
ordinary BLAST jobs.
But you can do that.
Okay, so that is a wrap on this one.
So those are the things that I wanted to show you.
We ran over by a minute or two.
We can stay open for a couple more minutes,
if anybody has any questions.
Okay, thanks everybody for coming,
and that concludes this webinar.
We will have another one beginning in about 15 minutes.