Identifying Novel Proteins across Genomes - Brucella part 2

(female narrator) Let's look particularly at some features of the alignment. You can see that, well, this is pretty clear. This is a mis-called start site. Maybe all of Brucella could be extended up here, but probably it should have been called here. You can see that this little group is uniquely different. If you scroll down the alignment, you can see that although they have some similarities that link them with everybody else, that is a unique little group of proteins. And when I look at the members, they're all members of a particular clade of Brucella that infect marine mammals. So they have a kind of different protein family. When I scroll along, you can see that this one, suis biovar 5, uh-oh, it starts picking up here, so something's wrong with it. But when you look here, suis biovar 5, you see that this copy ends here. It's the one with that extended start site and it ends prematurely and it picks up here. This is a good point to bring up that RAST annotation does not call pseudogenes. RefSeq requires when you submit a genome that it cull pseudogenes, and you are not allowed to show coding sequence for any of the protein fragments in what they determine to be a pseudogene. Some biologists don't like this because they think that even these two protein parts may very well be useful, so RAST doesn't call those things. But when I see something like this, it indicates to me that it might be something that's identified at NCBI as a pseudogene. So how would I determine that? All I have is the genome name. Well, if you go up here, you can choose locus tags. And generally, when things are pseudogenes, they're located syntenically next to each other. So for this Biovar 3 for suis, it's 0806. And this one is 0742. Well, I don't know about you, but I find it hard to believe that there are two genes located distantly across this particular genome that just coincidentally one ends where the other one picks up. So I want to go explore this. So I'm going to go back to page before and I want to look at that particular genome. So I'm going to sort by the genome name. And because it's Brucella suis, it's towards the end of the alphabet when you compare all the Brucellas, so I'm going to sort descending. And here are the two biovars. There's three and there's three. And there's 742 and 806. Well, when I look here, see, this is where this functionality starts to become important. I can see that it starts here, but most importantly, it ends at 860823. And the next one starts at 860824. So they are syntenically located. Now let's say you have the question: How do I find that on PATRIC to look at it, to see that they really are syntenically located? Well, of course they are. They're on the same contig and they're right next to each other, but at PATRIC you can visualize that too. This is called the feature page. You have a feature page for each gene and protein. And it gives you a bunch of information, the accession number, the annotation source. Oh, this is a good one. It tells you what the RefSeq locus tag is. PATRIC serves as sort of a Rosetta Stone between different annotation sources, and so you can see what different annotation sources-- what links they put to the protein. And it's all provided here at PATRIC. You can click on this little icon here for the genome browser and it will take you to this particular place in this contig that this protein resides. So here it is. And remember, our numbers were 742 and 806. And look, they're right next to each other. And moreover, there's 807. And when I look over here, it's 741. So something funny is going on between those two proteins. Generally, RAST annotates them syntenically and consecutively. You can zoom in. Let's see what the problem is. You can zoom in and see the actual sequence. And there's a series of Ns there. And when--it generally means when you see a series of Ns in a sequence, remember that some of these sequences are not closed genomes. And this particular genome is not a closed genome. So when they have problems calling a particular nucleotide, they call it an N. Or sometimes between contigs that they can't figure out things, they put in a series of Ns. RAST breaks apart the assembly and annotates the genomes and then puts it back together. That's one of the reasons why these things are not numbered consecutively. I wanted to show you some other functionality while we're here. I'm reassigning the area that I'm looking at. And I can get the sequence for that. This is a new functionality at PATRIC that I really, really like. It gives me the whole sequence, and I can take that sequence for the coordinates that I chose and BLAST it. All you have to do is copy-paste it. And at PATRIC, you can BLAST it by going to search and tools, there you go, and you can BLAST it. PATRIC has its own BLAST. Let me show you some of the functionality of that while we're here. You can do a BLAST against a variety of different flavors. But if I were looking at that, generally I would want to BLAST against the genome sequences. And let me just point out that at PATRIC, you know, when you have the N closed genomes, it's always important to do a BLAST N to be sure something is really missing. 'Cause often because the genomes aren't closed, it could actually be there. And the annotation mechanism can't call it unless the genome is closed. Okay, there's one more functionality that I'm really excited about, so I wanted to take you back there. So I'm going back to the protein family sorter again. And I know that nobody's going to mind seeing this 'cause this is our new functionality, something we're really excited about. Let me tell you that when I first looked at Brucella, I had to do this all by myself in an Excel sheet. But next to the table, we have what's called-- we call our heat map. And this is a visualization of all of these 5,537 FIGfams that are found in PATRIC. So you notice that I clicked on heat map and it brought up this big, blue box with these coordinates on the X and Y axis. And it says genomes and protein families, but that's not very descriptive. In this little box, there's a slider. And you can grab it and extend it and you can suddenly see the genome names. All 40 will be there. It can also grab that and along the X-axis you can see the protein family names. Before I go into describing what these different colors mean, let me show you a different functionality that is deployed here. All of these things are moveable. And I always like to put everything in the correct phylogenetic context. So I'm going to be a little bit tedious and boring right here and show you just how you can do that. Okay. Because it's going to show you the power of this particular functionality. I'm going to have to go down a little bit more to grab the particular genome that I want. And then I want to bring these up. And what I'm assembling are two particular clades in the Brucella that are closely related. But you'll see how powerful this functionality is when you get to see them all next to each other. In a second, I'll show you where you can find a tree. PATRIC provides trees for all of our pathogenic organisms, so if you don't know the phylogeny of the organism, you can find it at PATRIC. Let's talk a little bit about the color here. When you see a black cell, it means that this protein here is not annotated in this genome. It's totally missing. When you see a blue cell, it means that one protein is annotated in this genome. And when you see particular other colors, it means that more than one. The more intense the color gets, the higher number of proteins have been annotated there. And these can be pseudogenes like we've just seen before when you see broken parts of the same protein that are in the same FIGfam. Or it could be paralogs within the genome. We now have this functionality. You can cut out an area that you're interested in, and then you can show the details in a new window. That will show you everything that you just chose. And so I could sort by the amino acid length and I could choose a few proteins. Having chosen just a few members, I can generate a multiple sequence alignment in a gene tree. And there it is. Let's go back to this heat map viewer. This is one of the things I went to the trouble of arranging these things into a phylogenetic context, at least the first 16 of the members. And this is why I like it. When you're scrolling along, you can see things that are lost across a particular clade. Now if a protein is missing in one genome, especially considering that the majority of the genomes in Brucella are not closed, most of the time I think, you know, it's probably a sequencing error because these things are so highly conserved. But when you see the same protein missing across 11 different members and they are all in the same clade, that tells you something. And it tells you that this particular clade made a particular decision about this protein. And they decided not to maintain it anymore, whereas the others keep it. So that's what I really like about this functionality. In our next release, we're going to have the ability for you to choose a reference genome and order the protein families syntenically based on that reference genome. And then, as you scroll along this axis, you'll be able to see things conserved across different clades, things that are lost. And it's extremely powerful for comparative genomics. One more point. I'm going to go over to this other window that I have open and go to Brucella. One of the things I pointed out was that maybe you don't know the phylogeny of your organism. At PATRIC, the phylogeny of every one of our target organisms is available. And it's in this tab here. You can look at the phylogram. You can also look at the cladogram. And you can see--you can order your genomes along those-- the phylogenetic guidelines for that. If you have any questions, you can always go to the FAQs page and they'll tell you anything you need to know about that.