Aligning Ngs Reads to a Genome for Transcriptome Analysis with Tophat

In this video, I'll demonstrate how to run TopHat on HPC Web. First, go to the HPC website. And as you can see, I'm logged in. Click on Applications, scroll down to TopHat. And when you first go to TopHat, the simple form is shown. If you run it with this, it will use all the default options and basically align your reads to the genome that you choose to get genomic alignments and look for exon-exon junctions de novo. For this demonstration, we'll go to the Advanced Forum so that we can discuss some of the options that you might want to modify, depending on your experience. We'll use the FASTQ format. If you click on Choose a Genome, you'll see all the genomes that we have preloaded for you. And if the genomes that you would like to line to is not on this list, you can let us know by clicking on Support, and this will send a support request to us. And in your comments, you can mention which genome you would like and as well as the version. And this will bump up the priority for adding this genome to the interface. Click Submit. OK, back to the forum. We are going to align to Yeast, Saccharomyces cerevisiae genome. We have a paired-end library, and now we need to choose the quality value for that. The default is Sanger or Phred plus 33. How can you tell what your quality value format is? If your data is from Ilumina sequencing and you know the CASAVA version used for the pipeline, then you can select it from the dropdown box. Otherwise, you can get this information by running FastQC or by looking at the file manually, which is what we'll do. Open up another tab with the HPC web, click on My Home, and if you go to your file and click on this wheel button, select Preview and it will give you the first few lines of your file. If you look at the fourth line here, you can see which characters are being used for quality scores. Now, what characters do we have here? Capital B, C, A, the at symbol, question mark, dash, six, and semicolon are some examples. Now if you go to the FASTQ format Wikipedia entry, we can see what range this matches. Now we see the dash, six, semicolon, at, capital A, B, C. So this matches the Sanger range. So because of that, we will keep the default, which is Sanger. Now we'll upload our files from the HPC server. Here's the first file, the second file. Now we'll look at some of the options to consider. Initial read mismatches. This is basically the number of mismatches allowed between your read and your reference genome. The default is two, which is probably good for 36 base pair or 50 base pair reads. You can increase it for longer reads. Since our reads are 75 bases, we'll use three. Maximum number of alignments to be allowed. If you want to get only unique reads, you can set this to one. Although it may limit the ability of the software defined junctions properly. So we will keep it at 20, the normal isoform fraction. This controls how common the junctions are before reporting them in the output. Basically, TopHat will compare the depth of coverage of an exon to the number of reads supporting the junctions coming from that exon. To be completely sensitive, you can set this to 0. And that will report all junctions it finds. But it will also report a lot of false positive junctions. So we'll leave this at the default. Junction options. By default, TopHat will look for junctions de novo without any prior knowledge. You can instruct TopHat to also look for particular junctions by giving it an annotation file. If you click here, you see all of the annotation files that we have preloaded for you for different versions of different genomes. Or if you have an annotation file for your genome, you can upload it as a GTF file. So, we will select Yeast. Now we'll look at some paired-end options. The paired-end options refer to your original library preparation. I have an illustration to describe this. This is what your library might look like. For example, you have a 5' adaptor here, 3' prime adaptor here, and the insert is green. For this library, we have an approximate library insert size of 300 bases. Now if you are doing paired-end 75 base pairs sequencing, as we are here, 75 bases from each end, the inner distance size can be calculated by subtracting 75 twice from the total library insert size, which comes to 150 bases. Because this is approximately what we have for our library, we'll use 150 for this field. The standard deviation should be set to the approximate standard deviation of your library insert size. The default is 20 bases. And you should probably set this somewhere between 20 and 50. Now we'll give this a name. And we'll click Submit. It's submitted. And now it's pending, and now it's running. And you can monitor your job here, under My Recent Jobs. Click on this. It'll give you the status and by clicking refresh, you can get an update on the status. And you can just wait for it to run and the output will be here under this link to output folder, if you click on that. So that has been my demonstration of running TopHat with HPC web. Hope you've enjoyed the video.