Tip:
Highlight text to annotate it
X
In this video, I'll demonstrate how to run TopHat
on HPC Web.
First, go to the HPC website.
And as you can see, I'm logged in.
Click on Applications, scroll down to TopHat.
And when you first go to TopHat, the
simple form is shown.
If you run it with this, it will use all the default
options and basically align your reads to the genome that
you choose to get genomic alignments and look for
exon-exon junctions de novo.
For this demonstration, we'll go to the Advanced Forum so
that we can discuss some of the options that you might
want to modify, depending on your experience.
We'll use the FASTQ format.
If you click on Choose a Genome, you'll see all the
genomes that we have preloaded for you.
And if the genomes that you would like to line to is not
on this list, you can let us know by clicking on Support,
and this will send a support request to us.
And in your comments, you can mention which genome you would
like and as well as the version.
And this will bump up the priority for adding this
genome to the interface.
Click Submit.
OK, back to the forum.
We are going to align to Yeast, Saccharomyces
cerevisiae genome.
We have a paired-end library, and now we need to choose the
quality value for that.
The default is Sanger or Phred plus 33.
How can you tell what your quality value format is?
If your data is from Ilumina sequencing and you know the
CASAVA version used for the pipeline, then you can select
it from the dropdown box.
Otherwise, you can get this information by running FastQC
or by looking at the file manually, which
is what we'll do.
Open up another tab with the HPC web, click on My Home, and
if you go to your file and click on this wheel button,
select Preview and it will give you the first few lines
of your file.
If you look at the fourth line here, you can see which
characters are being used for quality scores.
Now, what characters do we have here?
Capital B, C, A, the at symbol, question mark, dash,
six, and semicolon are some examples.
Now if you go to the FASTQ format Wikipedia entry, we can
see what range this matches.
Now we see the dash, six, semicolon, at, capital A, B,
C. So this matches the Sanger range.
So because of that, we will keep the
default, which is Sanger.
Now we'll upload our files from the HPC server.
Here's the first file, the second file.
Now we'll look at some of the options to consider.
Initial read mismatches.
This is basically the number of mismatches allowed between
your read and your reference genome.
The default is two, which is probably good for 36 base pair
or 50 base pair reads.
You can increase it for longer reads.
Since our reads are 75 bases, we'll use three.
Maximum number of alignments to be allowed.
If you want to get only unique reads, you
can set this to one.
Although it may limit the ability of the software
defined junctions properly.
So we will keep it at 20, the normal isoform fraction.
This controls how common the junctions are before reporting
them in the output.
Basically, TopHat will compare the depth of coverage of an
exon to the number of reads supporting the junctions
coming from that exon.
To be completely sensitive, you can set this to 0.
And that will report all junctions it finds.
But it will also report a lot of false positive junctions.
So we'll leave this at the default.
Junction options.
By default, TopHat will look for junctions de novo without
any prior knowledge.
You can instruct TopHat to also look for particular
junctions by giving it an annotation file.
If you click here, you see all of the annotation files that
we have preloaded for you for different versions of
different genomes.
Or if you have an annotation file for your genome, you can
upload it as a GTF file.
So, we will select Yeast.
Now we'll look at some paired-end options.
The paired-end options refer to your original library
preparation.
I have an illustration to describe this.
This is what your library might look like.
For example, you have a 5' adaptor here, 3' prime adaptor
here, and the insert is green.
For this library, we have an approximate library insert
size of 300 bases.
Now if you are doing paired-end 75 base pairs
sequencing, as we are here, 75 bases from each end, the inner
distance size can be calculated by subtracting 75
twice from the total library insert size, which
comes to 150 bases.
Because this is approximately what we have for our library,
we'll use 150 for this field.
The standard deviation should be set to the approximate
standard deviation of your library insert size.
The default is 20 bases.
And you should probably set this somewhere
between 20 and 50.
Now we'll give this a name.
And we'll click Submit.
It's submitted.
And now it's pending, and now it's running.
And you can monitor your job here, under My Recent Jobs.
Click on this.
It'll give you the status and by clicking refresh, you can
get an update on the status.
And you can just wait for it to run and the output will be
here under this link to output folder, if you click on that.
So that has been my demonstration of running
TopHat with HPC web.
Hope you've enjoyed the video.