Tip:
Highlight text to annotate it
X
The next type of parallel computation is one that's loosely coupled.
Meaning there are phases of independence
followed by phases where they have to work together.
So the schematic diagram is... we're gonna have four input files,
there's gonna be some initial computation on those input files
then those workers are gonna communicate with each other.
That's gonna create a set of four different independent tasks
on the back end that will then have their different output files.
So this fits really nicely into that MapReduce paradigm
where up front you're gonna have our mappers,
they're gonna shuffle to communicate with each other,
and then a series of reduce to produce our final result.
So that's really the special sauce, if you will, of MapReduce
about why it's more powerful than, say, Condor or Grid Engine,
is that it supports the shuffle over very very large datasets.
So again, if you think about this sort of conceptual model,
it's sort of like we're gonna coreograph this dance where everybody
has a dance partner. Then we're gonna blow a whistle,
everybody's gonna find a new dance partner, and then continue on.
So this is extremely important for Genomics, and it fits really well with
many many applications that we may be interested in.
So in particular there's this computation that I alluded to before called
Short Read Mapping, and the idea is we're gonna have our genome,
and here I'm just showing a shot of a few couple dozen nucleotides,
[silence]
whereas in the real genome we're gonna have about 3 billion nucleotides.
Many many billions, very short read here, I'm just showing
ten bases but in reality it's about a hundred bases.
We're gonna take all billions of these hundred base pair short reads,
try to line them up to our genome to identify variations.
So here we have the reference genome as a 'C', but all the reads that line up there are 'A'.
There's some really nice algorithms... some really nice string-mapping algorithms
that are very efficient for taking a single read mapping onto the reference genome.
But as efficient as it is, it still takes on the order of a thousand CPU hours
just to analyze one genome.
We're currently underway to doing thousands and thousands of genomes,
so that translates into millions of CPU hours
to be able to do this type of analysis for one project.
So it's just going to take way too long to try to do this on one computer.
We're gonna want to spread this out over many computers.
In addition to this computation, trying to identify where the variants are,
this sort of read mapping and coverage analysis
is really fundamental for many different assays.
Studying complex mutations in the genome, studying
the biologically active regions, that's called RNA-Seq,
studying where proteins are binding to the genome, that's called Chip-Seq,
studying different modifications to the genome, that's called Methyl-Seq,
and then studying how the genome actually folds up on itself, that's called Hi-C-Seq.
So here I'm just going to be talking about finding variations.
But this is really really fundamental to Computational Biology,
and extremely important for a number of applications.
So the nice part is that this fits really well into a MapReduce framework,
and in particular we're able to use a feature of Hadoop called Hadoop Streaming
that lets us tackle this problem in a very efficient way.
So the goal is to align billions of short reads to the reference genome
to identify single nucleotide polymorphisms where those SNPs are,
and we're going to be able to reuse as much software as possible
using this Hadoop Streaming.
So the input to this new pipeline that we call Crossbow
is gonna be some set of those sequence files.
Each of these files is going to have millions and millions of reads in it, and there's gonna be...
Here I'm showing four files, but it could be thousands of these files
with tons and tons of biological sequence data in them.
The first step that's gonna be to run this sequence alarm that... called Bowtie
that was developed by my collaborator, Ben Langmead.
And the idea is that Bowtie knows how to take a single one of these reads
and then quickly do string matching to find
where it best aligns to the reference genome.
So normally Bowtie is run one copy of Bowtie per computer,
and what I'm trying to show here is we're gonna launch
four copies of Bowtie on four different computers.
This could be four hundred different copies of Bowtie on four hundred different computers.
And then we're gonna take the output of Bowtie and then admit it as a key value pair,
where the key is gonna be the chromosome...
what we call the chromosome region that the read aligns to,
and then the value is gonna be the information about that alignment.
So your genome is organized into 23 chromosomes,
so that'll allow us to have some nice natural partitioning
so the idea is that the reads are gonna align more or less randomly
to these different chromosomes, but then we're gonna tag where those alignments are
so that we can tag all the alignments to Chromosome #1,
tag all the alignments to Chromosome #2, all the alignments to Chromosome #3,
and so forth. Actually, in reality, we're gonna do this a little bit better
because we're gonna tag it and be like, 'Oh, we're on the first ten million bases
of Chromosome #1, the second ten milllion bases,
the third ten million, the fourth ten million.'
That'll give us better opportunity for load balancing down the road.
Given those key value pairs of where the alignments occur,
we're gonna use that Hadoop Shuffle functionality
to collect all the alignments to those different Chromosome #1 chromosome regions.
So here's all the alignments being collected.
That first few megabases of Chromosome #1, the second few megabases,
the third, and so forth, all the way across all the chromosomes.
Those are going to get sorted and shuffled into place across the different machines.
And then for our reducer we're going to use another program called SOAPsnp
that knows how to scan through those alignments
and identify where those polymorphisms... It has a really nice model
for discarding where the sequencing errors occur.
So we built this pipeline using a small cluster at the University of Maryland,
but now in our main publication we scaled this up to run
on Amazon Cloud and EC2. Now it does take time
and also costs money to transfer data in and out of the Amazon Cloud.
But in our case it wasn't so bad. So our input data was 3.3 billion short reads,
35 base pair reads. It took about an hour and fifteen minutes
to do the transfer, it was about 106 gigabytes of data to do so.
The trick for doing this was to rather than just using, like, ITP one file at a time,
we were able to FTP forty different files at a time.
I should have said that the data were housed in a repository
in England and then we were transfering it to Amazon someplace in Northern Virginia.
So we were able to do that transfer very very efficiently
by executing forty parallel FTPs at once. There are actually better
transfer mechanisms for doing large data transfers.
There's something called Grid FTP and there's a commercial client called Espera
that can just make better use of the bandwidth available than just a regular FTP.
Nevertheless, it only took us about an hour and fifteen minutes.
At the time that we did the transfer, it cost us about $14 to do the transfer,
which, yes, is money, but relative to the thousands of dollars it took to
generate this data, a $14 transfer is just a tiny fraction of the data.
And then in our main experiment we ran our Crossbow pipeline on a cluster
rented from Amazon EC2 with 320 cores in it, so this was...
[pause]
This was forty computers with 8 cores each. It takes a little bit of time
just to set up a cluster of that size, it takes some time to run many copies,
in this case three hundred twenty copies of Bowtie.
And then it takes about an hour to do all the
variation calling, but not so much time. So...
And including the time to do the data transfer, it took just around four hours.
At the time that we did the experiment it took about...
it cost about a hundred dollars to do the experiment.
If you were to reproduce this exact experiment today,
Amazon has gotten quite a bit cheaper.
So it would be about fifty or sixty dollars to do this transfer today.
So we think this is a really compelling example
of how Cloud computing and MapReduce can be used in the Life Sciences
because we started with 3.3 billion reads, we were very...
we were able to very rapidly transfer the data, do all the computations
to identify about three million of those single nucleotide polymorphisms
in an afternoon and for less than $100.
So it's a really compelling example of how these technologies may be used.
Since that publication we recognize that this sort of framework
of mapping reads to a reference genome, shuffling them into chromosome regions,
and then scanning those alignments occurs all the time under many different contexts.
And we've been developing a new system that we call Jnomics
that, rather than focusing on just Bowtie and just SOAPsnp,
is going to support many many different genomic
sequence analysis tools to be able to do it at a very very large scale.
So far we've been using these technologies to study human diseases like cancer
to see how your genomes are mutated, and then also different
plant systems to try to figure out why different strains of rice
and different strains of corn and different strains of wheat
are able to grow with different properties.
Like how fast they grow, how tall they get,
how big the fruits are, and so forth.