Tip:
Highlight text to annotate it
X
Well, thank you very much for the kind introductions. It's nice to be back on the NIH campus again,
and I was only briefly frisked this time coming through security, so I didn't tell them I knew
David Lipman so they let me through. Anyway congratulations to NCBI, but particularly David and
Jim Ostell and the team. This 25 years seems to have gone by rather quickly, at least looking back,
And we've enjoyed all the associations we've had with NCBI over the years and trying to change things.
So the work I'm going to be talking about emanated from basically four institutions starting with my lab
here at the NIH, then forming TIGR in 1992. This is the picture of the Venter Institute in Rockville, Maryland,
and here's a picture of one of the buildings in La Jolla, California, so it's now a bicoastal institute,
and also from some of the work at Celera Genomics. This was an article that Karen Remington
helped me write for the 10th anniversary of Nature Genetics looking at some of the things that lead
to the advances in genomics. It wasn't just technology. A lot of basic ideas had a big impact such as
massive parallelism which is pretty obvious with massively parallel computers. It wasn't so obvious
to the group early on in terms of DNA sequencing as everybody wanted one machine that would
sequence the human genome. When we wanted to double the output, we just bought a second machine.
It seemed kind of logical at the time, but that expanded up to hundreds of machines, but more importantly
is something that's a mathematical concept as well as a social one at times, and that's randomness.
We used randomness throughout all the different approaches to make what have turned out to be some nice advances.
We were actually sequencing some test regions of the human genome here at NIH in the late '80s.
We found that we could not interpret that sequence without having cDNAs. So we developed the idea
of just randomly sequencing cDNA clones from the library. This was after I spent ten years trying to get
a single gene, the human brain beta adrenergic receptor that finally we were able to clone in the mid-'80s.
It was the first and last gene that I sequenced by manual DNA sequencing. It was a strong motivation
to go into automation. This paper seems small by comparison to modern papers, but it was a big deal in 1991
to have 337 genes described in a single paper looking at random sequencing of brain libraries.
NCBI and GenBank, but particularly driven by David, I had the foresight to start a new database
at the time, dbEST. So it started slowly basically until our collaboration with Burt Vogelstein discovered
some new DNA repair enzymes associated with colon cancer. After that ESTs caught on. Our teams
have continued to submit them over the years with a variety of species, but when you look at how it
took off after the work on colon cancer, our contribution percentage-wise to the total was decreasing,
but these numbers have become pretty staggering. The first paper in '99, we only had 337 ESTs in GenBank.
A decade later that number had jumped to 10,000,000. Two years later it doubled again, and just downloading
things last night in the latest release there's now 50,000,000 ESTs, amazingly over 8,000,000 from human alone.
Now there hasn't been a genome done that ESTs didn't play a major role in the annotation. In fact,
I think the annotation would be questionable without these. The cDNAs were very key. It's amazing
that they continue to be generated at the rate. This was just a list of the ones right off the dbEST website
that had 300,000 or more ESTs. The list goes on in a very dramatic fashion. In fact, aegypti was
our latest contribution. Now it wasn't the ESTs but the randomness and more importantly some new algorithms
developed primarily by Granger Sutton. Granger Sutton came in with Chris Fields. Many of you will remember him.
He was sort of my impetus for leaving NIH, that and $70,000,000 to start an institute. I tried to get Chris
promoted here at NIH, and at the time biology was not part of the computing world and certainly
biologists thought computing was not part of the biological world, so that was sort of the final straw
with leaving to form the Institute for Genomic Research. But the team lead by Granger Sutton
developed new algorithms for assembling ESTs, and they were extremely effective with hundreds of thousands,
so we decided we could go back and apply these same tools to genomics, and that's what lead
Ham Smith and my other colleagues just to try this experiment. E. coli was roughly in its tenth year of funding
with not tremendous progress. The yeast genome was making better progress. It took 1,000 scientists
ten years to do it. It ended up being the fourth genome completed. This took projects of just using
mathematics and computing from decades down to months. This has progressed, so these are now down
to a single day in a single cell. Completed genomes started to come out exponentially. Looking back
at this literature despite all the fuss and complaints more than 90 percent of all the genomes completed
have been done now with the method that we published with Haemophilus genome. This is just
looking at prokaryotic genomes. Again like ESTs our contributions were 100 percent early on
and the rate of these have really caught on particularly with the JGI contributing a substantial amount.
If we look at the submitted sequence databases - I'm looking at - this is finished prokaryotic genomes.
The JGI and the DOE have submitted 25 percent of all the genomes, and the Venter Institute in its
various incarnations including TIGR has done 23 percent. And it drops off pretty dramatically
after that, showing that some of these things happen faster in centers. Again these are NCBI websites looking at the clone-by-clone approaches versus the whole genome shotgun approach.
It's clear that the whole genome shotgun just totally phase shifted how things were done over
a very short period of time, so now if we look in the databases, the majority of anything that's there
has been done by whole genome assembly or probably the majority of data is from some random approach.
The big change came only five years later after Haemophilus, and that was with the scale-up with
new sequencing technology, but also more importantly and much improved algorithms. These were driven
primarily again by Granger Sutton but also by Gene Myers who as a team came up with the
Celera assembler. Even though the substantially larger size, Drosophila was sequenced in the same
four months that Haemophilus was five years earlier. The size of the compute went up substantially.
Nine months later the first draft of human was assembled. Now assembling 27,000,000 sequences.
The results from our team were published in Science in 2001and from the public effort in Nature.
It turns out both these efforts were dramatically flawed. The conclusion in 2000 and 2001 is we
all shared the same set of genes, and we all differed from each other only in 1 out of 1,000 letters
of the genetic code. In the wisdom of the early committees that I was part of with Lee Hood
threw out the number. It probably cost a dollar a base pair for sequencing. Nobody wanted to ask
for six billion dollars to do a diploid genome. They thought we could just get by with a haploid genome
and infer rest. It turns out those inferences were wrong. Celera sequenced, ironically, from five individuals.
And the assembly did a majority rule assembly subtracting out again a lot of the variation. The public
sequence had essentially no variation, because they were from back clones from a collection of people.
I say ironically because had we sequenced just from one individual in '99 through 2000, we would
have gotten the right answer eight years ago instead of this last September. But we went on to continue
to sequence, building on the sequences from a single individual. In this case it was me, from Celera
adding to the data until we had substantial assembly of both haploid genomes. In fact tremendous
haplotype phasing based on the genetic variation, more than ten times of what came out of the Hap Map.
And we're adding to this even more substantially. Basically every major new sequencing manufacturer
is trying their instruments against now this reference diploid human genome. So we hope in the near future
we will have complete haplotype phasing going down the sets of chromosomes. The biggest surprise
is that my two haploid genomes differ from each other on the order of .5 percent. We try to compare
any two people, it looks like we differ from each other on the order of one to three percent.
To put a number in context, myself and others based on the work of Svante Paabo working
with Celera early on with chimpanzee, told the world that we differ from chimps by 1.27 percent.
So audiences start to get very nervous around this time if we differ from each other by 1 to 3 percent
and chimps by 1.27, you're probably hoping that number changed too, and many of you know that
it has - we're more like 5 to 6 percent different from chimps when we count INDELs and other types of variation.
We have substantially more base pairs in INDELs than we do in SNPs, but we're in a SNP centric world
right now. We have an industry that only knows how to measure SNPs, and they think these are
what account for human variation, and I think there's increasing surprises. I think Jim Watson's genome
is going to be published in the next week or two, so we'll have two genomes out there. His genome
I think is a full diploid genome. The sequence is only a haploid genome, because it was not an
independent assembled genome, it was layered on the NCBI sequence. I argue, we need thousands,
perhaps ten thousand complete diploid genomes to understand human genetic variation. We can't
get that off a reference genome, and you can't get that looking at SNPs. We're actually looking at
a much more complex problem. The sequencing is pretty straight forward. Some of the new technologies
that are coming along in the next 12 to 18 months are really going to change what can be done
very dramatically, possibly down in the few thousand dollar range per genome. But this data is basically useless.
It's a different version of what David Botstein was just telling you, in this case without phenotypic information.
And so we're working with groups at UC San Diego and other places to try and define a human phenotype
in this way that that can be digitized so we can compare the sequence information across the genome
to the human phenotypes, and with that using complex analysis, we have a chance of answering
most nature versus nurture questions. If we just collect sequence, and as NIH has proposed, partial
sequence from individuals with no phenotypic information, it's going to be mostly a wasted set of information.
You can go to the PLoS Biology website and download the file that has the map showing the various types
of variation, and there's a new browser just released this last month from the JVI.org website as well.
But the challenge is going to be comparing all these genomes. It certainly helps, I think, when
NCBI started the Trace database that was a huge leap forward certainly for scientists and teams
like we have that use this data. Instead of relying on other people's assemblies and interpretation,
we're able to download the raw data, even reanalyze that, reassemble it, and do other things with it.
The trouble is from the public human genome dataset those traces are still not yet available.
We've been told they probably never will be. These are continued additions. We've now put over
103,000,000 sequences just as of last month into the Trace Archives showing how our total is
growing over time and contributing to the efforts. Now after we finish this rough draft of human,
we're looking around to other things we could apply these same techniques to. Knowing that the
environment was important, we decided to try to look broadly at the environment in applying
shotgun sequencing to the environment, and that lead to this paper that Karen Remington was a
key contributor to trying a shotgun sequencing of what we found in the Sargasso Sea by simply
filtering organisms out of seawater. In fact, we stopped sequencing after about a billion base pairs.
Over 1.2 million previously unknown genes and as many as 40,000 new species. This started
getting applied at the institute in multiple directions. The start of the Human Microbiome Project
which is now expanding greatly. We're not just a hundred trillion cells with 23,000 genes. We have literally
thousands and thousands of organisms associated with our bodies, so the number of genes adding
up to maybe 10,000,000 or more with our associated organisms than our meager 23,000, and efforts
are now starting to characterize these in much broader populations. Groups like Metabalone use
high throughput mass spec to look at chemicals in the bloodstream. They've catalogued that our
genetic code can lead to around 2,400 unique chemical compounds. If we were to take a blood sample
from any of you after a meal, there's around 500 pretty substantial compounds circulating in our bloodstream.
Interestingly, about 60 percent are from human metabolism, about 30 percent are just from what you consumed,
chemicals absorbed out of the various species that you ate. Interesting, about 10 percent of what is
circulating in your blood at any one time is from bacterial metabolites of what you just ate. So not only
are you what you eat. You are what you feed your bacteria. So the role they play in our physiology
is now just being looked at for the first time. We went back and started the Sorcerer II expedition,
trying to ask questions on a global scale with these techniques. The results of the first third of this
were published in a special issue of PLoS Biology last spring. This is the route that Sorcerer followed
similar to the route of the HMS Challenger in the 1870s, and as with the Challenger, we took samples
every 200 miles, only they sent a dredge to the bottom and counted organisms they could see.
We collected filters, put the filters in a freezer, brought them back to the lab in Washington and
shotgun sequenced everything off of these filters. This is a picture of rapid obsolescence. Three years ago
this was a state-of-the-art facility. We're now selling these things for less than $50,000 each if
you're looking for a DNA sequencer. Because now the output of 100 ABI machines can be replaced
with a single 4-5-4 machine or other technologies. We switched to 4-5-4 because we think length
of sequence reads and doing parallel sequencing is essential for accurate human and other sequencing,
but there's even more exciting single-molecule technologies on the near horizon. To everybody's
surprise, we found that every 200 miles in the ocean, 85 percent of the sequences were totally unique,
to the point we can actually tell where a water sample came from by the sequence information contained
in that water sample. This is from the original Sargasso Sea paper, and one of the questions that people had
was after finding so much unique biology, is how could it be there. The ocean, particularly the Sargasso Sea
was supposed to be a desert, very low nutrients, and it turns out almost every organism in the top layers
of the ocean that's not a photosynthetic organism has photoreceptors, bacterial rhodopsins, very similar
to our own visual pigments. Looking at GenBank, the blue at that bottom was our knowledge of
photoreceptor biology in terms of distribution, including humans, before these findings. This was
just from the first sample. You can see a deep branching. These have been discovered in a linear fashion.
They're in the tens of thousands now. There's reasons for lining them up. We can get down to a single
amino acid residue that can predict the wavelength of light that these receptors will see, and so we can
ask unique questions because now we have a huge metadata set, not just sequence information.
We have, for example, GPS coordinates for every single sequence, and we can ask questions about
unique distributions. In fact, we found them. In the middle of the Sargasso Sea, the ocean is a deep indigo blue.
The receptors on the organisms that thrive there see primarily blue light. You get into coastal waters
where there's a lot of chlorophyll, they see primarily green light. Places like the Panama Canal that are
pure fresh water, they see entirely green light. We have now about eight different new derivatives
yet to determine what the wavelength of light is, but last year a Swedish group published this
nice letter in Nature, showing that these are actually phototropic organisms. This is not photosynthesis.
This is a basically seven indiscernible receptor, very similar to the adrenalin receptor.
Light hits it, it transmits ions across the membrane that lead to some key chemical reactions
providing the generation of energy for these cells. The other surprise is that species were not
nice, neat low punctuate species, and this is true whether if you're gut, whether you're oral cavity,
the ocean, the air, or soil outside this auditorium. This is a tool that Doug Rusch built that you
can download. It puts a reference genome across the top. Each of these little bars is 900 base pairs
of sequence that match that reference genome. We see very little at the 99 to 100 percent level
with any organism in the ocean, particularly ones that were claimed to be the most abundant species.
SAR-11 was supposed to be a single species. This is basically all under a single 16SRNA sequence,
so the people that were telling you they were measuring diversity with 16SRNA were measuring
diversity, but they were missing two to three orders of magnitude of that diversity. We can take
a slice in any of this data and create trees and ask different questions. These are color coded by sight.
You can compare Atlantic versus the Pacific. We see things like major changes in phosphate metabolism
in different oceans. I think the bottom is the most interesting one. We've seen a switch between
blue and green light four times in recent evolution. A single base pair of the genetic code changes
that single amino acid. It changes the wavelength of light. This is classical Darwinian evolution.
You get a mutation, you get increased or decreased survivability depending on that environment.
What we see with this huge distribution of related organisms with as much as 40 percent
sequence variation is not predicted by classical Darwinian evolution. But you can ask any
different question of this type of data. A variety of questions started to come out of this.
What was the novelty with this data. Were these just new members of known families?
If we're discovering things, what was the rate of discovery, and this is where Shibu Yooseph
at the institute lead an effort to compare the GOS dataset to everything we could download
out of public databases at the time. This was just from Halifax to the Galapagos. It was roughly
twice the proteins as were in all the public databases. They proceeded to do a compute using
over 1,000,000 CPU hours to assemble all this data together. We were truly surprised by this,
so this is a Venn diagram showing the quality at the time in terms of the smaller circle. The gray
circle was GenBank. Looking at microbial sequences, there was only 115 gene families in
GenBank that our dataset did not hit. And yet we had about 4,000 new gene families that were
not seen in GenBank. Now we see a huge tail on this data. There's about 50,000 of these computes now
a couple years ago, gene families with ten or more sequences in public databases, but every time
somebody takes a new environmental sample, they add new gene families on in a linear fashion.
If we look at subparts of the evolutionary tree, we get a different answer. For example, if we look at
mammalian genomes, my contention is there's no point in sequencing another mammalian genome
if you just want to discover new genes. It's totally saturated. You can discover new combinations
of those genes that lead to that species or variance. We're going to do an awful lot of humans
as a community in the near future, not for gene discovery but for variant analysis. So metagenomics
is now a new part of NCBI and GenBank. So we've more than doubled the number of peptide sequences
in the public databases with a single publication last spring. We hope to do so again this year
with around 35,000,000 new gene sequences. But this is catching on. People around the world
are picking up on this particularly as the new sequencing technologies come on. One major funder
has been providing 4-5-4 sequencers to a variety of labs. Fortunately, the sequence length is changing
with 4-5-4 later this year, and I think that data will be good as well. We tried to put all this information
together to ask some unique questions. Obviously one that we've all asked ourselves at various times
is the definition of life. Can we do classical reduction as biology and pare it down to its most basic components?
The third question we've obviously answered. We can digitize it. That's what we've been all doing
for the last 20 years. When we sequence a genome, we're digitizing that biological information.
And lately for the last decade or so, we've been trying to answer the last question, can we start
with that digital information in the computer and go the other direction and regenerate life.
So working with Ham Smith and Clyde Hutchison and a team of about 30 scientists, we've
been trying to work on designing and synthesizing life starting with the genetic code.
Again a couple simple questions, difficult to answer. Will the chemistry permit making large
accurate pieces of DNA, entire chromosomes? And if we can make these pieces of DNA, can
we boot them up? When we first tried this, the answer was no even for things on the order of 5 kb,
because DNA synthesis is a degenerate process. The longer you make the sequence, the more
errors there are, but we've developed new mechanisms for correcting those errors as part of the
assembly, and we started with Phi X 174, the first genome - the first phage sequence by Fred Sanger.
Clyde Hutchison was in his lab at the time, had one of the isolates still stuck in his pocket, and so
the problem - one of the tenets with this new field of synthetic genomics is you can only build
something as accurate as the quality of the data in the digitized form. So we went back to resequence
Phi X 174 and actually found after all these years only three differences. Sanger and colleagues
certainly set a standard, unfortunately that very few have followed since. It took two weeks going from
the design with the new sequence through to having a 5 kb piece of DNA, which we then stuck
in E. coli, and E. coli recognized this as normal phage DNA. It started making the viral particles.
As is classical in all biology, there's no gratitude once you make something like a virus, it turns around
and kills the parent cell, and that's how you can detect it with these clear plaques. So here's a cartoon
of Phi X 174. So we call this a case where the software is actually building the hardware, and we're now
working on taking this to the next stages. Our goal is actually to build a synthetic bacterial chromosome
as the next stage based on work we're doing with the second genome that was done in history,
mycoplasma genetalium. That's only about 500 genes. We asked questions like did it need all 500 of those.
Was there a smaller genome capable of self-replication, etc.? We thought we would just build Phi-X-size cassettes
and put those together. Again we went through this design phase, only we resequenced the
mycoplasma genetalium genome. The standard of sequence in 1995 when it was sequenced was one error
in 10,000 base pairs. We found about that. We found 30 errors in the genome. Some of those errors
on building the molecules would have not resulted in an accurate molecule, so it was good that we
went back and re-digitized the information. I think this could be a major limitation of the field going forward.
Our big concern was generating artifacts, so we went through intergenic regions where we know
that we could insert DNA into the genome. We built in watermarks into the genome. This can actually be fun.
We have a four-letter genetic code, sets of triplets as you know code for roughly 20 amino acids.
There's a single letter designation for those so we can write words, names, sentences, etc.
And we watermarked it where you can see in the second line where those vertical bars are.
Basically the team assigned the genome because even a single molecule of the native chromosome
contaminated things, could have lead to a giant artifact. We started by making Phi-X-size pieces
and assembling four of those together to make 24 kb pieces. This took a very long time because
after each stage, we cloned them back into E. coli and sequenced them to make sure that artifacts
were not being generated. Many of you remember people trying to use Maynard Olsen's yak technology
for humans. It probably set back the human genome project by at least four years because of
the massive artifacts that were generated from shuffling the human DNA, deleting it, and we wanted
to make sure that would not happen here. We think one of the reasons there's a different codone
usage in mycoplasma versus E. coli, but at a certain stage - and the largest piece that had ever been made before
was 31 kilobases. When we got into the hundreds of kilobases E. coli didn't like growing these pieces.
We shifted, as probably David would have told us in the first place to yeast, and tried to use
homologous recombination for assembly. We'd been characterizing the Deinococcus radioduran's genome
because it can take about three million rads of radiation and not be killed. It's chromosome gets blown apart.
That's after 1.75 million rads. Twenty-four hours later, stitches its genome back together exactly
as it was before, so we had a team working over a decade on isolating all the repair systems out of
Deinococcus trying to make an in vitro recombination system when we found that yeast does this extremely nicely,
a technique here developed at NIH with the TAR cloning. Not only did yeast grow these pieces
very nicely but if you design them properly and put them in, yeast will actually assemble them for us
and were able to assemble the entire genome. It's a circular molecule. It's so big you only need a light microscope
to see it, not an electron microscope. This is looking at the synthetic chromosome over a six second time period.
And this was just recently published on Science. This is the largest human-made molecule of a defined structure.
It's over 300,000,000 molecular weight. It's 582,000 base pairs. If you were to print this out a
ten-font with no spacing, it takes 147 pages to print this out. It was designed, executed and on
sequencing exactly what we had designed and made. We're in the process of trying to boot this up
right now. We have not done that, but we published the key prototype experiment last year on how
to boot up a chromosome. We did this by transferring a chromosome from one species to another,
completely transforming one species into another. We started with two related mycoplasmas,
roughly the same distance apart as human and mice. We isolated the chromosome from mycoides.
We treated it extensively with digestive enzymes to make sure there were no proteins associated with this.
We wanted this to be a model of what we could do with naked DNA being built by the team lead by Ham Smith.
We added a couple of gene sets to it, lacZ operon and a selectable marker. And we had thought about
these experiments over about a decade, and we actually thought we would have to remove the
chromosome from the recipient cell before the transplant, much in the way that's done with
eukaryotic nuclear transplant, which is very simple to do. Just pop out the nucleus out of one cell
and pop another one in, so it's a relatively simple mechanical step. With bacteria and Archaea as you know
the chromosome is integrated into the cytoplasma of the cell. There's no simple way to do this.
We've tried radiation. We've tried chemical damage. We have some unique methods with restriction enzymes
that we're trying now, but we decided we might be able just to leave the chromosome intact,
and use a different set of processes. The capricolum genome does not have a restriction enzyme system,
mycoides does, and you'll appreciate this very sophisticated graphic that shows how this works.
We inserted the chromosome from my mycoides into capricolum. The DNA started being read immediately,
and the restriction systems got expressed and recognized the parent chromosome as foreign DNA chewing it up.
We were left, therefore, only with the transplanted chromosome. In a short while we had big
beautiful, blue cells that when we did 2-D gels and sequenced the proteins off the gels, we found
that all the proteins had shifted to that encoding just for the mycoides-transplanted chromosome.
All the characteristics of the parent cell were gone. Now Ham Smith has been my best friend and
colleague in science for a very long time. He got the Nobel prize in 1978 for discovering restriction enzymes.
I always appreciated them as a phenomenal tool in molecular biology, but until we did this set of experiments,
I never really appreciated their evolutionary role. Imagine if you had sex, and I'm sure some of
you are imaging that right now versus paying attention, and from that act of having sex, you completely
converted that person you were having sex with into a new species or into a clone of you.
Some people do argue that happens anyway just mentally, but cells obviously evolutionary-wise
wouldn't want to have their identities hijacked. Restriction enzymes are a wonderful defense against this,
but at the same time instead of being a rare phenomenon that we think we created in the lab,
all of a sudden a lot of what we've seen from sequencing microbial genomes pop into clarity.
For example, many people argue there was no point in sequencing the cholera genome, because
it had the same 16SRNA basically as E. coli, but we were all surprised when we found cholera
had two chromosomes. One that was indeed similar to E. coli and one that was very different.
So it's clear that cholera derived as a unique species by either absorbing a chromosome from another species,
a cell fusion. Nobody knows how that happens. We see this over and over and over again in the microbial world.
Deinococcus radioduran has four genetic elements, two chromosomes, a megaplasmid and a smaller plasmid,
all different GC content indicating clearly different origins. If you have restriction enzyme systems
that recognize the DNA as foreign, you can protect yourselves, but we think this is a basic mechanism
of evolution, one that explains a lot, instead of depending like the photoreceptors on a point mutation
for a selective advantage. Think of it. In one genetic event, you can gain 2- or 3,000 new elements,
2- or 3,000 new characteristics instantaneously, new speciation. So where do we go from here?
We're trying to do the same experiments now with the mycoplasma genetalium chromosome.
Obviously the cells that we're using, they have different restriction systems. Had we made them
mycoides genome, we would be done by now. We thought the synthesis was going to be the complicated
part of this, and so we chose to make the smaller genome, but we think we're pretty close to overcoming
a number of barriers. In a short while, the dataset is going to be 20,000,000 different genes or
families of genes, clearly, and so I like to think about these as the design components of the future.
Yes, we need to know function, but this has so exceeded, you know, the number of scientists
and the person years it would take to sort out the function of 20,000,000 genes, it's an impossible
task with current technologies and one-at-a-time protein studies. But with doing empirical science
and computational design, we think we can come out in a different place. This is part of the
software that's being worked on at Synthetic Genomics and in the institute, where we're trying to design
some unique pathways and species, but the exciting part of this using the homologous recombination
systems in yeast and other species, what took us four years to do in making this first large chromosome,
we can now do in a very short period of time. We have a robot. First stage of it is working quite well
for making a large number of chromosomes, certainly not in the hundreds of thousands to millions yet,
but definitely in the hundreds, and we think this is going to be readily scalable, and we're
calling this new field combinatorial genomics, because we can take basically any sets, we can
take these different cassettes that we build for mycoplasma genetalium and shuffle those.
We don't know simple things like is gene order in a genome important other than operons.
If we take the same genes and just shuffle them, do we get the same answer? Obviously in
some parts we know order is important, but in cassettes even shuffling those. We can just
screen these for simple questions. Is it viable? Does it produce the chemical we want, etc.?
We're working on trying to make designer fuels, viewing the biggest problem is not human disease
but what we're doing to the environment. Particularly trying to capture new metabolic roots of
energy. We're not short of energy on this planet. We get 120,000 terawatts a day hitting us from the sun,
and so we're working actually on designing new species to capture CO2 as their source of carbon
and produce a number of compounds from that. The third genome that we did was the first Archaea.
This is a complete autotroph, grows at 85 degrees centigrade. CO2 is its complete source of carbon,
molecular hydrogen is its energy source, so we have now organisms that can work in day and night cycle
capturing CO2. A lot of people have been working on feed stocks. We're doing a pretty silly experiment
in this country of taking corn oil and converting it into ethanol. In the process of a year, doubled
the feed stock prices for all food, and it's shifting farmland into producing fuel. That clearly
won't work on a global basis. CO2 is a great feed stock. Other combinatorial genomic purposes:
working first with Chiron and they got purchased by Novartis. We're very close now to having
the first genomic vaccine hit the clinics. This started shortly after we did the first few genomes
working with Rino Rappuoli we sequenced the Neisseria meningitidis genome and used predictions including
phase variation to understand which antigens would be stable. The teams at Chiron built a number of these.
We're now into Phase III clinical trials using this combination of antigens derived from the genome,
the new vaccine is successful against a wide range of meningitis strains. Also there is this
incredible work by Bert Vogelstein using bacteria to treat cancer. He's actually been injecting
Clostridium, an anaerobic-growing organism into patients. The Clostridium homes in on the anaerobic
part of the tumors, releases a variety of substrates. It kills the tumors. It's incredibly effective.
It's in clinical trials. We're now designing synthetic organisms to do this without all the issues
of creating infectivity. So I think as we go into this new phase starting in the digital world,
hopefully, it will greatly increase our understanding of basic life as we start to build these minimal genomes
showing that we can understand things enough at first principles to redesign and build something,
hopefully impact on the petrochemical industry, but I think it's going to certainly drive antibiotic
and vaccine discovery, and with programs like Vogelstein's, new therapeutics. Now many of you
remember some of this work was started a NIH. We underwent an ethical review before even doing
the first experiments. The Sloane Foundation funded the Venter Institute along with MIT to do
a much more in depth study, and that was published last fall. The executive branch of the government
has put together the NSABB based on the Phi X 174 work to review work in this area. They approved
making the 1918 flu virus at the CDC, the first true Jurassic Park scenario of isolating RNA
back from patients buried in 1918 that died from the flu and reconstructing what was thought
to be an extinct virus. So a point to leave on for the future segments of GenBank - because we
can watermark this DNA, it will be very trackable and, hopefully, anybody that makes synthetic
constructs in their labs will appropriately label them. I've already talked to David about starting
a synthetic genome part of the databases. As we start with and on top of three and half billion years
of evolution, not going back and recreating it, but starting with that, knowing all we have to do is
design the software of life, boot it up in cells, and start a whole new speciation, perhaps and do
Cambrian explosion. David, congratulations. Thank you. [Applause] Landsman: Do we have questions for Dr. Venter?
Question: There was a panel at my college, UMBC up in Baltimore, called 'Ethical Implications of Synthetic Life,'
And your name was brought up several times because of these experiments you were just talking about.
There were differences of opinion on patenting of DNA sequences. Some of the panelists thought that
you should only be able to patent the non-natural sequence, some the specific strain that the
sequence you created was put into, so my question to you is you insert a gene that you've
created novely, a truly novel gene sequence, into the M. genetalium genome, and then you
synthesize the entire genome. What should you be able to patent - either the strain, the entire
organism, the gene itself, and why? Venter: Those are all good questions that a lot of people
are looking at right now, but as with the patenting of any DNA, you can't patent naturally-occurring DNA,
but all the genes that all the biotech and pharmaceutical companies have patented - they were able
to do that, because these were manmade constructs in the laboratory. So most of the patent
attorneys that we've talked to think that designing and synthesizing an entire genome from scratch
is clearly a manmade construct if there ever was one, and I think in the cases where these organisms -
for example, DuPont spent ten years and $100,000,000 making a new derivative of E. coli
by changing a couple dozen genes, so they could go from a six-carbon sugar to the three-carbon
propanediol. They patented that organism just changing 12 genes, not synthesizing the organism.
You know, you can find patent law going back to Chakravati on even patenting life forms that
have been discovered in the environment. I don't think that's a great idea, but I think synthetic
organisms that have novel economic value will certainly be subject to patents around the world.
Question: How long in the future do you think it would be before you can do multi-chromosomal
organisms? Thinking more in terms of trying to save some of the endangered species.
Venter: I think trying to reconstruct species other than simple bacteria by trying to save the
genetic code is certainly a recognition of the sad state of society that we're in, to have to do that.
Multi-cellular organisms are probably not far away in design. I think as you've heard many
times today, yeast has a very small genome. It's very easy to synthesize different chromosomes.
Yeast artificial chromosomes are in fact how we grow the synthetic bacterial genome. It really
depends on the purpose and the application. But I think, you know, the true Jurassic Park scenario
is where they capture a lot of people's imagination. I hope there's better ways to start to save species
than just hoping we can reconstruct them from the genetic code. Because people probably made
errors in sequencing the genetic code anyway, and so if we don't have the species to check accuracy,
we're likely to come up with some pretty bizarre constructs. I'd hate to see the human that came out of
the first two versions of the human genome sequences [Laughter] Question: Yeah, that's a good point.
I work with Dr. Laurie Marker at the Cheetah Conservation Fund, and that's a case where it's an
endangered species because of issues with its genetics so - Venter: Yeah, you might be able
to induce genetic heterogeneity into species like that in an artificial way, but who knows?
This field is changing so rapidly, I think every time somebody says something is impossible,
I or somebody else proves them wrong. So I'm -- Question: Thank you very much.
Venter: trying to be cautious by saying things are impossible. Landsman: Any other questions?
Question: Referring to the ocean sampling experiment, how reproducible were the specific
Question: fingerprints of a region when sampled at different times? Venter: That's a great question, one that's
come up a lot. A lot of people don't realize, you know, unless you're out sailing or swimming
in the ocean that the ocean moves. For example, there's roughly a one knot current moving
continuously across the Pacific Ocean from the Galapagos to the Marquesas, so just doing
the same geographic measurement that has some complications to it. We're doing a repeated
sampling off of Scrip's Pier in La Jolla trying to answer that where you can track the upwelling
changes, the seasonal variations, the storms, etc. We don't even know what the minimal distance
of change is. We're measuring every 200 miles. We saw some initial differences a mile apart in
the Sargasso Sea. All you have to do is look at a satellite image, and you see a total heterogeneous
mixture of chlorophyll for example. So there's no reason to assume the underlying microorganisms
aren't equally asymmetrically distributed, so it's going to take a lot of work, and I think that's
going to be one of the major beneficiaries of greatly decreasing sequencing costs, that we can
go out and get repeated samples on a grid. We're now sampling - one of our faculty member's,
Shannon Williamson's husband is one of the Alvin pilots, and they've designed a rig
for doing continuous filtering a couple mile down now, and so instead of just getting single isolates
off of high-temperature vents, we're going to have the whole milieu there, but it's going to raise
similar questions. How reproducible are these - you get these undersea eruptions and trillions of
organisms spew out - I didn't have a chance to mention we've also now sampled a mile deep in the earth
and found more diversity than we find in the oceans, and a single microscopic field - the
amazing thing to me - maybe for the microbiologists here it wouldn't - it was loaded with
spirochetes, maybe because they're so mobile and can really swim rapidly down there. We're
getting even deeper samples now, basically going down to the temperature max for organisms is
around 140 degrees, so our sampling size is so small, we're so early on that linear curve of discovery
that probably every sample we're going to find massive heterogeneity until we start to get some
indiscernible and some datasets that are probably five orders of magnitude larger than our current ones,
so that's something for the GenBanks of the future to ponder. Landsman: Any other questions?
Landsman: Well, thank you very much. Venter: Thank you. [Applause]