J. Craig Venter speaking at the Genbank 25th Anniversary

Well, thank you very much for the kind introductions. It's nice to be back on the NIH campus again, and I was only briefly frisked this time coming through security, so I didn't tell them I knew David Lipman so they let me through. Anyway congratulations to NCBI, but particularly David and Jim Ostell and the team. This 25 years seems to have gone by rather quickly, at least looking back, And we've enjoyed all the associations we've had with NCBI over the years and trying to change things. So the work I'm going to be talking about emanated from basically four institutions starting with my lab here at the NIH, then forming TIGR in 1992. This is the picture of the Venter Institute in Rockville, Maryland, and here's a picture of one of the buildings in La Jolla, California, so it's now a bicoastal institute, and also from some of the work at Celera Genomics. This was an article that Karen Remington helped me write for the 10th anniversary of Nature Genetics looking at some of the things that lead to the advances in genomics. It wasn't just technology. A lot of basic ideas had a big impact such as massive parallelism which is pretty obvious with massively parallel computers. It wasn't so obvious to the group early on in terms of DNA sequencing as everybody wanted one machine that would sequence the human genome. When we wanted to double the output, we just bought a second machine. It seemed kind of logical at the time, but that expanded up to hundreds of machines, but more importantly is something that's a mathematical concept as well as a social one at times, and that's randomness. We used randomness throughout all the different approaches to make what have turned out to be some nice advances. We were actually sequencing some test regions of the human genome here at NIH in the late '80s. We found that we could not interpret that sequence without having cDNAs. So we developed the idea of just randomly sequencing cDNA clones from the library. This was after I spent ten years trying to get a single gene, the human brain beta adrenergic receptor that finally we were able to clone in the mid-'80s. It was the first and last gene that I sequenced by manual DNA sequencing. It was a strong motivation to go into automation. This paper seems small by comparison to modern papers, but it was a big deal in 1991 to have 337 genes described in a single paper looking at random sequencing of brain libraries. NCBI and GenBank, but particularly driven by David, I had the foresight to start a new database at the time, dbEST. So it started slowly basically until our collaboration with Burt Vogelstein discovered some new DNA repair enzymes associated with colon cancer. After that ESTs caught on. Our teams have continued to submit them over the years with a variety of species, but when you look at how it took off after the work on colon cancer, our contribution percentage-wise to the total was decreasing, but these numbers have become pretty staggering. The first paper in '99, we only had 337 ESTs in GenBank. A decade later that number had jumped to 10,000,000. Two years later it doubled again, and just downloading things last night in the latest release there's now 50,000,000 ESTs, amazingly over 8,000,000 from human alone. Now there hasn't been a genome done that ESTs didn't play a major role in the annotation. In fact, I think the annotation would be questionable without these. The cDNAs were very key. It's amazing that they continue to be generated at the rate. This was just a list of the ones right off the dbEST website that had 300,000 or more ESTs. The list goes on in a very dramatic fashion. In fact, aegypti was our latest contribution. Now it wasn't the ESTs but the randomness and more importantly some new algorithms developed primarily by Granger Sutton. Granger Sutton came in with Chris Fields. Many of you will remember him. He was sort of my impetus for leaving NIH, that and $70,000,000 to start an institute. I tried to get Chris promoted here at NIH, and at the time biology was not part of the computing world and certainly biologists thought computing was not part of the biological world, so that was sort of the final straw with leaving to form the Institute for Genomic Research. But the team lead by Granger Sutton developed new algorithms for assembling ESTs, and they were extremely effective with hundreds of thousands, so we decided we could go back and apply these same tools to genomics, and that's what lead Ham Smith and my other colleagues just to try this experiment. E. coli was roughly in its tenth year of funding with not tremendous progress. The yeast genome was making better progress. It took 1,000 scientists ten years to do it. It ended up being the fourth genome completed. This took projects of just using mathematics and computing from decades down to months. This has progressed, so these are now down to a single day in a single cell. Completed genomes started to come out exponentially. Looking back at this literature despite all the fuss and complaints more than 90 percent of all the genomes completed have been done now with the method that we published with Haemophilus genome. This is just looking at prokaryotic genomes. Again like ESTs our contributions were 100 percent early on and the rate of these have really caught on particularly with the JGI contributing a substantial amount. If we look at the submitted sequence databases - I'm looking at - this is finished prokaryotic genomes. The JGI and the DOE have submitted 25 percent of all the genomes, and the Venter Institute in its various incarnations including TIGR has done 23 percent. And it drops off pretty dramatically after that, showing that some of these things happen faster in centers. Again these are NCBI websites looking at the clone-by-clone approaches versus the whole genome shotgun approach. It's clear that the whole genome shotgun just totally phase shifted how things were done over a very short period of time, so now if we look in the databases, the majority of anything that's there has been done by whole genome assembly or probably the majority of data is from some random approach. The big change came only five years later after Haemophilus, and that was with the scale-up with new sequencing technology, but also more importantly and much improved algorithms. These were driven primarily again by Granger Sutton but also by Gene Myers who as a team came up with the Celera assembler. Even though the substantially larger size, Drosophila was sequenced in the same four months that Haemophilus was five years earlier. The size of the compute went up substantially. Nine months later the first draft of human was assembled. Now assembling 27,000,000 sequences. The results from our team were published in Science in 2001and from the public effort in Nature. It turns out both these efforts were dramatically flawed. The conclusion in 2000 and 2001 is we all shared the same set of genes, and we all differed from each other only in 1 out of 1,000 letters of the genetic code. In the wisdom of the early committees that I was part of with Lee Hood threw out the number. It probably cost a dollar a base pair for sequencing. Nobody wanted to ask for six billion dollars to do a diploid genome. They thought we could just get by with a haploid genome and infer rest. It turns out those inferences were wrong. Celera sequenced, ironically, from five individuals. And the assembly did a majority rule assembly subtracting out again a lot of the variation. The public sequence had essentially no variation, because they were from back clones from a collection of people. I say ironically because had we sequenced just from one individual in '99 through 2000, we would have gotten the right answer eight years ago instead of this last September. But we went on to continue to sequence, building on the sequences from a single individual. In this case it was me, from Celera adding to the data until we had substantial assembly of both haploid genomes. In fact tremendous haplotype phasing based on the genetic variation, more than ten times of what came out of the Hap Map. And we're adding to this even more substantially. Basically every major new sequencing manufacturer is trying their instruments against now this reference diploid human genome. So we hope in the near future we will have complete haplotype phasing going down the sets of chromosomes. The biggest surprise is that my two haploid genomes differ from each other on the order of .5 percent. We try to compare any two people, it looks like we differ from each other on the order of one to three percent. To put a number in context, myself and others based on the work of Svante Paabo working with Celera early on with chimpanzee, told the world that we differ from chimps by 1.27 percent. So audiences start to get very nervous around this time if we differ from each other by 1 to 3 percent and chimps by 1.27, you're probably hoping that number changed too, and many of you know that it has - we're more like 5 to 6 percent different from chimps when we count INDELs and other types of variation. We have substantially more base pairs in INDELs than we do in SNPs, but we're in a SNP centric world right now. We have an industry that only knows how to measure SNPs, and they think these are what account for human variation, and I think there's increasing surprises. I think Jim Watson's genome is going to be published in the next week or two, so we'll have two genomes out there. His genome I think is a full diploid genome. The sequence is only a haploid genome, because it was not an independent assembled genome, it was layered on the NCBI sequence. I argue, we need thousands, perhaps ten thousand complete diploid genomes to understand human genetic variation. We can't get that off a reference genome, and you can't get that looking at SNPs. We're actually looking at a much more complex problem. The sequencing is pretty straight forward. Some of the new technologies that are coming along in the next 12 to 18 months are really going to change what can be done very dramatically, possibly down in the few thousand dollar range per genome. But this data is basically useless. It's a different version of what David Botstein was just telling you, in this case without phenotypic information. And so we're working with groups at UC San Diego and other places to try and define a human phenotype in this way that that can be digitized so we can compare the sequence information across the genome to the human phenotypes, and with that using complex analysis, we have a chance of answering most nature versus nurture questions. If we just collect sequence, and as NIH has proposed, partial sequence from individuals with no phenotypic information, it's going to be mostly a wasted set of information. You can go to the PLoS Biology website and download the file that has the map showing the various types of variation, and there's a new browser just released this last month from the JVI.org website as well. But the challenge is going to be comparing all these genomes. It certainly helps, I think, when NCBI started the Trace database that was a huge leap forward certainly for scientists and teams like we have that use this data. Instead of relying on other people's assemblies and interpretation, we're able to download the raw data, even reanalyze that, reassemble it, and do other things with it. The trouble is from the public human genome dataset those traces are still not yet available. We've been told they probably never will be. These are continued additions. We've now put over 103,000,000 sequences just as of last month into the Trace Archives showing how our total is growing over time and contributing to the efforts. Now after we finish this rough draft of human, we're looking around to other things we could apply these same techniques to. Knowing that the environment was important, we decided to try to look broadly at the environment in applying shotgun sequencing to the environment, and that lead to this paper that Karen Remington was a key contributor to trying a shotgun sequencing of what we found in the Sargasso Sea by simply filtering organisms out of seawater. In fact, we stopped sequencing after about a billion base pairs. Over 1.2 million previously unknown genes and as many as 40,000 new species. This started getting applied at the institute in multiple directions. The start of the Human Microbiome Project which is now expanding greatly. We're not just a hundred trillion cells with 23,000 genes. We have literally thousands and thousands of organisms associated with our bodies, so the number of genes adding up to maybe 10,000,000 or more with our associated organisms than our meager 23,000, and efforts are now starting to characterize these in much broader populations. Groups like Metabalone use high throughput mass spec to look at chemicals in the bloodstream. They've catalogued that our genetic code can lead to around 2,400 unique chemical compounds. If we were to take a blood sample from any of you after a meal, there's around 500 pretty substantial compounds circulating in our bloodstream. Interestingly, about 60 percent are from human metabolism, about 30 percent are just from what you consumed, chemicals absorbed out of the various species that you ate. Interesting, about 10 percent of what is circulating in your blood at any one time is from bacterial metabolites of what you just ate. So not only are you what you eat. You are what you feed your bacteria. So the role they play in our physiology is now just being looked at for the first time. We went back and started the Sorcerer II expedition, trying to ask questions on a global scale with these techniques. The results of the first third of this were published in a special issue of PLoS Biology last spring. This is the route that Sorcerer followed similar to the route of the HMS Challenger in the 1870s, and as with the Challenger, we took samples every 200 miles, only they sent a dredge to the bottom and counted organisms they could see. We collected filters, put the filters in a freezer, brought them back to the lab in Washington and shotgun sequenced everything off of these filters. This is a picture of rapid obsolescence. Three years ago this was a state-of-the-art facility. We're now selling these things for less than $50,000 each if you're looking for a DNA sequencer. Because now the output of 100 ABI machines can be replaced with a single 4-5-4 machine or other technologies. We switched to 4-5-4 because we think length of sequence reads and doing parallel sequencing is essential for accurate human and other sequencing, but there's even more exciting single-molecule technologies on the near horizon. To everybody's surprise, we found that every 200 miles in the ocean, 85 percent of the sequences were totally unique, to the point we can actually tell where a water sample came from by the sequence information contained in that water sample. This is from the original Sargasso Sea paper, and one of the questions that people had was after finding so much unique biology, is how could it be there. The ocean, particularly the Sargasso Sea was supposed to be a desert, very low nutrients, and it turns out almost every organism in the top layers of the ocean that's not a photosynthetic organism has photoreceptors, bacterial rhodopsins, very similar to our own visual pigments. Looking at GenBank, the blue at that bottom was our knowledge of photoreceptor biology in terms of distribution, including humans, before these findings. This was just from the first sample. You can see a deep branching. These have been discovered in a linear fashion. They're in the tens of thousands now. There's reasons for lining them up. We can get down to a single amino acid residue that can predict the wavelength of light that these receptors will see, and so we can ask unique questions because now we have a huge metadata set, not just sequence information. We have, for example, GPS coordinates for every single sequence, and we can ask questions about unique distributions. In fact, we found them. In the middle of the Sargasso Sea, the ocean is a deep indigo blue. The receptors on the organisms that thrive there see primarily blue light. You get into coastal waters where there's a lot of chlorophyll, they see primarily green light. Places like the Panama Canal that are pure fresh water, they see entirely green light. We have now about eight different new derivatives yet to determine what the wavelength of light is, but last year a Swedish group published this nice letter in Nature, showing that these are actually phototropic organisms. This is not photosynthesis. This is a basically seven indiscernible receptor, very similar to the adrenalin receptor. Light hits it, it transmits ions across the membrane that lead to some key chemical reactions providing the generation of energy for these cells. The other surprise is that species were not nice, neat low punctuate species, and this is true whether if you're gut, whether you're oral cavity, the ocean, the air, or soil outside this auditorium. This is a tool that Doug Rusch built that you can download. It puts a reference genome across the top. Each of these little bars is 900 base pairs of sequence that match that reference genome. We see very little at the 99 to 100 percent level with any organism in the ocean, particularly ones that were claimed to be the most abundant species. SAR-11 was supposed to be a single species. This is basically all under a single 16SRNA sequence, so the people that were telling you they were measuring diversity with 16SRNA were measuring diversity, but they were missing two to three orders of magnitude of that diversity. We can take a slice in any of this data and create trees and ask different questions. These are color coded by sight. You can compare Atlantic versus the Pacific. We see things like major changes in phosphate metabolism in different oceans. I think the bottom is the most interesting one. We've seen a switch between blue and green light four times in recent evolution. A single base pair of the genetic code changes that single amino acid. It changes the wavelength of light. This is classical Darwinian evolution. You get a mutation, you get increased or decreased survivability depending on that environment. What we see with this huge distribution of related organisms with as much as 40 percent sequence variation is not predicted by classical Darwinian evolution. But you can ask any different question of this type of data. A variety of questions started to come out of this. What was the novelty with this data. Were these just new members of known families? If we're discovering things, what was the rate of discovery, and this is where Shibu Yooseph at the institute lead an effort to compare the GOS dataset to everything we could download out of public databases at the time. This was just from Halifax to the Galapagos. It was roughly twice the proteins as were in all the public databases. They proceeded to do a compute using over 1,000,000 CPU hours to assemble all this data together. We were truly surprised by this, so this is a Venn diagram showing the quality at the time in terms of the smaller circle. The gray circle was GenBank. Looking at microbial sequences, there was only 115 gene families in GenBank that our dataset did not hit. And yet we had about 4,000 new gene families that were not seen in GenBank. Now we see a huge tail on this data. There's about 50,000 of these computes now a couple years ago, gene families with ten or more sequences in public databases, but every time somebody takes a new environmental sample, they add new gene families on in a linear fashion. If we look at subparts of the evolutionary tree, we get a different answer. For example, if we look at mammalian genomes, my contention is there's no point in sequencing another mammalian genome if you just want to discover new genes. It's totally saturated. You can discover new combinations of those genes that lead to that species or variance. We're going to do an awful lot of humans as a community in the near future, not for gene discovery but for variant analysis. So metagenomics is now a new part of NCBI and GenBank. So we've more than doubled the number of peptide sequences in the public databases with a single publication last spring. We hope to do so again this year with around 35,000,000 new gene sequences. But this is catching on. People around the world are picking up on this particularly as the new sequencing technologies come on. One major funder has been providing 4-5-4 sequencers to a variety of labs. Fortunately, the sequence length is changing with 4-5-4 later this year, and I think that data will be good as well. We tried to put all this information together to ask some unique questions. Obviously one that we've all asked ourselves at various times is the definition of life. Can we do classical reduction as biology and pare it down to its most basic components? The third question we've obviously answered. We can digitize it. That's what we've been all doing for the last 20 years. When we sequence a genome, we're digitizing that biological information. And lately for the last decade or so, we've been trying to answer the last question, can we start with that digital information in the computer and go the other direction and regenerate life. So working with Ham Smith and Clyde Hutchison and a team of about 30 scientists, we've been trying to work on designing and synthesizing life starting with the genetic code. Again a couple simple questions, difficult to answer. Will the chemistry permit making large accurate pieces of DNA, entire chromosomes? And if we can make these pieces of DNA, can we boot them up? When we first tried this, the answer was no even for things on the order of 5 kb, because DNA synthesis is a degenerate process. The longer you make the sequence, the more errors there are, but we've developed new mechanisms for correcting those errors as part of the assembly, and we started with Phi X 174, the first genome - the first phage sequence by Fred Sanger. Clyde Hutchison was in his lab at the time, had one of the isolates still stuck in his pocket, and so the problem - one of the tenets with this new field of synthetic genomics is you can only build something as accurate as the quality of the data in the digitized form. So we went back to resequence Phi X 174 and actually found after all these years only three differences. Sanger and colleagues certainly set a standard, unfortunately that very few have followed since. It took two weeks going from the design with the new sequence through to having a 5 kb piece of DNA, which we then stuck in E. coli, and E. coli recognized this as normal phage DNA. It started making the viral particles. As is classical in all biology, there's no gratitude once you make something like a virus, it turns around and kills the parent cell, and that's how you can detect it with these clear plaques. So here's a cartoon of Phi X 174. So we call this a case where the software is actually building the hardware, and we're now working on taking this to the next stages. Our goal is actually to build a synthetic bacterial chromosome as the next stage based on work we're doing with the second genome that was done in history, mycoplasma genetalium. That's only about 500 genes. We asked questions like did it need all 500 of those. Was there a smaller genome capable of self-replication, etc.? We thought we would just build Phi-X-size cassettes and put those together. Again we went through this design phase, only we resequenced the mycoplasma genetalium genome. The standard of sequence in 1995 when it was sequenced was one error in 10,000 base pairs. We found about that. We found 30 errors in the genome. Some of those errors on building the molecules would have not resulted in an accurate molecule, so it was good that we went back and re-digitized the information. I think this could be a major limitation of the field going forward. Our big concern was generating artifacts, so we went through intergenic regions where we know that we could insert DNA into the genome. We built in watermarks into the genome. This can actually be fun. We have a four-letter genetic code, sets of triplets as you know code for roughly 20 amino acids. There's a single letter designation for those so we can write words, names, sentences, etc. And we watermarked it where you can see in the second line where those vertical bars are. Basically the team assigned the genome because even a single molecule of the native chromosome contaminated things, could have lead to a giant artifact. We started by making Phi-X-size pieces and assembling four of those together to make 24 kb pieces. This took a very long time because after each stage, we cloned them back into E. coli and sequenced them to make sure that artifacts were not being generated. Many of you remember people trying to use Maynard Olsen's yak technology for humans. It probably set back the human genome project by at least four years because of the massive artifacts that were generated from shuffling the human DNA, deleting it, and we wanted to make sure that would not happen here. We think one of the reasons there's a different codone usage in mycoplasma versus E. coli, but at a certain stage - and the largest piece that had ever been made before was 31 kilobases. When we got into the hundreds of kilobases E. coli didn't like growing these pieces. We shifted, as probably David would have told us in the first place to yeast, and tried to use homologous recombination for assembly. We'd been characterizing the Deinococcus radioduran's genome because it can take about three million rads of radiation and not be killed. It's chromosome gets blown apart. That's after 1.75 million rads. Twenty-four hours later, stitches its genome back together exactly as it was before, so we had a team working over a decade on isolating all the repair systems out of Deinococcus trying to make an in vitro recombination system when we found that yeast does this extremely nicely, a technique here developed at NIH with the TAR cloning. Not only did yeast grow these pieces very nicely but if you design them properly and put them in, yeast will actually assemble them for us and were able to assemble the entire genome. It's a circular molecule. It's so big you only need a light microscope to see it, not an electron microscope. This is looking at the synthetic chromosome over a six second time period. And this was just recently published on Science. This is the largest human-made molecule of a defined structure. It's over 300,000,000 molecular weight. It's 582,000 base pairs. If you were to print this out a ten-font with no spacing, it takes 147 pages to print this out. It was designed, executed and on sequencing exactly what we had designed and made. We're in the process of trying to boot this up right now. We have not done that, but we published the key prototype experiment last year on how to boot up a chromosome. We did this by transferring a chromosome from one species to another, completely transforming one species into another. We started with two related mycoplasmas, roughly the same distance apart as human and mice. We isolated the chromosome from mycoides. We treated it extensively with digestive enzymes to make sure there were no proteins associated with this. We wanted this to be a model of what we could do with naked DNA being built by the team lead by Ham Smith. We added a couple of gene sets to it, lacZ operon and a selectable marker. And we had thought about these experiments over about a decade, and we actually thought we would have to remove the chromosome from the recipient cell before the transplant, much in the way that's done with eukaryotic nuclear transplant, which is very simple to do. Just pop out the nucleus out of one cell and pop another one in, so it's a relatively simple mechanical step. With bacteria and Archaea as you know the chromosome is integrated into the cytoplasma of the cell. There's no simple way to do this. We've tried radiation. We've tried chemical damage. We have some unique methods with restriction enzymes that we're trying now, but we decided we might be able just to leave the chromosome intact, and use a different set of processes. The capricolum genome does not have a restriction enzyme system, mycoides does, and you'll appreciate this very sophisticated graphic that shows how this works. We inserted the chromosome from my mycoides into capricolum. The DNA started being read immediately, and the restriction systems got expressed and recognized the parent chromosome as foreign DNA chewing it up. We were left, therefore, only with the transplanted chromosome. In a short while we had big beautiful, blue cells that when we did 2-D gels and sequenced the proteins off the gels, we found that all the proteins had shifted to that encoding just for the mycoides-transplanted chromosome. All the characteristics of the parent cell were gone. Now Ham Smith has been my best friend and colleague in science for a very long time. He got the Nobel prize in 1978 for discovering restriction enzymes. I always appreciated them as a phenomenal tool in molecular biology, but until we did this set of experiments, I never really appreciated their evolutionary role. Imagine if you had sex, and I'm sure some of you are imaging that right now versus paying attention, and from that act of having sex, you completely converted that person you were having sex with into a new species or into a clone of you. Some people do argue that happens anyway just mentally, but cells obviously evolutionary-wise wouldn't want to have their identities hijacked. Restriction enzymes are a wonderful defense against this, but at the same time instead of being a rare phenomenon that we think we created in the lab, all of a sudden a lot of what we've seen from sequencing microbial genomes pop into clarity. For example, many people argue there was no point in sequencing the cholera genome, because it had the same 16SRNA basically as E. coli, but we were all surprised when we found cholera had two chromosomes. One that was indeed similar to E. coli and one that was very different. So it's clear that cholera derived as a unique species by either absorbing a chromosome from another species, a cell fusion. Nobody knows how that happens. We see this over and over and over again in the microbial world. Deinococcus radioduran has four genetic elements, two chromosomes, a megaplasmid and a smaller plasmid, all different GC content indicating clearly different origins. If you have restriction enzyme systems that recognize the DNA as foreign, you can protect yourselves, but we think this is a basic mechanism of evolution, one that explains a lot, instead of depending like the photoreceptors on a point mutation for a selective advantage. Think of it. In one genetic event, you can gain 2- or 3,000 new elements, 2- or 3,000 new characteristics instantaneously, new speciation. So where do we go from here? We're trying to do the same experiments now with the mycoplasma genetalium chromosome. Obviously the cells that we're using, they have different restriction systems. Had we made them mycoides genome, we would be done by now. We thought the synthesis was going to be the complicated part of this, and so we chose to make the smaller genome, but we think we're pretty close to overcoming a number of barriers. In a short while, the dataset is going to be 20,000,000 different genes or families of genes, clearly, and so I like to think about these as the design components of the future. Yes, we need to know function, but this has so exceeded, you know, the number of scientists and the person years it would take to sort out the function of 20,000,000 genes, it's an impossible task with current technologies and one-at-a-time protein studies. But with doing empirical science and computational design, we think we can come out in a different place. This is part of the software that's being worked on at Synthetic Genomics and in the institute, where we're trying to design some unique pathways and species, but the exciting part of this using the homologous recombination systems in yeast and other species, what took us four years to do in making this first large chromosome, we can now do in a very short period of time. We have a robot. First stage of it is working quite well for making a large number of chromosomes, certainly not in the hundreds of thousands to millions yet, but definitely in the hundreds, and we think this is going to be readily scalable, and we're calling this new field combinatorial genomics, because we can take basically any sets, we can take these different cassettes that we build for mycoplasma genetalium and shuffle those. We don't know simple things like is gene order in a genome important other than operons. If we take the same genes and just shuffle them, do we get the same answer? Obviously in some parts we know order is important, but in cassettes even shuffling those. We can just screen these for simple questions. Is it viable? Does it produce the chemical we want, etc.? We're working on trying to make designer fuels, viewing the biggest problem is not human disease but what we're doing to the environment. Particularly trying to capture new metabolic roots of energy. We're not short of energy on this planet. We get 120,000 terawatts a day hitting us from the sun, and so we're working actually on designing new species to capture CO2 as their source of carbon and produce a number of compounds from that. The third genome that we did was the first Archaea. This is a complete autotroph, grows at 85 degrees centigrade. CO2 is its complete source of carbon, molecular hydrogen is its energy source, so we have now organisms that can work in day and night cycle capturing CO2. A lot of people have been working on feed stocks. We're doing a pretty silly experiment in this country of taking corn oil and converting it into ethanol. In the process of a year, doubled the feed stock prices for all food, and it's shifting farmland into producing fuel. That clearly won't work on a global basis. CO2 is a great feed stock. Other combinatorial genomic purposes: working first with Chiron and they got purchased by Novartis. We're very close now to having the first genomic vaccine hit the clinics. This started shortly after we did the first few genomes working with Rino Rappuoli we sequenced the Neisseria meningitidis genome and used predictions including phase variation to understand which antigens would be stable. The teams at Chiron built a number of these. We're now into Phase III clinical trials using this combination of antigens derived from the genome, the new vaccine is successful against a wide range of meningitis strains. Also there is this incredible work by Bert Vogelstein using bacteria to treat cancer. He's actually been injecting Clostridium, an anaerobic-growing organism into patients. The Clostridium homes in on the anaerobic part of the tumors, releases a variety of substrates. It kills the tumors. It's incredibly effective. It's in clinical trials. We're now designing synthetic organisms to do this without all the issues of creating infectivity. So I think as we go into this new phase starting in the digital world, hopefully, it will greatly increase our understanding of basic life as we start to build these minimal genomes showing that we can understand things enough at first principles to redesign and build something, hopefully impact on the petrochemical industry, but I think it's going to certainly drive antibiotic and vaccine discovery, and with programs like Vogelstein's, new therapeutics. Now many of you remember some of this work was started a NIH. We underwent an ethical review before even doing the first experiments. The Sloane Foundation funded the Venter Institute along with MIT to do a much more in depth study, and that was published last fall. The executive branch of the government has put together the NSABB based on the Phi X 174 work to review work in this area. They approved making the 1918 flu virus at the CDC, the first true Jurassic Park scenario of isolating RNA back from patients buried in 1918 that died from the flu and reconstructing what was thought to be an extinct virus. So a point to leave on for the future segments of GenBank - because we can watermark this DNA, it will be very trackable and, hopefully, anybody that makes synthetic constructs in their labs will appropriately label them. I've already talked to David about starting a synthetic genome part of the databases. As we start with and on top of three and half billion years of evolution, not going back and recreating it, but starting with that, knowing all we have to do is design the software of life, boot it up in cells, and start a whole new speciation, perhaps and do Cambrian explosion. David, congratulations. Thank you. [Applause] Landsman: Do we have questions for Dr. Venter? Question: There was a panel at my college, UMBC up in Baltimore, called 'Ethical Implications of Synthetic Life,' And your name was brought up several times because of these experiments you were just talking about. There were differences of opinion on patenting of DNA sequences. Some of the panelists thought that you should only be able to patent the non-natural sequence, some the specific strain that the sequence you created was put into, so my question to you is you insert a gene that you've created novely, a truly novel gene sequence, into the M. genetalium genome, and then you synthesize the entire genome. What should you be able to patent - either the strain, the entire organism, the gene itself, and why? Venter: Those are all good questions that a lot of people are looking at right now, but as with the patenting of any DNA, you can't patent naturally-occurring DNA, but all the genes that all the biotech and pharmaceutical companies have patented - they were able to do that, because these were manmade constructs in the laboratory. So most of the patent attorneys that we've talked to think that designing and synthesizing an entire genome from scratch is clearly a manmade construct if there ever was one, and I think in the cases where these organisms - for example, DuPont spent ten years and $100,000,000 making a new derivative of E. coli by changing a couple dozen genes, so they could go from a six-carbon sugar to the three-carbon propanediol. They patented that organism just changing 12 genes, not synthesizing the organism. You know, you can find patent law going back to Chakravati on even patenting life forms that have been discovered in the environment. I don't think that's a great idea, but I think synthetic organisms that have novel economic value will certainly be subject to patents around the world. Question: How long in the future do you think it would be before you can do multi-chromosomal organisms? Thinking more in terms of trying to save some of the endangered species. Venter: I think trying to reconstruct species other than simple bacteria by trying to save the genetic code is certainly a recognition of the sad state of society that we're in, to have to do that. Multi-cellular organisms are probably not far away in design. I think as you've heard many times today, yeast has a very small genome. It's very easy to synthesize different chromosomes. Yeast artificial chromosomes are in fact how we grow the synthetic bacterial genome. It really depends on the purpose and the application. But I think, you know, the true Jurassic Park scenario is where they capture a lot of people's imagination. I hope there's better ways to start to save species than just hoping we can reconstruct them from the genetic code. Because people probably made errors in sequencing the genetic code anyway, and so if we don't have the species to check accuracy, we're likely to come up with some pretty bizarre constructs. I'd hate to see the human that came out of the first two versions of the human genome sequences [Laughter] Question: Yeah, that's a good point. I work with Dr. Laurie Marker at the Cheetah Conservation Fund, and that's a case where it's an endangered species because of issues with its genetics so - Venter: Yeah, you might be able to induce genetic heterogeneity into species like that in an artificial way, but who knows? This field is changing so rapidly, I think every time somebody says something is impossible, I or somebody else proves them wrong. So I'm -- Question: Thank you very much. Venter: trying to be cautious by saying things are impossible. Landsman: Any other questions? Question: Referring to the ocean sampling experiment, how reproducible were the specific Question: fingerprints of a region when sampled at different times? Venter: That's a great question, one that's come up a lot. A lot of people don't realize, you know, unless you're out sailing or swimming in the ocean that the ocean moves. For example, there's roughly a one knot current moving continuously across the Pacific Ocean from the Galapagos to the Marquesas, so just doing the same geographic measurement that has some complications to it. We're doing a repeated sampling off of Scrip's Pier in La Jolla trying to answer that where you can track the upwelling changes, the seasonal variations, the storms, etc. We don't even know what the minimal distance of change is. We're measuring every 200 miles. We saw some initial differences a mile apart in the Sargasso Sea. All you have to do is look at a satellite image, and you see a total heterogeneous mixture of chlorophyll for example. So there's no reason to assume the underlying microorganisms aren't equally asymmetrically distributed, so it's going to take a lot of work, and I think that's going to be one of the major beneficiaries of greatly decreasing sequencing costs, that we can go out and get repeated samples on a grid. We're now sampling - one of our faculty member's, Shannon Williamson's husband is one of the Alvin pilots, and they've designed a rig for doing continuous filtering a couple mile down now, and so instead of just getting single isolates off of high-temperature vents, we're going to have the whole milieu there, but it's going to raise similar questions. How reproducible are these - you get these undersea eruptions and trillions of organisms spew out - I didn't have a chance to mention we've also now sampled a mile deep in the earth and found more diversity than we find in the oceans, and a single microscopic field - the amazing thing to me - maybe for the microbiologists here it wouldn't - it was loaded with spirochetes, maybe because they're so mobile and can really swim rapidly down there. We're getting even deeper samples now, basically going down to the temperature max for organisms is around 140 degrees, so our sampling size is so small, we're so early on that linear curve of discovery that probably every sample we're going to find massive heterogeneity until we start to get some indiscernible and some datasets that are probably five orders of magnitude larger than our current ones, so that's something for the GenBanks of the future to ponder. Landsman: Any other questions? Landsman: Well, thank you very much. Venter: Thank you. [Applause]