Ensembl - Intro to genome browsing

Hello, my name’s Emily and I’m an Outreach Officer for the Ensembl project. In this video, I’d like to give you a basic introduction to what Ensembl is And I’d also like to define some basic bioinformatics terms So what is Ensembl? Ensembl is a genome browser. Genome browsers make genomic data accessible To do this, we provide a database, which you can access directly. We also present this data in a web browser. What do I mean by genomic data? There are many different kinds of data that we have about genomes. Firstly we have the genome sequence itself. On top of the sequence, we can plot features such as genes and transcripts, variation between individuals, the binding patterns of proteins on the DNA which regulate genes and comparisons between species. All of these features can also be accessed from our database. We often talk about a genome assembly. What do we mean by that? This is to do with the way that we sequence genomes. We can’t take a 250 million base pair chromosome, start at one end then sequence along the length of it in one go. Current technology only allows us to sequence short stretches at a time. So the way that we sequence genomes is that we cut them up into short stretches, and sequence each of these. We can then put these stretches back together to form (or assemble) the full genome sequence. We call this an assembly. There are various methods of doing this, however they all boil down to the same thing - we cut the genome into bits, sequence the bits and put them back together. The problem with doing it this way is that it can be difficult to put the pieces back together. Some genomes are easier to assemble, because we have a template to start from. But even with a template, we don’t know where each of our short stretches came from. To put them back together we have to rely on overlaps between the sequences. Complicated algorithms are used for this Repeats in the sequence can make this difficult to resolve. This means that some regions will be incorrectly assembled. Gaps in the assembly can occur where regions were not sequenced. For popular species, such as human, mouse and zebrafish, new assemblies are released periodically. This means that the problem areas, such as gaps or poorly assembled regions, can be repaired. Sometimes you might see scaffolds instead of, or as well as, chromosomes. This means a long stretch of genomic sequence has been assembled, but we have not been able to work out which chromosome it belongs to. For some genomes, particularly those that were sequenced a long time ago, parts of the genome were cloned into bacteria. Each of these was then sequenced and assembled, then put back together. We call these contigs. Genome assemblies are not produced by Ensembl. We import them from other sources, such as the Genome Reference Consortium. This means that we are using the same genome assembly as other databases and such as NCBI, UCSC, ENCODE and 1000 Genomes, amongst others. Ensembl provides details of the assembly used for each of its species. Now that we have our genome assembly, we need to plot data onto it, such as genes. To do this we start with biological data. We get protein and cDNA sequences from a variety of databases, such as NCBI and Uniprot. We plot these onto the genome to build our transcript models. A transcript model consists of the exons and introns. Some transcripts will encode a protein, so within those exons there will be a coding sequence, plus 5’ and 3’ untranslated regions. Some transcripts are non-coding, so will only have non-coding sequence. When transcripts share exons with each other, they belong to the same gene. Genes can have many transcripts, both coding and non-coding, and single transcript genes are very rare in vertebrates. Many transcripts may exist in nature that science don’t know about yet. Species which have not been as extensively studied, will not have as many known transcripts as others. On top of the genes, we plot other kinds of data. Variation between individuals, which all comes from biological data, is plotted onto the genome. For example, we have single nucleotide polymorphisms, or SNPs, insertions, deletions and copy number variations. We can then predict which genes and transcripts variants affect and any changes in the protein sequence that might result. We also plot regulation data, such as the data produced by the ENCODE project. Raw ChIP-seq data, indicating where transcription factors bind and how histones are modified, is plotted onto the genome. This is analysed in our regulatory build and used to predict the locations of regions that regulate gene expression, such as promoters, enhancers and insulators. Analysis of the genes and sequences is also carried out. We compare the sequences of genes between and within species to produce gene trees and infer homology. We also carry out whole genome alignments to identify regions of high and low conservation. This analysis is done for all the species in our database, which covers over 60 vertebrates. Our sister site, Ensembl Genomes, carries out this analysis for other species, namely bacteria, plants, fungi, invertebrates and protists. All of the data we produce is freely available on the internet. You can access it via our user friendly browser, or directly from our databases via our APIs. In later tutorials, we’ll show you how you can access this data.