Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. Emerging long-read, single-molecule sequencing can now produce reads tens of kilobases in length, and this promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. However, the increased error rate and special characteristics of this new, single-molecule data type have required new bioinformatic approaches. Developing new computational methods that can assemble this data into complete genomes is a primary focus of this project. A highlight of the past year was the successful completion and publication of the first human genome assembly derived entirely from nanopore sequencing data (ref 1). Four members of the Genome Informatics Section (GIS) contributed to this international collaboration. The GIS led the assembly, phasing, and MHC analyses for this paper, which were key results demonstrating the promise of this new technology. Importantly, this paper introduced a protocol for generating ultra-long nanopore reads reaching lengths of over 100 kb. This unique datatype allows, for the first time, the confident assembly and phasing of highly repetitive regions of the genome. The assembly of these ultra-long reads into a high contiguous assembly was facilitated by the Canu software previously developed by the GIS and reported in 2017's annual report. In the year since its publication, Canu has been cited over 300 times and has become the de facto standard for nanopore sequence assembly. An increasing focus of the GIS is the assembly of haplotype-resolved diploid genomes, and this year's work led to a major breakthrough. Rather than assemble a single, mosaic representation for a diploid genome, we sought to completely assemble both parental haplotypes. To accomplish this, we developed a new method called trio binning that simplifies haplotype assembly by resolving allelic variation prior to assembly. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. Application of this approach to both a human and bovine genome identified complex structural variants missed by alternative approaches and was able to completely assemble parental haplotypes at 99.998% accuracy. A manuscript describing this work is currently in press, and in the coming years we plan to apply this approach to multiple human genomes in order to construct a reference database of complete human haplotypes that accurately captures complex structural variation. In addition to developing new methods for genome assembly, we seek to directly apply these methods to solve problems of significant public health impact. In particular, a focus is on finishing the remaining gaps in the human genome. A primary source of unfinished sequence in the human genome is arrays of hundreds of ribosomal RNA genes, the sequence of which has been poorly defined to date. This year saw the GIS publish a collection of high-quality rDNA units from a single human chromosome 22 using sequencing and assembly approaches previously developed by the section (ref 2). These sequences revealed substantial variation in the rDNA units, facilitating future exploration of functional significance. This year the GIS also took delivery of a GridION nanopore sequencing device and began generating high-coverage of ultra-long sequencing reads for a human reference cell line. This data will be used for the assembly of a gapless telomere-to-telomere human genome in the coming years, which will include rDNA and centromeric repeats. Beyond human, the GIS is in the process of generating high-quality reference genomes for all (approximately 250) extant vertebrate orders in collaboration with the Vertebrate Genomes Project. Led by the GIS, this year marked the successful assembly of the first 16 vertebrate genomes for this project. In addition to vertebrate genomes, the GIS completed assemblies for three mosquito genomes: Aedes albopictus (ref 3), Aedes aegypti, and Anopheles funestus. The first of which has been published, the second manuscript is in press, and the third is being drafted. These mosquitos are important disease vectors (e.g. malaria, zika, west nile, yellow fever, dengue, chickungunya, etc.), and due to their highly repetitive nature, present a difficult assembly challenge that will drive further improvements to our methods. Lastly, the GIS continues to work on developing new methods for the alignment and real-time analysis of long-read sequencing data. This year we published an approximate sequence aligner called MashMap (ref 4) and its successor MashMap2 is in press. These algorithms allow for both the rapid alignment of long sequence reads to reference databases, as well as the fast construction of homology maps for the comparison of whole genomes or the identification of segmental duplications within genomes. We also published an update to the popular Nucmer software (ref 5), which has now been cited over 3,000 times since I initially developed it in the early 2000's. In addition to the five publications referenced above, this year the section has submitted 7 pre-prints to bioRxiv that are currently working their way through peer review. These pre-prints include some of the genomes mentioned above (mosquito, bovine) as well as new methods for sequence alignment, comparative genomics, assembly, and metagenomics.
Showing the most recent 10 out of 18 publications