Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. Emerging long-read, single-molecule sequencing can now produce reads tens of kilobases in length, and this promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. However, the increased error rate and special characteristics of this new, single-molecule data type have required new bioinformatic approaches. Developing new computational methods that can assemble this data into complete genomes is a primary focus of this project. This year we released the first version of our Canu assembly software, which is specifically designed for long, single-molecule sequencing reads. Using Canu, we have now assembled five publically available human genomes using and made these assemblies freely available. These assemblies are the most continuous human assemblies generated to date, and were evaluated and compared to the current human reference genome in collaboration with the Genome Reference Consortium (GRC). Additionally, with the USDA, we evaluated the integration of multiple technologies for improved assembly including single-molecule sequencing, Hi-C chromatin interaction mapping, and optical mapping. Using the goat genome as a demonstration, the combination of these technologies produced a new goat reference assembly that is the most complete mammalian de novo assembly ever produced, with scaffolds rivaling the human reference in terms of completeness and continuity. Preprints have been posted for all three of these studies (Canu, GRC, and goat). In addition to developing new methods for genome assembly, this project seeks to directly apply these methods to solve problems of significant public health impact. We are currently involved in over 20 genome projects of various species. However, our involvement in many of these projects is limited, and we are most focused on applications of assembly with a direct link to human health. This includes finishing the remaining gaps in the human reference genome (specifically, the rDNA gene clusters) and generating high-quality reference genomes for the Anopheles and Aedes mosquitos, which are important disease vectors (e.g. malaria, zika, west nile, yellow fever, dengue, chickungunya, etc.). Due to their highly repetitive nature, these sequences present a difficult assembly challenge that will drive improvements to our methods. Additionally, finishing the human genome has obvious benefit to the study of genetic disorders and, in the case of mosquito, host-pathogen-vector interactions in infectious disease. We are tackling these problems with both PacBio and Oxford Nanopore sequencing. For the human genome, the largest uncharted regions of the current reference are the telomeric, centromeric, and ribosomal DNA (rDNA) regions. While each region is a challenging assembly problem, our current focus is on completion of the rDNA gene clusters, due to their obvious importance to cell function. The ribosome plays a central role in protein biosynthesis. Also known as nucleolus organizer regions, hundreds of these rDNA genes are arranged in tandem across the short arms of the acrocentric autosomes. Despite their obvious importance, little is known about intra-genome and population polymorphism of the rDNAs because they are too repetitive to be sequenced and assembled with past technologies. In a pilot study this year, we applied a new approach to clone, sequence, and assemble six rDNA BACs, including a reference rDNA BAC (CH507-528H12), which we were able to reconstruct de novo with >99% concordance to the reference. Further validation suggests that our assembly is indeed more accurate than the manually constructed reference BAC. Of the five additional BACs, we have identified multiple structural and site variations, which we are in the process of validating. For mosquito genomes, we published a multi-year effort to assemble the entirely heterochromatic Y chromosome of Anopheles gambiae. Although current read lengths are insufficient to completely reconstruct the full chromosome, we successfully corrected and assembled a collection of Y-linked BACs and whole-genome shotgun sequencing reads, which revealed the rapid evolution and vast population diversity of the An. gambiae Y chromosome. We discovered that the entire chromosome is comprised of a few massively amplified repeats, arranged in large tandem arrays that undergo rapid sequence turnover within and between species. Characterization of these chromosomes is not just important for the study of mosquito biology and host-vector-pathogen relationships, but will inform potential mosquito control strategies based on sex ratio distortion. We also performed trial assemblies of An. funestus, Ae. aegypti, and Ae. albopictus genomes using partially collected PacBio data, and are continuing these projects using a combination of technologies. In the area of real-time diagnostics, we have developed and published Masha new approximation algorithm for computing mutation distances between many whole genomes or metagenomes. Mash achieves unparalleled scalability via an adaptation of the MinHash technique, originally developed for clustering the World Wide Web and widely used for document similarity tests. MinHash is a dimensionality reduction technique that reduces a large document to a representative sketch that can be thousands of times smaller than the original. Our key innovation was to adapt this technique to biological sequences and derive an estimate of the sequence mutation rate directly from the compressed sketches. This enables massive all-pairs comparisons, which are typically prohibitive due to the number of comparisons required. To demonstrate, we used Mash to successfully sketch all genomes in NCBI RefSeq, and rapidly searched this database using both assembled genomes and streaming nanopore data to identify unknown pathogens.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Human Genome Research
Zip Code
Jain, Chirag; Dilthey, Alexander; Koren, Sergey et al. (2018) A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. J Comput Biol 25:766-779
Kim, Jung-Hyun; Dilthey, Alexander T; Nagaraja, Ramaiah et al. (2018) Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing. Nucleic Acids Res 46:6712-6725
Miller, Jason R; Koren, Sergey; Dilley, Kari A et al. (2018) Analysis of the Aedes albopictus C6/36 genome provides insight into cell line utility for viral propagation. Gigascience 7:1-13
Mar├žais, Guillaume; Delcher, Arthur L; Phillippy, Adam M et al. (2018) MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14:e1005944
Jain, Miten; Koren, Sergey; Miga, Karen H et al. (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338-345
Bickhart, Derek M; Rosen, Benjamin D; Koren, Sergey et al. (2017) Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49:643-650
Phillippy, Adam M (2017) New advances in sequence assembly. Genome Res 27:xi-xiii
Schneider, Valerie A; Graves-Lindsay, Tina; Howe, Kerstin et al. (2017) Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 27:849-864
Koren, Sergey; Walenz, Brian P; Berlin, Konstantin et al. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722-736
Venkateswaran, Kasthuri; Checinska Sielaff, Aleksandra; Ratnayake, Shashikala et al. (2017) Draft Genome Sequences from a Novel Clade of Bacillus cereus Sensu Lato Strains, Isolated from the International Space Station. Genome Announc 5:

Showing the most recent 10 out of 18 publications