Single-molecule sequence assembly and analysis

Phillippy, Adam

Abstract

Like a jigsaw puzzle with large pieces, a genome sequenced with long reads is easier to assemble. Emerging long-read, single-molecule sequencing can now produce reads tens of kilobases in length, and this promises to improve the quality of reference databases and facilitate new studies of chromosomal structure and variation. However, the increased error rate and special characteristics of this new, single-molecule data type have required new bioinformatic approaches. Developing new computational methods that can assemble this data into complete genomes is a primary focus of this project. This year we published the Canu assembly software, which is specifically designed for long, single-molecule sequencing reads. Using Canu, we assembled several human genomes directly from both single-molecule and nanopore sequencing reads. Some of these assemblies were evaluated and compared to the current human reference genome (GRCh38) and published in collaboration with the Genome Reference Consortium. In addition, we assembled the first human genomes reconstructed solely from nanopore sequencing data, and are evaluating the potential of ultra-long nanopore sequencing to finish the human genome. For further improvement of long-read assemblies, we evaluated the integration of multiple technologies for scaffolding and chromosome binning, such as Hi-C chromatin interaction mapping and optical mapping. Along with the USDA, we combined these technologies to publish a new goat reference assembly that is the most complete mammalian de novo assembly ever produced, with scaffolds rivaling the human reference in terms of completeness and continuity. In addition to developing new methods for genome assembly, this project seeks to directly apply these methods to solve problems of significant public health impact. In particular, our focus is on finishing the remaining gaps in the human genome. Beyond human, we aim to complete a high quality reference genome for all vertebrate orders, of which there are 200. This is in collaboration with the Vertebrate Genomes Project, who is providing the sequencing data, while our group is leading the genome assembly effort. In addition to vertebrate genomes, which are useful comparative models for the human genome, we are sequencing and assembling genomes for the Anopheles and Aedes mosquitos, which are important disease vectors (e.g. malaria, zika, west nile, yellow fever, dengue, chickungunya, etc.). Due to their highly repetitive nature, these sequences present a difficult assembly challenge that will drive improvements to our methods. This year we finished the genome of the C6/36 Aedes cell line, which is commonly used for viral culture. Work continues on assembly of the wild mosquitos. For the human genome, the largest uncharted regions of the current reference are the telomeric, centromeric, and ribosomal DNA (rDNA) regions. While each region is a challenging assembly problem, our current focus is on completion of the rDNA gene clusters, due to their obvious importance to cell function. The ribosome plays a central role in protein biosynthesis. Also known as nucleolus organizer regions, hundreds of these rDNA genes are arranged in tandem across the short arms of the acrocentric autosomes. Despite their obvious importance, little is known about intra-genome and population polymorphism of the rDNAs because they are too repetitive to be sequenced and assembled with past technologies. We have now finished a study and prepared a manuscript, in collaboration with NCI and NIA investigators, that describes single-molecule and nanopore-based sequencing of these regions. This analysis has revealed previously undiscovered variation between rDNA gene copies. In the area of real-time diagnostics, we have developed and published MashMap, a new approximate algorithm for mapping long reads to large reference databases. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. MashMap combines a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we developed a mathematical framework that defines the types of mapping targets uncovered, and established probabilistic estimates of p-value and sensitivity. We further demonstrated the scalability of our method by mapping noisy single-molecule reads to the complete NCBI RefSeq database. Other minor contributions for the year include the genome sequences for Bacillus strains isolated from the International Space Station, a review article on genome assembly problem, and analysis of the natural killer complex in mammals.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIAHG200398-02
Application #: 9567425
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 2
Fiscal Year: 2017
Total Cost
Indirect Cost

Institution

Name: Human Genome Research
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIA HG	Single-molecule sequence assembly and analysis Phillippy, Adam / National Human Genome Research Institute
NIH 2018 ZIA HG	Single-molecule sequence assembly and analysis Phillippy, Adam / Human Genome Research
NIH 2017 ZIA HG	Single-molecule sequence assembly and analysis Phillippy, Adam / Human Genome Research
NIH 2016 ZIA HG	Single-molecule sequence assembly and analysis Phillippy, Adam / Human Genome Research

Publications

Jain, Chirag; Dilthey, Alexander; Koren, Sergey et al. (2018) A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. J Comput Biol 25:766-779

Kim, Jung-Hyun; Dilthey, Alexander T; Nagaraja, Ramaiah et al. (2018) Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing. Nucleic Acids Res 46:6712-6725

Miller, Jason R; Koren, Sergey; Dilley, Kari A et al. (2018) Analysis of the Aedes albopictus C6/36 genome provides insight into cell line utility for viral propagation. Gigascience 7:1-13

Marçais, Guillaume; Delcher, Arthur L; Phillippy, Adam M et al. (2018) MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14:e1005944

Jain, Miten; Koren, Sergey; Miga, Karen H et al. (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338-345

Bickhart, Derek M; Rosen, Benjamin D; Koren, Sergey et al. (2017) Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49:643-650

Phillippy, Adam M (2017) New advances in sequence assembly. Genome Res 27:xi-xiii

Schneider, Valerie A; Graves-Lindsay, Tina; Howe, Kerstin et al. (2017) Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 27:849-864

Koren, Sergey; Walenz, Brian P; Berlin, Konstantin et al. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722-736

Venkateswaran, Kasthuri; Checinska Sielaff, Aleksandra; Ratnayake, Shashikala et al. (2017) Draft Genome Sequences from a Novel Clade of Bacillus cereus Sensu Lato Strains, Isolated from the International Space Station. Genome Announc 5:

Showing the most recent 10 out of 18 publications

Comments

Be the first to comment on Adam Phillippy's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: