A major highlight of the past year was the successful completion of the first human chromosome ever to be sequenced and assembled from end-to-end without gaps. The project built on nanopore sequencing data generated and mentioned in 2018's annual report. This year, in collaboration primarily with Karen Miga at UC Santa Cruz, we led the complete assembly of a human X chromosome using the ultra-long read sequencing technique reported in 2018 and the Canu software reported in 2017. Achieving a complete X chromosome required further refinement of our assembly approach and the development of novel tools for improving the accuracy of highly repetitive regions of the genome, such as the centromeric satellite arrays. One of these new tools is a highly accurate single-molecule sequencing strategy, which we contributed to the development of (ref 12). Successful reconstruction of the first complete human chromosome is a milestone achievement, and we will be working toward the publication of this result, and the completion of additional chromosomes, in the coming year. Both the data and methods developed for this project put us on a path to complete all remaining gaps in the human reference genome in the coming years. Moving beyond a single reference genome, and towards a pan-genome for all humans, we began a new project with collaborators at UC Santa Cruz to sequence multiple diploid human genomes using the trio-binning approach first described in 2018's annual report and successfully published in this reporting year (ref 9). This year we selected our initial 10 human samples and began sequencing on both PacBio and Nanopore platforms. A preprint describing the initial nanopore sequencing and assembly of these samples was released, and work continues to integrate the additional data types and bring these genomes up to a reference-grade quality standard. In the coming years we plan to use this data in order to construct a reference database of complete human haplotypes that accurately captures complex structural variation from across the human population. Beyond the human genome, the Genome Informatics Section (GIS) is also in the process of generating high-quality reference genomes for all (approximately 250) extant vertebrate orders in collaboration with the Vertebrate Genomes Project. Building upon the 16 vertebrate genomes announced last year, we have now completed nearly 100 vertebrate genomes via this project and related efforts. These genomes include that of the Canada lynx, platypus, kakapo, yak, cow, whale shark, goldfish (ref 2), pig, and many others. These genomes will enable powerful comparative genomics studies that will help reveal the function of vertebrate genomes and enhance our understanding of the human genome. In addition to vertebrates, we have completed the genomes of several invertebrates with significant public health impact, including the mosquitos Aedes aegypti (ref 10), Aedes albopictus (in draft), and Anopheles funestus (ref 5), which are vectors of important diseases such as malaria, zika, west nile, yellow fever, dengue, chickungunya, etc. Due to their highly repetitive nature, mosquito genomes present a difficult assembly challenge that will drive further improvements to our methods. Lastly, the GIS continues to work on developing new methods for the alignment and real-time analysis of long-read sequencing data. This year we published an analysis of public microbial genome databases (ref 11), a new tool for accurate HLA typing from long reads (ref 4), tools for aligning for assigning metagenomic reads to their source genome (refs 3 and 6), and assisted the USDA in the analysis of the complete cow rumen metagenome (ref 1). We also applied our tools to the study of microbial speciation and produced evidence that a bacterial species boundary does exist and can be detected and measured using genomic tools we developed (ref 7). In addition to the 12 papers we formally published this year, the section has submitted 12 pre-prints to bioRxiv that are currently undergoing peer review. These pre-prints include some of the genomes mentioned above as well as new methods for comparative genomics, genome assembly, and metagenomics.

Project Start
Project End
Budget Start
Budget End
Support Year
4
Fiscal Year
2019
Total Cost
Indirect Cost
Name
National Human Genome Research Institute
Department
Type
DUNS #
City
State
Country
Zip Code
Jain, Chirag; Dilthey, Alexander; Koren, Sergey et al. (2018) A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases. J Comput Biol 25:766-779
Kim, Jung-Hyun; Dilthey, Alexander T; Nagaraja, Ramaiah et al. (2018) Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing. Nucleic Acids Res 46:6712-6725
Miller, Jason R; Koren, Sergey; Dilley, Kari A et al. (2018) Analysis of the Aedes albopictus C6/36 genome provides insight into cell line utility for viral propagation. Gigascience 7:1-13
Marçais, Guillaume; Delcher, Arthur L; Phillippy, Adam M et al. (2018) MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 14:e1005944
Jain, Miten; Koren, Sergey; Miga, Karen H et al. (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338-345
Bickhart, Derek M; Rosen, Benjamin D; Koren, Sergey et al. (2017) Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet 49:643-650
Phillippy, Adam M (2017) New advances in sequence assembly. Genome Res 27:xi-xiii
Schneider, Valerie A; Graves-Lindsay, Tina; Howe, Kerstin et al. (2017) Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 27:849-864
Koren, Sergey; Walenz, Brian P; Berlin, Konstantin et al. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722-736
Venkateswaran, Kasthuri; Checinska Sielaff, Aleksandra; Ratnayake, Shashikala et al. (2017) Draft Genome Sequences from a Novel Clade of Bacillus cereus Sensu Lato Strains, Isolated from the International Space Station. Genome Announc 5:

Showing the most recent 10 out of 18 publications