A major highlight of the past year was the successful completion of the first human chromosome ever to be sequenced and assembled from end-to-end without gaps. The project built on nanopore sequencing data generated and mentioned in 2018's annual report. This year, in collaboration primarily with Karen Miga at UC Santa Cruz, we led the complete assembly of a human X chromosome using the ultra-long read sequencing technique reported in 2018 and the Canu software reported in 2017. Achieving a complete X chromosome required further refinement of our assembly approach and the development of novel tools for improving the accuracy of highly repetitive regions of the genome, such as the centromeric satellite arrays. One of these new tools is a highly accurate single-molecule sequencing strategy, which we contributed to the development of (ref 12). Successful reconstruction of the first complete human chromosome is a milestone achievement, and we will be working toward the publication of this result, and the completion of additional chromosomes, in the coming year. Both the data and methods developed for this project put us on a path to complete all remaining gaps in the human reference genome in the coming years. Moving beyond a single reference genome, and towards a pan-genome for all humans, we began a new project with collaborators at UC Santa Cruz to sequence multiple diploid human genomes using the trio-binning approach first described in 2018's annual report and successfully published in this reporting year (ref 9). This year we selected our initial 10 human samples and began sequencing on both PacBio and Nanopore platforms. A preprint describing the initial nanopore sequencing and assembly of these samples was released, and work continues to integrate the additional data types and bring these genomes up to a reference-grade quality standard. In the coming years we plan to use this data in order to construct a reference database of complete human haplotypes that accurately captures complex structural variation from across the human population. Beyond the human genome, the Genome Informatics Section (GIS) is also in the process of generating high-quality reference genomes for all (approximately 250) extant vertebrate orders in collaboration with the Vertebrate Genomes Project. Building upon the 16 vertebrate genomes announced last year, we have now completed nearly 100 vertebrate genomes via this project and related efforts. These genomes include that of the Canada lynx, platypus, kakapo, yak, cow, whale shark, goldfish (ref 2), pig, and many others. These genomes will enable powerful comparative genomics studies that will help reveal the function of vertebrate genomes and enhance our understanding of the human genome. In addition to vertebrates, we have completed the genomes of several invertebrates with significant public health impact, including the mosquitos Aedes aegypti (ref 10), Aedes albopictus (in draft), and Anopheles funestus (ref 5), which are vectors of important diseases such as malaria, zika, west nile, yellow fever, dengue, chickungunya, etc. Due to their highly repetitive nature, mosquito genomes present a difficult assembly challenge that will drive further improvements to our methods. Lastly, the GIS continues to work on developing new methods for the alignment and real-time analysis of long-read sequencing data. This year we published an analysis of public microbial genome databases (ref 11), a new tool for accurate HLA typing from long reads (ref 4), tools for aligning for assigning metagenomic reads to their source genome (refs 3 and 6), and assisted the USDA in the analysis of the complete cow rumen metagenome (ref 1). We also applied our tools to the study of microbial speciation and produced evidence that a bacterial species boundary does exist and can be detected and measured using genomic tools we developed (ref 7). In addition to the 12 papers we formally published this year, the section has submitted 12 pre-prints to bioRxiv that are currently undergoing peer review. These pre-prints include some of the genomes mentioned above as well as new methods for comparative genomics, genome assembly, and metagenomics.
Showing the most recent 10 out of 18 publications