Over the past fifteen years, the field of genomics has generated thousands of plant and animal genomes, each of which is an important resource for a broad range of scientific studies. During the same time period, DNA sequencing has become thousands of times faster and cheaper, which has been the main driver of this flood of new genomes. Despite its efficiency, the new technology has significant limitations, one of which is that only a short stretch of DNA, a few hundred bases in length, can be sequenced at a time. These "reads" then need to be assembled together to reconstruct a genome, which might contain billions of bases spread across dozens of chromosomes. One consequence is that most genomes today, except for a very tiny number of intensively-studied model organisms, exist in highly fragmented form, often comprising tens of thousands of fragments. The vast majority of plant and animal species have remained in this "draft" format ever since their initial publication. This project will create new genome assembly software that will make it possible for scientists to use new sequencing technology to fix these draft genomes, not only correcting errors but also stitching together many of the small fragments to create much better assemblies for a very broad range of species. These improved genomes will, in turn, provide the foundation for more accurate gene catalogs, better analyses of genome structure and evolution, and a deeper understanding of the biology of genomes.

Genome assembly has long been an extremely challenging computational task, due to the complex repetitive nature of many genomes and to the large scale of the data generated in a sequencing effort. Next-generation sequencing (NGS) has dramatically expanded the scale of the assembly problem, with raw data sets increasing from millions of reads in the mid-2000s to billions of reads in recent years. The recent introduction of very-long-read technologies with high error rates has made assembly even more challenging, but the great length of these sequences offers the possibility of much more contiguous assemblies. This project aims to develop new technology for genome assembly and to improve the genomes of a broad range of plant species that are used by scientists across many fields of research. The investigators will pursue these improvements through two related aims. First, they will develop new and improved assembly algorithms that combine Illumina sequences with third-generation sequencing technologies from Oxford Nanopore, Pacific Biosciences, and others. The new algorithms will allow investigators to combine low-cost short reads with higher-cost, high-error long reads to produce dramatically better genome assemblies. Second, they will develop a new algorithm to construct an assembly using the existing genome of a closely related species, without the need to generate new data. The investigators will demonstrate their methods by re-assembling multiple plant genomes from publicly available data. All assemblies will be released rapidly to the community, and all software will be free and open source.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Integrative Organismal Systems (IOS)
Application #
1744309
Program Officer
Gerald Schoenknecht
Project Start
Project End
Budget Start
2018-12-15
Budget End
2021-11-30
Support Year
Fiscal Year
2017
Total Cost
$774,675
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218