The pace of genome sequencing promises to continue to accelerate thanks to next-generation sequencing (NGS) technologies that have dramatically reduced the cost of generating large numbers of sequence reads. These new technologies are also in a state of almost continuous change, and predicting the state of the art one year from now, let alone five years from now, can be challenging. With the changing capabilities comes the expansion of the number of strategies for de novo sequencing of plant genomes. At the same time, access to NGS platforms is ever expanding and generation of a genome sequence is no longer solely the domain of the large genome center. Consequently, more scientists are faced with a greater breadth of possible strategies for sequencing their genomes of interest, and very little in the way of tools or data to evaluate the appropriateness of a given strategy. Consequently, this results in either expensive collection of data to evaluate strategies, or more often, the selection of a strategy with very little data to support the choice. This proposal seeks to leverage the existing plant genomes in order to develop a web-based plant genome assembly simulation platform. For a given genome, a range of sequencing strategies will be simulated and the resulting data will be assembled with a number of assembly algorithms. These data will be evaluated with newly developed evaluation metrics that seek to capture the heterogeneity of uses for genome sequences, and place the results in a context that maximizes the value to the user. In addition, the platform's relevance to a larger number of plant species will be improved by introducing the concept of a virtual genome. Prototype tools will be developed that enable the construction of such a virtual genome using key genome characteristics (e.g. size, GC content, nature of repeats) that impact genome assembly. The simulation platform generated will provide the user community with:
- Pre-computed results from common sequencing strategies on a range of plant genomes;
- Customizable cost models to compare the costs of different strategies;
- Standardized datasets to compare existing assemblers, and to benchmark new ones;
- Ability to create artificial genomes through a "recipe" of defined changes to a template genome, and then test sequencing strategies on it; and,
- Ability to use the system to develop and test feasibility of new sequencing (and physical mapping) strategies prior to testing in the lab.
Broader Impacts: The web platform generated leverages existing investments in plant genomes in order to improve the quality and cost-effectiveness of future plant genome sequences. Although no explicit outreach activities are included in this one-year pilot project, the web platform has the potential to educate visitors about the range of sequencing strategies, and to be a forum in which new strategies can be discussed. A continued reduction in cost, and increase in quality of plant genome sequences will have significant, albeit indirect, impact on some of societies most pressing problems such as food security, climate change, and biofuel development. Resources and tools generated by this project will be made freely available to the public through a web site (www.plantagora.org).