The Cold Spring Harbor Laboratory is awarded a CAREER grant for the PI Michael Schatz to develop new computational methods for processing DNA sequencing data from the latest high-throughput sequencing technologies. DNA sequencing costs and throughput have improved by orders of magnitudes over the last three decades, although many questions remain unsolved, especially because of the short sequence lengths currently available. Emerging "third generation" sequencing technology from Pacific Biosciences, Moleculo, Oxford Nanopore, and other companies are poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. The sequence lengths with these technologies can reach up to tens of thousands of nucleotides, however few or no analysis packages are capable of dealing with these types of genetic sequence data. This project will overcome these limitations by developing several novel analysis algorithms specifically for long read single molecule sequencing and their associated complex error models. The outcomes will help answer biological questions of profound significance to all of society, such as: What were the genetic implications of the domestication of rice? What genes and regulatory elements give rise to the incredible regenerative properties of the flatworm? or, What can be understood from assembling reference genomes of sugarcane and pineapple towards breeding more robust plant crops and biofuels?

Specific objectives of the research include working towards assembling entire plant and animal chromosomes into complete, haplotype-phased sequences; identifying fusion genes and complex alternative splicing patterns responsible for diseases or adaptability; and searching for structural variations associated with improved crop yield or human diseases such as cancer or autism. Even if some future technology is capable of directly reading entire transcripts or entire genomes, this research will remain necessary to examine the higher level relationships across populations of genomes or in measuring the dynamics of gene expression and splicing.

This project will tightly integrate research and education, promoting opportunities at high school through postdoctoral levels with the development of new course materials, hands-on research opportunities, and one-on-one mentoring experiences. This effort will specifically target the intersection of computer science and biology, promoting interdisciplinary education, and ensuring the next generation of scientists are ready for the complexities of quantitative and digital biology. To engage the widest possible audience, Dr Schatz will also develop novel online teaching materials made available through a yearly bioinformatics contest. The first round of the contest reached nearly 1000 students around the world and at all levels of education, engaging students far beyond our physical limits. The products of the research will be made available as open-source software, and installed into the graphical iPlant Discovery Environment making them easily accessible to the large community of plant researcher around the world.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
1627442
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2016-01-01
Budget End
2021-05-31
Support Year
Fiscal Year
2016
Total Cost
$1,196,855
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218