New developments in DNA sequencing technology have spurred a tremendous increase in the use of sequencing to answer fundamental questions in biology and medicine. Whole- genome sequencing is being used to study cancer, to discover disease-causing gene variants in patient genomes, and to study human genetic diversity. Numerous WGS projects are being launched for species whose genomes have not yet been sequenced. Sequencing of messenger RNA through RNA-seq has led to an explosion of projects to characterize transcribed genes in multiple cell types and in many species, and simultaneously to discover new genes and new splice variants of known genes. These sequencing-based studies generate enormous amounts of data, which in turn require sophisticated, efficient, and innovative new algorithms that will make it possible to assemble these genomes and identify their gene content. We propose to develop new cloud-computing based assembly algorithms to assemble genomes from short reads generated by the latest sequencing technologies. In parallel, we will continue to improve our existing assemblers, extending them to handle new and diverse data types, including """"""""3rd-generation"""""""" sequences. We will also reach out to outside groups to help them assemble novel species, modifying our software as needed and continuing to push the limits of assembly technology. One of the most exciting recent technology developments in the gene finding arena is RNA- seq, a new protocol for capturing and sequencing the mRNA in a cell. This technique is well on its way to replacing both conventional EST sequencing as a method for capturing transcribed protein-coding genes, and microarray hybridization experiments for measuring transcript levels. We propose to develop new algorithms to take advantage of the flood of new RNA-seq data that has begun to appear. We have already developed two new algorithms, TopHat and Cufflinks, for RNA-seq analysis, which are the first to be able to discover previously unknown splice sites and isoforms. These tools, enhanced with new features to handle a wider variety of sequence data, form the basis of our plans to develop integrated gene finders that can identify novel genes, novel isoforms of known genes, and fusion genes, and to include these methods in a genome annotation pipeline.

Public Health Relevance

Many biomedical researchers are now using large-scale DNA sequencing to study human disease and to understand human biology. The analysis of these new types of sequence data requires highly sophisticated software that can assemble millions or billions of DNA fragments to reconstruct a genome, and that can then identify genes in the assembled sequence. This project will develop new algorithms and software that will help researchers use the latest DNA sequencing technology to sequence, assemble, and find genes in human genomes as well as the genomes of many other species.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006677-14
Application #
8530261
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Bonazzi, Vivien
Project Start
1999-09-01
Project End
2014-08-31
Budget Start
2013-09-01
Budget End
2014-08-31
Support Year
14
Fiscal Year
2013
Total Cost
$575,512
Indirect Cost
$192,844
Name
Johns Hopkins University
Department
Genetics
Type
Schools of Medicine
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Li, Gang; Hillier, LaDeana W; Grahn, Robert A et al. (2016) A High-Resolution SNP Array-Based Linkage Map Anchors a New Domestic Cat Draft Genome Assembly and Provides Detailed Patterns of Recombination. G3 (Bethesda) 6:1607-16
Canzar, Stefan; Andreotti, Sandro; Weese, David et al. (2016) CIDANE: comprehensive isoform discovery and abundance estimation. Genome Biol 17:16
Sork, Victoria L; Fitz-Gibbon, Sorel T; Puiu, Daniela et al. (2016) First Draft Assembly and Annotation of the Genome of a California Endemic Oak Quercus lobata Née (Fagaceae). G3 (Bethesda) :
Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650-67
Kim, Daehwan; Song, Li; Breitwieser, Florian P et al. (2016) Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 26:1721-1729
Vij, Shubha; Kuhl, Heiner; Kuznetsova, Inna S et al. (2016) Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding. PLoS Genet 12:e1005954
Stephens, Zachary D; Lee, Skylar Y; Faghri, Faraz et al. (2015) Big Data: Astronomical or Genomical? PLoS Biol 13:e1002195
Smolka, Moritz; Rescheneder, Philipp; Schatz, Michael C et al. (2015) Teaser: Individualized benchmarking and optimization of read mapping results for NGS data. Genome Biol 16:235
Kim, Daehwan; Langmead, Ben; Salzberg, Steven L (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357-60
Wasik, Kaja; Gurtowski, James; Zhou, Xin et al. (2015) Genome and transcriptome of the regeneration-competent flatworm, Macrostomum lignano. Proc Natl Acad Sci U S A 112:12462-7

Showing the most recent 10 out of 64 publications