Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Numerous DNA sequencing projects are being launched for species whose genomes have not yet been sequenced. Sequencing of messenger RNA has led to an explosion of RNA-seq projects to characterize gene expression in multiple cell types and conditions, and simultaneously to discover new genes and new splice variants of known genes. These sequencing-based studies generate enormous amounts of data, which in turn require sophisticated, efficient, and innovative new algorithms to assemble these genomes and identify their gene content. We propose to develop new computational methods for three specific problems: first, we will develop new assembly algorithms, building on existing methods wherever possible, to assemble genomes from reads generated by the latest sequencing technologies including emerging single molecule technology. In parallel, we will continue to improve our existing assemblers, extending them to handle new and diverse data types, and to evaluate multiple other assembly systems to determine what methods work best for different WGS projects. We will also continue to collaborate with outside groups to help them assemble particularly challenging genomes. Second, we will develop new methods for discovering sequence variants, using a combination of alignment and assembly-based algorithms. These include a new method that finds variants without using alignment to the reference genome, dramatically reducing false positive rates. The method uses very fast alignment algorithms to achieve significant gains in computational speed. We propose another method that uses localized assembly to detect insertions and deletions, one of the weaknesses of most current methods. Third, one of the most exciting recent technology developments in genome analysis of the past five years is RNA-seq, a protocol for sequencing the RNA in a cell. Our group has previously developed two widely used alignment algorithms, TopHat and Cufflinks, for RNA-seq analysis, which were the first to be able to discover previously unknown splice sites and isoforms. Here we propose a novel transcript assembly algorithm, StringTie, which uses a novel network flow algorithm, a method imported from mathematical optimization theory, combined with de novo assembly to assemble and quantitate transcripts. StringTie is the first transcript assembler to use both assembly and reference-based alignment together. One key advantage of StringTie's algorithm is that it assembles and quantifies gene transcripts simultaneously. As compared to Cufflinks and all other competing methods, StringTie produces more complete reconstructions of genes and splice variants, and more accurate estimates of expression levels on both real and simulated data.

Public Health Relevance

Many biomedical researchers now use high-throughput DNA sequencing to study human disease and biology. The analysis of very large, complex sequence data sets requires highly sophisticated, efficient software that can assemble DNA fragments to reconstruct a genome, assemble RNA sequences to identify genes and gene isoforms, or identify genetic variants associated with disease. This project will develop new algorithms and software that will help researchers address these problems in sequence data from humans and a wide range of other species.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006677-16
Application #
9120904
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sofia, Heidi J
Project Start
1999-09-01
Project End
2019-06-30
Budget Start
2016-07-01
Budget End
2017-06-30
Support Year
16
Fiscal Year
2016
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Genetics
Type
Schools of Medicine
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205
Simner, Patricia J; Antar, Annukka A R; Hao, Stephanie et al. (2018) Antibiotic pressure on the acquisition and loss of antibiotic resistance genes in Klebsiella pneumoniae. J Antimicrob Chemother :
Gómez-Romero, Laura; Palacios-Flores, Kim; Reyes, José et al. (2018) Precise detection of de novo single nucleotide variants in human genomes. Proc Natl Acad Sci U S A 115:5516-5521
Li, Zhigang; Breitwieser, Florian P; Lu, Jennifer et al. (2018) Identifying Corneal Infections in Formalin-Fixed Specimens Using Next Generation Sequencing. Invest Ophthalmol Vis Sci 59:280-288
Nattestad, Maria; Goodwin, Sara; Ng, Karen et al. (2018) Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res 28:1126-1135
Salzberg, Steven L (2018) Open questions: How many genes do we have? BMC Biol 16:94
Fang, Han; Huang, Yi-Fei; Radhakrishnan, Aditya et al. (2018) Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution. Cell Syst 6:180-191.e4
Pertea, Mihaela; Shumate, Alaina; Pertea, Geo et al. (2018) CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19:208
El-Diwany, Ramy; Soliman, Mary; Sugawara, Sho et al. (2018) CMPK2 and BCL-G are associated with type 1 interferon-induced HIV restriction in humans. Sci Adv 4:eaat0843
Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198
Sedlazeck, Fritz J; Rescheneder, Philipp; Smolka, Moritz et al. (2018) Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15:461-468

Showing the most recent 10 out of 88 publications