Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery

Salzberg, Steven

Abstract

Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come.

Public Health Relevance

Many biomedical researchers use high-throughput DNA sequencing to study human disease and biology, and to do so they rely heavily on the human genome sequence and its annotated genes. The analysis of these very large, complex sequence data sets requires highly sophisticated, efficient software that can assemble DNA fragments to reconstruct a genome or assemble RNA sequences to identify genes and gene isoforms. This project will develop new algorithms, software, and data that will provide researchers with the necessary tools to address relevant biological questions in humans and a wide range of other species.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG006677-20
Application #: 10147905
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Sofia, Heidi J

Project Start: 1999-09-01
Project End: 2025-02-28
Budget Start: 2021-03-01
Budget End: 2022-02-28
Support Year: 20
Fiscal Year: 2021
Total Cost
Indirect Cost

Institution

Name: Johns Hopkins University
Department: Genetics
Type: Schools of Medicine
DUNS #: 001910777

City: Baltimore
State: MD
Country: United States
Zip Code: 21218

Related projects


NIH 2021 R01 HG	Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery Salzberg, Steven L. / Johns Hopkins University
NIH 2020 R01 HG	Computational Methods for Genome Assembly, Transcript Assembly, and Gene Discovery Salzberg, Steven L. / Johns Hopkins University
NIH 2018 R01 HG	Computational Methods for Genome Assembly, Transcript Assembly, and Variant Discovery Salzberg, Steven L. / Johns Hopkins University
NIH 2017 R01 HG	Computational Methods for Genome Assembly, Transcript Assembly, and Variant Discovery Salzberg, Steven L. / Johns Hopkins University
NIH 2016 R01 HG	Computational Methods for Genome Assembly, Transcript Assembly, and Variant Discovery Salzberg, Steven L. / Johns Hopkins University
NIH 2015 R01 HG	Computational Methods for Genome Assembly, Transcript Assembly, and Variant Discovery Salzberg, Steven L. / Johns Hopkins University	$600,000
NIH 2013 R01 HG	Computational Gene Modeling and Genome Sequence Assembly Salzberg, Steven L. / Johns Hopkins University	$575,512
NIH 2012 R01 HG	Computational Gene Modeling and Genome Sequence Assembly Salzberg, Steven L. / Johns Hopkins University	$595,227
NIH 2011 R01 HG	Computational Gene Modeling and Genome Sequence Assembly Salzberg, Steven L. / Johns Hopkins University	$712,968

Publications

Simner, Patricia J; Antar, Annukka A R; Hao, Stephanie et al. (2018) Antibiotic pressure on the acquisition and loss of antibiotic resistance genes in Klebsiella pneumoniae. J Antimicrob Chemother :

Gómez-Romero, Laura; Palacios-Flores, Kim; Reyes, José et al. (2018) Precise detection of de novo single nucleotide variants in human genomes. Proc Natl Acad Sci U S A 115:5516-5521

Li, Zhigang; Breitwieser, Florian P; Lu, Jennifer et al. (2018) Identifying Corneal Infections in Formalin-Fixed Specimens Using Next Generation Sequencing. Invest Ophthalmol Vis Sci 59:280-288

Nattestad, Maria; Goodwin, Sara; Ng, Karen et al. (2018) Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res 28:1126-1135

Salzberg, Steven L (2018) Open questions: How many genes do we have? BMC Biol 16:94

Fang, Han; Huang, Yi-Fei; Radhakrishnan, Aditya et al. (2018) Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution. Cell Syst 6:180-191.e4

Pertea, Mihaela; Shumate, Alaina; Pertea, Geo et al. (2018) CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19:208

El-Diwany, Ramy; Soliman, Mary; Sugawara, Sho et al. (2018) CMPK2 and BCL-G are associated with type 1 interferon-induced HIV restriction in humans. Sci Adv 4:eaat0843

Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198

Sedlazeck, Fritz J; Rescheneder, Philipp; Smolka, Moritz et al. (2018) Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15:461-468

Showing the most recent 10 out of 88 publications

Comments

Be the first to comment on Steven Salzberg's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: