Improvements in sequencing technology have spurred a tremendous increase in the use of sequencing to answer a wide range of questions in biology and medicine. Thousands of new human genomes are being sequenced each year in efforts to track down the genetic causes of human diseases. In parallel with this increase in whole-genome sequencing, RNA sequencing has also exploded in popularity, due to its power to characterize gene expression in a multitude of cell types and conditions, and to its potential to discover new genes and new splice variants. These enormous data sets require highly efficient and accurate computational methods for analysis, and they also presents opportunities for discovery. Furthermore, to properly analyze the many diverse humans being sequenced, we can no longer afford to rely on a single reference genome that is missing much of the variation found in the human population, and that makes it very difficult to analyze sequences that do not match the reference. We propose to address these challenges in four specific ways: first, we will develop new and improved assembly algorithms that take advantage of the latest long-read technology to create genomes of unprecedented contiguity and completeness. This effort will include a method for creating haplotype-resolved assemblies when sequences from both parents are available, and a method to use an existing reference genome to create a highly contiguous assembly at minimal cost. Second, we will apply these methods to build new human reference genomes, assembled and annotated as thoroughly as the current human reference. These genomes, each representing a single individual, can then serve as the basis for many future studies of the relevant populations. Third, in the area of RNA-seq analysis our lab has previously developed two widely-used spliced aligners, TopHat and HISAT, and two equally popular transcriptome assemblers, Cufflinks and StringTie, which now have many thousands of users. We will extend and improve the StringTie algorithm, augmenting its novel network flow algorithm with de novo assembly plus new alignment methods to handle long reads and to improve its construction and quantification of transcripts. Fourth, we propose to systematically assemble thousands of RNA-seq experiments to discover new genes and to re-build the human gene catalog, an effort that could have a major impact on a broad array of human genetic and genomic studies. We have recently released our first version of this effort as CHESS, a human gene catalog built from a massive RNA-seq database that represents a comprehensive, reproducible, and open method for annotating the human genome. The CHESS database already agrees more closely with the two most widely-used human gene databases than either of them agree with one another, and we will improve it further so that it can provide a basis for biomedical research for many years to come.

Public Health Relevance

Many biomedical researchers use high-throughput DNA sequencing to study human disease and biology, and to do so they rely heavily on the human genome sequence and its annotated genes. The analysis of these very large, complex sequence data sets requires highly sophisticated, efficient software that can assemble DNA fragments to reconstruct a genome or assemble RNA sequences to identify genes and gene isoforms. This project will develop new algorithms, software, and data that will provide researchers with the necessary tools to address relevant biological questions in humans and a wide range of other species.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG006677-19
Application #
9965200
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sofia, Heidi J
Project Start
1999-09-01
Project End
2025-02-28
Budget Start
2020-05-01
Budget End
2021-02-28
Support Year
19
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Genetics
Type
Schools of Medicine
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205
Salzberg, Steven L (2018) Open questions: How many genes do we have? BMC Biol 16:94
Fang, Han; Huang, Yi-Fei; Radhakrishnan, Aditya et al. (2018) Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution. Cell Syst 6:180-191.e4
Pertea, Mihaela; Shumate, Alaina; Pertea, Geo et al. (2018) CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19:208
El-Diwany, Ramy; Soliman, Mary; Sugawara, Sho et al. (2018) CMPK2 and BCL-G are associated with type 1 interferon-induced HIV restriction in humans. Sci Adv 4:eaat0843
Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198
Sedlazeck, Fritz J; Rescheneder, Philipp; Smolka, Moritz et al. (2018) Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15:461-468
Simner, Patricia J; Antar, Annukka A R; Hao, Stephanie et al. (2018) Antibiotic pressure on the acquisition and loss of antibiotic resistance genes in Klebsiella pneumoniae. J Antimicrob Chemother :
Gómez-Romero, Laura; Palacios-Flores, Kim; Reyes, José et al. (2018) Precise detection of de novo single nucleotide variants in human genomes. Proc Natl Acad Sci U S A 115:5516-5521
Li, Zhigang; Breitwieser, Florian P; Lu, Jennifer et al. (2018) Identifying Corneal Infections in Formalin-Fixed Specimens Using Next Generation Sequencing. Invest Ophthalmol Vis Sci 59:280-288
Nattestad, Maria; Goodwin, Sara; Ng, Karen et al. (2018) Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res 28:1126-1135

Showing the most recent 10 out of 88 publications