The latest generation of DNA sequencing technology has spurred a tremendous increase in the use of sequencing to answer fundamental questions in biology and medicine. Whole-genome sequencing is being used to study cancer, to study common disease-causing variants in the human genome, and to create a better picture of human diversity. Sequencing of messenger RNA through the protocol known as RNA-seq has led to an explosion of projects to characterize the transcriptome of many cell types in many species. These sequencing-based studies generate enormous amounts of data, which in turn require sophisticated, efficient computational tools to align the DNA sequence back to a reference genome and to help interpret the results. Our group has developed a suite of software tools for alignment of DNA and RNA to a reference genome. These include Bowtie, a very fast short-read alignment program;TopHat, an alignment program that aligns spliced transcripts (mRNA) across introns;and Cufflinks, a program that assembles complete transcripts, including alternative splice variants, from the alignments that TopHat produces. Our tools have been designed to handle very large next-generation sequence data sets, reducing alignment times that took multiple CPU-days with previous tools to just minutes. They also have relatively modest memory requirements, allowing them to be run on a desktop computer. For these and other reasons, these programs have become the preferred tools for numerous research groups;the Bowtie program alone has already attracted a very large user base, with over 20,000 downloads since its initial release in 2008. In this proposal, we ask for support to maintain these open-source software programs, adapt them to continuously changing DNA sequencing technology, and add new features designed to improve the alignments and to assist investigators with their analyses.

Public Health Relevance

Many biomedical researchers are now using large-scale DNA sequencing to study human disease and genetic mutations. The analysis of these new types of sequence data requires highly sophisticated software that can align billions of DNA fragments to the human genome and identify various types of genetic variations. This proposal will support a suite of software tools that will help researchers take advantage of the latest DNA sequencing technology and apply it to the study of human genetics.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG006102-01
Application #
8068060
Study Section
Special Emphasis Panel (ZRG1-BST-Z (02))
Program Officer
Good, Peter J
Project Start
2011-07-06
Project End
2014-04-30
Budget Start
2011-07-06
Budget End
2012-04-30
Support Year
1
Fiscal Year
2011
Total Cost
$707,654
Indirect Cost
Name
Johns Hopkins University
Department
Genetics
Type
Schools of Medicine
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Zimin, Aleksey V; Puiu, Daniela; Hall, Richard et al. (2017) The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum. Gigascience 6:1-7
Canzar, Stefan; Salzberg, Steven L (2017) Short Read Mapping: An Algorithmic Tour. Proc IEEE Inst Electr Electron Eng 105:436-458
Dinalankara, Wikum; Bravo, Héctor Corrada (2015) Gene Expression Signatures Based on Variability can Robustly Predict Tumor Progression and Prognosis. Cancer Inform 14:71-81
Chelaru, Florin; Corrada Bravo, Héctor (2015) Epiviz: a view inside the design of an integrated visual analysis software for genomics. BMC Bioinformatics 16 Suppl 11:S4
Pertea, Mihaela; Pertea, Geo M; Antonescu, Corina M et al. (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33:290-5
Kim, Daehwan; Langmead, Ben; Salzberg, Steven L (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357-60
Chelaru, Florin; Smith, Llewellyn; Goldstein, Naomi et al. (2014) Epiviz: interactive visual analytics for functional genomics data. Nat Methods 11:938-40
Ye, Chengxi; Hsiao, Chiaowen; Corrada Bravo, Héctor (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30:1214-9
Salzberg, Steven L; Pertea, Mihaela; Fahrner, Jill A et al. (2014) DIAMUND: direct comparison of genomes to detect mutations. Hum Mutat 35:283-8
Schatz, Michael C; Langmead, Ben (2013) The DNA Data Deluge: Fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze. IEEE Spectr 50:26-33

Showing the most recent 10 out of 42 publications