The latest generation of DNA sequencing technology has spurred a tremendous increase in the use of sequencing to answer fundamental questions in biology and medicine. Whole-genome sequencing is being used to study cancer, to study common disease-causing variants in the human genome, and to create a better picture of human diversity. Sequencing of messenger RNA through the protocol known as RNA-seq has led to an explosion of projects to characterize the transcriptome of many cell types in many species. These sequencing-based studies generate enormous amounts of data, which in turn require sophisticated, efficient computational tools to align the DNA sequence back to a reference genome and to help interpret the results. Our group has developed a suite of software tools for alignment of DNA and RNA to a reference genome. These include Bowtie, a very fast short-read alignment program;TopHat, an alignment program that aligns spliced transcripts (mRNA) across introns;and Cufflinks, a program that assembles complete transcripts, including alternative splice variants, from the alignments that TopHat produces. Our tools have been designed to handle very large next-generation sequence data sets, reducing alignment times that took multiple CPU-days with previous tools to just minutes. They also have relatively modest memory requirements, allowing them to be run on a desktop computer. For these and other reasons, these programs have become the preferred tools for numerous research groups;the Bowtie program alone has already attracted a very large user base, with over 20,000 downloads since its initial release in 2008. In this proposal, we ask for support to maintain these open-source software programs, adapt them to continuously changing DNA sequencing technology, and add new features designed to improve the alignments and to assist investigators with their analyses.

Public Health Relevance

Many biomedical researchers are now using large-scale DNA sequencing to study human disease and genetic mutations. The analysis of these new types of sequence data requires highly sophisticated software that can align billions of DNA fragments to the human genome and identify various types of genetic variations. This proposal will support a suite of software tools that will help researchers take advantage of the latest DNA sequencing technology and apply it to the study of human genetics.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006102-03
Application #
8464182
Study Section
Special Emphasis Panel (ZRG1-BST-Z (02))
Program Officer
Bonazzi, Vivien
Project Start
2011-07-06
Project End
2014-04-30
Budget Start
2013-05-01
Budget End
2014-04-30
Support Year
3
Fiscal Year
2013
Total Cost
$660,194
Indirect Cost
$204,645
Name
Johns Hopkins University
Department
Genetics
Type
Schools of Medicine
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Chelaru, Florin; Smith, Llewellyn; Goldstein, Naomi et al. (2014) Epiviz: interactive visual analytics for functional genomics data. Nat Methods 11:938-40
Salzberg, Steven L; Pertea, Mihaela; Fahrner, Jill A et al. (2014) DIAMUND: direct comparison of genomes to detect mutations. Hum Mutat 35:283-8
Ye, Chengxi; Hsiao, Chiaowen; Corrada Bravo, H├ęctor (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30:1214-9
Magoc, Tanja; Pabinger, Stephan; Canzar, Stefan et al. (2013) GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29:1718-25
Kim, Daehwan; Pertea, Geo; Trapnell, Cole et al. (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36
Magoc, Tanja; Wood, Derrick; Salzberg, Steven L (2013) EDGE-pro: Estimated Degree of Gene Expression in Prokaryotic Genomes. Evol Bioinform Online 9:127-36
Schatz, Michael C; Phillippy, Adam M; Sommer, Daniel D et al. (2013) Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief Bioinform 14:213-24
Treangen, Todd J; Salzberg, Steven L (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13:36-46
Trapnell, Cole; Roberts, Adam; Goff, Loyal et al. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7:562-78
Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557-67

Showing the most recent 10 out of 14 publications