Next-generation sequencing is ubiquitous in the study of biology and disease. More projects are generating vast NGS datasets and more investigators and trainees are using sophisticated software to analyze them. We propose a research program with the goal of hardening and improving the Bowtie and Bowtie 2 read alignment software tools. Bowtie 1&2 were created and published by the Principal Investigator. They have become widely used and crucial tools. Given a sequencing dataset and a reference genome, a read aligner determines where each sequencing read originated with respect to the reference. This puts reads in a common coordinate system and enables downstream analyses such as variant calling and isoform assembly. Alignment is computationally challenging, but it is also a very common need. Bowtie 1 and 2 are used at many stages in software tools for analyzing RNA, bisulfite, ChIP, metagenomics, and other sequencing data.
In Aim 1 we will support and improve Bowtie 1&2. We will (a) add an application programming interface (API) so other tools can more easily access Bowtie functionality, (b) enable Bowtie 1&2 to obtain data directly from public archives, (c) extend Bowtie 2 to work efficiently with long, error-prone reads, such as those from Nanopore sequencers, and (d) improve Bowtie 1&2 to make better use of the many processor cores available on current and upcoming computer architectures.
In Aim 2, we address a pressing interpretability issue: reference bias. We propose a mix of short- and long-term solutions; we will (a) compile and disseminate major-allele reference sequences, and software for creating new ones, (b) create novel, efficient methods for just-in-time editing of the referenc genome and associated index, (c) make Bowtie 1&2 compatible with graph- shaped genomes like CRGh38, and (d) investigate novel minimizer-based graph indexing strategy that avoids blow-up.
In Aim 3, we will create a new software system called Rail that enables scaling of Bowtie-based analyses so researchers and trainees can analyze very large public datasets in a manner that is fault tolerant, secure, reproducible, and inexpensive. Rail's design is based on our previous work on the scalable, cloud- enabled tools Crossbow, Myrna and Rail-RNA. We will work with leaders of the Galaxy project to incorporate Rail so that Galaxy users can more easily build analysis tools with highly scalable components.
Aims 2 and 3 are motivated and validated by scientific collaborations addressing difficult analysis problems in (a) allele-specifi expression, (b) methylation analysis of inbred strains and crosses, (c) mosaic variant detection, and (d) detection of expressed repetitive elements. Bowtie 1&2 are open source software; all software and data generated by the project will be freely available under an open source license.

Public Health Relevance

A large and increasing number of researchers use DNA sequencing to study genetic diseases, cancer, and other aspects of human biology. Analyzing DNA sequencing data requires sophisticated software that is capable of piecing together puzzles made of billions of fragments of DNA. This proposal supports and improves the popular Bowtie suite of software tools, which will allow researchers to leverage the latest DNA sequencing technology and apply it to the study of human disease.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM118568-02
Application #
9247225
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
2016-04-01
Project End
2021-03-31
Budget Start
2017-04-01
Budget End
2018-03-31
Support Year
2
Fiscal Year
2017
Total Cost
$377,550
Indirect Cost
$140,085
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205
Langmead, Ben; Nellore, Abhinav (2018) Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19:325
Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav et al. (2018) Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics 34:114-116
Langmead, Ben; Nellore, Abhinav (2018) Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19:208-219
Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198
Parsana, Princy; Amend, Sarah R; Hernandez, James et al. (2017) Identifying global expression patterns and key regulators in epithelial to mesenchymal transition through multi-study integration. BMC Cancer 17:447
Langmead, Ben (2017) A tandem simulation framework for predicting mapping quality. Genome Biol 18:152
Nellore, Abhinav; Jaffe, Andrew E; Fortin, Jean-Philippe et al. (2016) Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol 17:266
Nellore, Abhinav; Wilks, Christopher; Hansen, Kasper D et al. (2016) Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32:2551-3
Pritt, Jacob; Langmead, Ben (2016) Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res 44:e133