Hardening and Scaling Core Genomics Software

Langmead, Benjamin

Abstract

Next-generation sequencing is ubiquitous in the study of biology and disease. More projects are generating vast NGS datasets and more investigators and trainees are using sophisticated software to analyze them. We propose a research program with the goal of hardening and improving the Bowtie and Bowtie 2 read alignment software tools. Bowtie 1&2 were created and published by the Principal Investigator. They have become widely used and crucial tools. Given a sequencing dataset and a reference genome, a read aligner determines where each sequencing read originated with respect to the reference. This puts reads in a common coordinate system and enables downstream analyses such as variant calling and isoform assembly. Alignment is computationally challenging, but it is also a very common need. Bowtie 1 and 2 are used at many stages in software tools for analyzing RNA, bisulfite, ChIP, metagenomics, and other sequencing data.
In Aim 1 we will support and improve Bowtie 1&2. We will (a) add an application programming interface (API) so other tools can more easily access Bowtie functionality, (b) enable Bowtie 1&2 to obtain data directly from public archives, (c) extend Bowtie 2 to work efficiently with long, error-prone reads, such as those from Nanopore sequencers, and (d) improve Bowtie 1&2 to make better use of the many processor cores available on current and upcoming computer architectures.
In Aim 2, we address a pressing interpretability issue: reference bias. We propose a mix of short- and long-term solutions; we will (a) compile and disseminate major-allele reference sequences, and software for creating new ones, (b) create novel, efficient methods for just-in-time editing of the referenc genome and associated index, (c) make Bowtie 1&2 compatible with graph- shaped genomes like CRGh38, and (d) investigate novel minimizer-based graph indexing strategy that avoids blow-up.
In Aim 3, we will create a new software system called Rail that enables scaling of Bowtie-based analyses so researchers and trainees can analyze very large public datasets in a manner that is fault tolerant, secure, reproducible, and inexpensive. Rail's design is based on our previous work on the scalable, cloud- enabled tools Crossbow, Myrna and Rail-RNA. We will work with leaders of the Galaxy project to incorporate Rail so that Galaxy users can more easily build analysis tools with highly scalable components.
Aims 2 and 3 are motivated and validated by scientific collaborations addressing difficult analysis problems in (a) allele-specifi expression, (b) methylation analysis of inbred strains and crosses, (c) mosaic variant detection, and (d) detection of expressed repetitive elements. Bowtie 1&2 are open source software; all software and data generated by the project will be freely available under an open source license.

Public Health Relevance

A large and increasing number of researchers use DNA sequencing to study genetic diseases, cancer, and other aspects of human biology. Analyzing DNA sequencing data requires sophisticated software that is capable of piecing together puzzles made of billions of fragments of DNA. This proposal supports and improves the popular Bowtie suite of software tools, which will allow researchers to leverage the latest DNA sequencing technology and apply it to the study of human disease.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM118568-04
Application #: 9668155
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Ravichandran, Veerasamy

Project Start: 2016-04-01
Project End: 2021-03-31
Budget Start: 2019-04-01
Budget End: 2020-03-31
Support Year: 4
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: Johns Hopkins University
Department: Biostatistics & Other Math Sci
Type: Biomed Engr/Col Engr/Engr Sta
DUNS #: 001910777

City: Baltimore
State: MD
Country: United States
Zip Code: 21205

Related projects


NIH 2020 R01 GM	Hardening and Scaling Core Genomics Software Langmead, Benjamin Thomas / Johns Hopkins University
NIH 2019 R01 GM	Hardening and Scaling Core Genomics Software Langmead, Benjamin Thomas / Johns Hopkins University
NIH 2018 R01 GM	Hardening and Scaling Core Genomics Software Langmead, Benjamin Thomas / Johns Hopkins University
NIH 2017 R01 GM	Hardening and Scaling Core Genomics Software Langmead, Benjamin Thomas / Johns Hopkins University	$377,550
NIH 2016 R01 GM	Hardening and Scaling Core Genomics Software Langmead, Benjamin Thomas / Johns Hopkins University

Publications

Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198

Langmead, Ben; Nellore, Abhinav (2018) Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19:325

Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav et al. (2018) Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics 34:114-116

Langmead, Ben; Nellore, Abhinav (2018) Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19:208-219

Parsana, Princy; Amend, Sarah R; Hernandez, James et al. (2017) Identifying global expression patterns and key regulators in epithelial to mesenchymal transition through multi-study integration. BMC Cancer 17:447

Langmead, Ben (2017) A tandem simulation framework for predicting mapping quality. Genome Biol 18:152

Pritt, Jacob; Langmead, Ben (2016) Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res 44:e133

Nellore, Abhinav; Jaffe, Andrew E; Fortin, Jean-Philippe et al. (2016) Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol 17:266

Nellore, Abhinav; Wilks, Christopher; Hansen, Kasper D et al. (2016) Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32:2551-3

Comments

Be the first to comment on Benjamin Langmead's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: