Next-generation sequencing is ubiquitous in the study of biology and disease. More projects are generating vast NGS datasets and more investigators and trainees are using sophisticated software to analyze them. We propose a research program with the goal of hardening and improving the Bowtie and Bowtie 2 read alignment software tools. Bowtie 1&2 were created and published by the Principal Investigator. They have become widely used and crucial tools. Given a sequencing dataset and a reference genome, a read aligner determines where each sequencing read originated with respect to the reference. This puts reads in a common coordinate system and enables downstream analyses such as variant calling and isoform assembly. Alignment is computationally challenging, but it is also a very common need. Bowtie 1 and 2 are used at many stages in software tools for analyzing RNA, bisulfite, ChIP, metagenomics, and other sequencing data.
In Aim 1 we will support and improve Bowtie 1&2. We will (a) add an application programming interface (API) so other tools can more easily access Bowtie functionality, (b) enable Bowtie 1&2 to obtain data directly from public archives, (c) extend Bowtie 2 to work efficiently with long, error-prone reads, such as those from Nanopore sequencers, and (d) improve Bowtie 1&2 to make better use of the many processor cores available on current and upcoming computer architectures.
In Aim 2, we address a pressing interpretability issue: reference bias. We propose a mix of short- and long-term solutions; we will (a) compile and disseminate major-allele reference sequences, and software for creating new ones, (b) create novel, efficient methods for just-in-time editing of the referenc genome and associated index, (c) make Bowtie 1&2 compatible with graph- shaped genomes like CRGh38, and (d) investigate novel minimizer-based graph indexing strategy that avoids blow-up.
In Aim 3, we will create a new software system called Rail that enables scaling of Bowtie-based analyses so researchers and trainees can analyze very large public datasets in a manner that is fault tolerant, secure, reproducible, and inexpensive. Rail's design is based on our previous work on the scalable, cloud- enabled tools Crossbow, Myrna and Rail-RNA. We will work with leaders of the Galaxy project to incorporate Rail so that Galaxy users can more easily build analysis tools with highly scalable components.
Aims 2 and 3 are motivated and validated by scientific collaborations addressing difficult analysis problems in (a) allele-specifi expression, (b) methylation analysis of inbred strains and crosses, (c) mosaic variant detection, and (d) detection of expressed repetitive elements. Bowtie 1&2 are open source software; all software and data generated by the project will be freely available under an open source license.
A large and increasing number of researchers use DNA sequencing to study genetic diseases, cancer, and other aspects of human biology. Analyzing DNA sequencing data requires sophisticated software that is capable of piecing together puzzles made of billions of fragments of DNA. This proposal supports and improves the popular Bowtie suite of software tools, which will allow researchers to leverage the latest DNA sequencing technology and apply it to the study of human disease.
|Langmead, Ben; Nellore, Abhinav (2018) Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19:325|
|Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav et al. (2018) Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics 34:114-116|
|Langmead, Ben; Nellore, Abhinav (2018) Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19:208-219|
|Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198|
|Parsana, Princy; Amend, Sarah R; Hernandez, James et al. (2017) Identifying global expression patterns and key regulators in epithelial to mesenchymal transition through multi-study integration. BMC Cancer 17:447|
|Langmead, Ben (2017) A tandem simulation framework for predicting mapping quality. Genome Biol 18:152|
|Pritt, Jacob; Langmead, Ben (2016) Boiler: lossy compression of RNA-seq alignments using coverage vectors. Nucleic Acids Res 44:e133|
|Nellore, Abhinav; Jaffe, Andrew E; Fortin, Jean-Philippe et al. (2016) Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol 17:266|
|Nellore, Abhinav; Wilks, Christopher; Hansen, Kasper D et al. (2016) Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32:2551-3|