Next-generation sequencing is ubiquitous in the study of biology and disease. More projects are generating vast NGS datasets and more investigators and trainees are using sophisticated software to analyze them. We propose a research program with the goal of hardening and improving the Bowtie and Bowtie 2 read alignment software tools. Bowtie 1&2 were created and published by the Principal Investigator. They have become widely used and crucial tools. Given a sequencing dataset and a reference genome, a read aligner determines where each sequencing read originated with respect to the reference. This puts reads in a common coordinate system and enables downstream analyses such as variant calling and isoform assembly. Alignment is computationally challenging, but it is also a very common need. Bowtie 1 and 2 are used at many stages in software tools for analyzing RNA, bisulfite, ChIP, metagenomics, and other sequencing data.
In Aim 1 we will support and improve Bowtie 1&2. We will (a) add an application programming interface (API) so other tools can more easily access Bowtie functionality, (b) enable Bowtie 1&2 to obtain data directly from public archives, (c) extend Bowtie 2 to work efficiently with long, error-prone reads, such as those from Nanopore sequencers, and (d) improve Bowtie 1&2 to make better use of the many processor cores available on current and upcoming computer architectures.
In Aim 2, we address a pressing interpretability issue: reference bias. We propose a mix of short- and long-term solutions; we will (a) compile and disseminate major-allele reference sequences, and software for creating new ones, (b) create novel, efficient methods for just-in-time editing of the referenc genome and associated index, (c) make Bowtie 1&2 compatible with graph- shaped genomes like CRGh38, and (d) investigate novel minimizer-based graph indexing strategy that avoids blow-up.
In Aim 3, we will create a new software system called Rail that enables scaling of Bowtie-based analyses so researchers and trainees can analyze very large public datasets in a manner that is fault tolerant, secure, reproducible, and inexpensive. Rail's design is based on our previous work on the scalable, cloud- enabled tools Crossbow, Myrna and Rail-RNA. We will work with leaders of the Galaxy project to incorporate Rail so that Galaxy users can more easily build analysis tools with highly scalable components.
Aims 2 and 3 are motivated and validated by scientific collaborations addressing difficult analysis problems in (a) allele-specifi expression, (b) methylation analysis of inbred strains and crosses, (c) mosaic variant detection, and (d) detection of expressed repetitive elements. Bowtie 1&2 are open source software; all software and data generated by the project will be freely available under an open source license.
A large and increasing number of researchers use DNA sequencing to study genetic diseases, cancer, and other aspects of human biology. Analyzing DNA sequencing data requires sophisticated software that is capable of piecing together puzzles made of billions of fragments of DNA. This proposal supports and improves the popular Bowtie suite of software tools, which will allow researchers to leverage the latest DNA sequencing technology and apply it to the study of human disease.