Next-generation sequencing is ubiquitous in the study of biology and disease. The ?rst step when analyz- ing a sequencing dataset is read alignment: the process of determining where each snippet of sequencing data (?read?) came from with respect to a reference genome. Currently, genomics research is hampered by the use of a single, arbitrary reference. This fails to account for the vast genetic diversity that exists among humans and model organisms. Further, it can result in ?reference bias,? in turn leading to false or misleading scienti?c results. We propose a three-aim project that addresses the reference bias problem on multiple fronts.
In Aim 1, we will develop new methods and a new software tool called biastools for summarizing and visualizing reference bias.
In Aim 2, we will develop new software and methods that address reference bias by enabling alignment to multiple representative reference genomes. In one subproject, we will use genotype imputation to infer a personalized genome with the help of a large panel of reference haplotypes. In a second subproject, we will use small collections of representative genomes connected in a ??ow graph,? so that reads are ultimately analyzed with respect to the most appropriate reference. The methods described in both subprojects will be implemented as part of a new software tool called pals. Also as part of this aim, we will release a software library and tool called jector for transforming alignments from one reference coordinate system to another. Finally, for Aim 3, we apply a novel text-indexing method called r-index to enable alignment of reads to large panels of reference haplotypes. We will release the software as a software library and tool called pandex. Successful completion of the project will provide the community with new methods and references that leverage the genetic information we are gleaning from large-scale genotyping studies and from new long-read assemblies. All software will be made available under an open source license.
Many researchers use DNA sequencing to study disease and biology, and analyzing this data requires sophisticated software capable of piecing together puzzles made of billions of fragments of DNA. The main strategy used to assemble the puzzle ? aligning sequencing reads to a genome ? suffers from ?reference bias? which causes it to give incorrect answers downstream. Here we propose a suite of new methods, visualizations, software tools, and genome representations that help researchers to analyze sequencing data while avoiding the perils of reference bias. ii