Next-generation sequencing is ubiquitous in the study of biology and disease. The ?rst step when analyz- ing a sequencing dataset is read alignment: the process of determining where each snippet of sequencing data (?read?) came from with respect to a reference genome. Currently, genomics research is hampered by the use of a single, arbitrary reference. This fails to account for the vast genetic diversity that exists among humans and model organisms. Further, it can result in ?reference bias,? in turn leading to false or misleading scienti?c results. We propose a three-aim project that addresses the reference bias problem on multiple fronts.
In Aim 1, we will develop new methods and a new software tool called biastools for summarizing and visualizing reference bias.
In Aim 2, we will develop new software and methods that address reference bias by enabling alignment to multiple representative reference genomes. In one subproject, we will use genotype imputation to infer a personalized genome with the help of a large panel of reference haplotypes. In a second subproject, we will use small collections of representative genomes connected in a ??ow graph,? so that reads are ultimately analyzed with respect to the most appropriate reference. The methods described in both subprojects will be implemented as part of a new software tool called pals. Also as part of this aim, we will release a software library and tool called jector for transforming alignments from one reference coordinate system to another. Finally, for Aim 3, we apply a novel text-indexing method called r-index to enable alignment of reads to large panels of reference haplotypes. We will release the software as a software library and tool called pandex. Successful completion of the project will provide the community with new methods and references that leverage the genetic information we are gleaning from large-scale genotyping studies and from new long-read assemblies. All software will be made available under an open source license.

Public Health Relevance

Many researchers use DNA sequencing to study disease and biology, and analyzing this data requires sophisticated software capable of piecing together puzzles made of billions of fragments of DNA. The main strategy used to assemble the puzzle ? aligning sequencing reads to a genome ? suffers from ?reference bias? which causes it to give incorrect answers downstream. Here we propose a suite of new methods, visualizations, software tools, and genome representations that help researchers to analyze sequencing data while avoiding the perils of reference bias. ii

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG011392-01
Application #
10057490
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
2020-09-01
Project End
2025-06-30
Budget Start
2020-09-01
Budget End
2021-06-30
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205