The current human reference genome, GRCh38, plays a central role in medical and population human genetics. It primarily models a single human individual and is missing hundreds of thousands of large structural variations segregating in the population. This underrepresentation of genetic diversity leads to various artifacts in data analysis and significantly hampers our understanding of the functional and medical relevance of these large human variations, which may collectively have pervasive impact. To address this issue, we will extend our previous work on sequence graphs and alignment algorithms and construct a pan-genome reference graph from hundreds of long-read human assemblies that more completely represent genetic diversity. Specifically, we will (1) design a reference graph model with a stable coordinate system compatible with GRCh38 and develop toolkits and libraries to interact with this model; (2) develop minimizer-based sequence-to-graph alignment algorithms for short and long sequences; (3) incrementally construct a reference graph by mapping assemblies to an existing graph and updating the graph; and (4) develop a graph-based genotyping algorithm and apply it to short-read based projects to call structural variations missed by the current pipelines. Upon completion, the proposed project could replace the current practices based on a linear genome and will enable the profiling and study of complex human variations missed in most current research.

Public Health Relevance

The human reference genome sequence is at the heart of all sequence-related biomedical research, such as population genetics, phenotype association, cancer genomics and sequencing based diagnostics and disease treatment. The proposed studies aim to deliver a more complete human reference genome and tools supporting the new representation. The success of this proposal could push the study of complex human variations and their association with diseases to the next level.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
1U01HG010961-01
Application #
9904877
Study Section
Special Emphasis Panel (ZHG1)
Program Officer
Sofia, Heidi J
Project Start
2020-03-01
Project End
2023-02-28
Budget Start
2020-03-01
Budget End
2021-02-28
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215