The current human reference genome, GRCh38, plays a central role in medical and population human genetics. It primarily models a single human individual and is missing hundreds of thousands of large structural variations segregating in the population. This underrepresentation of genetic diversity leads to various artifacts in data analysis and significantly hampers our understanding of the functional and medical relevance of these large human variations, which may collectively have pervasive impact. To address this issue, we will extend our previous work on sequence graphs and alignment algorithms and construct a pan-genome reference graph from hundreds of long-read human assemblies that more completely represent genetic diversity. Specifically, we will (1) design a reference graph model with a stable coordinate system compatible with GRCh38 and develop toolkits and libraries to interact with this model; (2) develop minimizer-based sequence-to-graph alignment algorithms for short and long sequences; (3) incrementally construct a reference graph by mapping assemblies to an existing graph and updating the graph; and (4) develop a graph-based genotyping algorithm and apply it to short-read based projects to call structural variations missed by the current pipelines. Upon completion, the proposed project could replace the current practices based on a linear genome and will enable the profiling and study of complex human variations missed in most current research.
The human reference genome sequence is at the heart of all sequence-related biomedical research, such as population genetics, phenotype association, cancer genomics and sequencing based diagnostics and disease treatment. The proposed studies aim to deliver a more complete human reference genome and tools supporting the new representation. The success of this proposal could push the study of complex human variations and their association with diseases to the next level.