Enabling Comparative Pangenomics To many in the field, it is clear that we are moving rapidly toward a golden age of vertebrate comparative genomics in which thousands of high quality genomes of different species are publicly available and used in understanding the human genome. Despite the opportunity presented by the growth in available genomes, there has been relative stagnation in the software used to compare complete genomes, most of the software developed being old and limited in capabilities. To remedy this situation, we will create a hardened toolkit for genome comparison and annotation that can be robustly applied to thousands of vertebrate genomes. To demonstrate this toolkit and deliver its results to the broader genomics community, we will apply it to create a resource within the existing UCSC and Ensembl Genome Browsers that will incorporate thousands of vertebrate genomes. Large, well organized consortia have coalesced to take on the challenge of sequencing and assembling vertebrate genomes. Our alignments will form a backbone of these projects? analysis, and our synthesis of their data will create a resource that is much greater than the sum of what might otherwise be a series of smaller, fragmented and not directly comparable efforts. We will gather together more than 600 vertebrate genomes into our proposed resource in the first year of the proposal, rapidly delivering results. Paralleling the growth in available reference genomes, the last decade has been marked by an explosion in population sequencing projects. Although much of the cataloged human variation has a very recent evolutionary origin, there is a tremendous opportunity to combine and so better understand intra- and inter- species change using models from population genetics. We will create pangenome software to (i) avoid reference bias in species comparisons (i.e. avoiding assumptions about which alleles are fixed when comparing between species, which is important in quasi-species such as cichlids), (ii) allow ancestral alleles to be comprehensively estimated, including those that are part of structural variation, and (iii) more easily enable the study of balancing selection. To demonstrate the utility of comprehensive variation integration we will create a prototype of a pan-genome for the apes. We will use this graph to identify ancestral alleles and to dynamically convert annotations between species and assembly versions, and, via population mapping experiments, we will demonstrate its power for typing segregating but ancient variation. Using knowledge of ape evolution, we will ultimately extend this graph to adequately model the most complex regions of the human genome.

Public Health Relevance

Project Title: Enabling Comparative Pangenomics Project Narrative This proposal will create a flexible, scalable toolkit for accurate, reference-free, duplication- aware genome alignment and annotation of diverse diploid genomes. It will apply this toolkit to the alignment, annotation, visualization and ancestral history reconstruction of thousands of vertebrate genomes. Building on this toolkit it will integrate -inter and -intra species variation, focusing on humans and ape outgroups, to enhance our understanding of human variation and its ancestral origins.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG010485-02
Application #
10117276
Study Section
Genetic Variation and Evolution Study Section (GVE)
Program Officer
Sofia, Heidi J
Project Start
2020-03-02
Project End
2023-12-31
Budget Start
2021-01-01
Budget End
2021-12-31
Support Year
2
Fiscal Year
2021
Total Cost
Indirect Cost
Name
University of California Santa Cruz
Department
Engineering (All Types)
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
125084723
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064