(Project 1: Maintaining, improving, and providing the human reference) We propose to combine the best current methods and practices with a practical, scalable model to create and share a broadly useful pan-human genome reference (the pan-genome) based on the assemblies and raw data delivered by the genome production center. The pan-genome reference we propose does not wholly replace the existing GRCh38 reference, rather it substantially builds upon it to create a reference resource that incorporates a much richer representation of human genetic diversity. To deliver this plan, we have assembled a strong group of individuals with complementary expertise that together will construct, share, maintain and improve a state-of-the-art human pan-genome resource. To convert the initial genome assemblies into a high-quality human genome reference cohort we will perform assembly quality control, error correction and validation, mirroring the successful processes we have created for maintaining and improving the existing human reference genome. To make this collection of assemblies comparable, we will then create a comprehensive map of the genomic variants that exist among them. Using the human genome reference cohort and variation map, we will create a human pan-genome reference that has three complementary and essential parts: (i) a sequence graph that encodes the genomes in a non-redundant manner by merging together shared sequences; (ii) a searchable encoding of the haplotypes in the sequence graph, something that the graph itself does not capture; and (iii) a coordinate system that makes it possible to refer to all the variation equally while preserving backwards compatibility with GRCh38. To build this graph we will use tools that we have developed and work across the community to test and prototype the approach, releasing stable and versioned pan-genome references. We also propose a plan to handle error reports from the consortium and broader community, and to fix errors in the references via data analysis, curation and targeted sequencing. We will follow a completely open model to ensure all data ? primary data, genome assemblies and pan-genome reference ? are simultaneously available via AnVIL and accessioned through appropriate international databases. We will add value to this resource by creating a core set of pan-genome functional annotations, focusing primarily on genes. To make working with and migrating to the proposed pan-genome reference straightforward, we will foster the creation of new and updated tools. Our strategy harnesses the community by working with tool developers, in particular establishing and promoting exchange formats and creating benchmarks to promote best-of-breed community methods.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Biotechnology Resource Cooperative Agreements (U41)
Project #
5U41HG010972-02
Application #
10020432
Study Section
Special Emphasis Panel (ZHG1)
Project Start
Project End
Budget Start
2020-08-01
Budget End
2021-07-31
Support Year
2
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Washington University
Department
Type
DUNS #
068552207
City
Saint Louis
State
MO
Country
United States
Zip Code
63130