Thehumangenomereferencesequenceisoneofthefoundationsofgenomesciences,especiallyinthecontext ofnext-generationsequencing(NGS)analysis.Thereferencehasenableddiscoveriesinbiomedicalresearch andbeenparticularlyinstrumentalinhumandiseasegeneidentification.However,thehumangenomereference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missedorincorrectlyrepresented.Asasolutionthatwillbridgethenextgenerationofreferenceassemblieswith populationgenomesequencingstudies,wehavedevelopedaK-mer-basedindexingapproach.Thismethodis moreefficientcomputationally,providesaccuraterepresentationinthecontextofpopulationsandfacilitatesthe analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First,weplantodevelopascalable,efficientK-merrepresentationofalargecollectionofhaplotype/phased referencegenomes,by1)generatinganindexofallK-mersinhumanreferencegenomeGRCh38inamanner thatcanefficientlystorevariantinformationasmetadata,andthen2)incrementallyupdatingtheK-merindexto includeallnovelK-mersderivedfromongoingpopulationsequencingefforts,while3)developingschemesfor directlyanalyzingcompressedgenomicdata. Second,weplantoapplyK-merrepresentationtogenomicanalysisby1)providingtheentiretyofknown human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developingfunctionsforourpan-genomicindexthatsupportsultra-rapidqueries,suchasofclinicallyimportant variants,and3)linkingconventionalcoordinateinformationtotheK-mermetadatainthepan-genomeindexto allowannotatinggeneticvariationtoaparticulargenomereference. Third,wewillcreateanonlinewebportalforthepan-genome,usingcloudcomputing,tomaximizetheutility ofourapproach,topromotecommunityengagementandtoenablingcontributionfromtheresearchcommunity. Weexpectthatcompletionoftheseaimswillprovide:ascalablecomputationalarchitecturewhichincorporates thecontinuousadditionofvariantinformationwithoutlossofresolutionoraccuracy;?rapidqueryspeedsthatwill remainnearlyconstantasthedatabasegrows;?auniversallyaccessibleportalusingcloudcomputing. Thisworkwillhelpsolvetheissuesofmultipleassemblies.Itwillimproveresearchers?abilitytounderstand the relationshipof variantsand disease, whilealso providing great savings over the long-term in infrastructure andcomputingcosts.

Public Health Relevance

Acomprehensiveunderstandingofhumanbiologyanddiseaserequireshavingaguidetothecompletegenetic codeforhumans.Referredtoasagenomereference,thecurrentcodedoesnotfullydescribethefeaturesseen in many human genomes. To address this issue and enable forward compatibility, we will develop a strategy thatenablesanypersontohavetheirgenomeanalyzedandannotatedwithgreaterspeedandaccuracythanis currentlyfeasible.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01HG010963-02
Application #
10093116
Study Section
Special Emphasis Panel (ZHG1)
Program Officer
Sofia, Heidi J
Project Start
2020-02-01
Project End
2023-01-31
Budget Start
2021-02-01
Budget End
2022-01-31
Support Year
2
Fiscal Year
2021
Total Cost
Indirect Cost
Name
Stanford University
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
009214214
City
Stanford
State
CA
Country
United States
Zip Code
94305