Thehumangenomereferencesequenceisoneofthefoundationsofgenomesciences,especiallyinthecontext ofnext-generationsequencing(NGS)analysis.Thereferencehasenableddiscoveriesinbiomedicalresearch andbeenparticularlyinstrumentalinhumandiseasegeneidentification.However,thehumangenomereference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missedorincorrectlyrepresented.Asasolutionthatwillbridgethenextgenerationofreferenceassemblieswith populationgenomesequencingstudies,wehavedevelopedaK-mer-basedindexingapproach.Thismethodis moreefficientcomputationally,providesaccuraterepresentationinthecontextofpopulationsandfacilitatesthe analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First,weplantodevelopascalable,efficientK-merrepresentationofalargecollectionofhaplotype/phased referencegenomes,by1)generatinganindexofallK-mersinhumanreferencegenomeGRCh38inamanner thatcanefficientlystorevariantinformationasmetadata,andthen2)incrementallyupdatingtheK-merindexto includeallnovelK-mersderivedfromongoingpopulationsequencingefforts,while3)developingschemesfor directlyanalyzingcompressedgenomicdata. Second,weplantoapplyK-merrepresentationtogenomicanalysisby1)providingtheentiretyofknown human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developingfunctionsforourpan-genomicindexthatsupportsultra-rapidqueries,suchasofclinicallyimportant variants,and3)linkingconventionalcoordinateinformationtotheK-mermetadatainthepan-genomeindexto allowannotatinggeneticvariationtoaparticulargenomereference. Third,wewillcreateanonlinewebportalforthepan-genome,usingcloudcomputing,tomaximizetheutility ofourapproach,topromotecommunityengagementandtoenablingcontributionfromtheresearchcommunity. Weexpectthatcompletionoftheseaimswillprovide:ascalablecomputationalarchitecturewhichincorporates thecontinuousadditionofvariantinformationwithoutlossofresolutionoraccuracy;?rapidqueryspeedsthatwill remainnearlyconstantasthedatabasegrows;?auniversallyaccessibleportalusingcloudcomputing. Thisworkwillhelpsolvetheissuesofmultipleassemblies.Itwillimproveresearchers?abilitytounderstand the relationshipof variantsand disease, whilealso providing great savings over the long-term in infrastructure andcomputingcosts.
Acomprehensiveunderstandingofhumanbiologyanddiseaserequireshavingaguidetothecompletegenetic codeforhumans.Referredtoasagenomereference,thecurrentcodedoesnotfullydescribethefeaturesseen in many human genomes. To address this issue and enable forward compatibility, we will develop a strategy thatenablesanypersontohavetheirgenomeanalyzedandannotatedwithgreaterspeedandaccuracythanis currentlyfeasible.