K-mer indexing for pan-genome reference annotation

Ji, Hanlee; Weissman, Tsachy

Abstract

Thehumangenomereferencesequenceisoneofthefoundationsofgenomesciences,especiallyinthecontext ofnext-generationsequencing(NGS)analysis.Thereferencehasenableddiscoveriesinbiomedicalresearch andbeenparticularlyinstrumentalinhumandiseasegeneidentification.However,thehumangenomereference is limited by its static and linear nature. Specifically, the current reference lacks the featural and contextual flexibility to represent the breadth of human variation. Important elements of individual genomes are either missedorincorrectlyrepresented.Asasolutionthatwillbridgethenextgenerationofreferenceassemblieswith populationgenomesequencingstudies,wehavedevelopedaK-mer-basedindexingapproach.Thismethodis moreefficientcomputationally,providesaccuraterepresentationinthecontextofpopulationsandfacilitatesthe analysis of diverse human genomes. Our goal is to use this strategy in developing a robust computational architecture that will encode and annotate large collections of genomes in the context of a pan-genome reference. First,weplantodevelopascalable,efficientK-merrepresentationofalargecollectionofhaplotype/phased referencegenomes,by1)generatinganindexofallK-mersinhumanreferencegenomeGRCh38inamanner thatcanefficientlystorevariantinformationasmetadata,andthen2)incrementallyupdatingtheK-merindexto includeallnovelK-mersderivedfromongoingpopulationsequencingefforts,while3)developingschemesfor directlyanalyzingcompressedgenomicdata. Second,weplantoapplyK-merrepresentationtogenomicanalysisby1)providingtheentiretyofknown human genetic variation in an aggregated index that is computationally efficient and easy to understand, 2) developingfunctionsforourpan-genomicindexthatsupportsultra-rapidqueries,suchasofclinicallyimportant variants,and3)linkingconventionalcoordinateinformationtotheK-mermetadatainthepan-genomeindexto allowannotatinggeneticvariationtoaparticulargenomereference. Third,wewillcreateanonlinewebportalforthepan-genome,usingcloudcomputing,tomaximizetheutility ofourapproach,topromotecommunityengagementandtoenablingcontributionfromtheresearchcommunity. Weexpectthatcompletionoftheseaimswillprovide:ascalablecomputationalarchitecturewhichincorporates thecontinuousadditionofvariantinformationwithoutlossofresolutionoraccuracy;?rapidqueryspeedsthatwill remainnearlyconstantasthedatabasegrows;?auniversallyaccessibleportalusingcloudcomputing. Thisworkwillhelpsolvetheissuesofmultipleassemblies.Itwillimproveresearchers?abilitytounderstand the relationshipof variantsand disease, whilealso providing great savings over the long-term in infrastructure andcomputingcosts.

Public Health Relevance

Acomprehensiveunderstandingofhumanbiologyanddiseaserequireshavingaguidetothecompletegenetic codeforhumans.Referredtoasagenomereference,thecurrentcodedoesnotfullydescribethefeaturesseen in many human genomes. To address this issue and enable forward compatibility, we will develop a strategy thatenablesanypersontohavetheirgenomeanalyzedandannotatedwithgreaterspeedandaccuracythanis currentlyfeasible.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project--Cooperative Agreements (U01)
Project #: 5U01HG010963-02
Application #: 10093116
Study Section: Special Emphasis Panel (ZHG1)
Program Officer: Sofia, Heidi J

Project Start: 2020-02-01
Project End: 2023-01-31
Budget Start: 2021-02-01
Budget End: 2022-01-31
Support Year: 2
Fiscal Year: 2021
Total Cost
Indirect Cost

Institution

Name: Stanford University
Department: Internal Medicine/Medicine
Type: Schools of Medicine
DUNS #: 009214214

City: Stanford
State: CA
Country: United States
Zip Code: 94305

Related projects


NIH 2021 U01 HG	K-mer indexing for pan-genome reference annotation Ji, Hanlee P.; Weissman, Tsachy / Stanford University
NIH 2020 U01 HG	K-mer indexing for pan-genome reference annotation Ji, Hanlee P.; Weissman, Tsachy / Stanford University

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: