The NHLBI TOPMed whole genome sequencing (WGS) studies are generating unprecedented scale of sequence reads, totaling >2 quadrillion bases and >300 million variants across >20,000 individuals. While >97% of accessible genomic regions are be exhaustively interrogated through existing variant calling methods, ~3% repeat-rich genomic regions are insufficiently interrogated due to limited ability to call short tandem repeats (STRs). Because ~50% short insertions and deletions (indels) are found in repeat-rich regions of genome, it is important to comprehensively call STRs to reach near-complete sensitivity to identify disease-causing variants from TOPMed WGS studies. In this application, we build on our record of developing innovative methods and analyzing petabytes of TOPMed WGS reads to generate comprehensive and accurate short variant calls, capitalizing on STRs, from TOPMed WGS studies. We leverage related and duplicated samples to improve the quality of STRs. We also propose to estimate mitochondrial DNA copy numbers and telomere lengths from the sequence data, and perform genome-wide association studies to demonstrate the power of the new STR-augmented callset.
Short tandem repeats (STRs) consists of a large fraction (>50%) of short insertions and deletions (indels) but currently undercalled by most existing variant calling methods. Because STRs have different mutational mechanisms, recurrence rate, allele frequency spectrum, and error rates in sequencing and alignment compared to simple biallelic SNPs and indels, STRs are often poorly tagged by existing array-based SNPs, and potentially explain a large fraction of missing heritability of complex traits. By comprehensively calling STRs and performing genome-wide association analysis of two traits - mitochondrial DNA copy numbers and telomere lengths, which are reported to be associated with many cardiovascular and hematologic traits ? we expect that our analysis will unravel novel biological insight on the genetic architecture of these traits. In addition, the variant callset that will be generated and deposited from our proposed study will motivate other investigators of TOPMed WGS studies to extend the horizon of their analysis to encompass STRs and other complex variants exclusively identified from our callset.
Regier, Allison A; Farjoun, Yossi; Larson, David E et al. (2018) Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat Commun 9:4038 |