Technological advances in high-throughput sequencing and custom genotyping arrays are making genetic studies larger than ever. The number of studies generating whole genome sequencing (WGS) data has increased substantially over the past few years, and the NHLBI alone is expected to generate WGS data on over ~30,000 individuals in the next year. Ongoing collections of exome sequence data are now approaching 200,000 subjects, and future NIH- and private-funded projects will soon generate WGS with similar sample size. There is a great need to make full use of the large amount of newly generated data, including a better way to identify and utilize relatedness information. Due to its computational efficiency, our relationship inference tool (KING) has been the main software tool to infer relationships in large genetic studies in the past few years. With the challenges and great opportunities provided by high-throughput genotyping, exome and whole genome sequence data, an even faster, more reliable and more powerful relationship inference procedure and tool is urgently needed. Such a tool would open possibilities to inform rare variant association beyond currently available approaches. We propose to develop robust and computationally efficient algorithms to infer close and distant family relationships in large datasets consisting of 1,000s-100,000s of individuals. The fast algorithm will allow identification of close relationships in large datasets consisting of >100,000 individuals, and the algorithms that are proposed specifically for the rare variant data from the WGS technology will allow us to infer relationships more reliably in the presence of inbreeding, population structure (including population admixture), and/or sample contamination, and also at a higher-order of degree. Further, we plan to develop an integrated toolset that is based on our fast relationship inference algorithms, such as pedigree reconstruction, Quality Control (QC), and family-based association methods. Preliminary data analysis shows our algorithm can identify all close relationships in the 1000 Genomes data in 12 seconds. We also successfully inferred an extended pedigree containing only distant (2nd- and 3rd-degree) relationships representing an aunt, her niece, and her first cousin. Our proposed methods will be implemented in freely distributed software (KING), allowing other investigators to apply the methods directly to analysis of their own sequencing and other high-throughput array data. We expect the relationship inference methods developed here will play an important role in the quality control and analysis of large sets of genetic/genomic data in the coming years.

Public Health Relevance

We propose to develop statistical methods and computationally efficient tools to infer cryptic family relationships, establish extended pedigrees/lineages, and increase power and resolution for mapping common and rare variants that contribute to human diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Li, Rongling
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Virginia
Public Health & Prev Medicine
Schools of Medicine
United States
Zip Code