In the next a few years, large genotyped cohorts are becoming available (e.g., TOPMed, UK biobank, All of Us, Million Veteran Program). With the sample size approaches 0.1%-1% of the total population size, extensive distant relatives and Identity-by-descent, or IBD information are represented in such samples. Such information will enable more sophisticated and powerful genetics analysis beyond single variant-based analyses. However, current informatics methods are not equipped with the efficiency to handle genotype data of that scale. We will develop new genome informatics methods for biobank-scale cohorts with genotypes. We have developed an efficient tool, RaPID, the first computationally feasible method for inferring IBD segments among individuals in a biobank-scale cohort. We demonstrated that RaPID achieves running time linear to the sample size and is over 100 times faster than existing methods. At the same time, RaPID detects a greater number of IBDs, with higher accuracy, and sharper segment boundaries than existing methods. In this application, we propose to develop (1) the RaPID+ method for pairwise IBD detection that can tolerate and correct phasing errors, with a principled way of parameter tuning, and can work with genotype data across sequencing and array platforms; (2) the RaPID-diploid method for detection of IBD2 segments; (3) the RaPID-multiway method that identifies IBD Cluster; and (4) the RaPID-ancestry method for local ancestry inference across subcontinental populations. Methods will be rigorously tested in simulations using realistic population demographic models as well as real data from large cohorts. All methods will be implemented as free software for academic use. This project will advance genetic research by developing efficient informatics tools that reveal detailed genetic relationships in very large genotyped cohorts.
(Public Health Relevance Statement) The aim of the project is to develop and evaluate accurate and efficient methods and tools for detecting Identity-by-Descent (IBD) and local ancestry information in large genotyped cohorts, resources of increasing importance in the era of precision medicine. If successful, this project will advance genetic research by offering efficient informatics tools to researchers that can reveal detailed genetic relationships in very large genotyped cohorts.