Consider several unrelated human individuals from a population. It is a common sense that these individuals are descendants of some common ancestors if tracing backward in time long enough. Indeed, these sampled individuals share a genealogical history that specifies the ancestry of these individuals. This genealogy is very informative, since it can tell, e.g. which individuals are closely related. One potential application of the genealogy is understanding why some individuals are more susceptible to some phenotypic traits (such as diabetes or cancer) than others. It might be the case that individuals sharing a trait are more closely related to each other on the genealogy than the rest of the population.
Genealogy, although useful, cannot be directly observed. The so-called genetic variation provides hints on the underlying genealogy of the sampled individuals. One common type of genetic variations is the single nucleotide polymorphism (SNP). A SNP refers to the genomic position where individuals in a population may have different nucleotides at the position. A moment of thoughts suggests that individuals with the same nucleotide at a SNP tend to be more closely related than individuals with different nucleotides. This may allow one to infer the plausible underlying genealogy from genetic data collected at multiple SNPs. Inference of genealogy from real SNP data is, however, much more complex than this. One main difficulty is caused by meiotic recombination. Without recombination, genealogy can be modeled as a tree (similar to the usual tree of life model that has been extensively studied in biology). Recombination allows one genome to have more than one ancestor and thus violates the basic property of this simple tree model. Recombination essentially breaks down the genome into many small segments, where each segment may originate from different ancestors. That is, genealogical history at different genomic positions may be different. Genealogy with recombination is thus much more complex than that with no recombination.
This project aims to developing effective computational methods for analyzing large-scale population genetic data that has become available during the past several years. The main goals are first accurately inferring the genealogical history of sampled individuals from the genetic data, and then performing inference for several population genetic problems with the inferred genealogy. The successful completion of the proposed research will produce new computational tools and software that may allow population geneticists to better understand the implications of large-scale population genetic data. Potential applications of these tools include, for example, mapping the genomic positions that are associated with complex traits, inferring the population admixture history and finding regions of the genome that are under natural selection.
The intellectual merits of this project are as follows. This project will develop efficient and accurate computational methods for inferring population history from haplotypes based on inferred gene genealogies. Gene genealogy refers to the evolutionary history of extant population haplotypes, and captures the underlying LD information. While gene genealogies are fundamental to population genetics, most existing inference methods don't use gene genealogies explicitly because genealogies are not directly observable. Inferring gene genealogies from haplotypes is just starting to become feasible, due to the latest development in genomic technologies and genealogy inference methods. This project aims to developing effective computational methods for the following two problems. First, new methods for inferring gene genealogies from haplotypes will be developed. Second, new methods for inferring population demographic history (e.g. population admixture) will be developed. Successful completion of the proposed research will produce new efficient and accurate algorithms that are implemented in practical software tools and allow population biologists to infer population history from genome-scale data.
The broader impacts of this project include the following. Developed software tools will be made available freely to the multidisciplinary research community, and are expected to enable novel biological applications in complex population history inference. Research results will be integrated into classroom teaching. The project will ensure broad dissemination of the research results and teaching materials. The proposed educational and outreach activities include training of future researchers with unique interdisciplinary skills.