Single nucleotide polymorphism data (SNP - 'snip') are quickly becoming popular for addressing an array of problems in human genetics and in population and evolutionary biology in general. Modern SNP discovery techniques make it feasible to survey multiple populations and numerous individuals using thousands of these SNPs distributed over the entire genome. These large SNP datasets are not only important in human genetics, but will increase the precision of many population genetic or evolutionary studies in other species. However, the analysis of SNP data is not straightforward. For instance, it is necessary to take into account the condition that every SNP is variable; the appropriate ascertainment correction will depend on the nature of the SNP discovery process. Another difficulty is that each SNP contributes only a minute amount of population genetic or phylogenetic information. It is only when information can be efficiently extracted from large SNP datasets that SNPs become more informative about the population or species history than traditional sequence data. Coalescent-based approaches that take into account genealogical relationships among sampled individuals suffer from these restrictions and currently none of the current implementations work well with thousands of independent SNP loci. Our proposed research will focus on three areas of analytical improvements that will facilitate the analysis of SNP data: (1) allowing for biased genealogy ascertainment by finding mathematical formulae enabling us to collapse large genealogies into simpler ones where all sub-trees with tips having the same SNP allele, occurring in the same population, or belonging to the same selection class are combined; (2) improving the precision of the correction for ascertainment bias by exploring different correction schemes; and (3) increasing model accuracy by accommodating different substitution patterns in exons and introns. Collapsing trees into smaller constructs will be particularly important in reducing the computational burden associated with the analysis of large SNP data sets. These analytical improvements will be implemented in an computer program, which will allow researchers who work on difficult problems in human genetics, human ancestry, and other fields to analyze large numbers of SNP loci in a reasonable amount of time. Single nucleotide polymorphism (SNP) data are already abundant, but current programs are not able to handle this flood of data. We will develop algorithms that allows collapsing the large genealogies of sampled individuals into smaller constructs and that account for biased sampling of SNPs. These methods will reduce the computational burden and will thus enable researchers to work on complicated population models using a large numbers of SNP loci in a reasonable amount of time.
Showing the most recent 10 out of 11 publications