The investigator studies ways to estimate the average amount of selection for or against new mutations at a particular genetic locus using information contained in aligned DNA or protein sequences. If the estimated amount of selection is significantly negative, this is an indication that the region is under stabilizing selection and may have an important biological function. If the estimated average amount of selection is positive, then new variants in this region are favored, as is common for some immune system genes. Models that allow random variation of the amounts of selection over sites are also considered. The methods studied are computationally easy to carry out and have easily obtained significance levels, but assume statistical independence between polymorphic sites instead of the more usual assumption of tight linkage. Studies have shown in many cases that short-segment gene conversion and recombination act at a greater rate than point mutation, so that tight linkage is not always a reasonable assumption. Simulations are carried out to test the assumption that small to moderate amounts of gene conversion and recombination are sufficient to guarantee enough independence between sites so that the proposed estimators will have desirable statistical properties. The models are applied to aligned sequences from public databases. They are also applied to human SNP datasets, for which the assumption of sitewise independence is clearer. Recent advances in biology have made a vast amount of data available about the human genome and about the genomes of other species. Tools for understanding the significance of this data, and in particular the significance of variation in this data between individuals, have lagged behind. One question is whether new genetic variants at a particular genetic locus (or segment of DNA) have been favored on the average in the past, have rendered their owners less fit, or have had essentially the same effect on the fate of individuals with those DNA variants. Tests are proposed that give a computationally easy way of determining which of these three cases applies at a particular genetic locus. These tests are based on the configurations of variant sites in DNA from several individuals at that locus. The results of this test can give information about the function of this stretch of DNA. They can also give information about how a particular gene is affected by or is currently being affected by evolution. Information from related biological species can give further information. The difficulty of the proposed methods is that they assume that different portions of the genetic locus are evolving approximately independently. For some genetic data (for example, SNP for "single-nucleotide polymorphism" data) this can generally be assumed. The sensitivity of the proposed methods to this independence assumption for non-SNP data is tested.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Mary Ann Horn
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Saint Louis
United States
Zip Code