In this research project the investigators construct statistical methods and algorithms for SNP analysis that are designed to enhance the biological realism of the underlying models. This project has three primary aims. The first extends development of a new method for constructing hierarchical trees from sequence data using maximum likelihood and modal inference. The tree construction is based upon application of either one of these two inference methods to an ancestral mixture model, a model whose parameters describe the population structure at each fixed time point T in the past. If one estimates this structure over a fine grid of time points T, the relationship between the estimates over time can be graphically described as a hierarchical tree. The second project aim is to enhance the biological realism of the ancestral mixture model to include (a) multi-state characters, (b) advanced models of sequence evolution, and (c) recombination. The extensions are based on using diffusion kernels constructed from continuous time Markov Chains. Empirical Bayes methods are also proposed to be employed to improve overall estimation precision. The third aim is development of a new method for reconstructing haplotype sequences from genotype data without knowing the parental information. The methods and algorithms are based on the ancestral mixture models together with a multi-moment approach that simplifies computation. In addition, the investigators propose to extend the method to long sequences by sliding a window along longer genotype sequences, then using the information from the overlapping estimates to construct longer haplotype estimates.

The current release of the National Center for Biotechnology Information's(NCBI) database dbSNP contains over 11.5 million human single nucleotide polymorphism (SNP) records, representing a 10-fold increase over the last 4 years. Analysis of such data has become a focus of much research within bioinformatics and computational biology because it is the SNPs that carry the information that distinguishes the individuals within a species. Encoded in this data is important information about the relationship between the characteristics of an individual and their genetic code. The investigators are developing methods and models for this data that are fundamentally different in approach from the standard techniques currently used in two areas where SNP data is used, coalescent and phylogenetic inference, in which one reconstructs genetic relationships similar to family trees based on the current ancestors only. The broader impacts of this project include the development of methods and models with potential wide ranging uses across broad scientific disciplines, the increased fusion of biological and mathematical innovation, and opportunities for broad interdisciplinary training of diverse students.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0714949
Program Officer
Mary Ann Horn
Project Start
Project End
Budget Start
2007-08-15
Budget End
2013-02-28
Support Year
Fiscal Year
2007
Total Cost
$624,592
Indirect Cost
Name
Arizona State University
Department
Type
DUNS #
City
Tempe
State
AZ
Country
United States
Zip Code
85281