Statistical Methods and Algorithms for Genomic Data

Lindsay, Bruce

Abstract

In this research project the investigators construct statistical methods and algorithms for SNP analysis that are designed to enhance the biological realism of the underlying models. This project has three primary aims. The first extends development of a new method for constructing hierarchical trees from sequence data using maximum likelihood and modal inference. The tree construction is based upon application of either one of these two inference methods to an ancestral mixture model, a model whose parameters describe the population structure at each fixed time point T in the past. If one estimates this structure over a fine grid of time points T, the relationship between the estimates over time can be graphically described as a hierarchical tree. The second project aim is to enhance the biological realism of the ancestral mixture model to include (a) multi-state characters, (b) advanced models of sequence evolution, and (c) recombination. The extensions are based on using diffusion kernels constructed from continuous time Markov Chains. Empirical Bayes methods are also proposed to be employed to improve overall estimation precision. The third aim is development of a new method for reconstructing haplotype sequences from genotype data without knowing the parental information. The methods and algorithms are based on the ancestral mixture models together with a multi-moment approach that simplifies computation. In addition, the investigators propose to extend the method to long sequences by sliding a window along longer genotype sequences, then using the information from the overlapping estimates to construct longer haplotype estimates.

The current release of the National Center for Biotechnology Information's(NCBI) database dbSNP contains over 11.5 million human single nucleotide polymorphism (SNP) records, representing a 10-fold increase over the last 4 years. Analysis of such data has become a focus of much research within bioinformatics and computational biology because it is the SNPs that carry the information that distinguishes the individuals within a species. Encoded in this data is important information about the relationship between the characteristics of an individual and their genetic code. The investigators are developing methods and models for this data that are fundamentally different in approach from the standard techniques currently used in two areas where SNP data is used, coalescent and phylogenetic inference, in which one reconstructs genetic relationships similar to family trees based on the current ancestors only. The broader impacts of this project include the development of methods and models with potential wide ranging uses across broad scientific disciplines, the increased fusion of biological and mathematical innovation, and opportunities for broad interdisciplinary training of diverse students.

Project Report

The goal of this project was to develop new methods for the analysis of biological sequence data (such as DNA) based on the idea of mixture tree modeling. This methodology creates an estimated tree of relationships for a set of individuals that have some common genetic ancestry, but does so without knowing any information about the ancestors beyond what is contained in the DNA (or other) sequences for the individuals. Up to the time of the proposal, all work had been done on binary sequences and had been applied to single nucleotide polymorphism (SNP) data. The activity of the grant was on several fronts. We extended the method to non-binary data to that it could be applied to full 4 letter DNA sequences. We greatly enhanced the speed of the computations with new algorithms that were hundreds of times faster. We produced new software so that one can compute the estimates, plot the trees, and create visualizations of relationships. New theory was developed that sharpened the quality of the estimation. This project was funded during the period of 2007-2011, and then awarded a no cost extension until July 31 of 2012. During the grant period five different graduate students of Penn State University worked on the project. Three of them achieved Ph.D.s working on the subject of the grant, and one is continuing on the project this next year, unfunded. In addition, the Penn State PI mentored the junior PI at Arizona State University. The majority of the participants were female.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Application #: 0714839
Program Officer: Mary Ann Horn

Project Start
Project End
Budget Start: 2007-08-15
Budget End: 2012-07-31
Support Year
Fiscal Year: 2007
Total Cost: $290,000
Indirect Cost

Statistical Methods and Algorithms for Genomic Data
Lindsay, Bruce
Pennsylvania State University, University Park, PA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments