The investigator and his colleagues will develop methods for high-throughput biological data that provide innovative extensions of modern statistical building blocks, including the use of random effects for regularization, shrinkage estimation, Bayesian statistics, and mixtures for posterior classification and prediction. Novel modifications of the expectation-maximization algorithm are proposed for scalable and efficient model fitting and inference in several important biological applications.

The goal of this project is to develop new statistical models, computational algorithms, and decision theoretic analysis of estimates for high-throughput biological data.  The new methods can be applied in the analysis of microarrays, RNA sequencing counts, label-free shotgun proteomics, metabolomics, the identification of quantitative trait loci and association mapping.

Project Report

The main accomplishment was the development of a bivariate statistical model for testing treatment effects in high-throughput data. A canonical example is simultaneous testing for changes in gene expression using microarray data, but the methods developed can be applied in a variety of settings. Most previous work on this topic has focused on testing for differences between mean levels of expression, but it has been recognized recently that testing for differential variation is sometimes as important. A bivariate modeling strategy was proposed which allows for both mean differences and differential variation, and for which estimation and testing can be implemented using an extremely efficient expectation-maximization algorithm. This work has been peer reviewed and accepted by the Journal of the American Statistical Association. Other work related to the project's major goals included an investigation of the feasibility of using variational Bayes methods for inference with hierarchical mixture models that are widely used in the analysis of high-throughput biological data. Other articles examined theoretical aspects of mean and covariance estimation for the multivariate normal distribution in the high dimensional setting in which the dimension of the repsonse is greater than the sample size. General model selection foundational issues were also studied in the context of spherically symmetric distributions. More recent work concerned the estimation of sparse canonical vectors for discriminant analysis.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1208488
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2012-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2012
Total Cost
$94,895
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850