The goal of this proposal is to develop improved methods for statistical inference from data arising in genomic studies, specifically from microarray platforms. Statistical algorithms, particularly those based on Markov chain Monte Carlo (MCMC) have become widely used in data analysis in all fields. In applications to genomic studies they have become particularly prevalent, in part due to the enormous amount of data collected and their ability to handle complex models. We address three specific aims:
Specific Aim 1 : Develop missing data methods applicable to SNP association genetics. In this process, where one is looking to associate a quantitative trait with SNPs, it is typical to get information on a large number of SNPs. As the information is typically not complete, we must deal with missing data, which causes two difficulties: (i) Accurate modeling must take into account the SNP correlation structure, which causes problems for standard missing data methods, and (ii) The large number of SNPs brings along computational and statistical problems. We are developing a Gibbs sampler that shows great promise in allowing efficient estimation of SNP effects in these problems.
Specific Aim 2 : Clustering and classification methods for time-course microarray data. We continue our development of clustering methods for time-course data based on Bayesian hierarchical models and Metropolis-Hastings search algorithm with the specific goal of developing a new classifier that associates clusters, or gene patterns, with clinical outcomes.
Specific Aim 3 : Testing for the existence of clusters. Although there are many methods for clustering data, there are few methods for assessing whether the clusters are significant. We propose a Bayesian model selection methodology to derive a test for the existence of clusters. As many phenotypes show quantitative variation, detection of clusters is a preliminary step that would suggest further genomic analysis to determine the existence SNPs controlling the observes quantitative traits.

Public Health Relevance

The methods that will be developed are motivated by a number of studies that promise to have impact on disease management. In particular, we look to apply our missing data methods to a SNP discovery data set from lupus patients to find associations between SNPs and disease status, and our gene-based classifier can aid physicians in managing the treatment of trauma patients. The proposed cluster test can provide a screening tool to identify data with possible genetic associations, again leading to information on genetic associations.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM081704-02
Application #
7683009
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Remington, Karin A
Project Start
2008-09-01
Project End
2011-08-31
Budget Start
2009-09-01
Budget End
2010-08-31
Support Year
2
Fiscal Year
2009
Total Cost
$145,105
Indirect Cost
Name
University of Florida
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
969663814
City
Gainesville
State
FL
Country
United States
Zip Code
32611
León-Novelo, Luis G; Müller, Peter; Arap, Wahid et al. (2013) Bayesian decision theoretic multiple comparison procedures: an application to phage display data. Biom J 55:478-89
León-Novelo, Luis G; Müller, Peter; Arap, Wadih et al. (2013) Semiparametric Bayesian inference for phage display data. Biometrics 69:174-83
León-Novelo, Luis; Kemppainen, Kaisa M; Ardissone, Alexandria et al. (2013) TWO APPLICATIONS OF PERMUTATION TESTS IN BIOSTASTICS. Bol Soc Mat Mex 19:255-266
Graze, R M; Novelo, L L; Amin, V et al. (2012) Allelic imbalance in Drosophila hybrid heads: exons, isoforms, and evolution. Mol Biol Evol 29:1521-32
Leon-Novelo, Luis; Moreno, Elias; Casella, George (2012) Objective Bayes model selection in probit models. Stat Med 31:353-65
Yang, Jie; Casella, George; McIntyre, Lauren M (2011) Generalized shrinkage F-like statistics for testing an interaction term in gene expression analysis in the presence of heteroscedasticity. BMC Bioinformatics 12:427
Joo, Yongsung; Casella, G; Hobert, J (2010) Bayesian model-based tight clustering for time course data. Comput Stat 25:17-38
Verhoeven, Koen J F; Casella, George; McIntyre, Lauren M (2010) Epistasis: obstacle or advantage for mapping complex traits? PLoS One 5:e12264
Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G et al. (2010) PANGEA: pipeline for analysis of next generation amplicons. ISME J 4:852-61
Fuentes, Claudio; Casella, George (2009) Testing for the existence of clusters. Sort (Barc) 33:115-157

Showing the most recent 10 out of 12 publications