Among the most fundamental and commonly encountered statistical problems in medical research is the problem of model selection. Model selection is the process by which researchers identify the relationships between measured quantities;thus it plays a central role in the analysis of essentially all high-throughput screening data. Model selection procedures represent the primary analytical mechanism through which the associations between diseases and large numbers of biochemical, genetic and pharmacological variables are discovered. The fundamental hypothesis tested in this application is that a new class of model selection procedures can be used to effectively identify associations between biological variables and disease outcomes, even in settings where there are many more potential biological correlates than there are observations on each variable. The goals of this project are to develop these variable selection procedures so that they can be applied to high-throughput screening data, and to apply the resulting methodology in three important application areas. To achieve these goals, the following specific aims will be addressed. Known theoretical properties of the proposed model selection procedures will be extended to cases in which there are many more biological measurements available than there are observations on each measurement (i.e., p n setting). Constraints on the number of variables that can be included in final models for outcome variables will be determined, and efficient numerical algorithms will be developed so that these methods can be applied to actual high-throughput screening data. The new model selection procedures will be used to define binary classification algorithms that can predict clinical outcomes from high-dimensional gene expression data sets. The new model selection procedures will be used to identify and analyze interactions between genes that are associated with cancer and other diseases in genome-wide association studies using single-nucleotide polymorphism data. The new model selection procedures will be used to analyze biological pathways as informed by high- throughput molecular interrogation data. The algorithms developed during this project constitute a major innovation in the field of model selection and will provide medical researchers with a new and unique set of tools for effectively identifying biological associations among biomarkers, disease attributes, and patient outcomes from high-throughput screening data.

Public Health Relevance

Model selection procedures are statistical techniques that allow researchers to discover the associations between disease and the large number of variables that are measured in emerging high-throughput screening technologies. For example, model selection techniques are used to discover which genes are associated with particular forms of cancer. This project proposes a new class of model selection procedures that will make it easier for researchers to discover such associations.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Dunn, Michelle C
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas MD Anderson Cancer Center
Biostatistics & Other Math Sci
Other Domestic Higher Education
United States
Zip Code
Liu, Suyu; Johnson, Valen E (2016) A robust Bayesian dose-finding design for phase I/II clinical trials. Biostatistics 17:249-63
Yajima, Masanao; Telesca, Donatello; Ji, Yuan et al. (2015) Detecting differential patterns of interaction in molecular pathways. Biostatistics 16:240-51
Jung, Yoonsuh; Hu, Jianhua (2015) A K-fold Averaging Cross-validation Procedure. J Nonparametr Stat 27:167-179
Wang, Yuan; Hobbs, Brian P; Hu, Jianhua et al. (2015) Predictive classification of correlated targets with application to detection of metastatic cancer using functional CT imaging. Biometrics 71:792-802
Hu, Jianhua; Zhu, Hongjian; Hu, Feifang (2015) A Unified Family of Covariate-Adjusted Response-Adaptive Designs Based on Efficiency and Ethics. J Am Stat Assoc 110:357-367
Hu, Jianhua; Wang, Peng; Qu, Annie (2015) Estimating and Identifying Unspecified Correlation Structure for Longitudinal Data. J Comput Graph Stat 24:455-476
Stephan-Otto Attolini, Camille; Peña, Victor; Rossell, David (2015) Designing alternative splicing RNA-seq studies. Beyond generic guidelines. Bioinformatics 31:3631-7
Rossell, David (2015) BIG DATA AND STATISTICS: A STATISTICIAN'S PERSPECTIVE. Metode Sci Stud J 5:143-149
Rossell, David; Stephan-Otto Attolini, Camille; Kroiss, Manuel et al. (2014) QUANTIFYING ALTERNATIVE SPLICING FROM PAIRED-END RNA-SEQUENCING DATA. Ann Appl Stat 8:309-330
Jung, Yoonsuh; Huang, Jianhua Z; Hu, Jianhua (2014) Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA. J Am Stat Assoc 109:1355-1367

Showing the most recent 10 out of 18 publications