The overall goal of this research is to develop novel statistical methods for addressing the difficult issue of multiplicity in current cancer etiology. To identify determinants of cancer and quantify their role, cancer etiology studies are intrinsically multi-factorial because of the multi-step nature of carcinogenesis and multi-extrinsic factors that lead normal cells to malignant ones. Multiplicity inflates false positive rate. In the simplest example of searching for a cutpoint of one quantitative biomarker for disease status, the common practice of examining different cutpoints and pick the one with the smallest p-value results in highly inflated false positive rate. Even in largest studies, statistical power for testing interactions quickly diminishes, sample sizes rapidly become inadequate with stratification and risk estimates become unstable. Because there are so many risk factors, model overfitting is a common problem and the predictive performance of the statistical model is poor. It is thus not surprising that even main effects (e.g., candidate gene associations) have proven notoriously difficult to replicate and reported interactions even harder. The multiplicity issue is acute today as more biomarkers of risk exposures and even the entire pathways comprising easily dozens of genes and their environmental substrates become available. An effective means to reduce overfitting and prediction error is to constrain model parameters as in least absolute shrinkage and selection operator (lasso) to eliminate the large number of irrelevant variables (e.g., genes). Finding MLE in such regression models with large number of variables is challenging. Since some measures of exposure may not be indicative of cancer and these irrelevant variables reduce the accuracy of the regression model, selecting the most relevant variables into the model would be a significant step. However, classic methods for model/variable selection have not had much success in biomedical application because they too aggressively eliminate significant factors predictor and are numerically unstable due to collinearity. This pilot project application focuses on the commonly used logistic regression model in cancer etiology studies. Built upon the novel accelerated expectation-maximization (EM) algorithm we developed for variable selection in linear models, we propose to develop fast variable selection procedures for logistic regression model that reduces overfitting and has improved predictive property; and to develop computer programs, conduct simulation studies to assess the performance of the method/algorithm and to analyze the esophageal data from two currently NCI funded studies. Upon completion of the proposed research, the methods/algorithms developed can be used to analyze cancer epidemiology data more effectively and efficiently. It also provides a basis for further developments of the approach into potentially an RO1 application. The future study can includes extensions to multinomial (i.e., multi-class) logistic regression models for cancer outcomes, the Cox regression model for time-to-event data such as time to advanced cancer analyzing data in cancer etiology and the Bayesian hierarchical modeling and model selection that incorporate prior biological knowledge about pathways will enhance the ability to detect real causal effects.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Small Research Grants (R03)
Project #
5R03CA119758-02
Application #
7127228
Study Section
Special Emphasis Panel (ZCA1-SRRB-Q (O1))
Program Officer
Choudhry, Jawahar
Project Start
2005-09-30
Project End
2007-08-31
Budget Start
2006-09-01
Budget End
2007-08-31
Support Year
2
Fiscal Year
2006
Total Cost
$72,505
Indirect Cost
Name
University of Maryland Baltimore
Department
Public Health & Prev Medicine
Type
Schools of Medicine
DUNS #
188435911
City
Baltimore
State
MD
Country
United States
Zip Code
21201
Tian, Guo-Liang; Ng, Kai Wang; Li, Kai-Can et al. (2009) Non-iterative sampling-based Bayesian methods for identifying changepoints in the sequence of cases of haemolytic uraemic syndrome. Comput Stat Data Anal 53:3314-3323
Tian, Guo-Liang; Tang, Man-Lai; Fang, Hong-Bin et al. (2008) Efficient methods for estimating constrained parameters with applications to lasso logistic regression. Comput Stat Data Anal 52:3528-3542
Tian, Guo-Liang; Yu, Jun-Wu; Tang, Man-Lai et al. (2007) A new non-randomized model for analysing sensitive questions with binary outcomes. Stat Med 26:4238-52
Liu, Zhenqiu; Jiang, Feng; Tian, Guoliang et al. (2007) Sparse logistic regression with Lp penalty for biomarker identification. Stat Appl Genet Mol Biol 6:Article6