Investigations have been conducted for the potential for using data from current and future genome-wide association studies for improving performance of models for predicting disease risks. A new mathematical paradigm was developed to characterize predictive performance of polygenic models in terms of sample size for training datasets, number of underlying susceptibility loci and distribution of their effect-sizes. The paradigm was then applied to make projections for performance of risk prediction models for ten different complex traits, including cancers. These projections revealed that in the future extremely large GWAS, with sample size of a larger order magnitude than even some of the largest GWAS to date, would be needed for building genetic risk models with substantially improved predictive performance. A new method was developed for assessing gene-environment interactions using data from case-control genome-wide association studies that uses publicly available genetic controls. It was shown that under a set of assumptions it possible to characterize joint gene-environment effects from such studies if data on environmental exposures are available from an internal case-control study even if controls in such a study are not genotyped. New methods was developed for evaluating association of SNP markers with disease outcome of ordinal nature reflecting various stages of the progression of a disease. Two alternative tests, the maximum score test (MAX) and the adaptive P-value combination test (Adapt-P), are proposed with the aim of striking a balance between efficiency and robustness over possible alternative models by which a SNPs might be involved in the various stages. Simulation studies were used to demonstrates that MAX and Adapt-P have the most robust performance among all a range of tests under various realistic scenarios. A permutation-based resampling method was developed for using metabolomic data for testing the hypothesis of mediation of the effect of an exposure (e.g smoking) on the risk of a disease (e.g lung cancer) through intermediate biomarkers. Extensive simulation studies were used to examine validity and power of the proposed test. Methods were developed for analysis of population-based case-control studies with complex sampling designs. Two methods were developed for incorporating the information included in the sample weights by modeling the sample expectation of the weights conditional on design variables. These methods have higher efficiency and smaller finite sample bias compared with the standard estimators that use original sample weights. The methods were to the U.S. Kidney Cancer Case-Control Study to identify risk factors. A project developed a linear-expit regression model (LEXPIT) to incorporate linear and nonlinear risk effects to estimate absolute risk from studies of a binary outcome. The LEXPIT is a generalization of both the binomial linear and logistic regression models. The coefficients of the LEXPIT linear terms estimate adjusted risk differences, while the exponentiated nonlinear terms estimate residual odds ratios. The LEXPIT could be particularly useful for epidemiological studies of risk association, where adjustment for multiple confounding variables is common. The method was applied to estimate the absolute five-year risk of cervical precancer or cancer associated with different Pap and human papillomavirus test results in 167,171 women undergoing screening at Kaiser Permanente Northern Califronia. The LEXPIT model found an increased risk due to abnormal Pap test in HPV-negative that was not detected with logistic regression. An R package blm was developed to provide free and easy-to-use software for fitting the LEXPIT model.
Showing the most recent 10 out of 182 publications