Correlated and high-dimensional data arise frequently in health sciences research, especially in cancer research. Correlated data arise in longitudinal studies and familial studies, while high-dimensional data have emerged in recent years as a consequence of the rapid advance of genomic and proteomic research. We propose in this application to develop nonparametric and semiparametric regression methods for clustered/longitudinal data and high-dimensional genomic and proteomic data. Specifically, we propose to develop (1) the kernel (spline) profile EM method for generalized semiparametric mixed models for clustered/longitudinal data;(2) nonparametric and semiparametric regression models for longitudinal data with dropouts;(3) the mixed model kernel machine method for generalized semiparametric regression models and semiparametric Cox models for the analysis of gene expression pathways and tag single nucleotide polymorphisms (SNPs) within a candidate gene, and the sparse kernel machine (SKM) method for selecting genes and tag SNPs from a large pool of genes or tag SNPs;(4) the joint modeling method using functional wavelet models and generalized semiparametric models for mass spectrometry proteomic data and disease outcomes. Asymptotic properties of the proposed methods will be investigated and simulation studies will be conducted to evaluate their finite sample performance. Efficient numerical algorithms and user-friendly statistical software will be developed, with the goal of disseminating these models and methods to health sciences researchers. In collaboration with biomedical investigators, we will apply the proposed models and methods to several motivating data sets on cancer research and other fields of research.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Method to Extend Research in Time (MERIT) Award (R37)
Project #
Application #
Study Section
Special Emphasis Panel (NSS)
Program Officer
Dunn, Michelle C
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Chen, Han; Wang, Chaolong; Conomos, Matthew P et al. (2016) Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. Am J Hum Genet 98:653-66
Chen, Jun; Just, Allan C; Schwartz, Joel et al. (2016) CpGFilter: model-based CpG probe filtering with replicates for epigenome-wide association studies. Bioinformatics 32:469-71
Lin, Xinyi; Lee, Seunggeun; Wu, Michael C et al. (2016) Test for rare variants by environment interactions in sequencing association studies. Biometrics 72:156-64
Yung, Godwin; Lin, Xihong (2016) Validity of using ad hoc methods to analyze secondary traits in case-control association studies. Genet Epidemiol 40:732-743
Barnett, Ian J; Lin, Xihong (2014) Analytic P-value calculation for the higher criticism test in finite d problems. Biometrika 101:964-970
Hu, Tianle; Lin, Xihong; Nan, Bin (2014) Cross-ratio estimation for bivariate failure times with left truncation. Lifetime Data Anal 20:23-37
Liao, Shu-Yi; Lin, Xihong; Christiani, David C (2014) Genome-wide association and network analysis of lung function in the Framingham Heart Study. Genet Epidemiol 38:572-8
Carmona, Juan Jose; Sofer, Tamar; Hutchinson, John et al. (2014) Short-term airborne particulate matter exposure alters the epigenetic landscape of human genes associated with the mitogen-activated protein kinase network: a cross-sectional study. Environ Health 13:94
Lee, Seunggeung; Abecasis, Gonçalo R; Boehnke, Michael et al. (2014) Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 95:5-23
Seow, Wei Jie; Kile, Molly L; Baccarelli, Andrea A et al. (2014) Epigenome-wide DNA methylation changes with development of arsenic-induced skin lesions in Bangladesh: a case-control follow-up study. Environ Mol Mutagen 55:449-56

Showing the most recent 10 out of 62 publications