Methods for Genetic Epidemiology Investigations have been conducted on designs for next generation association studies based on next generation genotyping and sequencing technologies. One project has investigated the optimal balance between coverage depth and number of samples for studies of variant detection and disease-association testing using next generation sequencing technologies. A second project investigated potential utility of two-stage designs where all subjects may be genotyped using a standard genotyping platforms that have been commonly used for existing genome-wide association studies and a small number of subjects may be genotyped using newer generation, denser, genotyping platforms designed to provide higher coverage for less common variants. New statistical methods for calling copy number variations (CNV) in large scale genome-wide association studies (GWAS) have been developed. The currently available CNV calling packages have low power for detecting short CNV and hence dont have sufficient power for detecting short CNVs associated with diseases. Dr. Shi has been trying to integrate the B-allele frequencies, the linkage disequilibrium among SNP probes and the relatedness among subjects to improve the CNV calling. A software package SegCNV in C++ for calling CNVs in unrelated subjects has been developed. A new method has been developed for conducting meta-analysis of genome-wide association studies in presence of heterogeneity. Combining GWAS of distinct, but possibly related, traits is now considered a particularly promising direction for the discovery of loci with small but common pleiotropic effects. Classical approaches for meta- or pooled- analysis, however, are not optimal for such a setting in which individual variants are likely to be associated with only a subset of the traits or demonstrate effects in different directions. We propose a method that generalizes standard fixed-effects analytic approaches by agnostically exploring subsets of studies for the presence of true association signals, in either the same direction or possibly in opposite directions. An efficient approximation is used for rapid evaluation of p-values. A software package in R programming language called ASSET has been developed is being freely distributed for external use. A number of projects involved development of methods for studies of gene-environment interactions. One project studied strategies for incorporation of sampling weights in case-control studies of gene-environment interaction. A second project developed a novel method for studying gene-environment interaction using multiple SNPs from same genetic region. General statistical methods New method has been developed for highly efficient p-value evaluation procedure for the general resampling-based test that requires as little as 1% of even 0.0002% of the computing time required by standard resampling-based procedures when evaluating a test statistic with a small p-value;an empirical estimate of probabilities of the order of is quite common interest in many molecular epidemiology studies that use rigorous testing procedures to avoid false positives . The method has been implemented in a user friendly package in the popular R programming language for other researchers to use freely. BB members have developed a new more flexible logistic regression model that allows some exposures to be modeled purely linearly. The advantage of this model is that it can automatically calculate absolute risks and risk differences -- the 2 key quantities in clinical epidemiology and translational medicine. The model is related to some of the special models Jay Lubin and Dale Preston have developed for other purposes in Epicure. A number of challenging computational issues have been solved and a general-purpose software for fitting the model has been made publicly available. lows 100q% of the population at highest risk. PNF(p) assess the feasibility of covering 100p% of cases by indicating how much of the population at highest risk must be followed. Showed the relationship of those two criteria to the Lorenz curve and its inverse, and present distribution theory for estimates of PCF and PNF. Develop new methods, based on influence functions, for inference for a single risk model, and also for comparing the PCFs and PNFs of two risk models, both of which were evaluated in the same validation data. Proposed a class of GOF tests for the mean structure of GLMMs. Our test statistic is a quadratic form of the difference between observed values and the values expected under the estimated model in cells defined by a partition of the covariate space. We show that this test statistic has an asymptotic chi-squared distribution when model parameters are estimated by maximum likelihood, and study its power under local alternatives both analytically and in simulations. For the case of linear mixed models, we also study the setting when parameters are estimated by least squares and method of moments. Several data examples illustrate the methods. Conducted extensive simulations to compare FPs of degree 2 (FP2) and degree 4 (FP4) and two variants of P-splines that used generalized cross validation (GCV) and restricted maximum likelihood (REML) for smoothing parameter selection. The ability of P-splines and FPs to recover the true''functional form of the association between continuous, binary and survival outcomes and exposure for linear, quadratic and more complex, non-linear functions, using different sample sizes and signal to noise ratios was evaluated. For more curved functions FP2, the current default setting in implementations for fitting FPs, showed considerable bias and consistently higher mean squared error (MSE) compared to spline-based estimators and FP4, that performed equally well in most simulation settings. FPs however, are prone to artefacts due to the specific choice of the origin, while P-splines based on GCV reveal sometimes wiggly estimates in particular for small sample sizes. Exposure Assessment, Errors in Exposure Measurements, and Missing Exposure Data Compared the diagnostic accuracy and agreement of two diagnostic tests under subsampling. Developed methods for analysis of disease incidence in cohort studies incorporating data on multiple disease traits using a two-stage semiparametric Cox proportional hazards regression model that allows one to examine the heterogeneity in the effect of the covariates by the levels of the different disease traits. Proposed an estimating-equation approach for handling missing cause of failure in competing-risk data. Proved asymptotic unbiasedness of the estimating-equation method under a general missing-at-random assumption and propose a novel influence-function based sandwich variance estimator. The methods are illustrated using simulation studies and a real data application involving the Cancer Prevention Study (CPS-II) nutrition cohort. Methods for descriptive epidemiologic studies Identifying regions with the highest and lowest mortality rates and producing the corresponding color-coded maps help epidemiologists identify promising areas for analytic etiological studies. Based on a two-stage PoissonGamma model with covariates, we used information on known risk factors, such as smoking prevalence, to adjust mortality rates and reveal residual variation in relative risks that may reflect previously masked etiological associations. In addition to covariate adjustment, we studied rankings based on SMRs, empirical Bayes (EB) estimates, and a posterior percentile ranking (PPR) method and indicated circumstances that warrant the more complex procedures in order to obtain a high probability of correctly classifying the regions with high and low relative risks .

National Institute of Health (NIH)
National Cancer Institute (NCI)
Investigator-Initiated Intramural Research Projects (ZIA)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Division of Cancer Epidemiology and Genetics
Zip Code
Flegal, Katherine M; Graubard, Barry I; Yi, Sang-Wook (2017) Comparative effects of the restriction method in two large observational studies of body mass index and mortality among adults. Eur J Clin Invest 47:415-421
Tomassi, Diego; Forzani, Liliana; Bura, Efstathia et al. (2017) Sufficient dimension reduction for censored predictors. Biometrics 73:220-231
Grill, Sonja; Ankerst, Donna P; Gail, Mitchell H et al. (2017) Comparison of approaches for incorporating new information into existing risk prediction models. Stat Med 36:1134-1156
Boca, Simina M; Pfeiffer, Ruth M; Sampson, Joshua N (2017) Multivariate meta-analysis with an increasing number of parameters. Biom J 59:496-510
Chatterjee, Nilanjan; Chen, Yi-Hau; Maas, Paige et al. (2016) Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources. J Am Stat Assoc 111:107-117
Kant, Ashima K; Graubard, Barry I (2016) A prospective study of water intake and subsequent risk of all-cause mortality in a national cohort. Am J Clin Nutr :
Wang, Lingxiao; Graubard, Barry I; Li, Yan (2016) A composite likelihood approach in testing for Hardy Weinberg Equilibrium using family-based genetic survey data. Stat Med 35:5040-5050
Maas, Paige; Barrdahl, Myrto; Joshi, Amit D et al. (2016) Breast Cancer Risk From Modifiable and Nonmodifiable Risk Factors Among White Women in the United States. JAMA Oncol 2:1295-1302
Espinosa, Pablo; Pfeiffer, Ruth M; GarcĂ­a-Casado, Zaida et al. (2016) Risk factors for keratinocyte skin cancer in patients diagnosed with melanoma, a large retrospective study. Eur J Cancer 53:115-24
Zhang, Han; Wu, Colin O; Yang, Yifan et al. (2016) A multi-locus genetic association test for a dichotomous trait and its secondary phenotype. Stat Methods Med Res :

Showing the most recent 10 out of 169 publications