Methods for Genetic Epidemiology Hidden population substructure in casecontrol data can distort the performance of CochranArmitage trend tests (CATTs) for genetic associations. A unified approach is presented for deriving the bias and variance distortion under three scenarios for any CATT in a general family. Our results provide insight into the properties of some proposed corrections for bias and variance distortion and show why they may not fully correct for the effects of population substructure . Compared several procedures to combine geonome-wide association data both in terms of the power to detect a disease-associated SNP while controlling the genome-wide significance level, and in terms of the detection probability, namely the probability that a particular disease-associated SNP will be among the T most promising SNPs selected on the basis of low p-values. Meta-analytic approaches that focus on a single degree of freedom had higher power and detection probablities than global methods . Studied the situation in which the primary disease is rare and the secondary phenotype and genetic markers are dichotomous. An analysis of the association between a genetic marker and the secondary phenotype based on controls only is valid. Developed an adaptively weighted method that combines the case and control data to study the association, while reducing to the controls only analysis if there is strong evidence of an interaction . Developed tools to estimate the number of susceptibility loci and the distribution of their effect sizes for a trait based on discoveries from existing GWAS and then to project statistical power and risk prediction utility studies integrating over estimated distributions of effect sizes. Used reported GWAS findings for height, Crohns disease and cancers of breast, prostate and colorectum (BPC) to estimate that each of the traits is likely to harbor additional loci within the spectrum of low penetrant common variants that together could explain at least 15-20% of known heritability. By conducting sufficiently large studies, it will be possible to discover a large set of additional loci for these traits. However, for diseases with modest familial aggregation, like BPC cancers, the projected discoveries are unlikely to lead to high discriminatory power for clinically important risk models. Used data from the CGEMS case-control genome-wide association study of breast cancer to demonstrate empirically that the case-only and related methods have the potential to create large-scale false positives due to the presence of population stratification (PS) that creates long-range linkage disequilibrium in the genome. We show that the bias can be removed by considering methods that assume gene-gene independence between unlinked loci, not in the entire population, but only within homogeneous strata that can be defined based on the principal components of a suitably large panel of PS markers. We propose both parametric and non-parametric strategies for exploiting such conditional gene-gene independence assumptions. Proposed a False Negative Report Probability (FNRP), analogously to Sholom Wacholder's False Positive Report Probability (FPRP), for analyzing multiple genetic variants simultaneously. Propose use of the Bayes Factors underlying FPRP and FNRP as the pure statistical evidence for any variant-disease relationship, and show that their ratio approximates the true Bayes Factor, thus unifying the FPRP/FNRP framework with traditional Bayesian statistics. Risk modeling Propose and study two criteria to assess the usefulness of models that predict risk of disease incidence for screening and prevention, or the usefulness of prognostic models for management following disease diagnosis. The proportion of cases followed PCF(q) and the proportion needed to follow-up, PNF(p). PCF(q) assesses the effectiveness of a program that follows 100q% of the population at highest risk. PNF(p) assess the feasibility of covering 100p% of cases by indicating how much of the population at highest risk must be followed. Showed the relationship of those two criteria to the Lorenz curve and its inverse, and present distribution theory for estimates of PCF and PNF. Develop new methods, based on influence functions, for inference for a single risk model, and also for comparing the PCFs and PNFs of two risk models, both of which were evaluated in the same validation data. Proposed a class of GOF tests for the mean structure of GLMMs. Our test statistic is a quadratic form of the difference between observed values and the values expected under the estimated model in cells defined by a partition of the covariate space. We show that this test statistic has an asymptotic chi-squared distribution when model parameters are estimated by maximum likelihood, and study its power under local alternatives both analytically and in simulations. For the case of linear mixed models, we also study the setting when parameters are estimated by least squares and method of moments. Several data examples illustrate the methods. Conducted extensive simulations to compare FPs of degree 2 (FP2) and degree 4 (FP4) and two variants of P-splines that used generalized cross validation (GCV) and restricted maximum likelihood (REML) for smoothing parameter selection. The ability of P-splines and FPs to recover the true''functional form of the association between continuous, binary and survival outcomes and exposure for linear, quadratic and more complex, non-linear functions, using different sample sizes and signal to noise ratios was evaluated. For more curved functions FP2, the current default setting in implementations for fitting FPs, showed considerable bias and consistently higher mean squared error (MSE) compared to spline-based estimators and FP4, that performed equally well in most simulation settings. FPs however, are prone to artefacts due to the specific choice of the origin, while P-splines based on GCV reveal sometimes wiggly estimates in particular for small sample sizes. Exposure Assessment, Errors in Exposure Measurements, and Missing Exposure Data Compared the diagnostic accuracy and agreement of two diagnostic tests under subsampling. Developed methods for analysis of disease incidence in cohort studies incorporating data on multiple disease traits using a two-stage semiparametric Cox proportional hazards regression model that allows one to examine the heterogeneity in the effect of the covariates by the levels of the different disease traits. Proposed an estimating-equation approach for handling missing cause of failure in competing-risk data. Proved asymptotic unbiasedness of the estimating-equation method under a general missing-at-random assumption and propose a novel influence-function based sandwich variance estimator. The methods are illustrated using simulation studies and a real data application involving the Cancer Prevention Study (CPS-II) nutrition cohort. Methods for descriptive epidemiologic studies Identifying regions with the highest and lowest mortality rates and producing the corresponding color-coded maps help epidemiologists identify promising areas for analytic etiological studies. Based on a two-stage PoissonGamma model with covariates, we used information on known risk factors, such as smoking prevalence, to adjust mortality rates and reveal residual variation in relative risks that may reflect previously masked etiological associations. In addition to covariate adjustment, we studied rankings based on SMRs, empirical Bayes (EB) estimates, and a posterior percentile ranking (PPR) method and indicated circumstances that warrant the more complex procedures in order to obtain a high probability of correctly classifying the regions with high and low relative risks .
Showing the most recent 10 out of 182 publications