Methods for Genetic Epidemiology Sample Size: We developed sample size calculations for the Risch-Teng tests for genetic association with disease in sibships ascertained on the basis of a fixed number of affected and unaffected children. Most previous sample size calculations for case-control studies to detect genetic associations with disease have assumed that the disease gene locus is known, whereas, in fact, markers are used. We calculated sample sizes for unmatched and sibling case-control studies to detect associations between a biallelic marker and a disease governed by a biallelic disease locus. We evaluated the sample size requirements for two-sided trend tests with additive scores applied to marker genotypes. The main factors influencing sample size apart from alpha_levels and relative risk parameters are: 1) the degree of agreement between marker allele and disease allele frequencies, which determine the maximal linkage disequilibrium; 2) the percent of maximal linkage disequilibrium present; 3) the attributable risk from the disease allele. The large sample size requirements represent a formidable challenge to studies of this type and may partly explain why many genetic associations based on SNPs have not been confirmed in subsequent studies. Marker Selection and Tests for Genetic Associations and Gene-Environment Interactions: We developed efficiency robust tests that have high power for a wide family of possible underlying genetic models. The TDT is optimum when the underlying model is additive or multiplicative but less powerful when the disease is recessive. The maximum of the optimum tests for the dominant, additive and recessive models was shown to have greater power robustness than the TDT. We also obtained a procedure that has greater power_robustness than the usual additive scores used in the Cochran_Armitage trend test. We showed that methods for family studies that take ascertainment and random genetic correlations into account are robust to mis_specificaion of the unobserved genetic mechanism. We developed semiparametric maximum likelihood estimates (SPMLE) for case-control studies of gene-environment interactions under the assumption of independence of gene and environmental factors. Traditional logistic regression analysis is not be efficient in this setting. We use a profile-likelihood technique to obtain SPMLE and study its asymptotic theory. The results are extended to deal with situations where genetic and environmental factors are independent conditional on some other factors. The method is applied to ovarian case-control data to investigate the interplay of BRCA1/2 mutations and oral contraceptive use. We studied the false positive report probability (FPRP), the probability of no true association given a significant finding. FPRP depends not only on the observed p-value but also on the prior probability that the association is real and on the power of the test. We proposed a four-step approach that uses a decision rule based on FPRP to evaluate the chance that a finding deemed noteworthy is, in fact, truly associated with disease. We investigated current proposals for selecting single nucleotide polymorphisms (SNPs) to define haplotypes for disease association studies in independent samples of cases and controls. Current proposals based on diversity measures often lead to sub-optimal selections of subsets of SNPs. We evaluated the power of SNP-by-SNP versus haplotype-based analysis of case control association studies, by simulating data using a panel of 13 genes with known SNP and haplotype structure. When disease risk is conferred by a SNP, SNP-by-SNP analysis combined using the False Discovery Rate (FDR) for multiplicity control appears to be superior to haplotype-based analysis. The superiority of this SNP-based approach was largest when certain relatively common SNPs (variant allele frequency 0.20) were associated with disease. FDR control provided 1-3% more power than Hochberg's familywise error rate controlling procedure. Conversely, when disease risk is conferred by a haplotype, haplotype-based analysis was sometimes substantially superior to SNP-by-SNP, but for other haplotypes SNP-by-SNP analysis had similar or higher power. In the kin-cohort design, the data consist of the event history data of the relatives of a sample of genotyped subjects. Existing methods may produce biased estimates of risk when multiple events are related to the genetic mutation and follow-up of some of the events may be censored by the onset of the other events. We show that cause-specific hazard functions for carriers and non-carriers are identifiable from kin-cohort data and estimate parameters using a composite-likelihood. We illustrate the use of the proposed method for estimation of risk of ovarian cancer from BRCA1/2 mutations in the absence of breast cancer. DNA microarrays depict the expression levels of thousands of genes simultaneously and can be used to identify differentially expressed genes across different groups of samples. Among those differentially expressed genes, a further reduction may lead to the identification of the """"""""necessary"""""""" genes for distinguishing class membership (e.g., tumor type) in the samples. We introduce a simple dimension estimation technique, sliced average variance estimation, to infer the dimension of the classification problem and obtain linear combinations of genes that are sufficient to discriminate between pre-defined tumor types. Logistic regression models are then fitted to predict tumor class and the performance of the class predictors is assessed by cross-validation. The method worked well on cDNA microarray data on BRCA1 and BRCA2 mutation carriers as well as sporadic tumors, and on oligonucleotide microarray data on acute leukemia. Detecting Familial Aggregation of Disease We use data on lymphoma in families of Hodgkin lymphoma (HL) cases from the Swedish Family Cancer Database to illustrate survival methods for detecting familial aggregation in first degree relatives of case probands compared to first degree relatives of control probands. Because more than one case may occur in a given family, the first degree relatives of case probands are not necessarily independent, and we present procedures that allow for such dependence. A bootstrap procedure also accommodates matching of case and control probands. Regarding families as independent sampling units leads to inference based on """"""""sandwich variance estimators"""""""" and accounts for dependencies from having more than one proband in a family but not for matching. We compare these methods in analysis of familial aggregation of HL and also present simulations to compare survival analyses with analyses of binary outcome data. Design and Analysis of Case_Control and Cohort Studies We have developed methods to estimate survival (absolute risk) and PAR from sampled cohort studies. Prior to this research there were no practical semiparametric estimators, and no non-parametric estimators. This research also establishes the considerations for the efficient design of risk estimation in all sampled cohort studies, and provides efficient estimators. We are near completion of documenting software programs that allow implementation of these results in the software, R. Though the focus was on survival estimation, the paper contains the same results for relative risk estimation in the Cox proportional hazards model. The Cornfield inequality is commonly used to assess whether an omitted variable can explain a significant association found in an epidemiologic study, but such omission can also mask a real association. We adapted the method to re-analyze data from an epidemiologic study of birth defects
Showing the most recent 10 out of 21 publications