We are in the process of developing methods for using exposures measured in pooled specimens from several individuals, together with genotypes measured separately on each individual, to study gene-environment interactions. Suppose one has case-control study and genotyped each individual at a panel of SNPs (single nucleotide polymorphisms). Suppose that one also has biological specimens (e.g., serum or urine) from the same individuals but lacks the budget to assay each individual specimen for an exposure of interest. Pooling specimens and assaying the resulting pooled specimens will not only save assay costs but preserve specimen volume for future uses. In the past, we have developed methods for analyzing case-control studies with exposures measured in pooled specimens. Those methods assume, reasonably, that the measured value on the pooled specimen is the average of the values for the individual specimens. With those methods, testing gene-environment interactions at a single SNP required creating specimen pools within strata of individuals who all had the same genotype for that SNP. To study gene-environment interactions for a panel of SNPs, our previous methods would require creating new pooled specimens for each SNP studied and the potential savings in assay costs would disappear. The approach that we are developing regards the individual measurements as missing data and uses the pooled specimens in a principled way to impute those missing data. With a give set of imputed data in hand, we can use standard statistical methods for case-control data to estimate gene-environment interactions. In practice, we use a multiple-imputation approach: creating multiple sets of imputed data, doing a case-control analysis for each set, and combining the results from the multiple analyses. This approach has shown some promise but some problems remain to be resolved. Work on this problem is ongoing. Identification of causative SNPs in a genome-wide study can be challenging when individual SNPs have small marginal effects because testing thresholds must reflect the large number of SNPs under study. For complex diseases, particular combinations of SNPs may dramatically increase risk a kind of epistasis or gene-gene interaction. We are currently investigating the use of a machine learning technique for the discovery of sets of SNPs that together cause disease (causative SNPs) in case-parents data. First, we devised a way to use actual case-parent triad genotypes to create simulated genome-wide data sets that reflect realistic linkage disequilibrium structure and are seeded with known sets of causative SNPs. We are currently working to better characterize the genetic properties of populations simulated in this way. Second, we implemented an existing stochastic search algorithm (called GA-KNN) that is based on an evolutionary algorithm to find multiple sets of k SNPs that are predictive of disease (here k is a small number, say 2 or 4). By cataloguing those SNPs which appear most frequently among the sets that are predictive of disease, we hope to uncover the sets of causative SNPS. In preliminary trials on simulated data seeded with two interacting sets of four SNPs each, our approach shows promise. In ongoing work, we are attempting to speed up the algorithm and to see whether the promising performance is maintained in more complex situations. (see also Z01 ES040007; PI Clare Weinberg; Min Shi is also a within-lab collaborator on this project; her time is allocated in Weinberg's project but not in this one.)
Showing the most recent 10 out of 11 publications