Large-scale gene expression profiling studies provide valuable information about the expression changes of individual genes in response to exposure to environmental toxicants/stressors. However, investigators often face the challenges of making sense of the changes in a broader prospective, as the tools for integrating individual genes into functional pathways and networks remain rudimentary. Statistical/data mining approaches are urgently needed to make optimal use of these high-dimensional data. This need becomes greater as the size and complexity of genomics data grow and the biological questions to be addressed become more sophisticated. We have proposed a method called the genetic algorithm/k-nearest-neighbor (GA/KNN) approach. It is a multivariate stochastic search algorithm which selects a subset of genes that can discriminate between different classes of samples, e.g., normal versus tumor tissue, or unexposed versus exposed tissue. This tool has proved able to identify differentially-expressed genes, and, when used in conjunction with clustering methods, to reveal the existence of subcategories that share characteristic distinct patterns of response (e.g., tumor subtypes) We have also developed methods for classifying effects on expressionover time or dose, based on order-restricted statistical inference. In another project, we developed a non-linear regression model for quantitatively analyzing periodic gene expression in studies of experimentally synchronized cells. The model permits identification of genes whose expression varies with the cell cycle and permits hypothesis testing about biologically meaningful parameters that characterize cycling genes. Presently, we are developing methods that combine gene expression data and genomic sequence data to identify families of genes that may be functionally related, and to try to understand gene regulation. Towards this goal, we have created a human-mouse gene ortholog promoter sequence data set. We have developed a sequence alignment algorithm for identifying promoter regions that are conserved between the two species. In addition, we have implemented a computational algorithm that can look in the promoter sequences in the data set and scan to identify binding sites for known transcription factors. We are also developing algorithms based on a mathematical approach called the Gibbs sampler to identify common motifs (both known and unknown) that are present in a set of human and mouse promoter sequences.
Showing the most recent 10 out of 29 publications