Finding the individual cis-regulatory elements for a gene is an important initial step for understanding its regulation and function. We would like to take the analysis one step further. We hypothesize that genes that have similar motif patterns in the promoter region, where the same pattern of motifs occurs in both human and mouse, might be functionally related or involved in the same biological process. This hypothesis can be extended to exploit the notion that the expression of a gene may be governed by a set of regulatory motifs in a combinatorial fashion. We accordingly use paired promoter sequences (human/mouse ortholog) to aid in identifying the ?true? cis-regulatory motifs. This approach has the potential to identify new genes involved in a known pathway or biological process. Towards this goal, first, we created a putative human-mouse gene ortholog promoter sequence database. Second, in order to effectively mine this data set we have developed a sequence alignment algorithm for identifying conserved segments in the paired promoter regions for human and mouse ortholog genes. We assume that transcription factor binding sites are more likely to be present in conserved (i.e., sequence-similar) regions than in non-conserved regions. Third, we have implemented a computational algorithm that can examine the promoter sequences in the data set and scan to identify binding sites for known transcription factors. Finally, we are developing algorithms based on a mathematical approach called the Gibbs sampler to identify common motifs (both known and unknown) that are present in a set of human and mouse promoter sequences. In the next section, I will explain each of the four steps we are taking and illustrate the idea of new gene discovery using a learning set of 17 base excision repair (BER) genes as an example. Besides this new initiative, we are also developing methods for analysis of microarray data and proteomics data. Notably, we have proposed a method called the genetic algorithm/k-nearest-neighbor (GA/KNN) approach. It is a multivariate stochastic search algorithm which selects a subset of genes that can discriminate between different classes of samples, e.g., normal versus tumor tissue, or unexposed versus exposed tissue. This tool has proved able to identify differentially-expressed genes, and, when used in conjunction with clustering methods, to reveal the existence of subcategories that share characteristic patterns of response (e.g., revealing important tumor subtypes) that may be etiologically distinct.
Showing the most recent 10 out of 29 publications