Optimized mixed Markov models for motif identification? ? Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. We introduced a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif.? ? Accurate anchoring alignment of divergent sequences? ? Obtaining high quality alignments of divergent homologous sequences for cross-species sequence comparison remains a challenge. We proposed a novel pair-wise sequence alignment algorithm, ACANA (ACcurate ANchoring Alignment), for aligning biological sequences at both local and global levels. Like many fast heuristic methods, ACANA uses an anchoring strategy. However, unlike others, ACANA uses a Smith-Waterman-like dynamic programming algorithm to recursively identify near-optimal regions as anchors for a global alignment. Performance evaluations using a simulated benchmark dataset and real promoter sequences suggest that ACANA is accurate and consistent, especially for divergent sequences. Specifically, using a simulated benchmark dataset, we showed that ACANA has the highest sensitivity to align constrained functional sites compared to BLASTZ, CHAOS and DIALIGN for local alignment and compared to AVID, ClustalW, DIALIGN and LAGAN for global alignment. Applied to 6007 pairs of human-mouse orthologous promoter sequences, ACANA identified the largest number of conserved regions (defined as greater than 70% identity over 100 bp) compared to AVID, ClustalW, DIALIGN and LAGAN. In addition, the average length of conserved region identified by ACANA was the longest. Thus, we suggest that ACANA is a useful tool for identifying functional elements in cross-species sequence analysis, such as predicting transcription factor binding sites in non-coding DNA.? ? A method for non-linear association measurement between genes and between a phenotypic endpoint and gene expression? ? Understanding the mechanisms of liver damages upon chemical exposures is important step in preventing liver damage and developing early biomarkers. The NCT has profiled both liver and blood (paired) gene expression of rats that were treated with seven chemical toxicants and one non-toxic analog at three dose levels (low, medium, and high) and four time points (6, 12, 24, and 48hr) after treatment. This data set consists of 318 pairs of arrays, each of which contains nearly 20,500 probes. We propose a non-linear association measure between the alanine transaminase (ALT) level in the blood and gene expression levels in both blood and liver. We are currently applying this measure to identify biological pathways/processes in both organs that may be associated with ALT levels in the blood.? ? Optimize position weight matrix (PWM) for motif detection? ? PWMs have been widely used to scan promoter sequences for putative transcription factor binding sites. PWMs for many transcription factors are currently available in the Transfac database. However, PWMs are usually created from a few known (but not always validated) transcription factor binding sites, resulting in a poor estimate of the PWM. Thus, the results obtained with these PWMs may not be reliable. While identifying and validating a binding site is time consuming, chromatin immunoprecipitation (ChIP) with microarray (ChIP on chip) provides an alternative way to identify many low-resolution binding sites (i.e., regions rather than specific binding sites). Recently, this technology has been successfully applied to Oct4, Sox2, Nanog, and p53. Because ChIP can not pinpoint the exact location of a binding site, such data are rarely used in constructing a PWM. We are currently developing a method that would allow one to build a statistical model from the ChIP data.
Showing the most recent 10 out of 29 publications