Transcriptional regulation is a highly coordinated process in the human genome. A significant component of transcriptional regulation is the interaction between transcriptional factor proteins (TFs) and cis-regulatory DNA elements. The goal of this project is to computationally predict and experimentally validate DNA sequence motifs that explain promoter function. The results of this project will be direct functional measurements of sequence motifs at base-pair resolution. This will yield extremely valuable information to assess the sensitivity and specificity of algorithms that can be immediately applied to the whole genome. These results will also help to identify the proportion of functionally relevant transcription factor binding events. The three aims of our project are:
Aim 1 : We will use two machine learning algorithms (support vector machines and random forest) to determine a subset of known transcription factor binding motifs that are the most predictive of promoter activities.
Aim 2 : We will then use Bayesian networks to select the most predictive motif features. These features are the strengths of the motif using PSSM and the positions of individual sites relative to each other and the transcription start site.
Aim 3 : We will then perform mutagenesis of the informative positions within the 900 sites identified in Aim 2, and measure their promoter activities by using transient transfection assays. We also plan to test 100 lower ranking sites to determine the sensitivity and specificity of our algorithms. We will also develop an oligo competition assay as a new approach to increase the throughput of experimental motif analysis for the rest of the genome. The data generated in this project will be the first systematic functional analysis of TF binding sites at base-pair resolution.