Gains or losses of transcription factor binding at specific locations in the genome have been linked to a wide range of human diseases. Despite our knowledge about the determinants of transcription factor-DNA interaction, it is still challenging to accurately predict changes in transcription factor binding due to genetic and epigenetic variants in the genome. Several critical gaps remain in our understanding of the integration of sequence and non-sequence information on endogenous genomic DNA that give rise to the genome-wide binding patterns of transcription factors. Our long-term vision is to shed light on how genome and epigenome variation, which leads to variation in the genome-wide targets of transcription factors, affects the regulatory networks of the cell, and the gene expression programs that give rise to phenotypic diversity. Our previous study characterized the genome-wide binding locations of more than 500 transcription factors in Arabidopsis thaliana on the reference genome. Our integrative computational analysis revealed the features of endogenous genome context, consisting of sequence motif, DNA shape, and 5-methylcytosine modification of genomic DNA, that play a role in determining the binding landscape of transcription factors of major structural families. To further study the variability of these binding sites, driven by native genome and epigenome variation, we generated genome-wide, base-resolution maps of 5-methylcytosine, an epigenomic mark on DNA, in a collection of over 1,000 world-wide, natural strains (accessions) of A. thaliana, complementing the efforts to catalog genome sequence variation in these accessions. Guided by the diversity in the genome and epigenome, a wealth of phenotypic data, and preliminary results suggesting transcription factor binding variation in these accessions, our goals for the next five years are to address three major challenges in understanding natural variation in transcription factor binding: 1) to determine the genome-wide transcription factor binding variation across multiple accession genomes; 2) to characterize the effect of transcription factor coding variants on their genome-wide binding specificities and target genes; and 3) to investigate how natural variation of protein-protein interactions alters target genes and genome-wide binding specificities for interacting transcription factors. All three projects will use computational modeling to evaluate the contributions from features in the binding site environment. Our proposed experiments and computational models will make a broad impact by characterizing transcription factor binding variation and understanding the role played by sequence and non-sequence features of endogenous genomic DNA. Our results will shed light on the fundamental principles underlying the regulatory functions of genome and epigenome variation, empowering the discovery and prediction of regulatory variants and their molecular mechanisms.
Variable transcription factor-DNA interaction mediates phenotypic variation, including traits and a wide range of human diseases. Our proposed studies aim to understand how naturally occurring genetic and epigenetic variation in the genome gives rise to variation in transcription factor-DNA interactions, and how this variation collectively alters the regulatory networks in the cell. Our research will have the potential to reveal novel principles for predicting the regulatory functions of genetic and epigenetic variants, a critical element in the development of precision medicine.