The availability of massive amounts of genomic data, in the form of genome sequence, mRNA gene expression, and DNA structural information has opened up a huge opportunity to develop investigative tools to infer the molecular basis for biological function. However, existing statistical methods insufficiently address the complexities arising in the analysis of such large data sets, which include complex dependence structures, many missing observations, and varying resolutions of different data types, leading to biased inference. The goals of this project are to provide robust and efficient analysis tools for data in the presence of the above complexities through development of: (i) novel unified Bayesian statistical methodology for detection of transcription factor binding sites utilizing genomic sequence and data generated through high-throughput technologies, (ii) innovative statistical classification methods for precise detection of elements of chromatin structure using data from high-resolution genome tiling arrays and identification of underlying sequence characteristics that determine chromatin structure and function and (iii) publicly available software addressing the above goals for the use of the scientific research community. The methods will be applied and validated on data from genomes at three levels of complexity, yeast, C. elegans and human, leading, in the long term, to major scientific advances in characterizing distinguishing features of chromatin regulation in complex genomes and a better understanding of causation of various cellular processes including a variety of disease states.
Elucidation of the factors underlying chromatin structure and binding of transcription factors to genomic DNA has enormous potential implications for human health. Most fundamental cellular processes involve protein-DNA interactions that are influenced by chromatin structure. Achievement of the goals of this project would provide a detailed understanding of how biological function is encoded through genomic sequence and structure, and can have large implications in the understanding of diseases, potentially leading to new breakthroughs in genomic medicine.