The fast development in statistical methodology is mostly driven by the necessity to describe, model and analyze complex large-scale data sets generated from various scientific and engineering disciplines. In order to make full use of available and incoming large amount of data in gene regulation, this proposal aims (1) to develop predictive modeling approaches to combine sequence analyses, gene expression data, and protein binding data; and (2) to develop a full Bayesian model for de novo identification of cis-regulatory modules (combinatorial patterns of multiple sequence motifs that mediate the interactions between regulatory proteins and DNA sequences) in multiple related species. For the first project, the use of many contemporary statistical learning methods is investigated, such as boosting, random forests, MARS and BART, for detecting influential sequence signals and predicting protein-DNA interactions. Multi-level models are proposed to incorporate the uncertainty in covariates into a statistical learning framework and efficient computational algorithms are developed for the inference. The statistical aspects of the second project involve modeling multiple interacting stochastic processes by coupling chains of random variables. Efficient algorithms that utilize two-dimensional dynamic programming and advanced Monte Carlo techniques such as tempering and equi-energy jumps are developed for the challenging Bayesian inference on the proposed model.

The proposed research is expected to have direct and immediate impact on various fields in molecular biology, genetics, and medical sciences, in which gene regulation analyses play critical roles. In addition to methodological development, algorithms and software will be delivered for biologists to use on their own experimental data. Many statistical components in these projects, such as the coupling of hidden Markov models and the design of advanced Monte Carlo sampling with dynamic programming, are expected to contribute significantly to statistics and other computational sciences as well.

Project Report

The goal of this project is to develop statistical modeling approach for analyzing high-throughput data from genomics. There are huge amount of data in the field of genomics that target at a better and more systematic understanding of how genes and proteins work with each other to regulate various cellular processes. Without principled statistical modeling and estimation methods, one cannot obtain and fully exploit the rich information contained in these data. More specifically, we developed a few novel statistical methods for three types of genomics data, namely, gene expression, protein binding, and DNA sequence data. These methods involve the use and further development of a variety of statistical techniques, such as predicting modeling, hidden Markov models, discriminative learning, and Markov chain Monte Carlo. We have delivered a few software tools that are free for academic use to analyze the mentioned data. This project also contributed to a better understanding of a few important biological problems by the use of the new statistical methods. We have analyzed data generated from mouse embryonic stem cells and more recently RNA-Seq data for alternative splicing. This enhances the broader impact of our project. In addition, participants of this project, including the PI and his graduate students, have also presented the results and outcomes of this project to other peers and biologists. Some developed statistical methods have been incorporated into graduate-level courses taught by the PI at UCLA. This project also trained a few PhD and MS students in Statistics.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0805491
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-07-15
Budget End
2012-06-30
Support Year
Fiscal Year
2008
Total Cost
$138,767
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095