In the past two decades, statisticians and other quantitative researchers have begun to appreciate the power of Monte Carlo integration and optimization methods. This proposal focuses on the development of a novel Markov chain Monte Carlo (MCMC) framework, which promises to greatly enhance our capability of and flexibility in designing effective Monte Carlo algorithms. More precisely, the investigator proposes a unified framework to generalize the standard Metropolis-Hastings approach to design Markov chains and shows its deep relationship with a few existing MCMC methods, such as multigrid Monte Carlo, configurational-bias Monte Carlo, and orientational-bias Monte Carlo. The investigator will also focus on one of the fastest growing application areas, protein bioinformatics (encompassing multiple sequence alignments, protein function annotation, and protein-protein interactions, and protein structural modeling, etc.), which serves both as an important application and as a great source of significant challenges to existing MCMC methods. On one hand, the investigator seeks to apply the new MCMC framework to design novel protein structure and sequence analysis tools; on the other hand, the challenging problems encountered during such endeavors will motivate and steer the investigator to develop new MCMC strategies.

With the ever growing need of quantitative (statistical) analysis of very large datasets with complex structures (such as genomics data, consumer goods data, internet data, etc.), the need for designing more efficient computational methods to analyze these data and to make useful predictions is also strong. This proposal has three inter-related themes: to develop a novel Monte Carlo framework, which can be generally understood as a new way of utilizing computer-generated random numbers to approximately solve an optimization or integration problem, to develop novel statistical models for biological sequence and proteinstructure analysis, and to apply these new computational methods and statistical models to infer molecular mechanisms of protein functions and to predict protein structures. The proposed research will not only significantly advance the Monte Carlo methodology and computational statistics theory, which are applicable to a wide range of optimization and simulation problems in different application areas, but will also bring the power of these new methods and theory to bear on one of the most important application areas, computational biology. It will particularly advance the modeling, analysis, and computational techniques in protein bioinformatics. It will help educators revise and generate new courses on computational biology and Monte Carlo methodologies for both undergraduate and graduate students. It will also provide interdisciplinary research opportunities for such students, and will result in software and methodologies that may be of interest to the pharmaceutical industry.

Project Report

This proposal aims to develop a general framework for designing efficient Monte Carlo computational algorithms that can deal with very-high dimensional integration and optimization problems, to develop novel statistical models for protein/DNA sequence and structure analyses, and to apply these models and algorithms to study molecular mechanisms of protein functions. The grant supports the PI's team to conduct numerous research activities, publishing 25 research articles in peer-reviewed scientific journals, and producing several publicly available bioinformatics software tools. The PI has led a number of bioinformatics studies. One study compares a few popular computational methods for predicting gene expression patterns only from sequence information surrounding each gene, and found that a tree-based Bayesian prediction method performed the best. They further developed a series of innovations on the method, such as allowing the algorithm to pick only useful predictors, and to detect interactions among some predictors that help improve prediction accuracies. These efforts evolved to a whole new class of methods. The PI's team demonstrated that these methods can be effectively applied to a wide variety of problems such as finding gene-gene and gene-environment interactions that may increase the risk of a certain disease; predicting cooperative mutations in the genome of HIV that may be responsible for drug resistance; and detecting non-linear gene regulatory effects. The PI's team has conducted a Bayesian meta analysis to discover cell-cycle genes, which concluded surprisingly that a large number of genes (>2000) are "cell-cycle"-related, much larger than what is commonly believed by the community. They also designed a Bayesian network algorithm for predicting protein-protein interactions (publicly available), and further extended the method to focus on network-related properties of the interaction information. The PI's team has participated the CASP 10 competition held in July 2012 for proteins tructure predictions, and made significant improvements on a Monte Carlo-based computational algorithm for structural simulation. In particular, a novel method for generating any legitimate segment of a protein structure according to its Boltzman distribution was proposed, which improves upon earlier methods significantly. They found that this algorithm is already very good at finding low-energy structural conformations for real proteins (in many cases the structures they found have lower potential energies than native structures). The method's ability to effectively generate new local variations for a protein is an important step in rational drug designs. Besides the bioinformatics-related research, the PI's team has also made progresses in general computational theory and methodology. They introdued a new phrase/word discovery method for from unstructured text data, which has been applied to analysing historical Chinese documents; developed a novel model and algorithm for discovering item associations in text and in market-basket type of data, which is complementary to the popular topics modeling approach for text analyses; invented a novel sequential Monte Carlo method for efficiently dealing with complex data structures; conducted a fruitful theoretical study and generalization of the popular Wang-Landau algorithm used for studying statistical physics models; and published a new method for nonparametric hierarchical Bayes analysis based on Bernstein polynomials. Methodologies developed under the support of this grant have impacts in general statistical research and machine learning. In particular, the protein simulation strategy can be potentially adapted for rational drug design. Bayesian methods developed for discovering item associations can be used for finding interesting interactions, and applicable to help study historical documents. The protein-protein interaction prediction software can be helpful to all medical and biological researchers. The protein folding algorithm the PI developed can have an impact on rational drug design, which is a key element in modern drug and treatment development. The text mining tools we developed for protein-protein interactions and for item associations can have an impact on individualized medicine, detection of trends, predicting patients and customer behaviors. The epistasis detection and co-mutation discovery tools (for some target proteins of HIV) can be useful for developing new HIV patients treatment. Together, these developed tools (for protein protein local structures and for predicting cooperative mutations) can become part of the individualized medicine toolkits and may change how people make medical treatment decisions. The grant has also helped the PI train a number of Ph.D. students (X Fan, Gutman, J Gu, Y Yuan, T Zhang, K Bartz) who have graduated and engaged in academic researches. Four current PhD students (D Fernandez, Simeng Han, Lei Guo, and Jiexing Wu) are working on extensions of the project now. It has also contributed to the research experiences and training of postdoc fellows such as Jinfeng Zhang, Rajesh Chowdhary, and Ke Deng. The PI has also recruited two undergraduate students into the research project. The related algorithmic developments are useful for teaching graduate and undergraduate students. Together with his colleauge X Shirley Liu, the PI continued to develop an undergraduate course on computational biology and bioinformatics, and revised a graduate-level course on statistical computing and Monte Carlo Methods.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0706989
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2007-07-01
Budget End
2013-06-30
Support Year
Fiscal Year
2007
Total Cost
$629,206
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138