Title: Computational inference, Monte Carlo methods, and scientific applications
With the advent of automated, high-throughput experimental protocols and data collection techniques, research and discoveries in many areas of science and technology have become increasingly data driven and computation intensive. The applications motivating the research in this project arise from molecular biology, biotechnology and neural science. The rapid accumaulation of experimental data in these areas have outstriped scientists' ability to analyze them, and advanced statistical methods are needed to automate the analysis process and to exploit the complex data structure and extensive scientific knowledge underlying such studies. Computational inference refers to statistical modeling and inference procedures that rely on intensive computation to extract information from large scale data and knowledge-based models. The board, long term goal of this project is to advance the methodologies of computational inference and apply them towards the solution of several important problems in the aforementioned scientific areas.
A critical step in almost all large scale computational inference procedure is the study of the posterior density through Monte Carlo sampling (or the related problem of studying the likelihood function). Successful sampling leads immediately to the inference of any parameter or prediction of interest to the investigator. Thus the first specific goal of this project is to develop Monte Carlo simulation methods that are effective in sampling complex, multimodal distributions. Advances in this core computational problem will not only facilitate effective computational inference, but will also be of interest to other scientific tasks such as simulation of molecular structures and combinatorial optimization. Three approaches will be investigated: a) an evolutionary Monte Carlo approach where a population of structures are evolved and individual structures, including recombinant ones, are continuously competing for survival in the population, b) further development of sequential importance sampling and dynamic importance sampling through better methods to handle skewed weight distributions, c) multi-level computational models. Hybrid algorithms combining the above approaches will also be investigated. Some of these methods will be used to investigate the grand-challenge problem of understanding the energy landscape of protein conformation.
The second specific goal of this project is the development of computational inference tools for two further scientifc problems: i) multiple alignment and clustering of DNA and protein sequences based on hidden Markov models, and the use of these in the analysis of human genome coding regions, ii) the development of hierarchical computational models for low-level vision task such as texture recognition and primal sketching.
If successful, the methods developed in this project will enable the wider application of computational inference and will also result in direct contributions to three problems of considerable importance in the current scientific frontier.