This application presents a comprehensive research plan for the investigation of a general framework and various new methods to handle complex large-scale data sets generated from biological (medical) as well as other scientific studies. Two goals are articulated in this proposal: theory development and application in biology and medicine. The former is focused on the study of a general yet core, model-free framework to effectively address major issues arising from high dimensional data. In the latter, the investigators seek to apply methods developed from the theory part to resolve machine learning type problems that arise in biology and medicine. In particular, this team intends to study the problems related to biological and medical prediction in response to treatments, clinical diagnosis of diseases (such as cancers), discovery of protein-protein interactions and biological network constructions related to disease etiology and motif identification. To achieve these two goals, the investigators will study theoretical and practical properties under a general setting and evaluate a series of novel statistical/computation procedures/software which will then be tested by a broad range of real and simulated data, some from current on-going studies.
The emergence of high dimensional data in most scientific fields poses new challenges for statisticians. Methods successful in dealing with low dimensional data are no longer effective for high dimensional data. One of the greatest difficulties in analyzing these data is to identify the informative variables/features and their associated clusters, and decipher the characteristics of the interaction between these variables and clusters. To meet current and future needs for digging hidden knowledge out of high dimensional data comprehensively and systematically, the scientific fields must develop new methods. The current project is a direct response to this need. Based on theoretical evidence (as preliminary results) already obtained in extracting low dimensional information, this team plans to apply and to develop various effective procedures to address practically important problems in the domains of biology and medicine. The investigators will study a novel screening process applicable across fields to demonstrate how high quality classifiers of low dimensionality can be identified while joint information among the influential variables are fully utilized. For further interpretation for biological validation/confirmation this team will study how to construct biological networks based on low dimensional classifiers and how to identify significant association patterns among them. A feedback mechanism will be established between the methodology development and biological validation teams, where statistical/computational results will be regularly discussed and biologically validated. It is anticipated that the key ideas and methods developed here will find numerous applications in disciplines other than biology/medicine. The proposed research is likely to advance substantial knowledge and significantly benefit current and future efforts in molecular biology/statistics/computational biology/disease prediction/drug discovery. The project would also provide valuable research experiences and training to undergraduates.