This project develops the computational and statistical principles of mining local latent correlations in extremely high-dimensional data. Recent advances in experimental technologies have rendered it possible to collect data of extremely high dimensionality. Examples include gene expression, genetic variation, and protein and DNA sequence data. A key problem in analyzing such data is finding latent local correlations among features. Such local correlations only exist in feature subspaces and may involve more than two features. The large number of features and the noisy characteristics of the data make modeling, identifying, and assessing the statistical significance of such local correlations a challenging research problem.
The project aims to develop tools that enable users to mine and explore local correlations efficiently and effectively. It seeks to develop (1) effective models to capture the local correlations among features; (2) scalable algorithms to identify local correlations from extremely high-dimensional data; (3) robust methods to assess the statistical significance of the identified correlations. The proposed methods combine the advantages of dimension reduction, intrinsic dimensionality estimation, information theoretic approach, and hypothesis testing for modeling and identifying significant local correlations.
The resuling tools will assist scientists in many disciplines including biologists in their study of gene function and medical doctors in their understanding of disease progression and searching for new and effective treatments. The research results will be published in peer reviewed data mining and bioinformatics journals and conferences and integrated into the educational and outreach programs at CWRU. The project Web site (http://engr.case.edu/zhang_xiang) will be used for dissemination of research results including publications, data, and software.