High-throughput experimental methods have revolutionized scientific inquiry. In contrast to the hypothesis-driven scientific method, data-driven science seeks to discover and explore hypotheses supported by the huge volume of data generated in high-throughput experiments. Such datasets are large and high-dimensional: they consist of a multitude of samples and many measured attributes for each sample. A typical hypothesis corresponds to a subspace of this dataset: a subset of samples that share similar values on a subset of attributes.

The goal of this project is to develop a series of new data mining methods that can effectively discover these subspaces, the embedded patterns among the values, and the relationships between patterns. The underlying problems are highly combinatorial and efficient algorithms are required to enable users to mine and explore subspace patterns in large and complex datasets. The proposed methods combine the advantages of efficient matrix decomposition, effective sampling techniques, and advanced graph algorithms. Solutions to these research problems will be integrated into an interactive and visual interface to explore subspace patterns mined from experimental data.

While the proposed methods are applicable across a wide range of domains, the focus of project is the analysis of gene regulatory networks and the analysis of protein structure, in collaboration respectively with geneticists and pharmacologists.

Current progress and results are accessible and continuously updated at http://compgen.unc.edu/deps/

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0812464
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2008-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2008
Total Cost
$460,711
Indirect Cost
Name
University of North Carolina Chapel Hill
Department
Type
DUNS #
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599