Several modern data mining applications involve predictive modeling on large amounts of multi-relational data with added structures such as product hierarchies or social networks among customers. The broad goal of this proposal is to develop a comprehensive framework for predictive modeling on large, heterogeneous, multi-relational data based on "Simultaneous Decomposition and Prediction" (SDaP) approaches that iteratively partition the problem into more homogeneous and manageable pieces while concurrently building multiple predictive models, one for each piece. Such approaches lead to simpler and more accurate solutions. The proposed algorithmic strategies that determine how many models to learn and where they should apply, which data to discard and which to keep, how to learn multiple related tasks defined on multi-modal data, and how to scalably implement the solutions on distributed computers, provide practical solutions to certain real-world problems for which current learning and data mining techniques are severely lacking. Application domains of ecology, bio- informatics, market research and web mining are specifically identified and targeted.

There are two broad research impacts of the proposed project: (a) it further vitalizes the research in data mining towards better algorithms for predictive modeling on rich and heterogeneous multi-modal data, and (b) provides and promotes the SDaP approach as a fundamental data analysis tool across multiple disciplines. The PI will organize a workshop and offer a tutorial at major data mining conferences to foster and promote research on various aspects of SDaP analysis. Moreover, the curated complex datasets and software developed under this project will be shared with the scientific community via a public web site as part of the proposed one-of-a-kind multi-relational data benchmarking facility. The PI will further develop a novel graduate course on Modeling and Analysis of Complex Data. Outreach modules that illustrate data analysis concepts and capabilities at levels appropriate for pre-college students will also be developed. For further information see the project web site at the URL: www.ideal.ece.utexas.edu/projects/sdap/

Project Report

The broad goal of this project was to develop a comprehensive framework for predictive modeling on large, heterogeneous dyadic data, where the entities may have associated covariates or other "side information". We researched and designed several approaches that primarily based on "Simultaneous Decomposition and Prediction" (SDaP)" methods that iteratively partition the problem into more homogeneous and manageable pieces while concurrently building multiple predictive models, one for each piece. Such approaches lead to simpler and more accurate solutions. We also formulated a novel framework called C3E that combines classifier ensembles and cluster ensembles to deal with using both labeled and (subsequent) unlabeled data, even when the underlying models change over time. Showed its power via extensive experiments. Journal version of C3E (to appear in IEEE Trans. TKDD). We also devised an ensemble based approach to the imbalanced class problem (when one class is very rare) using alpha-divergence. A journal paper on this work has appeared in IEEE Trans. KDE. A theory of Constrained Relative Entropy Minimization was developed that applies Bayesian methods in situations where it is not easy for domain knowledge to be captured via a prior. Several publications have already resulted from this line of research (including one that received Amazon's Best Student Paper at UAI'13), and a journal The problem of ranking on networks is closely related to this project. We have developed a set of tools using ideas of monotonicity and covexity, with results that are beating state-of-the-art ranking methods such as CofiRank. Broader Impacts: The algorithms developed in this project have widespread applicability. Applications to bioinformatics, market research and web mining (e.g. recommender systems, ranking) have already been demonstrated. We have demonstrated improved ways of associating genes with diseases. We have also applied SDAP concepts to high-throughput phenotype extraction from large scale EHR data. Since phenotyping is currently very tedious and time-consuming, this has enormous implications for health information technology. Three students who were supported by this grant have successfully completed their PhDs. The insights arising from this project have been widely disseminated through publications, public-domain code, lectures in class and in conferences, and multiple keynote talks.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1017614
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$489,323
Indirect Cost
Name
University of Texas Austin
Department
Type
DUNS #
City
Austin
State
TX
Country
United States
Zip Code
78759