III:  Small:  Simultaneous Decomposition and Predictive Modeling on Large Multi-Modal Data

Ghosh, Joydeep

Abstract

Several modern data mining applications involve predictive modeling on large amounts of multi-relational data with added structures such as product hierarchies or social networks among customers. The broad goal of this proposal is to develop a comprehensive framework for predictive modeling on large, heterogeneous, multi-relational data based on "Simultaneous Decomposition and Prediction" (SDaP) approaches that iteratively partition the problem into more homogeneous and manageable pieces while concurrently building multiple predictive models, one for each piece. Such approaches lead to simpler and more accurate solutions. The proposed algorithmic strategies that determine how many models to learn and where they should apply, which data to discard and which to keep, how to learn multiple related tasks defined on multi-modal data, and how to scalably implement the solutions on distributed computers, provide practical solutions to certain real-world problems for which current learning and data mining techniques are severely lacking. Application domains of ecology, bio- informatics, market research and web mining are specifically identified and targeted.

There are two broad research impacts of the proposed project: (a) it further vitalizes the research in data mining towards better algorithms for predictive modeling on rich and heterogeneous multi-modal data, and (b) provides and promotes the SDaP approach as a fundamental data analysis tool across multiple disciplines. The PI will organize a workshop and offer a tutorial at major data mining conferences to foster and promote research on various aspects of SDaP analysis. Moreover, the curated complex datasets and software developed under this project will be shared with the scientific community via a public web site as part of the proposed one-of-a-kind multi-relational data benchmarking facility. The PI will further develop a novel graduate course on Modeling and Analysis of Complex Data. Outreach modules that illustrate data analysis concepts and capabilities at levels appropriate for pre-college students will also be developed. For further information see the project web site at the URL: www.ideal.ece.utexas.edu/projects/sdap/

Project Report

The broad goal of this project was to develop a comprehensive framework for predictive modeling on large, heterogeneous dyadic data, where the entities may have associated covariates or other "side information". We researched and designed several approaches that primarily based on "Simultaneous Decomposition and Prediction" (SDaP)" methods that iteratively partition the problem into more homogeneous and manageable pieces while concurrently building multiple predictive models, one for each piece. Such approaches lead to simpler and more accurate solutions. We also formulated a novel framework called C3E that combines classifier ensembles and cluster ensembles to deal with using both labeled and (subsequent) unlabeled data, even when the underlying models change over time. Showed its power via extensive experiments. Journal version of C3E (to appear in IEEE Trans. TKDD). We also devised an ensemble based approach to the imbalanced class problem (when one class is very rare) using alpha-divergence. A journal paper on this work has appeared in IEEE Trans. KDE. A theory of Constrained Relative Entropy Minimization was developed that applies Bayesian methods in situations where it is not easy for domain knowledge to be captured via a prior. Several publications have already resulted from this line of research (including one that received Amazon's Best Student Paper at UAI'13), and a journal The problem of ranking on networks is closely related to this project. We have developed a set of tools using ideas of monotonicity and covexity, with results that are beating state-of-the-art ranking methods such as CofiRank. Broader Impacts: The algorithms developed in this project have widespread applicability. Applications to bioinformatics, market research and web mining (e.g. recommender systems, ranking) have already been demonstrated. We have demonstrated improved ways of associating genes with diseases. We have also applied SDAP concepts to high-throughput phenotype extraction from large scale EHR data. Since phenotyping is currently very tedious and time-consuming, this has enormous implications for health information technology. Three students who were supported by this grant have successfully completed their PhDs. The insights arising from this project have been widely disseminated through publications, public-domain code, lectures in class and in conferences, and multiple keynote talks.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Application #: 1017614
Program Officer: Sylvia Spengler

Project Start
Project End
Budget Start: 2010-09-01
Budget End: 2014-08-31
Support Year
Fiscal Year: 2010
Total Cost: $489,323
Indirect Cost

III: Small: Simultaneous Decomposition and Predictive Modeling on Large Multi-Modal Data
Ghosh, Joydeep
University of Texas Austin, Austin, TX, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments