PI Name: Cherkassky, Vladimir S. PI Institution: University of Minnesota-Twin Cities The objective of this research is investigation of emerging technologies for estimating predictive models from sparse heterogeneous data. Such problems are common in biomedical applications, such as micro-array data analysis, and structural & functional brain imaging. In medical applications, diagnostic (predictive) models are usually estimated from heterogeneous data. For example, for cancer diagnosis, patients? data may include clinical, demographic and genomic inputs. The proposed approach emphasizes direct formulation of the learning problem that takes full advantage of application-domain requirements and known characteristics of application data. Proposed work will investigate several novel learning formulations including Learning with Structured Data and closely related Multi-Task Learning, and a new method for incorporating a priori knowledge into the learning process called Learning through Contradictions. These non-standard learning approaches can potentially improve classification (diagnostic) accuracy for many biomedical applications.

Intellectual merit is the development and improved understanding of several alternative learning settings, which show great promise for predictive modeling with sparse heterogeneous data. Relative advantages and limitations of these new approaches (vs. standard inductive learning) will be investigated for several biomedical applications.

Broader impact include: -Improved diagnosis for biomedical applications that utilize heterogeneous data (i.e., clinical, genetic and demographic information). -Methodological impact of the growing importance of alternative learning formulations on data mining applications. -Incorporating new (non-standard) learning methodologies into graduate and undergraduate curriculum.

Project Report

Learning is the ability to make inferences from repeated events (observations), in order to predict or anticipate future events. Two inter-related characteristics of learning, i.e. the ability to explain the past and predict the future, have been known since ancient times (see Fig.1). However, quantitative models for managing uncertainty and risk have been developed fairly recently in the 20-th century, due to advances in computer technology and mathematical tools in statistics and machine learning. Existing methods for data-analytic learning are based on standard inductive-deductive approach, comprising two distinct steps, induction, when a predictive model is estimated from past data, and deduction, when this model is used to make predictions with new inputs. Examples include most statistical and machine learning methods. Many challenging applications involve heterogeneous and high-dimensional data. For example, in medical diagnostic applications, patients’ features include genetic, clinical, demographic and imaging data. Such applications are very ill-posed and require alternative learning methods which can be viewed as extensions of standard inductive learning. This project investigated investigated several emerging non-standard learning settings for predictive modeling, including their mathematical formulation, development of practical strategies for tuning parameters of these new formulations and several real-life medical applications. Specific technical accomplishments include analysis and development of two non-standard learning methodologies, SVM-based Multi-Task Learning and the Universum SVM, as detailed next. Non-standard learning methods investigated in this project can be explained using the task of gender recognition of human faces. In this case, standard inductive learning amounts to estimating a binary classifier from labeled examples of human faces. An estimated classification rule is then used to classify new (test) input images. In many practical situations, there exists additional information about data samples. For example, face images may include additional information about person’s age, so the problem may be to recognize gender of human faces for two separate groups, i.e. for old people and for young people. This leads to multi-task learning where the goal is to estimate two (related) classifiers, for ‘old’ faces and for ‘young’ faces. Another possibility arises when, in addition to labeled training samples, there also exist unlabeled data samples that can improve learning. In the case of gender recognition of human faces, there may be additional face images which can be readily recognized as human faces, but cannot be unambiguously classified as male or female faces. These additional samples are known as the Universum (see Fig. 2). Incorporating such unlabeled data into learning leads to so-called Universum learning aka learning through contradiction. Our research investigated conditions under which inclusion of Universum data can significantly improve generalization performance. The main intellectual merit is the development and improved understanding of several advanced learning methodologies such as Learning with Structured Data, Multi Task Learning (MTL), and Universum Learning. Advanced learning methods developed in this project have been applied to several biomedical data sets, in collaboration with medical researchers from the Mayo Clinic and the University of Minnesota Medical School, who provided data for diagnosis of Graft-versus-Host Disease for bone-and-marrow transplant patients. Research results have been incorporated into a new text book Predictive Learning published in 2013 – available at www.VCtextbook.com Broader impacts include: - Development of new methods for computer aided diagnosis in biomedical applications that utilize heterogeneous data (i.e., clinical, genetic and demographic information). - Development of software tools for non-inductive learning, including provisions for model selection. - Methodological impact of alternative learning formulations on data mining and machine learning. In particular, these new learning formulations emphasize application-domain knowledge required for a proper learning problem setting. This can be contrasted to a priori knowledge about the 'true' or 'good' model under existing data mining and machine learning algorithms. This new methodology leads to a different understanding of data-driven knowledge discovery in many biomedical applications. - Incorporating new non-inductive methodologies into graduate and undergraduate curriculum.

Agency
National Science Foundation (NSF)
Institute
Division of Electrical, Communications and Cyber Systems (ECCS)
Application #
0802056
Program Officer
Paul Werbos
Project Start
Project End
Budget Start
2008-05-01
Budget End
2013-04-30
Support Year
Fiscal Year
2008
Total Cost
$323,796
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455