Learning Classifiers From Autonomous, Semantically Heterogeneous, Distributed Data

Basu, Samik; Honavar, Vasant

Abstract

Advances in networks, sensors, storage, computing, and high throughput data acquisition, have led to a proliferation of autonomous, distributed data sources in many areas of human activity. New discoveries in biological, physical, and social sciences and engineering are being driven by our ability to discover, share, integrate and analyze disparate types of data. Statistically-based machine learning algorithms offer some of the most cost-effective approaches to discovery of experimentally testable predictive models and hypotheses from data. However, the large size, distributed nature, and autonomy of the data sources (and the attendant differences in access, queries allowed, processing capabilities, structure, organization, and underlying data models and data semantics) present hurdles to effective utilization of machine learning. This research aims to overcome these hurdles by developing efficient, resource-aware distributed algorithms and software services to support collaborative, integrative knowledge acquisition such a setting. The research team will implement, deploy, and evaluate the resulting algorithms using benchmark data sets, associated data models and ontologies, and user-specified inter-ontology mappings on a distributed test-bed of networked databases and services at Iowa State University and Kansas State University. The resulting open-source software can potentially transform collaborative e-science in the same way that Web has transformed information sharing. Broader impacts of this research include enhanced opportunities for research-based training of graduate and undergraduate students, interdisciplinary collaborations, participation of under-represented groups, and development of increasingly sophisticated software to support collaborative, integrative e-science. The project web site (www.cild.iastate.edu/projects/indus.html) provides access to information about the project, benchmark data, publications, software, and documentation.

Project Report

Advances in networks, sensors, storage, computing, and high throughput data acquisition, have led to a proliferation of autonomous, distributed data sources in many areas of human activity. New discoveries in biological, physical, and social sciences and engineering are increasingly being driven by our ability to discover, share, integrate and analyze disparate types of data. Statistical machine learning algorithms offer some of the most cost-effective approaches to discovery of experimentally testable predictive models and hypotheses from data. However, the large size, distributed nature, and autonomy of the data sources (and the attendant differences in access, queries allowed, processing capabilities, structure, organization, and underlying data models and data semantics) present hurdles to effective utilization of machine learning. This project was aimed at addressing these challenges. The key accomplishments of the project include: (i) Development of the theoretical underpinnings of, and practical algorithms for, learning predictive models (e.g., classifiers) from very large, autonomous, distributed, semantically heterogeneous data sets in settings where centralized access to such data or in-memory processing of data is neither desirable nor feasible. Thus, this work has led to foundational advances in big data analytics; (ii) Design, implementation, documentation, and dissemination of a suite of open source software for learning predictive models from large, distributed, richly structured, semantically disparate data sets. Thus, this project has resulted in software tools for big data analytics. (iii) Successful applications of the resulting algorithms to data driven prediction problems in bioinformatics and computational biology, and social network analytics; (iv) Training of a new generation of researchers and educators (Twelve PhD graduates including five women) in machine learning, bioinformatics, and big data analytics who have gone on to pursue productive careers in academia and industry. The results of the project have been broadly disseminated through (i) over 50 publications in rigorously refereed scientific journals and conferences in artificial intelligence, machine learning, data mining, bioinformatics and big data analytics; (ii) several keynote talks and research presentations at professional meetings and conferences; (iii) release of open source software; (iv) sharing of publications, and software through the project website: www.cs.iastate.edu/~honavar/indus.html

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Application #: 0711356
Program Officer: Mitra Basu

Project Start
Project End
Budget Start: 2007-08-15
Budget End: 2013-07-31
Support Year
Fiscal Year: 2007
Total Cost: $392,467
Indirect Cost

Learning Classifiers From Autonomous, Semantically Heterogeneous, Distributed Data
Basu, Samik Honavar, Vasant
Iowa State University, Ames, IA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments