Advances in networks, sensors, storage, computing, and high throughput data acquisition, have led to a proliferation of autonomous, distributed data sources in many areas of human activity. New discoveries in biological, physical, and social sciences and engineering are being driven by our ability to discover, share, integrate and analyze disparate types of data. Statistically-based machine learning algorithms offer some of the most cost-effective approaches to discovery of experimentally testable predictive models and hypotheses from data. However, the large size, distributed nature, and autonomy of the data sources (and the attendant differences in access, queries allowed, processing capabilities, structure, organization, and underlying data models and data semantics) present hurdles to effective utilization of machine learning. This research aims to overcome these hurdles by developing efficient, resource-aware distributed algorithms and software services to support collaborative, integrative knowledge acquisition such a setting. The research team will implement, deploy, and evaluate the resulting algorithms using benchmark data sets, associated data models and ontologies, and user-specified inter-ontology mappings on a distributed test-bed of networked databases and services at Iowa State University and Kansas State University. The resulting open-source software can potentially transform collaborative e-science in the same way that Web has transformed information sharing. Broader impacts of this research include enhanced opportunities for research-based training of graduate and undergraduate students, interdisciplinary collaborations, participation of under-represented groups, and development of increasingly sophisticated software to support collaborative, integrative e-science. The project web site (www.cild.iastate.edu/projects/indus.html) provides access to information about the project, benchmark data, publications, software, and documentation.

Project Report

Advances in networks, sensors, storage, computing, and high throughput data acquisition, have led to a proliferation of autonomous, distributed data sources in many areas of human activity. New discoveries in biological, physical, and social sciences and engineering are increasingly being driven by our ability to discover, share, integrate and analyze disparate types of data. Statistical machine learning algorithms offer some of the most cost-effective approaches to discovery of experimentally testable predictive models and hypotheses from data. However, the large size, distributed nature, and autonomy of the data sources (and the attendant differences in access, queries allowed, processing capabilities, structure, organization, and underlying data models and data semantics) present hurdles to effective utilization of machine learning. This project was aimed at addressing these challenges. The key accomplishments of the project include: (i) Development of the theoretical underpinnings of, and practical algorithms for, learning predictive models (e.g., classifiers) from very large, autonomous, distributed, semantically heterogeneous data sets in settings where centralized access to such data or in-memory processing of data is neither desirable nor feasible. Thus, this work has led to foundational advances in big data analytics; (ii) Design, implementation, documentation, and dissemination of a suite of open source software for learning predictive models from large, distributed, richly structured, semantically disparate data sets. Thus, this project has resulted in software tools for big data analytics. (iii) Successful applications of the resulting algorithms to data driven prediction problems in bioinformatics and computational biology, and social network analytics; (iv) Training of a new generation of researchers and educators (Twelve PhD graduates including five women) in machine learning, bioinformatics, and big data analytics who have gone on to pursue productive careers in academia and industry. The results of the project have been broadly disseminated through (i) over 50 publications in rigorously refereed scientific journals and conferences in artificial intelligence, machine learning, data mining, bioinformatics and big data analytics; (ii) several keynote talks and research presentations at professional meetings and conferences; (iii) release of open source software; (iv) sharing of publications, and software through the project website: www.cs.iastate.edu/~honavar/indus.html

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0711356
Program Officer
Mitra Basu
Project Start
Project End
Budget Start
2007-08-15
Budget End
2013-07-31
Support Year
Fiscal Year
2007
Total Cost
$392,467
Indirect Cost
Name
Iowa State University
Department
Type
DUNS #
City
Ames
State
IA
Country
United States
Zip Code
50011