Scalable Learning with Ensemble Techniques and Parallel Computing

Nilsson, Erik

Abstract

The ability to conduct basic and applied biomedical research is becoming increasingly dependent on data produced by new and emerging technologies. This data has an unprecedented amount of detail and volume. Researchers are therefore dependent on computing and computational tools to be able to visualize, analyze, model, and interpret these large and complex sets of data. Tools for disease detection, diagnosis, treatment, and prevention are common goals of many, if not all, biomedical research programs. Sound analytical and statistical theory and methodology for class pre- diction and class discovery lay the foundation for building these tools, of which the machine learning techniques of classification (supervised learning) and clustering (unsupervised learning) are crucial. Our goal is to produce software for analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies. Ensemble techniques are recent advances in machine learning theory and methodology leading to great improvements in accuracy and stability in data set analysis and interpretation. The results from a committee of primary machine learners (classifiers or clusterers) that have been trained on different instance or feature subsets are combined through techniques such as voting. The high prediction accuracy of classifier ensembles (such as boosting, bagging, and random forests) has generated much excitement in the statistics and machine learning communities. Recent research extends the ensemble methodology to clustering, where class information is unavailable, also yielding superior performance in terms of accuracy and stability. In theory, most ensemble techniques are inherently parallel. However, existing implementations are generally serial and assume the data set is memory resident. Therefore current software will not scale to the large data sets produced in today's biomedical research. We propose to take two approaches to scale ensemble techniques to large data sets: data partitioning approaches and parallel computing. The focus of Phase I will be to prototype scalable classifier ensembles using parallel architectures. We intend to: establish the parallel computing infrastructures;produce a preliminary architecture and software design;investigate a wide range of ensemble generation schemes using data partitioning strategies;and implement scalable bagging and random forests based on the preliminary design. The focus of Phase II will be to complete the software architecture and implement the scalable classifier ensembles and scalable clusterer ensembles within this framework. We intend to: complete research and development of classifier ensembles;extend the classification framework to clusterer ensembles;research and develop a unified interface for building ensembles with differing generation mechanisms and combination strategies;and evaluate the effectiveness of the software on simulated and real data.

Public Health Relevance

The common goals to many, if not all, biomedical research programs are the development of tools for disease detection, diagnosis, treatment, and prevention. These programs often rely on new types of data that have an unprecedented amount of detail and volume. Our goal is to produce software for the analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies to enable researchers who are dependent on computational tools to have the ability to visualize, analyze, model, and interpret these large and complex sets of data.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #: 5R44GM083965-04
Application #: 8045486
Study Section: Special Emphasis Panel (ZRG1-BST-D (10))
Program Officer: Lyster, Peter

Project Start: 2008-05-01
Project End: 2013-02-28
Budget Start: 2011-03-01
Budget End: 2013-02-28
Support Year: 4
Fiscal Year: 2011
Total Cost: $374,673
Indirect Cost

Institution

Name: Insilicos
Department
Type
DUNS #: 126643241

City: Seattle
State: WA
Country: United States
Zip Code: 98109

Related projects


NIH 2011 R44 GM	Scalable Learning with Ensemble Techniques and Parallel Computing Nilsson, Erik J. / Insilicos	$374,673
NIH 2010 R44 GM	Scalable Learning with Ensemble Techniques and Parallel Computing Nilsson, Erik J. / Insilicos	$376,899
NIH 2008 R44 GM	Scalable Learning with Ensemble Techniques and Parallel Computing Gong, Lixin / Insightful Corporation	$25,548
NIH 2008 R44 GM	Scalable Learning with Ensemble Techniques and Parallel Computing Nilsson, Erik J. / Insilicos	$143,361

Comments

Be the first to comment on Erik Nilsson's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: