The ability to conduct basic and applied biomedical research is becoming increasingly dependent on data produced by new and emerging technologies. This data has an unprecedented amount of detail and volume. Researchers are therefore dependent on computing and computational tools to be able to visualize, analyze, model, and interpret these large and complex sets of data. Tools for disease detection, diagnosis, treatment, and prevention are common goals of many, if not all, biomedical research programs. Sound analytical and statistical theory and methodology for class pre- diction and class discovery lay the foundation for building these tools, of which the machine learning techniques of classification (supervised learning) and clustering (unsupervised learning) are crucial. Our goal is to produce software for analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies. Ensemble techniques are recent advances in machine learning theory and methodology leading to great improvements in accuracy and stability in data set analysis and interpretation. The results from a committee of primary machine learners (classifiers or clusterers) that have been trained on different instance or feature subsets are combined through techniques such as voting. The high prediction accuracy of classifier ensembles (such as boosting, bagging, and random forests) has generated much excitement in the statistics and machine learning communities. Recent research extends the ensemble methodology to clustering, where class information is unavailable, also yielding superior performance in terms of accuracy and stability. In theory, most ensemble techniques are inherently parallel. However, existing implementations are generally serial and assume the data set is memory resident. Therefore current software will not scale to the large data sets produced in today's biomedical research. We propose to take two approaches to scale ensemble techniques to large data sets: data partitioning approaches and parallel computing. The focus of Phase I will be to prototype scalable classifier ensembles using parallel architectures. We intend to: establish the parallel computing infrastructures;produce a preliminary architecture and software design;investigate a wide range of ensemble generation schemes using data partitioning strategies;and implement scalable bagging and random forests based on the preliminary design. The focus of Phase II will be to complete the software architecture and implement the scalable classifier ensembles and scalable clusterer ensembles within this framework. We intend to: complete research and development of classifier ensembles;extend the classification framework to clusterer ensembles;research and develop a unified interface for building ensembles with differing generation mechanisms and combination strategies;and evaluate the effectiveness of the software on simulated and real data.
The common goals to many, if not all, biomedical research programs are the development of tools for disease detection, diagnosis, treatment, and prevention. These programs often rely on new types of data that have an unprecedented amount of detail and volume. Our goal is to produce software for the analysis and interpretation of large data sets using ensemble machine learning techniques and parallel computing technologies to enable researchers who are dependent on computational tools to have the ability to visualize, analyze, model, and interpret these large and complex sets of data.