This research develops theory for random forests specifically for the purpose of better facilitating its use in practical settings. Theoretical considerations include balancedness, subtrees, node distributions, node splitting, depth of variables, and other novel tree concepts. These concepts are used to improve prediction and variable selection for random forests in both high and low-dimensional problems.

One of the simplest techniques for improving the performance of a statistical method such as a tree is to take its average over multiple instances of the data. This averaging process is often referred to as ensemble learning and has attracted considerable attention as it has been widely observed that combining elementary learners can yield a predictor with superior prediction performance. One of the most successful tree ensemble learners is random forests. Random forests has met with considerable empirical success, yet much is still unknown about it. This research seeks to improve our understanding of random forests and utilize this knowledge to enhance its application in practical settings. This research focuses on cardiovascular disease, the number one cause of death in the developed world, cancer staging and prognostication for cancer patients, and identifying and developing genotype signatures for myelodsyplastic syndromes, a heterogeneous diseases of blood stem cells having no current curative medical therapy.

Project Report

Ensemble learning involves the simple task of taking elementary procedures (base-learners) and combining them to form an ensemble. This simple process often yields a predictor with superior performance; one of the most successful examples is random forests (RF), an ensemble formed using random tree base-learners. In this project a unified theory for splitting properties of RF was developed which yields not only a deeper understanding of the method, but always points to means for improving it in applications. It was shown that a class of weighted splitting rules possess a unique adaptive property to signal and noise, and in particular under noise weighted splitting favors end-cut splits. While end-cut splits have traditionally been viewed as undesirable for single trees, it is beneficial to RF for several reasons. This points to means for developing more general splitting rules: including unsupervised rules and multivariate rules which could be used for missing data analysis and multivariate regression problems. The project has also contributed to the extension of RF to more general problems. RF has traditionally been used for regression and classification. This project has extended its applications to other data analysis settings, such as competing risks, a data problem often seen in medical studies. A unified, user friendly parallel enabled open source RF software package was developed and is available to the general public and will be useful to scientists worldwide as a general, powerful data analysis tool.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1148991
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2011-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2011
Total Cost
$159,999
Indirect Cost
Name
University of Miami School of Medicine
Department
Type
DUNS #
City
Coral Gables
State
FL
Country
United States
Zip Code
33146