This research develops theory for random forests specifically for the purpose of better facilitating its use in practical settings. Theoretical considerations include balancedness, subtrees, node distributions, node splitting, depth of variables, and other novel tree concepts. These concepts are used to improve prediction and variable selection for random forests in both high and low-dimensional problems.

One of the simplest techniques for improving the performance of a statistical method such as a tree is to take its average over multiple instances of the data. This averaging process is often referred to as ensemble learning and has attracted considerable attention as it has been widely observed that combining elementary learners can yield a predictor with superior prediction performance. One of the most successful tree ensemble learners is random forests. Random forests has met with considerable empirical success, yet much is still unknown about it. This research seeks to improve our understanding of random forests and utilize this knowledge to enhance its application in practical settings. This research focuses on cardiovascular disease, the number one cause of death in the developed world, cancer staging and prognostication for cancer patients, and identifying and developing genotype signatures for myelodsyplastic syndromes, a heterogeneous diseases of blood stem cells having no current curative medical therapy.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1104830
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-07-01
Budget End
2011-10-31
Support Year
Fiscal Year
2011
Total Cost
$63,874
Indirect Cost
Name
Cleveland Clinic Lerner
Department
Type
DUNS #
City
Cleveland
State
OH
Country
United States
Zip Code
44195