Theory and Applications of Random Forests

Ishwaran, Hemant

Abstract

This research develops theory for random forests specifically for the purpose of better facilitating its use in practical settings. Theoretical considerations include balancedness, subtrees, node distributions, node splitting, depth of variables, and other novel tree concepts. These concepts are used to improve prediction and variable selection for random forests in both high and low-dimensional problems.

One of the simplest techniques for improving the performance of a statistical method such as a tree is to take its average over multiple instances of the data. This averaging process is often referred to as ensemble learning and has attracted considerable attention as it has been widely observed that combining elementary learners can yield a predictor with superior prediction performance. One of the most successful tree ensemble learners is random forests. Random forests has met with considerable empirical success, yet much is still unknown about it. This research seeks to improve our understanding of random forests and utilize this knowledge to enhance its application in practical settings. This research focuses on cardiovascular disease, the number one cause of death in the developed world, cancer staging and prognostication for cancer patients, and identifying and developing genotype signatures for myelodsyplastic syndromes, a heterogeneous diseases of blood stem cells having no current curative medical therapy.

Project Report

Ensemble learning involves the simple task of taking elementary procedures (base-learners) and combining them to form an ensemble. This simple process often yields a predictor with superior performance; one of the most successful examples is random forests (RF), an ensemble formed using random tree base-learners. In this project a unified theory for splitting properties of RF was developed which yields not only a deeper understanding of the method, but always points to means for improving it in applications. It was shown that a class of weighted splitting rules possess a unique adaptive property to signal and noise, and in particular under noise weighted splitting favors end-cut splits. While end-cut splits have traditionally been viewed as undesirable for single trees, it is beneficial to RF for several reasons. This points to means for developing more general splitting rules: including unsupervised rules and multivariate rules which could be used for missing data analysis and multivariate regression problems. The project has also contributed to the extension of RF to more general problems. RF has traditionally been used for regression and classification. This project has extended its applications to other data analysis settings, such as competing risks, a data problem often seen in medical studies. A unified, user friendly parallel enabled open source RF software package was developed and is available to the general public and will be useful to scientists worldwide as a general, powerful data analysis tool.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Application #: 1148991
Program Officer: Gabor Szekely

Project Start
Project End
Budget Start: 2011-07-01
Budget End: 2014-06-30
Support Year
Fiscal Year: 2011
Total Cost: $159,999
Indirect Cost

Theory and Applications of Random Forests
Ishwaran, Hemant
University of Miami School of Medicine, Coral Gables, FL, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments