This research develops theory for random forests specifically for the purpose of better facilitating its use in practical settings. Theoretical considerations include balancedness, subtrees, node distributions, node splitting, depth of variables, and other novel tree concepts. These concepts are used to improve prediction and variable selection for random forests in both high and low-dimensional problems.
One of the simplest techniques for improving the performance of a statistical method such as a tree is to take its average over multiple instances of the data. This averaging process is often referred to as ensemble learning and has attracted considerable attention as it has been widely observed that combining elementary learners can yield a predictor with superior prediction performance. One of the most successful tree ensemble learners is random forests. Random forests has met with considerable empirical success, yet much is still unknown about it. This research seeks to improve our understanding of random forests and utilize this knowledge to enhance its application in practical settings. This research focuses on cardiovascular disease, the number one cause of death in the developed world, cancer staging and prognostication for cancer patients, and identifying and developing genotype signatures for myelodsyplastic syndromes, a heterogeneous diseases of blood stem cells having no current curative medical therapy.