Prediction and data exploration are important aspects of modern commercial and scientific life. Regression methods predict dependent variables (e.g., tumor growth, severity of disease), while classification methods predict class membership (e.g., tumor or disease type). Both use a vector of independent variables to make the predictions. Because they are often superior predictors, can handle large numbers observations and large numbers of variables, can often yield insight into the data not provided by other methods, and because they can adapt to arbitrarily complex relationships, modern machine learning methods based on tree ensembles such as RANDOM FORESTS and MART have become leading modern analytical methods. Here we propose to commercially implement RULEFIT, a recent innovative method extending the RANDOM FORESTS and MART approaches, that shows strong evidence of being consistently more accurate than either ensemble. RULEFIT also includes groundbreaking new methods for variable selection in the face of huge numbers of predictors, and for identifying interactions, and ranking their importance. Optionally, RULEFIT extracts """"""""rules"""""""" of special interest: succinct statements of conditions under which an outcome is especially likely or unlikely, or especially large or small. The primary output of RULEFIT is a numeric value reecting a prediction of the value of the dependent variable or the probability of a class membership. RULEFIT is likely to become a leading technique in the machine learning and statistics. It builds on RANDOM FORESTS and MART and includes all their useful benefits such as variable selection, data exploration, data reduction, outlier detection, and missing value imputation, while enhancing and extending these benefits. ? ? COMMERCIAL POTENTIAL The market for advanced analytical tools has been growing strongly over the last decade and the growth shows no signs of diminishing. Modelers and data analysts in both university- based and commercial settings are increasingly aware of the power and value of new analytical tools derived from modern statistics and machine learning research. The increased accuracy of the new methods and the acceleration they provide to the analysis of complex data are fueling demand for this new technology. The advances embedded in the proposed product represent substantial improvements to existing technology and include methods to solve vexing problems in contemporary data analysis, and thus should find a welcoming market. ? ? There are further reasons to forecast robust commercial potential for this product. The applicant organization has a strong track record in the industry and is widely recognized as a developer of high quality software. We have been working with consultant Friedman since 1990 and have gained exclusive rights to the proprietary sourcecode for a number of his innovations. These include CART, MARS, MART and PRIM. With the addition of RULEFIT and its associated sub-components, these products represent a unique collection of pedigreed tools. We have also forged a similar relationship with the (late) Leo Breiman and have the exclusive rights to commercialization of Breiman's Random Forests sourcecode. Our proposed package thus occupies a distinctive position in machine learning software which cannot be replicated by other vendors. Keywords: machine learning; classi?cation; prediction; supervised learning; variable importance; inter- action detection; Justi?cation Dr. Steinberg has extensive experience in software development for advanced statistical and machine learning methods, particularly in the area of classi?cation and regression trees, sur- vival analysis, adaptive modeling, RANDOM FORESTS and MART. He will oversee all aspects of the project. He will will work with Dr. Cardell, Professor Friedman, Mr. Colla, and with the Salford Systems software development engineer in creating and studying the software and methods used in this proposal. He will also be responsible for the architecture of the Phase I software. Professor Friedman and Dr. Cardell will provide technical support as follows: Dr. Fried- man is an expert on machine learning methods and is one of the developers of the RULEFIT technique. Regular consultation with him will be in this area. Dr. Cardell is an expert in asymptotic theory, and in the design of Monte Carlo and other tests for the evaluation of ma- chine learning algorithms. He also has extensive experience in machine learning, including adaptive modeling, neural networks, logistic regression, and classi?cation methods. He will review core algorithms of RULEFIT for possible improvement and extension and design the Monte Carlo tests. Mr. Colla has extensive experience in software development and with machine learning methods, including work on the commercial implementations of CART, MARS, RANDOM FORESTS, and MART. Working with Dr. Cardell, he will be responsible for much of the new software coding. 5 Project Description Page 7 Principal Investigator/Program Director (Last, first, middle): Steinberg, Dan Prediction models based upon classification and regression tree ensembles have become important in medical and other research. There are currently no commercial products available that implement the proposed RuleFit methodology. These methods have significant advantages over existing techniques, and will aid researchers in obtaining the best possible predictions. ? ? ?

National Institute of Health (NIH)
National Cancer Institute (NCI)
Small Business Innovation Research Grants (SBIR) - Phase I (R43)
Project #
Application #
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Tiwari, Ram C
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Salford Systems
San Diego
United States
Zip Code