This SBIR project aims to produce superior methods and software for classification and regression when there are many potential predictor variables to choose from. The methods should (1) produce stable results, where small changes in the data do not produce major changes in the variables selected or in model predictions; (2) produce accurate predictions; (3) facilitate scientific interpretation, by selecting a smaller subset of predictors which provide the best predictions; (4) allow continuous and categorical variables; and (5) support linear regression, logistic regression (predicting a binary outcome), survival analysis, and other types of regression. This project is based on least angle regression, which unifies and provides a fast implementation for a number of modern regression techniques. Least angle regression has great potential, but currently available software is limited in scope and robustness. The outcome of this project should be software which is more robust and widely applicable. This software would apply broadly, including to medical diagnosis, detecting cancer, feature selection in microarrays, and modeling patient characteristics like blood pressure. Phase I work demonstrates feasibility by extending least angle work in three key directions-categorical predictors, logistic regression, and a numerically-accurate implementation. Phase II goals include extensions to other types of explanatory variables (e.g. polynomial or spline functions, and interactions between variables), to survival and other additional regression models, and to handle missing data and massive data sets. This proposed software will enable medical researchers to obtain high prediction accuracy, and obtain stable and interpretable results, in high-dimensional situations. Predicting outcomes based on covariates, determining which covariates most affect outcomes, and adjusting treatment effects estimates for covariates, are among the most important problems in biostatistics. Prediction and feature selection are particularly difficult when there are more possible features than samples; gene microarrays and protein mass spectrometry are extreme examples of this, producing thousands to millions of measurements per sample. LARS excels at feature selection; the proposed software should enable medical researchers to obtain stable and interpretable models with better prediction accuracy in high-dimensional situations. ? ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
7R44GM074313-04
Application #
7748342
Study Section
Special Emphasis Panel (ZRG1-HOP-E (10))
Program Officer
Lyster, Peter
Project Start
2005-05-15
Project End
2011-09-30
Budget Start
2008-12-01
Budget End
2011-09-30
Support Year
4
Fiscal Year
2007
Total Cost
$168,203
Indirect Cost
Name
Insilicos
Department
Type
DUNS #
126643241
City
Seattle
State
WA
Country
United States
Zip Code
98109
Fraley, Chris; Percival, Daniel (2015) Model-Averaged [Formula: see text] Regularization using Markov Chain Monte Carlo Model Composition. J Stat Comput Simul 85:1090-1101
Percival, Daniel; Roeder, Kathryn; Rosenfeld, Roni et al. (2011) STRUCTURED, SPARSE REGRESSION WITH APPLICATION TO HIV DRUG RESISTANCE. Ann Appl Stat 5:628-644