Historically statistics has dealt with the problem of extracting as much information as possible from a small data set. However, over the last decade, because of technological advances in various fields such as image processing, computational biology, climatology, economics and finance, one of the most important active research topics in statistics now involves dealing with data sets with enormous numbers of predictors. Such large scale problems may be abstracted as statistical regression and classification problems with the number of explanatory variables much larger than the number of observations. In these situations some form of regularization is essential. The investigators study a general class of penalty functions and the theoretical properties of the resulting regularization methods in regression and classification settings. In addition, two specific penalty functions that each motivate a different methodology are developed. The theoretical and empirical properties of these methods in the most common linear regression setting are investigated. Finally, the investigators study extending the methodologies to areas that are less well explored in the high dimensional setting, namely, mixed effects models, functional linear regression, and classification problems.
The proposed research is expected to have a broad impact on the practice and education, both of statistics, as well as on fields outside statistics. The common theme underlying this entire proposal is that of developing general regularization penalties and related methodologies for high dimensional problems. The investigators together have direct connections in many fields outside statistics such as Computational Biology, Finance, Marketing, Machine Learning, and Econometrics. The investigators will systematically develop software to implement the proposed methods through free software packages, like R, and then make them readily available and publicize them in all these fields. High dimensional data are becoming increasingly common, so the developed methodologies and software will be widely utilized. The research will also contribute to the training and development of future data analysts (including both statisticians and researchers outside statistics who analyze data).
This three-year project resulted in multiple novel statistical methods in variable selection which can be used to analyze big data sets commonly arising in various scientific areas and real life applications. Variable selection is a technique for selecting a subset of important variables (features) to enable robust statistical analyses. It is especially useful for analyzing big data sets, and can help people to acquire a better understanding about their data by telling them which are the important features and how they are related to each other. For example, in order to accurately predict movie revenues for marketing and other strategic planning, it is important for decision makers to know which variables are most related to, and how they are related to, movie revenues. In the study of cancer diseases, it is crucial for biologists to know which genes out of thousands are expressed differently between cancer patients and normal people. Classic variable selection methods can only deal with a few variables, and thus are in general insufficient to analyze big data sets with a large number of variables. Regularization is one of the most popular methods for variable selection. In this project, the PI and CO-PI proposed and studied innovative variable selection methods via regularization in commonly used statistical models including the functional regression model, mixed effects models, and the robust regression model. The proposed methods have been carefully studied and justified. For instance, the functional additive regression method has been applied to the Hollywood Stock Exchange data and the important variables for movie revenue prediction have been successfully identified. As a consequence, the prediction accuracy has been improved in comparison to currently available methods. A gene expression data set for studying cancer has been analyzed using the proposed adaptive robust variable selection method, and the important genes accounting for cancer disease classifications have been selected. In summary, the proposed methods in this project have improved the variable selection results in various statistical models and enhanced statistical modeling and prediction powers.