The recent advancements in science and technology present a huge challenge of handling large amount of data. Often the main objective is to retain the relevant features or covariates in the data and to filter out the redundant variables. In the statistical framework, this is known as variable selection. The problem becomes extremely difficult when there are a large number of covariates in comparison to the available sample size. Despite the great deal of research effort on variable selection, knowledge on modeling the dependence between the important variables is very limited and urgently needed in many fields. Penalization based techniques like Lasso can be used for variable selection purpose, but it cannot detect any grouping or dependency structure between the covariates. While methods like Elastic-Net and Octagonal Shrinkage and Clustering Algorithm for Regression (OSCAR) may be used to incorporate grouping, one apparent major drawback for such methods is that they are not based on any probabilistic framework. The project aims to develop probabilistic models or priors incorporating the dependency relationship, for simultaneous variable selection and grouping of closely related variables. The developed model will automate the process as much as possible using highly flexible Bayesian models characterized by a special dependency structure. The special dependency structure is formulated through an extension of the Laplace matrices of graphs (graph Laplacian) used in the machine learning and pattern recognition literature for finding good clusters. The proposed prior distribution for the graph Laplacian allows conjugacy and thereby greatly simplifies the computation. The graph Laplacian prior proposed in this research is very useful for small and moderately high dimension data sets. For data sets with a massive number of predictors, explicit modeling of the pair wise dependence through graph Laplacian is infeasible. Therefore, another goal in this research is to build a coherent Bayesian model which is capable of reducing the dimension and at the same time detecting the clusters of the nonzero coefficients through the graph Laplacian prior formulation (on the reduced dimension data set) for very high dimensional data sets. The proposed Bayesian Variable Selection and Grouping methods would be developed under continuous response data, as well as binary and count response data framework.

Due to enormous progress of computer technology, explosion of the internet based information, and emerging fields in biological sciences, high-dimensional complex data sets are now very common in our life. The first big challenge handling such data sets is to identify important features or covariates in them that are directly related and most important to the desired response or outcomes. In statistics this is commonly referred to as variable selection. This is the first objective of this project. Second goal of this project is to find any grouping pattern among the selected variables and enhance the understanding of how the features or covariates are related among themselves. The investigator proposes a new methodological framework to address these challenges. To account for any related uncertainty the proposed methodology is based on probabilistic or Bayesian framework. The practical implementation of the proposed models is done by developing fast computer algorithms, which are able to handle data sets of any size. The proposed work has enormous potential for real life applications, especially in the field of computer science, engineering, genetics, marketing research, and medicine. For example, the methods can be used to detect active genes and study the relationship between those active genes in biology. Another example arises in marketing segmentation for targeting a smaller market and helping the decision makers to effectively reach all customers. The principal investigator will distribute freely available and easy to use software along with a short tutorial, which will allow researchers from other disciplines to address their own scientific questions using the proposed methods. Based on the ideas developed in this project the principal investigator will develop new graduate courses, enhance the existing courses, and actively involve students in the research. This will train the current generation of students to deal with future challenges.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1106717
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-08-15
Budget End
2015-07-31
Support Year
Fiscal Year
2011
Total Cost
$121,279
Indirect Cost
Name
University of Missouri-Columbia
Department
Type
DUNS #
City
Columbia
State
MO
Country
United States
Zip Code
65211