The overall objective of the research project is to provide statisticians and biological scientists new and improved statistical tools for measuring variable importance and selecting key variables in high-dimensional biological data. The two major ingredients underlying the research are a distance-based procedure across multiple dimensions and a tilting/weighting-based importance measure for any specific dimension. The distance-based methods, e.g., the multi-response permutation procedure and the distance covariance, provide the ability to handle data even if the number of dimensions is larger than the sample size. The tilting/weighting-based procedures allow variable importance to be evaluated for any dimension in the presence of any number of other variables. Thus, variable importance is evaluated in the multivariate context rather than on univariate marginal distributions. In addition, the new methods will allow the number of selected variables to exceed the sample size; allow forward selection, backward selection, and sparse penalized weighting; minimize perturbation to the dependence structures actually present in the data; require minimal structural assumptions; and be sensitive to a wide range of multivariate dependencies, including some difficult or even impossible to detect with existing methods.

The methods developed as part of this project have a wide range of applications in biomedical and agricultural industries. Modern genomics tools allow researchers to simultaneously measure thousands of variables that contain information about DNA, RNA, and protein characteristics of organisms. The high-dimensional data generated by these modern high-throughput technologies must be mined to identify the variables that are most associated with health outcomes or other important traits. Uncovering of such associations is crucial in a variety of areas including drug discovery, genetic risk analysis, personalized medicine, and plant and animal breeding. This research project will provide tools to help make these discoveries possible. Reliable software implementations of the new methods will be created, maintained, archived in public repositories, and freely disseminated to genomics researchers and industry practitioners working with a diverse range of organisms and different high-throughput technologies. The research activity will enhance collaborations and partnerships among researchers from both computational/statistical fields and experimental/biomedical fields.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1313224
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2013-09-15
Budget End
2016-08-31
Support Year
Fiscal Year
2013
Total Cost
$150,000
Indirect Cost
Name
Iowa State University
Department
Type
DUNS #
City
Ames
State
IA
Country
United States
Zip Code
50011