Recent advances in the last decade have brought attention to the analysis of high-dimensional data and, in particular, to estimation on high-dimensional spaces. Such spaces are often structured by either exhibiting constraints on specific space components or by the incorporation of prior information identifying co-dependence patterns between components in order to help carrying out the inference. Given the recent predominance of discrete inference problems in many influential fields, the investigator takes on the critically important task of discrete estimation in high-dimensional settings and aims at laying foundational principles upon which estimation and characterization of high-dimensional discrete spaces can be efficiently performed and where structural properties of the space are adequately taken into account. More specifically, the PI explores estimators formally derived from statistical decision theory and based on loss functions that more naturally capture the features of the discrete space and are thus arguably better representatives of the ensemble. If the discrete space is constrained, obtaining an efficient procedure for estimation is of prime concern given the large size of the space; to this end, the PI also proposes to develop a general framework that can be explored to design efficient procedures for inference, assess the computational complexity of the proposed estimation, and further derive approximation schemes when needed. In addition, the investigator applies the proposed foundations to highlight important features of the discrete space such as regions of high concentration of probability mass, and studies a method to jointly elucidate features and identify good subspace representatives.

Many problems from fields like genetics, social sciences, molecular biology, and environmental studies can be casted as statistical inference problems on a large number of unknowns. Even though modern, high-throughput technology has enabled the collection of large datasets, these problems remain hard since the number of parameters describing the data generating process grows with the number of observations. In this setting, it is helpful to associate structure to the model in order to guide the inference. The investigator studies novel, principled estimators that address two issues under this high-dimensional regimen: effectively capture structural relationships among variables in the model, and efficiently derive solutions through computationally feasible routines. The PI intends to implement and publish the resulting methods as open-source software that benefits both academia and industry, and further fosters the development of algorithms and practical implementations. Through this research project the PI also intends to promote the integration of research and education by developing new courses and raise awareness for statistical analysis of high-dimensional data and inference on discrete spaces with state-of-the-art methods. Finally, the PI expects to encourage collaborations between statisticians and researchers from other fields and promote statistical methods in interdisciplinary areas.

Project Report

This research aimed at developing statistical estimators for problems that have discrete parameters, that is, when each of the unknowns can take only on a finite number of values. Even though the number of options is discrete, this research focused on challenging problems where the number of parameters is very large and where the parameters can show complex relationships, that is, problems defined on high-dimensional and structured parameter spaces. The main contribution of this project was the derivation of "centroid" estimators that have good theoretical properties -- for instance, estimators that are good representatives of the space of possible solutions -- and that can be obtained in a computationally efficient way. While the proposed estimators can be applied to many different problems, in this work we concentrated on three relevant specific contributions: (i) land cover classification, an application to environmental studies and remote sensing where we seek to identify types of land cover -- say, deciduous forests, or croplands, or water -- in satellite images to help assess global changes; (ii) community detection in social networks, an application to social sciences where we try to group "actors" according to their interaction patterns with other individuals such that within-group interactions are dense and between-group interactions are sparse; and (iii) genome-wide association studies (GWAS), an application to genetics, where the goal is to identify which genetic markers are associated to a particular human disease, a version of a problem in Statistics known as "variable selection" but here being severly ill-posed with many more variables than observations. In all three problems we found evidence that the proposed centroid estimators and inferential methods give better results when compared to state-of-the-art methods. As broader impacts, this project has yielded three published papers, four papers in review, and three more in preparation, with a number of presentations and posters in conferences. One PhD student graduated, two PhD students to graduate within a year; the PI served as mentor for two MSc graduate students, three undergraduate students, and one high school student. Moreover, all the methods developed in this research have been implemented in statistical software packages that have been released to the public under open source licenses.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1107067
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2011
Total Cost
$64,495
Indirect Cost
Name
Boston University
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02215