In the era of Big Data, researchers often encounter datasets that are large in size and complex in structure where the information of interest is usually contained in "components" hidden in the enormous amount of noise. Examples include communities in large social networks, topics in text documents, and confounding factors in genome-wide association studies (GWAS). Extracting these hidden components is an interesting but challenging problem. This project will address these challenges and include applications to many scientific areas including social networks, text mining, genomics, and genetics. The project will include (a) collection of large social networks data, (b) development of new models, methods, and theory for extracting hidden components in network analysis, text mining, and genome-wide association studies, and (c) a study of knowledge discovery using academic research data such as co-authorship and citation relationships. The research will have an impact in linguistics, social sciences, cancer research, and knowledge discovery.

This project aims to develop statistical models, methods, and theory for inferring and utilizing hidden components in complex data, especially matrix data. The goals of the project include: (1) Development of simple and fast methods for network mixed membership estimation and topic model estimation. These methods, based on nontrivial modifications of Principal Component Analysis (PCA), are easy to implement and can handle very large data. (2) New methods and theory for detecting and estimating rare and weak effects in GWAS. Problems related to optimal ranking of genes in the presence of complex correlation structures and detection of weak spikes in large covariance matrices will be considered. (3) A study of social network structures of scientific researchers. The PI and her collaborators will collect meta-information from published articles in representative statistics journals to understand social network structures and other features of the statistics community. (4) Development of new random matrix theory (RMT) for statistical analysis.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1925845
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2018-07-01
Budget End
2020-06-30
Support Year
Fiscal Year
2019
Total Cost
$141,781
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138