A large amount of text and social network data is emerging in scientific research as well as everyday life. This project will develop statistical methods for analyzing data resulting in new scientific, sociological, and biomedical discoveries. The research has several fundamental challenges due to the features of the data: (1) large scale, which requires advanced strategies on storage, computation, and quality control; (2) a complicated structure, which makes careful statistical modeling a critical need; and (3) strong noise, which requires sophisticated de-noising techniques. To address these challenges, the PI proposes a universal probabilistic factor modeling approach. The research will provide an array of statistical tools for social network analysis, natural language processing, RNA-sequencing data analysis, and electronic health records analysis. This project will also help train graduate and undergraduate students on data collection, data cleaning, statistical methodology and theory. In addition, this project will release new software and data sets for network and text analysis providing useful resources for both education and research.

Probabilistic factor models refer to factor models whose factors or factor loadings are connected to probability mass functions. Examples include the topic models in text mining and mixed membership models in social networks. Due to the nonnegative constraints and the dependent and heteroscedastic noise in these models, statistical estimation and inference are extremely challenging. This project will tackle these challenges and apply the proposed methods to different applications. The first thrust aims to develop a novel framework for exploring sparsity in topic models. It proposes a new notion of "sparsity" on the vocabulary, which is different from the conventional notion of sparsity in high-dimensional statistics. The framework will provide a theoretical foundation for dimension reduction in text mining, as well as new word screening methods and new spectral methods for topic weight estimation. The second thrust aims to study the fundamental statistical limits for network mixed membership estimation. It will lead to a new optimality theory of mixed membership estimation, especially for network models with a large degree of heterogeneity, and new random matrix theory for empirical eigenvectors. It will also produce data sets about the networks among academic researchers in statistics-related fields and generate discoveries about the trend and patterns in academic research. The third thrust aims to adapt the above technical tools to biomedical data, including bulk and single-cell RNA-sequencing data and electronic health care data. It will result in new mixture models and statistical inference tools for biomedical data.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1943902
Program Officer
Huixia Wang
Project Start
Project End
Budget Start
2020-07-01
Budget End
2025-06-30
Support Year
Fiscal Year
2019
Total Cost
$75,766
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138