In the era of Big Data, researchers often encounter datasets that are large in size and complex in structure where the information of interest is usually contained in "components" hidden in the enormous amount of noise. Examples include communities in large social networks, topics in text documents, and confounding factors in genome-wide association studies (GWAS). Extracting these hidden components is an interesting but challenging problem. This project will address these challenges and include applications to many scientific areas including social networks, text mining, genomics, and genetics. The project will include (a) collection of large social networks data, (b) development of new models, methods, and theory for extracting hidden components in network analysis, text mining, and genome-wide association studies, and (c) a study of knowledge discovery using academic research data such as co-authorship and citation relationships. The research will have an impact in linguistics, social sciences, cancer research, and knowledge discovery.
This project aims to develop statistical models, methods, and theory for inferring and utilizing hidden components in complex data, especially matrix data. The goals of the project include: (1) Development of simple and fast methods for network mixed membership estimation and topic model estimation. These methods, based on nontrivial modifications of Principal Component Analysis (PCA), are easy to implement and can handle very large data. (2) New methods and theory for detecting and estimating rare and weak effects in GWAS. Problems related to optimal ranking of genes in the presence of complex correlation structures and detection of weak spikes in large covariance matrices will be considered. (3) A study of social network structures of scientific researchers. The PI and her collaborators will collect meta-information from published articles in representative statistics journals to understand social network structures and other features of the statistics community. (4) Development of new random matrix theory (RMT) for statistical analysis.