Recent technological advancements in data collection and processing have led to the accumulation of vast amount of digital information with various types of auxiliary information such as prior data, external covariates, domain knowledge and expert insights. However, with few analytical tools available, much of the relevant data and auxiliary information have been severely underexploited in most current studies. The analysis of big data with complex structures poses significant challenges and calls for new theory and methodology for information integration. This collaborative research aims to develop new procedures, computational algorithms and statistical software to provide powerful tools for researchers in various scientific fields who routinely collect and analyze high dimensional data, which would help translate dispersed and heterogeneous data sources into new knowledge effectively.
This NSF project aims to develop new principles, theoretical foundations and methodologies for integrative large-scale data analysis and statistical inference. An important theme is to study how to combine the information from multiple sources in a unified framework. The project focuses on four types of problems: (i) inference of two sparse objects; (ii) structured simultaneous inference; (iii) simultaneous set-wise inference and multi-stage inference; and (iv) applications in genomics and network analysis. The new integrative framework provides a powerful approach for extracting and pooling information from various parts of massive data sets, and can improve conventional methods by delivering more accurate, informative and interpretable results.