With rapid advances of computing power and other modern technology, high-throughput data of unprecedented size and complexity are becoming a commonplace in diverse fields. Examples include data from genetic, microarrays, proteomics, fMRI, cancer clinical trials and high frequency financial data. These high dimensional data characterize many important contemporary problems in statistics and feature selection play pivotal roles in these problems. This research project aims to develop cutting-edge statistical theory and methods for high dimensional variable selections. In particular, the PI proposes the following interrelated research topics for investigation: (1) grouped-variables screening with sparse linear models; (2) nonparametric components screening with sparse additive models; (3) parametric components screening with sparse semiparametric models and(4) their further extensions. The proposed methods will be studied theoretically for their sure screening behavior and compared with some of the existing methods empirically in terms of computational expediency, statistical accuracy and algorithmic stability.
The outlined research project on variable selection in high dimensions tries to tackle fundamental problems in statistical learning and will stimulate interests from a large group of scientists and researchers in diverse fields of sciences, engineering and humanities ranging from genomics and health sciences to economics and finance. Another key aspect of this project is the integration of research and education, which will be achieved by developing two new courses on statistical learning and non-, semi-parametric inference and proposing specific projects for students during the teaching of classes. It will enable the participation of all citizens from various disciplines, including underrepresented groups of students.
Outcomes that address the intellectual merit: During the past period, the PI together with her collaborators have developed several variable screening methods, including censored rank independence screening (CRIS) with high dimensional survival data (Song et al. 2013), varying-coefficient Independence Screening (VIS) for high-dimensional longitudinal/functional data (Song et al. 2013) and forward selection methods for developing dynamic treatment regimens (Fan et al. 2013). Outcomes that address the broader of impacts: With the support of the NSF grant, the PI developed a graduate course on topics in high-dimensional statistical inference with benefit of her research findings. The PI is currently advising four graduate students, Ms. Ailin Fan, Ms. Runchao Jiang, Mr. Shikai Luo and Mr. Zhongkai Liu. All four students have contributed to various aspects of the research. Two of the PIâ€™s current students are female. The PI has established collaborations with other local scientists. The PI had also presented the research findings at conferences and seminars. Outcomes of the entire reward: With the support of the NSF grant, the PI has made significant progress in developing variable screening methods for high-dimensional data with semi-nonparametric methods. Fan and Song (2010) thoroughly studied sure independence screening for generalized linear models. Fan et al. (2011) proposed the nonparametric independence screening (NIS) for ultra-high dimensional additive models. Song et al. (2012a) developed an integrative prescreening approach to effectively pool and analyze multiple datasets to improve reproducibility and to reduce the dimensionality in cancer genomic studies. Song et al. (2013) studied censored rank independence screening (CRIS) with high dimensional survival data. Song et al. (2014) proposed Varying-coefficient Independence Screening (VIS) for high-dimensional longitudinal/functional data. Fan et al. (2013) considered variable screening methods for developing dynamic treatment regimens. Four graduate students were involved and have contributed to various aspects of the research. The PI presented the research findings at conferences and seminars. This proposal focused on statistical research, but the proposed approaches for analyzing high-dimensional data will have direct and important applications in diverse fields of sciences, engineering and humanities.