Feature selection has been proven to be efficient and effective in preparing high-dimensional data for data mining and machine learning applications, especially when the original features are important for model interpretation and knowledge extraction. The growth of data in both size and complexity accelerates rapidly as the dramatic increase of the capacity to collect data. Such big data has imposed tremendous challenges on traditional feature selection methods, which are usually designed to handle homogeneous and static data in a centralized fashion. Meanwhile in many real-world domains big data is unlabeled, which further exacerbates the difficulty. Therefore, the majority of existing feature selection methods are not well prepared for big data, and this thus calls for the development of novel unsupervised feature selection for unlabeled big data. The project extends the state-of-the-art feature selection research to a new frontier of taming big data. It has potential to benefit a number of real-world applications from various disciplines such as Computer Science, Business, Education, Politics, Healthcare and Bioinformatics.
This project proposes a suite of novel approaches for unsupervised feature selection to facilitate the computational understanding of big data, investigating associated fundamental research issues and developing effective algorithms. It consists of three major thrusts. First, it studies various strategies to scale unsupervised feature selection to handle large-scale and distributed data; and investigates distributed unsupervised feature selection with structured features and under asynchronous updates. Second, it develops a family of heterogeneous unsupervised feature selection with multiple types of heterogeneity. Third, it defines the unsupervised feature selection with various streaming scenarios, and develops new algorithms to improve the capability of unsupervised feature selection in handling the corresponding streaming settings. Disparate means are planned to disseminate the project and its findings, including web enabled data and software repositories, books, journal and conference publications, special-purpose workshops or tutorials, and external collaborations. The project lies at the confluence of feature selection, big data analysis, machine learning and data mining. It can be effectively integrated to undergraduate and graduate courses as well as in student research projects.