A plethora of digital data is being generated at unparalleled speed with an inordinate number of dimensions. Machine learning and data mining are approaches that can assist us in keeping pace with the rapidly advancing data gathering and storage techniques and help us mine nuggets or patterns from high-dimensional data. Semi-supervised learning can be interpreted as supervised learning that uses additional information from unlabeled data, or as unsupervised learning guided by constraints formed from labeled data. This research is addressing two key pressing issues with massive data: high dimensionality and a shortage of labeled data. In particular, this project is: investigating semi-supervised feature selection to remove irrelevant features; studying the combination of feature extraction and model selection to further reduce dimensionality; and developing a novel framework to integrate feature selection and feature extraction based on sparse learning. This study is an explicit attempt to connect and unify feature selection and extraction for hypothesis space reduction. The project is directly facilitating basic machine learning research and practical data mining and advances innovative research beyond feature selection and extraction. The work is engaging students in both teaching and research, and the algorithms, tools and databases will be made publically available for research purposes and for use as teaching resources.

Project Report

High-dimensional data is ubiquitous in real-world applications. The shortage of labeled data, resulting from high labeling costs, necessitates the need to explore machine learning approaches beyond classic classification and clustering paradigms. Semi-supervised learning is one such approach that demonstrates its potential in handling data with small labeled samples and reducing the need for expensive labeled data. However, high-dimensional data with small labeled samples permits too large a hypothesis space yet with too few constraints (labeled instances). The combination of the two data characteristics manifests a new research challenge. Employing computational and statistical learning theory, we analyze specific challenges presented by such data, show preliminary studies, delineate the need to integrate feature selection and extraction in a novel framework to reduce hypothesis space, design efficient and novel algorithms, and conduct theoretical and empirical studies to understand complex relationships between high-dimensional data and classification performance. We propose an integrated framework that promotes and facilitates the computational understanding of machine learning and data mining, and goes beyond the state of the art to bridge feature selection and extraction. Though sharing a common interest, the two lines of research have run largely in parallel. Based on our extensive research in individual areas, we join our expertise in feature selection and extraction and leverage the two approaches to the effective reduction of hypothesis space for high-dimensional data with small-labeled samples. The proposed framework presents an explicit attempt to connect and unify feature selection and extraction for hypothesis space reduction, adds to the existing theory and practice on evolving data with increasingly large dimensionality and few labeled instances, and expands the current capability of handling high-dimensional data with small labeled samples The joint framework connects and unifies feature selection and feature extraction on both theoretical and empirical level, giving rise to new research and curriculum opportunities. Learning from high-dimensional data with small-labeled samples is also a problem in many fields outside of machine learning and data mining. For example, the proposed techniques can be used in computer vision for image and video processing, in computational biology for gene expression pattern image analysis, and in analytical chemistry for quality control of raw materials, intermediates, and final products. Representative Outcomes Two web-based resources: (1) A Feature Selection Repository, http://featureselection.asu.edu/, and (2) A Sparse Learning Software Package, www.public.asu.edu/_jye02/Software/SLEP Two Workshops and One Tutorial on Feature Selection: (1) FSDM’08 at ECML-PKDD’08, and (2) FSDM’10 at PAKDD’10, and (3) Tutorial at SDM’10 Selected Publications Zheng Alan Zhao and Huan Liu. "Spectral Feature Selection for Data Mining", Chapman and Hall/CRC Press, 2012. Zheng Zhao, Lei Wang, Huan Liu, and Jieping Ye. "On Similarity Preserving Feature Selection", IEEE Transactions on Knowledge and Data engineering (TKDE), to appear. Jieping Ye and Jun Liu. Sparse Methods for Biomedical Data. SIGKDD Explorations, to appear. Project website: www.public.asu.edu/~huanliu/projects/NSF08/

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0812551
Program Officer
Vijayalakshmi Atluri
Project Start
Project End
Budget Start
2008-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$462,605
Indirect Cost
Name
Arizona State University
Department
Type
DUNS #
City
Tempe
State
AZ
Country
United States
Zip Code
85281