Nonparametric methods are increasingly applied to regression, classification and density estimation, both in statistics and other related areas such as data mining and machine learning. However, a key difficulty with nonparametric models is model fitting for high dimensional data due to the curse of dimensionality. Another difficulty is model inference and interpretation, i.e., how to evaluate or test individual variable effects on the complex surface fit. For heterogeneous data with complicated covariance structure, nonparametric model estimation is even more challenging. The objectives of this proposal are to develop novel and widely applicable procedures to simultaneous model selection and estimation for nonparametric models and their related paradigms in data mining. In the framework of reproducing kernel Hilbert space (RKHS), the PI proposes a host of new regularization techniques for several families of models: smoothing spline ANOVA models for correlated data, semiparametric regression models, support vector machines for supervised and semi-supervised learning. The proposed methodologies constitute key advances over standard methods through their unified framework for achieving model sparsity and function smoothing altogether, their tractable theoretical properties, and their easy adaptation to high dimensional problems. The PI will study asymptotic behaviors of the proposed estimators, explore data-driven procedures for tuning regularization parameters, and develop computation algorithms and softwares to implement the proposed procedures. The PI will also examine finite sample performance of new methods via extensive simulation studies and real data analysis.

In the current information era, the volume and complexity of scientific and industrial databases have been exponentially expanding. As a consequence, the data form keeps gaining higher and higher dimensionality. Analysis of such data poses new challenges to statisticians and is becoming one of the most important research topics in modern statistics. The purpose of this project is to significantly increase the available tools for analyzing complex high dimensional data. In this project, the PI aims to accomplish the following three goals: (1) meet the challenges of nonparametric model estimation and selection within a unified mathematical framework; (2) develop flexible methods with desired statistical properties and high-performance statistical softwares for mining massive data; (3) integrate research opportunities and findings from the above two activities into disciplinary and interdisciplinary statistical education at graduate, undergraduate and high school levels. This research will broaden traditional understanding of nonparametric inferences and model selection, provide a broad range of researchers and practitioners in various fields including sociology, economics, environmental, biological and medical sciences with state-of-the-art data analysis tools, and help to prepare the next-generation students with the necessary modern statistical perspectives.

Project Report

During this research project, the PI has endeavored to integrate her research, teaching, and service to professional communities and general public. In this project, the PI has devoted herself to the development of flexible and robust statistical inference methods and analysis tools which are mathematically appealing, computationally feasible, and highly effective in performance, to reveal crucial information from large, complex, and noisy data. Her research in nonparametric statistics and data mining has been innovative and productive, making fundamental contributions to the theory and applications of nonparametric model smoothing, selection, and spare estimation. Through this NSF Career award project, the PI has published 34 refereed research articles, many in highly-regarded top journals, and co-authored a textbook, "Principles and Theory for Data Mining and Machine Learning" published by Springer in 2009. These works have opened new research avenues and sparked new ideas in the fields. Central to statistical machine learning is high-performance computing. The PI has developed efficient computational algorithms to handle big data; user-friendly software packages are provided freely for public use. For example, she has made a serious commitment to developing, maintaining, and updating the R packages for public use. Furthermore, the PI has extensively collaborated with researchers and scientists in other fields, helped them develop powerful quantitative tools, and facilitated the process of making new discoveries and conducting translational research. Her interdisciplinary work has led new and sophisticated data analysis methods useful in bioinformatics and biomedical research. The research achievements and results from this project are well recognized within the professional community. The PI has been invited to present at more than 40 national and international conferences, workshops, and departmental and institutional colloquia. She also served on or chaired the program committees of various national and international conferences. In addition, the PI has actively promote Statistics and Mathematics to the general public on various events, such as "Math and City" and "Tech in Tucson Showcase" to exhibit recent advancements in research and education programs at University of Arizona. The PI believes that today's undergraduate and graduate students in STEM discipline should be exposed to and eventually master cutting-edge works in modern statistics, to better prepare for their careers in the era of big data. During the project period, she served on the committee of over 40 Ph.D. students at NCSU, UNC-Chapel Hill and University of Arizona. The PI designed and taught new special topic courses, respectively on nonparametric statistics and machine learning, at both North Carolina State University and University of Arizona. She has incorporated course material into the students' research activities, helping them gain research experiences and master problem-solving skills. Furthermore, the PI participated in the NC State Kenan Fellows Program for the Curriculum and Leadership Development, a statewide initiative to retain K-12 math and science teachers in local public school districts. During the project period, the PI served as a Mentor to two Kenan Fellows (both high school math and science teachers), engaging the Fellows in her research and helping them develop innovative curricula to teach AP statistics.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1347844
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2013-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2013
Total Cost
$96,133
Indirect Cost
Name
University of Arizona
Department
Type
DUNS #
City
Tucson
State
AZ
Country
United States
Zip Code
85719