Dimension Reduction, Model Selection and Classification in Functional Data Analysis.

Stufken, John; Li, Yehua

Abstract

Functional data analysis aims to model and analyze data sets where a datum is a random function, e.g. a curve or a high dimensional image. Due to the fast growth of modern data collection methods, such data sets become more and more prevalent in many biological, medical and industrial applications. Functional data are viewed as infinite dimensional vectors in a functional space, and are usually observed on discrete points and measured with error. Due to the infinite dimensional nature of functional data, dimension reduction is essential for visualizing, modeling and making inference on these data. In the proposed project, the investigator will study new, computationally efficient dimension reduction methods for functional data based on spline approximations, and use asymptotic theory to develop new statistical devices for model selection and inference. The investigator will also study classification problems in functional data, by combining the proposed dimension reduction techniques with modern machine learning methods.

The proposed research is motivated by data from colon carcinogenesis experiments, hypertension studies, AIDS clinical trials and functional magnetic resonance imaging experiments. The proposed project will benefit the society by advancing knowledge in these scientific fields. To achieve broader dissemination of the research results, the investigator will provide free and user friendly software to all scientific researchers. A new course on functional data analysis will be developed in the investigator's institute. The new course aims to nurture the ability of students to analyze real and innovative data sets and help them gain deeper understanding of modern statistical methods and theory.

Project Report

In this project, the investigators developed novel statistical methods for regression, clustering and classification problems that involve a new type of data called functional data. Functional data analysis (FDA) aims to model and analyze data sets where a datum is a random function, e.g. a curve or a high dimensional image. Due to the fast growth of modern data collection methods, such data sets become rapidly available in many biological, medical and industrial applications. In FDA, the data are viewed as infinite dimensional vectors in a functional space, which are usually observed on discrete points and measured with error. Non-Gaussian longitudinal data are usually modeled by a Generalized Linear Mixed Model where the latent longitudinal process can also be modeled as functional data. Due to the infinite dimensional nature of functional data, dimension reduction becomes essential for visualization, modeling and inference on these data, yet the unique features of functional data raise many new challenges. The scientific methods developed in this project can solve many of the estimation, model selection and inference problems related to dimension reduction for functional data. Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool in FDA and selecting the number of principal components is the most important model selection problem in almost all context of FDA. The investigators considered functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and proposed novel information criteria to select the number of principal component. The new information criteria vastly outperform existing methods both in terms of theoretical properties and numerical performance. The investigators applied a latent functional data analysis approach to model non-Gaussian longitudinal data that arose from a cocaine dependence treatment study. They used FPCA to flexibly model the within subject correlation structure in these longitudinal trajectories. The dimension reduction nature of FPCA kept the model parsimonious and tractable. By jointly modeling the cocaine use history before treatment and the cocaine use behavior after treatment, the investigators were able to detect clusters among different drug using patterns. This model can be used to subtype the patients, with the hope of developing personalized treatments. The investigators also jointly modeled the baseline cocaine use history with another important endpoint variable, time to first relapse. This model can be used to predict a patientâ€™s treatment outcome using his drug-using pattern at baseline. These new methods are all based on FDA and can potentially be used in many applications in social sciences, substance abuse treatment studies and behavioral studies. The investigators studied hypotheses testing problems in a class of functional analysis of covariance (fANCOVA) models. In many longitudinal studies, patients are assigned to different treatment groups and the response variable is repeatedly measured over time. The treatment effects are thus represented as nonparametric functions of time. The investigators model such data by a class of semiparametric fANCOVA models, which accommodate not only the nonparametric treatment effects but also the parametric effect of other covariates. To test for the nonparametric treatment effects, they proposed a generalized quasi-likelihood ratio test and investigated its theoretical properties. The proposed model and test procedure can be widely used in clinical trials, substance abuse rehabilitation studies and other industrial applications. The investigators also proposed novel global inference tools for the covariance function of functional data using simultaneous confidence envelops based on tensor product splines and investigated other dimension reduction tools such as functional sliced average variance estimation. These statistical methods have wide applications in classification problems in voice recognition, medical longitudinal studies and early detection of eye diseases.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Type: Standard Grant (Standard)
Application #: 1105634
Program Officer: Gabor Szekely

Project Start
Project End
Budget Start: 2011-09-01
Budget End: 2014-08-31
Support Year
Fiscal Year: 2011
Total Cost: $119,999
Indirect Cost

Dimension Reduction, Model Selection and Classification in Functional Data Analysis.
Stufken, John Li, Yehua
University of Georgia, Athens, GA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments