This proposal concentrates on various topics relating to the statistical analysis of dependent data. The first project extends the spectral envelope concept for analyzing DNA sequences. A common problem in analyzing long DNA sequence data is in identifying protein-coding sequences that are dispersed throughout the sequence and separated by regions of noncoding. DNA sequences are heterogeneous, so it is necessary to expand the methodology to capture the local behavior of such sequences. To address the problem of local behavior, a local spectral envelope with estimation via mixtures of smoothing splines will be explored. It is the hope that this methodology will help emphasize any periodic feature that exists in a categorical sequence of virtually any length in a quick and automated fashion. Projects such as the human genome project have produced large amounts of data and the methods established in this project will prove to be useful in the analysis of the vast quantities of data being produced by various genome projects. In another project, the focus is on the analysis of longitudinal data and the development of a practical nonparametric procedure for the estimation of the within-subject correlation structure. This technique is used to develop a data driven functional principal components analysis procedure (FPCA). Because longitudinal data often possess the property that observations made within a subject are correlated, an effective analysis of these data is required to account for this within-subject correlation. When a parametric form for the covariance structure is unknown, using a misspecified structure can result in biased and inefficient estimates. This project focuses on the analysis of longitudinal data that can be modeled as observations from smooth subject trajectories that are realizations of a stochastic process observed at discrete time points with noise. The high dimensionality and complexity of longitudinal data has made FPCA a popular tool for data reduction and visualization by capturing the primary modes of variation of the stochastic process generating the data. Scientists are often interested in using longitudinal data to determine the effect that a set of possibly time-varying covariates have on a given response over time. Functional linear models, and in particular the varying-coefficient model, provide a framework for analyzing such data. In many of these data sets, the functional coefficients have shapes that cannot be modeled parametrically. An effective analysis of these data is required to both account for the within-subject correlation and to allow for the flexible shapes of the coefficients. Because a parametric form for the within-subject covariance is not always known, a third project focuses on creating an iterative data-driven spline based procedure for fitting varying-coefficient models.

This proposal concentrates on solving problems involved in the analysis of dependent data. The first project will develop a method for detecting genes in a long DNA sequences. Projects such as the human genome project have produced large amounts of data and the methods established in this project will prove to be useful in the analysis of the vast quantities of data being produced by various genome projects. A second proposed project focuses on the analysis of complex data collected over time. This project is also motivated by the analysis of DNA, and in particular, the analysis of gene expression data. In a third project, the investigators will focus on a technique called functional linear models. For example, techniques will be developed for studying the effect that a growth factor should have on the decision to supplement chemotherapy with antiangiogenic therapy when treating ovarian cancer.

Project Report

In this work, we concentrated on various topics relating to the statistical analysis of dependent data. We extended the "spectral envelope" concept, which was first proposed by us as a method to analyze qualitative depedent data. While the motivation of the research was to analyze EEG sleep-state data, the was found to be useful in the analysis of DNA sequences. A common problem in analyzing long DNA sequence data is in identifying protein-coding sequences (CDS) that are dispersed throughout the sequence and separated by regions of noncoding. DNA sequences are heterogeneous, so we expanded to capture the local behavior of such sequences. Our method was able to help emphasize any regular feature in a DNA sequence of virtually any length in a quick and automated fashion. Projects such as the human genome project have produced large amounts of data and our methods have proved to be useful in the analysis of the vast quantities of data being produced by various genome projects. In addition, the techniques were expanded for use in Sleep Medicine and Circadian Biology. In this case, we developed a method to help analyze the effect of various conditions, such as depression, on sleep state cycling. The technnology, which is now used in the field of Sleep Medicine, is a new class of statistical methods that can be used to assess connections between patterns in sleep as it evolves over time with the subjects general state of quality of life. We were also able to discover that patterns in heart rate during sleep can be used to predict the effectivness of behavioral treatments of insomnia and help health care workers choose appropriate patient-specific treatments for poor sleep.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0805050
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2008
Total Cost
$319,999
Indirect Cost
Name
University of Pittsburgh
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213