Fundamental questions about how the brain's structure evolves with age and how gene activities are controlled by transcription factors may be answered by studies involving high-dimensional data sets. As massive amounts of medical imaging and genome sequencing data are now being collected, these questions can be investigated by combining neuroscientists' and biologists' expertise with powerful data analysis tools geared towards addressing the subtlety and uncovering hidden patterns in the data. This project aims to develop an integrated toolkit of scalable, robust, and theoretically sound nonparametric and semiparametric solutions for high-throughput estimation of complex biological systems. It addresses the resolution of two major problems. The first problem considers massive amounts of high-dimensional complex-structured data that possess complex generating distributions and interrelationships, which often cannot be captured by simple linear systems. Such data are usually noisy and contain numerous outliers. The second problem considers data exhibiting temporal and spatial correlations and a relatively weak signal. Assuming independent and identically distributed data could lead to erroneous estimation and prediction, giving rise to inaccurate interpretation of biological systems. This project aims to provide methods to solve both of these problems.

This research project puts forward new methods for effective analysis of biological systems, handling the aforementioned challenges in a unified fashion. One essential feature is the concept of large-scale robust nonparametric/semiparametric inference. In particular, the project aims to develop an integrated toolkit of methods that are: (1) easily scalable to high-dimensional data with a large sample size; (2) robust to data modeling assumptions and different kinds of data contaminations; (3) built in a nonparametric or semiparametric sense, where the corresponding generative models contain infinite-dimensional components that capture the data information or subtlety as much as possible. To illustrate, the investigator intends to construct, explore, and apply high dimensional generalized regression models, (generalized) partially linear models, shape-constrained regression models, and copula time series models, among others, to unveil hidden patterns in biological systems. The methods under development are designed to be optimal, namely, attaining either a nonparametric minimax or a semiparametric lower efficiency bound.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1712536
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-06-15
Budget End
2020-05-31
Support Year
Fiscal Year
2017
Total Cost
$160,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195