In many areas of biological or medical research, investigators are faced with the task of analyzing data sets that can be described as "large sample, moderate dimension". An important example, which is the specific focus of this project, is multi-parameter flow cytometry where the number of data points is in the range of tens of thousands to several millions, and each data point can provide measurements on multiple variables (5 to 60). New statistical tools are needed to analyze and visualize this data, and to address the associated hypothesis testing and modeling challenges. The goal of this research is to develop new statistical methods for this type of multivariate data, and based on these methods, to create and provide more effective data analysis and interpretation tools for multi-parameter flow cytometry. First we will develop a new approach to multivariate density estimation based on the approximation of the density by simple functions. These estimates are essentially histograms based on data adaptive partitions of the basic multivariate domain.
The aim i s to attain effective learning of these partitions using methods with strong theoretical justification and good empirical performance. We will also implement and further develop these methods for the analysis of multi-parameter flow cytometry data. Particular attention will be paid to mass cytometry which is a new cytometry modality that can greatly increase the number of variables measured per cell, as compared to classical polychromatic flow cytometry.
The aim i s not only to improve primary analysis tasks such as cell population identification, but also to develop new methods for downstream analysis tasks such as graphical modeling of the variables being analyzed. The methods will be disseminated though distribution of computer programs and also though web service.

Public Health Relevance

The goal of this research is to develop new statistical methods for nonparametric learning from large amount of multi-dimensional data points. Based on these methods, we will create and provide more effective data analysis and interpretation tools for multi- parameter flow cytometry. There are two broad impacts of this research. First, the new analytical and computational methods will open up novel ways to use flow cytometry in many areas of studies in current biology and medicine. Second, the density estimation methods and software resulting from this research have general applicability beyond flow cytometry analysis, and can be used as building blocks to design new statistical analysis and modeling tools useful in other areas.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code