Very high dimensional count and binary data are now common in many fields including machine learning, imaging and marketing. In high-throughput biology, ultra-high thoughput sequencing technologies which produce count and categorical data are displacing microarrays and other "omics" measurement devices. The output of these measurement devices are counts per gene or other biological subunit for tens of thousands of responses per sample, or presence/absence for features such as single nucleotide polymorphisms (SNPs), for possibly millions of responses per sample. Similar data can be derived on for features on satellite images, medical scans, monitoring devices and other very high dimensional measurement devices. The investigator will extend highly multivariate and multiple testing methods developed for continuous (primarily normally distributed) data to discrete data. New methods will be developed in four areas: A) analyses for differences in distribution for discrete data that can accommodate complex experimental designs using generalized linear mixed models with overdispersion and Bayesian or empirical Bayes shrinkage. B) methods for supervised clustering of samples and variables in the discrete data setting taking into account the error structure of the discrete predictors. C) classical and sufficient dimensions reduction methods such as canonical correlation and sliced inverse regression for discrete data. D) extension of concepts and methods in multiple testing, such as false discovery rate estimation to the discrete setting in which the p-values from independent or weakly dependent tests may have different null distributions using conditional mixture modeling. The methods will be tested on genomics and imaging data.

Very highly multivariate data are now the norm in fields as diverse as cell biology, marketing, medical and satellite imaging, meteorology, epidemiology, fraud detection and cancer research. These data may include thousands or millions of measurements on each item in the sample. For example, genotyping services provide individuals with information on hundreds of thousands of genetic variants in their cells and retailer databases may have information on the sales of tens of thousands of items for each store in the chain. Many of these data come in the form of counts (such as number of items of each type in inventory, number of mRNA molecules encoding a particular protein) or in the form of categories (such as on/off, present/absent, or genotype AA, aa or Aa). Methodology for highly multivariate continuous measurements such as blood pressure and temperature are well-developed but do not apply directly to count and categorical data. The investigator will develop statistical methodology and software to improve analysis and summary of count and categorical data. Four main areas of research are proposed: A) statistical models and tests to determine if the variables are associated with differences among groups; B) statistical methods for prediction or classification of group membership; C) methods to summarize the data with a much smaller set of derived variables which preserve the predictive power of the full data and D) multiple comparisons methods to estimate the error rates. For example, in a study of the genes associated with metastatic versus non-metastatic cancer, the methods could be used to determine which genes express differently in tumors which did or did not advance to metastasis, select a smaller set of genes which could be used as a diagnostic tool and then provide convenient summaries which can readily be interpreted by clinicians. In a study of stresses on a machine part, the pixels of scans of the part before and during the application of the stresses could be used to determine precise locations at which the part might fail and differences among features of the scan between parts which fail at low versus high stress. In studies in which a large number of models are fitted or tests conducted, it is necessary to tolerate a small percentage of errors. Concepts and methods in multiple testing which have been developed for continuous data will be extended to assist in estimating and controlling the number of false conclusions with count and categorical data.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1007801
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2010-06-01
Budget End
2015-05-31
Support Year
Fiscal Year
2010
Total Cost
$200,000
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802