The investigators study statistical analysis of multinomial counts with a large number K of categories and a small number n of sample size. This "large K and small n" problem is a very challenging problem that requires thinking about statistical inference at a fundamental level. Employing an auxiliary data-generating approach to reason directly toward probabilistic inference , the PIs develop methods that are probabilistic and have desirable frequency properties. A sequence of topics to be investigated include 1) one-sample and two-sample "large K and small n" multinomial inference; 2) large-scale simultaneous hypothesis testing; 3) application in genome-wide association study; and 4) associated efficient computational methods.

The "large K and small n" multinomial inference is motivated by genome-wide association studies with a large number of genotypes from single nucleotide polymorphisms (SNPs) data. SNPs are major genetic variants that may associate with common diseases such as cancer and heart disease. With new statistical methods and computing software to be developed in the project, the research is expected to generate useful tools for applied statisticians and scientists who are challenged by very-high-dimensional count data.

Project Report

Statistical analysis of multinomial counts with a large number K of categories and a small sample size n has proven to be a challenging problem for both frequentist and Bayesian methods. This ``large K and small n'' problem is typical in genome-wide association studies when a block of SNPs (Single Nucleotide Polymorphisms) are jointly considered. For scientific discovery where prior knowledge is very limited if any, multinomial models allowing for exploratory data analysis and model building ought to play a fundamental role. A main objective of the awarded project is to develop a new method for large-scale multinomial inference. The PI and CO-PI took the Inferential Model (IM) approach and developed methods for (1) large-scale simultaneous hypothesis testing (Liu and Xie, 2014a) and (2) large scale two sample multinomial inferences and its applications in genome-wide association studies (Liu and Xie, 2014b). After having observed the existence of confounding effects in Genome-Wide Association Studies, the PI and CO-PI also took a careful exploratory analysis of a real data set and proposed methods for multi-locus test and correction for confounding effects in Genome-Wide Association Studies (Chen, Liu, and Xie, 2014). Research on inference with large-scale multinomial models requires thinking about statistical inference at a very fundamental level and looking for novel ideas beyond the current two dominant schools of thought, the frequentist and Bayesian. As a promising alternative to existing schools of thought on statistical inference, the IM framework has been recently proposed by Zhang and Liu (2011) and Martin and Liu (2013, 2014a,b). Unlike existing schools of thought, IM produces prior-free probabilistic output as assessment of evidence for assertions of interest in scientific inference. Assertions of interest, for example, include the assertion that a particular block of SNPs is associated with a given disease. Taking the IM approach, the PI and CO-PI employed an auxiliary data-generating device, similar to the pivotal quantity used in Fisher's fiducial argument, to reason directly towards probabilistic inference rather than construct fiducial probabilities in an attempt to replace Bayesian posterior probabilities. Nonetheless, the resulting inference is probabilistic and thus resembles Bayes and its extension, namely, the Dempster-Shafer theory of belief functions. Unlike Bayesian procedures with no credible prior knowledge, the proposed method provides probabilistic inference that has desirable frequency properties. The PIs proposed an IM method for large-scale simultaneous hypothesis testing, one of the proposed research topics. The PIs also developed a generalized IM method for large-scale two-sample multinomial inference, another proposed research topic. The technical details of the methods have been published in Liu and Xie (2014a,b). The proposed methods were applied to genome-wide association studies, the third of the proposed research topics. In analyzing data from a genome-wide association study to identify association between genetic variants to the disease Rheumatoid Arthritis, the PIs recognized that confounding effects in Genome-Wide Association Studies must be handled carefully for valid inference. In addition to the proposed research topics, they developed methods for multi-locus test and correction for confounding effects in Genome-Wide Association Studies. The technical details are given in Chen, Liu, and Xie (2014). In carrying out the above research, the PIs developed parallel computing methods and computer software. It is expected that the above research on a new way of multinomial inference will motivate researchers to think deeply about realistic assessment of uncertainty in practice for scientific discoveries. Thus, the PIs believe that the research will help demonstrate the importance of statistics in scientific investigations. At the same time, the proposed project has provided graduate students research topics on statistical analysis and helped the development of statistics courses on massive data analysis. References: Liu, C. and Xie, J. (2014a). Probabilistic Inference for Multiple Testing. International Journal of Approximate Reasoning, 55, 654-665. Liu, C. and Xie, J. (2014b). Large Scale Two-Sample Multinomial Inferences and Its Applications in Genome Wide Association Studies, International Journal of Approximate Reasoning, 55, 330-340. Chen, D. Liu, C., and Xie, J. (2014). Multi-locus Test and Correction for Confounding Effects in Genome-Wide Association Studies, submitted. Zhang, J. and Liu, C (2011). Dempster-Shafer inference with weak beliefs. Statistica Sinica, 21, 475-494. Martin, R. and Liu, C. (2013). Inferential models: a framework for prior-free posterior probabilistic inference, Journal of the American Statistical Association, 108, 301-313. Martin, R. and Liu, C. (2014a). Conditional inferential models: combining information for prior-free probabilistic inference, Journal of the Royal Statistical Society, Series B, in press. Martin, R. and Liu, C. (2014b). Marginal inferential models: prior-free probabilistic inference on interest parameters, Journal of the American Statistical Association, in press.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1007678
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2010-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$230,000
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907