Multiple hypothesis testing has become an increasingly active area of research because of its usefulness as a statistical tool to analyze data from modern scientific investigations, such as DNA microarray and functional magnetic resonance imaging (fMRI) studies. While several statistical methods have been put forward to address different multiple testing problems arising in these studies, often they were developed without fully utilizing the fact that the underlying test statistics are dependent, although such dependence is a natural phenomenon for data from these studies and might lead to misleading conclusions if not properly taken into consideration. The proposed research project seeks to develop new and innovative multiple testing methods by properly addressing the so called `dependence issues?. It focuses on three broad areas of research: (i) developing new multiple testing methods controlling generalized versions of some standard error rates that allow a few false rejections (ii) developing new multiple testing methods controlling multiple false directional errors, and (iii) developing new data-adaptive methods controlling the familywise error and false discovery rates.
This project will be expected to have a broad impact on the theory and practice of statistics. The results from this project will be of importance to virtually any statistical investigation where questions are posed in terms of testing several hypotheses. For instance, in microarray or fMRI studies where detection of differentially expressed genes or active voxels is often framed as a multiple testing problem, in pharmaceutical investigations where multiple testing techniques are routinely used in dose-response study or in evaluating a drug's efficacy over standard drug or placebo, our project can potentially offer new and improved methodologies. The project would also benefit education through training of graduate students, incorporation of the developed methodologies in statistics courses. The results will be disseminated through presentations and discussions at national and international conferences, and visits to other institutions. The software to be developed under this project will be made available, free of charge, to the scientific community.
Hypothesis testing is a statistical method of making decision based on data regarding the truth or falsity of a certain hypothesis in a scientific study. This extends to what is referred to as Multiple Testing when there are multiple hypotheses to be tested simultaneously. In modern scientific investigations based on advanced technologies, such as in DNA microarray and functional magnetic resonance imaging (fMRI) studies, a breathtaking increase in data-acquisition capabilities is now generating a large number of hypotheses to be simultaneously tested, much larger than what the standard multiple testing methods were originally designed for. For instance, in a modern bio-medical research where one would be interested to determine which genes might be associated with a particular type of cancer, a typical study may record expression levels of thousands of genes for perhaps 100 subjects, with only half having the cancer and the half serving as a control group. In other words, the number of observations or sample at one’s hand in this study is much smaller than the number of variables associated with that of the hypotheses. This is a typical example of modern applications of large-scale multiple testing where traditional statistical methods would not be adequate, or even be applicable, making it an urgent case for developments of newer and more appropriate methods. In response, a tremendous upsurge of research has taken place in the area of multiple testing in the last fifteen years or so, producing many new theories and methodologies. Despite those activities, some fundamentally important issues related to the proper use of these methodologies in many of the current scientific studies, given some of the nuances these studies bring in, still remain to be fully investigated. This collaborative research project started with the primary goal of addressing some of these outstanding issues. The project has (i) produced fourteen articles, of which eight are published and six submitted for publication, all in peer reviewed journals or special volumes; (ii) created a synergy between a senior researcher with established research record and a junior researcher with a great potential to carry out fundamental research on theory and methodology and thus fostered and advanced quality research in an emerging scientific field through proper mentoring; (iii) educated two graduate students who earned PhD in the general area of large-scale multiple testing, a challenging scientific field of modern importance, and one of them is now engaged in teaching and research in a university; (iv) contributed toward modernizing graduate education by providing newer course materials for an area of modern importance; and (v) advanced the knowledge base through dissemination of its results at various conferences and institutions. The results generated through the project provide better understanding of proper use of statistical tools when applied to many of the modern scientific investigations. The following are three examples of such application: A study on how genes are associated with a particular type of cancer typically involves a large number of endpoints (i.e., genetic markers) and hence may be quite expensive. It has become more and more attractive to carry out such genetic studies in two stages. It can be cost effective and efficient since the genes can be tested in the first stage with relatively less number of observations to determine if they are cancer causing, not cancer causing, or to be further investigated in the second stage using additional observations. A statistical method for testing the hypotheses in such a two-stage design framework with a control over falsely discovered cancer causing genes has been urgently needed. This project gives such a method. Biologists are often interested in determining whether some pre-defined gene sets are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations; however, they do not take into account information regarding the subsets within each set. Secondly, genes belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate. Our project gives a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined gene set while exploiting the underlying dependence structure among the genes. Hypothesis testing plays a pivotal role in pharmaceutical research related to drug discovery. With the increasing importance given to studies on how DNA variation in the human genome affects the safety and efficacy of drugs, there is now a demand for statistical methods for testing not one but hundreds and thousands of hypotheses at the same time in modern drug discoveries. Our project offers such statistical methods.