The subject of this investigation is nonparametric methodology designed to provide sound inferences in large-scale multiple testing problems. In particular, situations are considered in which one observes a large number of small data sets and wishes to test as many hypotheses as there are data sets. A primary concern is ensuring validity of tests when the data generating mechanism is largely unknown. A fundamental model considered is one wherein the distribution of data within small data sets is the same, up to location and scale, for all data sets. This distribution is assumed to determine the sampling distributions of all test statistics, and hence, given an estimate of the within-data-sets distribution, the bootstrap can be used to estimate the requisite sampling distributions. Cluster analysis is investigated as a means of nonparametrically estimating the common within-data-sets distribution and also the joint distribution of location and scale parameters across data sets. Asymptotic properties of such estimates are investigated. These asymptotics allow the number of small data sets to tend to infinity, but bound the sizes of individual data sets. Extensions to models where the distributions across data sets differ with respect to more than just location and scale are also considered. A key idea in these extensions is defining and consistently estimating one or a small number of reference distributions that define critical values for all test statistics. This allows one to use existing technology to control the false discovery rate even though all test statistics have different sampling distributions, none of which can be estimated consistently.
Important areas of application for the research funded by this grant are microarray analysis and proteomics, both of which provide enormous insight into the study of genetics. The methods investigated have the potential of improving methods of analyzing microarray and proteomics data. Genetics has had and will continue to have a tremendous impact on society, particularly in the area of medicine. Therefore, any method that improves upon existing technology for analyzing genetics data has the potential of enhancing the general quality of life.