DNA microarrays can provide insight into genetic changes occurring during stagewise progression of diseases like cancer. Accurate identification of these changes has significant therapeutic and diagnostic implications. Statistical analysis of such data is however challenging due to the sheer volume of information. With new microarray technology it is possible to measure expressions on nearly 60,000 transcripts for each sample of tissue analyzed. To properly understand the evolution of a progressive disease, expression values are collected over all possible biological stages, thus the number of parameters in such problems can be in the hundreds of thousands, or even millions. The high dimensionality presents theoretical problems to standard ANOVA-based extensions of two-sample Z-tests, a popular method for detecting differentially expressed genes in two groups. Additionally, standard approaches that focus on controlling false detection rates primarily apply to simpler experimental designs; moreover these approaches tend to be conservative and are expected to be worse in multigroup settings. This work introduces a new methodology called Bayesian ANOVA for Microarrays (BAM) for reliably detecting differentially expressed genes in complex experimental settings. The method rests on a high dimensional variable selection method that exploits a rescaled spike and slab hierarchical model. BAM is shown to be risk optimal in terms of the total number of misclassified genes. The exact mechanisms for this risk optimality are theoretically delineated as a selective shrinkage effect. Theory guides development of graphical devices for adaptive optimal gene selection. A large multistage colon cancer microarray repository collected at the Ireland Cancer Center of Case Western Reserve University serves as a testbed for the methods. In parallel to this is the development of JAVA-based software for implementing BAM. Software uses a menu driven GUI and includes a minimal number of user-specified tuning parameters, thus making it user friendly for use by other molecular biology laboratories.
DNA microarrays allow for high throughput analysis of potential genetic determinants of diseases like cancer. It is now typical to have expression on nearly 60,000 transcripts for each sample of tissue analyzed. This information can potentially provide information about which genes are involved in stagewise development of cancer as well as indicate novel therapeutic and diagnostic targets. However, statistical inferences to identify interesting genes is challenging due to the large number of statistical tests that are run. Standard approaches employ ANOVA test statistics and are prone to high false detections. False detection rate control methods tend to be overly conservative and do not extend naturally to more complex multistage experimental designs. This work introduces a new methodology called Bayesian ANOVA for Microarrays (BAM) which reliably detects differentially expressed genes in multigroup experimental design settings. The method employs a special hierarchical model that imparts an oracle like behaviour for gene selection --- that is, ultimately, only those truly differentially expressing genes are selected. The reasons for this behaviour are theoretically delineated in this research, and the theory guides the development of novel graphical devices for adaptively optimal gene selection in real microarray datasets. A large multistage colon cancer microarray repository collected at the Ireland Cancer Center of Case Western Reserve University serves as a testbed for the methods and also provides a tremendous opportunity to understand the colon cancer disease process, a topic which is of great medical importance. While colon cancer has a well defined evolution defined by clinical stage, very little is known about its molecular evolution. In parallel to this, is the development of JAVA-based software using a menu driven GUI having a minimal number of user-specified tuning parameters, thus making it feasible to port the software to molecular biology laboratories for active use in analysis of other disease processes and potentially other high throughput sources of data.