Variation is a fundamental property of data obtained in modern cancer research, be they genomic changes, gene expression abnormalities, or phenotypic data from gene mapping. Statistical methods arise from a logic that decomposes variation into that which is sporadic and that which may have some biological significance. The purpose of this research proposal is to develop statistical methods tailored to current and emerging data structures in cancer biology. If successful, this research will improve inference about cancer biology by enabling more efficient and robust extraction of information from the complex data that will be upon us. Four specific problems will be tackled. New microarray-based technologies have enabled DNA sequence copy number variations to be measured at very high resolution in cancer tumor cells, thus enhancing the characterization of suppressor genes and oncogenes. Sources of variation complicate inference.
The first aim i s to develop statistical methods for analyzing copy-number variation by extending existing models of allelic-imbalance data. New mathematical formulations and inference methods are proposed for this purpose. Microarray technology is also creating a wealth of data on gene expression in cancer cells.
In Aim 2, hierarchical modeling methods are proposed to characterize the normal variation of these profiles, to enable comparison at various levels, such as among genes, or among microarrays, and to enable data reduction via nonparametric mixture modeling.
The third aim concerns interval mapping methods which have for some time enabled the localization of genes in controlled animal experiments. Methods which are nonparametric in the phenotype distribution are highly robust, but available methods can lose too much information by working with sums of ranks. Sensitive nonparametric interval mapping methodology is proposed to enhance efficiency. Finally, phenotype-driven mutagenesis experiments based on quantitative phenotypes require statistical methods to efficiently screen mutagenized animals and to trace mutant genotypes through progeny testing and mapping. Parametric and nonparametric methods are proposed for this purpose. Developments on these four specific aims are linked by common biological features, by structural similarities in the statistical models, and in the computational issues raised by data analysis.