Modern science, especially biochemistry, has become dependent on numerical analysis of large amounts of data generated in most every experiment. Scientific advancement in biology and in understanding disease pathogenesis will likely depend on the analysis of the huge corpus of biomolecular data (eg. microarray, RNA and DNA sequence data). This advancement is linked to the field's ability to continue developing statistical methodologies capable of identifying a robust ``signal'' which can be reproducibly identified in multiple experiments all of which generate noisy data. The PI has shown how the theoretical framework of spectral analysis with Markov chains unifies several statistical methods for identifying structure in data that is observed with noise: discrete Fourier analysis, correspondence analysis, principle components analysis, as well as spectral clustering. This unifying framework also provides insight into, and generalization of, the more traditional methods listed above. Therefore, the PI's proposed research has two major directions. In one direction, it will continue basic methodological development of exploratory data analysis with a focus on methods capable of identifying biological signals observed in noisy experimental conditions. In another, it will focus on rigorous statistical analysis of this methodology which is in wide use in statistics, computer science and bioinformatics.
Statistical methods developed here will be particularly aimed at the study of cellular regulation of gene and protein expression. These cellular mechanisms have wide ranging importance in understanding human disease including cancer and infectious disease. The data analytic methods developed under this grant will be implemented and made publicly available through Bioconductor, a package in R. The broad goal of this proposal is to work towards providing a methodological unification of methods in statistics, biology and computer science to biomolecular data. Thus, it falls roughly into the field of bioinformatics.