The analysis of large datasets from computational biology and medicine represents an important chal- lenge for Statisticians. These data typically have a large number of correlated features with relatively weak signals for predicting phenotypes of interest. Examples of such data includes DNA sequences and GWAS, mass-spectra, MRI and EEG images, RNAseq and protein arrays, to name a few. The broad goal of this ongoing three-investigator grant is to develop and study statistical techniques that enhance the analysis and interpretation of these data. Our team combines experience in statistical modeling, algorithmic devel- opment, and theoretical analysis of these techniques. In the new projects, our focus is the development of state-of-the art methods to exploit known or implied structure in order to extract useful information from high-dimensional data. The renewal will address these goals through four Speci?c Aims. The investigators will study: 1. Principal curves for modeling chromatin architecture. We propose new statistical methodology for modeling the chromatin structure of DNA based on contact maps derived from Hi-C assays. We use techniques inspired by principal curves, but applied in the context of metric scaling, that take into account local structure along the chromosome. 2. Fitting sparse models to large data and to summary data. Many modern datasets (e.g. GWAS with 1M SNPs and 500K subjects) are computationally challenging. We propose computational advances that enable the lasso to scale to such scenarios. Often the authors of published GWAS studies do not share the raw data for privacy and other reasons. We propose techniques for approximately ?tting multivariate versions of these models given only the univariate summary scores typically reported. 3. Estimating high-dimensional eigenstructure in virology and genetics. We will exploit low rank struc- ture in sequence data to compare different methods for inference about sectors in viral proteins. For quantitative genetics, we will develop statistical theory, methods and software for eigenanalysis of multiple levels of variation, and speci?cally for genetic covariance matrices. 4. Prediction with side information. Many studies seek biomarker signatures that are predictive of outcomes such as disease status under various treatments. We propose a statistical approach for exploiting side information such as membership in gene pathways or quantitative measures for each biomarker in order to increase the power for discovering signatures in these challenging domains. Working together, the investigators and their students will implement the new statistical tools into publi- cally available software, following a pattern established in earlier cycles of this grant, in which our packages have found wide use among medical researchers both at Stanford and around the world.
Statistical methods such as those to be developed in this research help medical researchers discover and validate new basic science results?for example in genomics, virology and fundamental biology?that can lead to new therapies. They aid also in the planning and analysis of clinical investigations of new treatments so as to use in the most ef?cient manner the large amount of data collected in current research, while also accurately describing the degree of uncertainty in the conclusions.