Rapid technological advances have allowed for molecular profiling across multiple domains from a single tumor sample, supporting clinical decision making in many diseases, especially cancer. Key challenges are to effectively assimilate information across these domains to identify genomic signatures and biological entities that may be targeted by drugs, develop accurate risk prediction profiles for future patients, and identify novel patient subgroups for tailored therapy and monitoring. The primary objective of this project is the development of an innovative, flexible and scalable statistical framework for analyzing multi-domain, complex-structured, and high throughput modern array and next generation sequencing-based 'omics datasets. The work is motivated by several investigations related to lung cancer; however, the proposed methods and computational tools are broadly applicable in a variety of contexts involving high-dimensional data. From a broader scientific perspective, the application of these novel methodologies to the motivating clinical and genomic datasets will allow for principled "structure hunting". This will provide more accurate prediction of clinical outcomes, greater statistical power to detect important biologically actionable biomarkers for improved risk estimation and treatment selection for cancer diagnosis and prognosis, and better utilization of biological domain knowledge to find relationships between different platforms. It will lead to subsequent implementation of rational biomarker-based and individualized clinical trials that increase the success rate of personalized therapies based on molecular markers.
To achieve these goals, the following specific aims are proposed: (1) Develop versatile and flexible statistical techniques for identifying differential genomic signatures for lung cancer in mixed, heterogeneously scaled single-domain datasets arising from array and next-generation sequencing based studies. A general class of nonparametric Bayesian models based on sound theoretical justifications will be developed and implemented using efficient, scalable algorithms. These models provide biologically interpretable summaries and enable applicability to a wide variety of high-throughput datasets. (2) Formulate integrative probabilistic frameworks for massive multiple-domain data, which coherently incorporate dependence within and between domains to accurately detect tumor subtypes and predict clinical outcomes, thus providing a catalogue of genomic aberrations associated with cancer taxonomy. (3) Foster massively parallel algorithms and high-performance computational and inferential tools that drastically reduce the computation times and increase scalability of high-throughput datasets. These scalable inferential procedures are able to assimilate information from several platforms and select flexible models with the appropriate dependence structures, while detecting optimally sparse, non-linear mechanisms for predicting and identifying tumor subtypes. Because these formulations are fully probabilistic, they offer substantial improvements over purely algorithmic approaches by accounting for different sources of variation and providing measures of inference uncertainty. Since existing simulation-based algorithms do not scale for massive datasets, theoretical properties of these models will be exploited to devise data-squashing algorithms for efficient inference. Furthermore, as traditional CPUs are limited by energy consumption, heat generation and memory access, software that harnesses the power of low-cost massively parallel computing tools such as graphics processing units (GPUs) will be developed and made freely available.