Biomarker identification is becoming an important use for high-throughput technologies like microarrays and mass spectrometry. These high-throughput data (especially microarray data) are used extensively for tissue type classification, including various tumor types, patient survival time prediction, time to relapse, and other clinically relevant temporal quantities. These high-throughput data measure the activity levels of thousands of potential predictors (genes in the case of gene expression data and peptides in the case of mass spectrometry or protein microarray data). The analysis of these data poses difficult statistical problems since the number of features measured is far larger than the number of tissue samples that are typically available. Moreover, many different sets of predictors produce similar prediction accuracies. Here, we propose to incorporate biological knowledge into a supervised framework to identify biologically meaningful predictors for classification and survival analysis. Towards this end, we will develop Bayesian Model Averaging (BMA) methods to produce simple, reliable, robust, and interpretable predictions. BMA also provides a probabilistic multivariate feature selection method. As part of this effort, we will extend the recently developed latent position cluster model for social networks to infer biological networks and identify network modules. Network properties (e.g., modules and the degree of connectivities) confer biological meanings. Hence, we will integrate network properties in a supervised framework to identify biologically meaningful predictors. We will extend the BMA methods to determine predictive network modules and pre-defined gene categories (e.g. GO categories, KEGG pathways). This proposal has two main computational thrusts: (1) the development of BMA methods for multi-class classification and survival analysis (Aim 1); and (2) the development of latent position cluster model for inferring biological networks and identifying network modules (Aim 3). These two computational thrusts are unified in Aim 2 in which we use network modules and properties in the supervised BMA framework.
In Aim 4, we will generate expression perturbation data to evaluate our network construction methods. Finally, we will make the software and data generated publicly available. The methods developed in this proposal are generally applicable to many high-throughput data types. However, since we will generate expression perturbation data to validate and refine the constructed expression networks, we will focus on applying our developed methods to gene expression data.
Biomarker identification is becoming an important use for high-throughput technologies like microarrays. This proposal aims to identify biologically meaningful predictive biomarkers for tissue type classification, including various tumor types, patient survival time prediction, time to relapse, and other clinically relevant temporal quantities. This project could lead to inexpensive, accurate and robust diagnostic tests that increase the accuracy of diagnoses or prognoses for patients with cancer or other diseases. ? ? ?
Showing the most recent 10 out of 13 publications