Biomarker identification is becoming an important use for high-throughput technologies like microarrays and mass spectrometry. These high-throughput data (especially microarray data) are used extensively for tissue type classification, including various tumor types, patient survival time prediction, time to relapse, and other clinically relevant temporal quantities. These high-throughput data measure the activity levels of thousands of potential predictors (genes in the case of gene expression data and peptides in the case of mass spectrometry or protein microarray data). The analysis of these data poses difficult statistical problems since the number of features measured is far larger than the number of tissue samples that are typically available. Moreover, many different sets of predictors produce similar prediction accuracies. Here, we propose to incorporate biological knowledge into a supervised framework to identify biologically meaningful predictors for classification and survival analysis. Towards this end, we will develop Bayesian Model Averaging (BMA) methods to produce simple, reliable, robust, and interpretable predictions. BMA also provides a probabilistic multivariate feature selection method. As part of this effort, we will extend the recently developed latent position cluster model for social networks to infer biological networks and identify network modules. Network properties (e.g., modules and the degree of connectivities) confer biological meanings. Hence, we will integrate network properties in a supervised framework to identify biologically meaningful predictors. We will extend the BMA methods to determine predictive network modules and pre-defined gene categories (e.g. GO categories, KEGG pathways). This proposal has two main computational thrusts: (1) the development of BMA methods for multi-class classification and survival analysis (Aim 1);and (2) the development of latent position cluster model for inferring biological networks and identifying network modules (Aim 3). These two computational thrusts are unified in Aim 2 in which we use network modules and properties in the supervised BMA framework.
In Aim 4, we will generate expression perturbation data to evaluate our network construction methods. Finally, we will make the software and data generated publicly available. The methods developed in this proposal are generally applicable to many high-throughput data types. However, since we will generate expression perturbation data to validate and refine the constructed expression networks, we will focus on applying our developed methods to gene expression data.

Public Health Relevance

Biomarker identification is becoming an important use for high-throughput technologies like microarrays. This proposal aims to identify biologically meaningful predictive biomarkers for tissue type classification, including various tumor types, patient survival time prediction, time to relapse, and other clinically relevant temporal quantities. This project could lead to inexpensive, accurate and robust diagnostic tests that increase the accuracy of diagnoses or prognoses for patients with cancer or other diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM084163-03
Application #
7884325
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Remington, Karin A
Project Start
2008-09-01
Project End
2013-06-30
Budget Start
2010-07-01
Budget End
2011-06-30
Support Year
3
Fiscal Year
2010
Total Cost
$462,742
Indirect Cost
Name
University of Washington
Department
Microbiology/Immun/Virology
Type
Schools of Medicine
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195
Fraley, Chris; Percival, Daniel (2015) Model-Averaged [Formula: see text] Regularization using Markov Chain Monte Carlo Model Composition. J Stat Comput Simul 85:1090-1101
Fronczuk, Maciej; Raftery, Adrian E; Yeung, Ka Yee (2015) CyNetworkBMA: a Cytoscape app for inferring gene regulatory networks. Source Code Biol Med 10:11
Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee (2014) Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC Syst Biol 8:47
Lenkoski, Alex; Eicher, Theo S; Raftery, Adrian E (2014) Two-Stage Bayesian Model Averaging in Endogenous Variable Models. Econom Rev 33:
Yeung, K Y; Gooley, T A; Zhang, A et al. (2012) Predicting relapse prior to transplantation in chronic myeloid leukemia by integrating expert knowledge and expression data. Bioinformatics 28:823-30
Raftery, Adrian E; Niu, Xiaoyue; Hoff, Peter D et al. (2012) Fast Inference for the Latent Space Network Model Using a Case-Control Approximate Likelihood. J Comput Graph Stat 21:901-919
Lo, Kenneth; Raftery, Adrian E; Dombek, Kenneth M et al. (2012) Integrating external biological knowledge in the construction of regulatory networks from time-series expression data. BMC Syst Biol 6:101
McCormick, Tyler H; Raftery, Adrian E; Madigan, David et al. (2012) Dynamic logistic regression and dynamic model averaging for binary classification. Biometrics 68:23-30
Yeung, Ka Yee; Dombek, Kenneth M; Lo, Kenneth et al. (2011) Construction of regulatory networks using expression time-series data of a genotyped population. Proc Natl Acad Sci U S A 108:19436-41
Steele, Russell J; Wang, Naisyin; Raftery, Adrian E (2010) Inference from Multiple Imputation for Missing Data Using Mixtures of Normals. Stat Methodol 7:351-364

Showing the most recent 10 out of 13 publications