Biomarker identification is becoming an important use for high-throughput technologies like microarrays and mass spectrometry. These high-throughput data (especially microarray data) are used extensively for tissue type classification, including various tumor types, patient survival time prediction, time to relapse, and other clinically relevant temporal quantities. These high-throughput data measure the activity levels of thousands of potential predictors (genes in the case of gene expression data and peptides in the case of mass spectrometry or protein microarray data). The analysis of these data poses difficult statistical problems since the number of features measured is far larger than the number of tissue samples that are typically available. Moreover, many different sets of predictors produce similar prediction accuracies. Here, we propose to incorporate biological knowledge into a supervised framework to identify biologically meaningful predictors for classification and survival analysis. Towards this end, we will develop Bayesian Model Averaging (BMA) methods to produce simple, reliable, robust, and interpretable predictions. BMA also provides a probabilistic multivariate feature selection method. As part of this effort, we will extend the recently developed latent position cluster model for social networks to infer biological networks and identify network modules. Network properties (e.g., modules and the degree of connectivities) confer biological meanings. Hence, we will integrate network properties in a supervised framework to identify biologically meaningful predictors. We will extend the BMA methods to determine predictive network modules and pre-defined gene categories (e.g. GO categories, KEGG pathways). This proposal has two main computational thrusts: (1) the development of BMA methods for multi-class classification and survival analysis (Aim 1);and (2) the development of latent position cluster model for inferring biological networks and identifying network modules (Aim 3). These two computational thrusts are unified in Aim 2 in which we use network modules and properties in the supervised BMA framework.
In Aim 4, we will generate expression perturbation data to evaluate our network construction methods. Finally, we will make the software and data generated publicly available. The methods developed in this proposal are generally applicable to many high-throughput data types. However, since we will generate expression perturbation data to validate and refine the constructed expression networks, we will focus on applying our developed methods to gene expression data.

Public Health Relevance

Biomarker identification is becoming an important use for high-throughput technologies like microarrays. This proposal aims to identify biologically meaningful predictive biomarkers for tissue type classification, including various tumor types, patient survival time prediction, time to relapse, and other clinically relevant temporal quantities. This project could lead to inexpensive, accurate and robust diagnostic tests that increase the accuracy of diagnoses or prognoses for patients with cancer or other diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM084163-05
Application #
8294758
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
2008-09-01
Project End
2014-06-30
Budget Start
2012-07-01
Budget End
2014-06-30
Support Year
5
Fiscal Year
2012
Total Cost
$457,112
Indirect Cost
$172,665
Name
University of Washington
Department
Microbiology/Immun/Virology
Type
Schools of Medicine
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195
Lenkoski, Alex; Eicher, Theo S; Raftery, Adrian E (2014) Two-Stage Bayesian Model Averaging in Endogenous Variable Models. Econom Rev 33:
Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee (2014) Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC Syst Biol 8:47
Yeung, K Y; Gooley, T A; Zhang, A et al. (2012) Predicting relapse prior to transplantation in chronic myeloid leukemia by integrating expert knowledge and expression data. Bioinformatics 28:823-30
McCormick, Tyler H; Raftery, Adrian E; Madigan, David et al. (2012) Dynamic logistic regression and dynamic model averaging for binary classification. Biometrics 68:23-30
Yeung, Ka Yee; Dombek, Kenneth M; Lo, Kenneth et al. (2011) Construction of regulatory networks using expression time-series data of a genotyped population. Proc Natl Acad Sci U S A 108:19436-41
Zarbl, Helmut; Gallo, Michael A; Glick, James et al. (2010) The vanishing zero revisited: thresholds in the age of genomics. Chem Biol Interact 184:273-8
Steele, Russell J; Wang, Naisyin; Raftery, Adrian E (2010) Inference from Multiple Imputation for Missing Data Using Mixtures of Normals. Stat Methodol 7:351-364
Oehler, Vivian G; Yeung, Ka Yee; Choi, Yongjae E et al. (2009) The derivation of diagnostic markers of chronic myeloid leukemia progression from microarray data. Blood 114:3292-8
Annest, Amalia; Bumgarner, Roger E; Raftery, Adrian E et al. (2009) Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data. BMC Bioinformatics 10:72