SYSTEMS LEVEL CAUSAL DISCOVERY IN HETEROGENEOUS TOPMED DATA ABSTRACT The advent of new technologies for collecting and analyzing multiple heterogeneous data streams from the same individual makes possible the detailed phenotypic characterization of diseases and paves the way for the development of individualized precision therapies. A major bottleneck in this process is the lack of robust, efficient and truly integrative analytic methods for such multi-modal data. This proposal builds on the ongoing efforts of our group in the area of causal learning in biomedicine. The objective of this application is to extend, modify and tailor our causal probabilistic graphical models to data typically collected by TOPMed projects, such as ?omics data (SNPs, metabolomics, RNA-seq, etc), imaging, patients' history, and clinical data. COPDGene is one of the TOPMed projects and has generated datasets with those modalities for 10,000 patients with chronic obstructive pulmonary disease (COPD), the third leading cause of death and a major cause of disability and health care costs in the US. The prevailing view is that COPD is a syndrome, consisting of multiple diseases with different characteristics. There is currently no satisfactory method for COPD subtyping or prediction of disease progression. In this project we will apply, test and validate our approaches on COPDGene and another large independent COPD cohort. The extension and application of our methods to cross-sectional and longitudinal data will also allow us to investigate a number of important questions and aspects related to COPD. Mechanistically, we will investigate how SNPs, genes and their networks are causally linked to disease phenotypes. In pathology, we will identify conditional biomarkers, which will lead to disease sub-classification and identification of causal components in each subtype. In pathophysiology, we will identify features that are directly linked to lung function decline and outcome. We will make all our algorithms and results available to the community through web and public cloud interfaces. The deliverables will be (1) new probabilistic approaches for integration and analysis of multi-modal cross-sectional and longitudinal data, including SNPs, blood biomarkers, CT scans and clinical data; (2) new cloud-based server to make these approaches available to the research community; (3) results on the mechanism, pathology and pathophysiology of COPD facilitation and progression. To guarantee the success of the project we have assembled a team of experts in genomics, machine learning, cloud computing and COPD. This cross- disciplinary team project will have a positive impact beyond the above deliverables, since the generality of our approaches makes them applicable to any disease. We expect that during this U01 we will have the opportunity to collaborate with other teams in the TOPMed consortium to help them investigate the causes of their corresponding disease phenotypes. We do believe that data integration in a single probabilistic framework will be in the heart of precision medicine strategies in the future, when massive high-throughput data collection will become a routine diagnostic and prognostic procedure in all hospitals.

Public Health Relevance

Current technologies for high-throughput biomedical data collection allow the interrogation of multiple modalities from a single patient. New promising analytical methods started emerging, which can analyze those multi-modal data in a holistic way. Chronic obstructive pulmonary disease (COPD) constitutes the third leading cause of death and a major cause of disability and health care costs in the US. The prevailing view is that COPD is a syndrome, consisting of multiple diseases with their own characteristics. There is currently no satisfactory method for COPD subtyping. We will apply, test and validate new probabilistic approaches on two cohorts of COPD patients. We will investigate the mechanisms of disease facilitation; we will identify patient cohorts with specific characteristics (disease subtypes); and investigate risk factors and causal variants for the disease progression in each subtype.

National Institute of Health (NIH)
National Heart, Lung, and Blood Institute (NHLBI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHL1-CSR-Q (F1))
Program Officer
Gan, Weiniu
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Raghu, Vineet K; Ramsey, Joseph D; Morris, Alison et al. (2018) Comparison of strategies for scalable causal discovery of latent variable models from mixed data. Int J Data Sci Anal 6:33-45
Kitsios, Georgios D; Fitch, Adam; Manatakis, Dimitris V et al. (2018) Respiratory Microbiome Profiling for Etiologic Diagnosis of Pneumonia in Mechanically Ventilated Patients. Front Microbiol 9:1413
Manatakis, Dimitris V; Raghu, Vineet K; Benos, Panayiotis V (2018) piMGM: incorporating multi-source priors in mixed graphical models for learning disease networks. Bioinformatics 34:i848-i856
Ping, Peipei; Hermjakob, Henning; Polson, Jennifer S et al. (2018) Biomedical Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular Medicine. Circ Res 122:1290-1301
Raghu, Vineet K; Beckwitt, Colin H; Warita, Katsuhiko et al. (2018) Biomarker identification for statin sensitivity of cancer cell lines. Biochem Biophys Res Commun 495:659-665
Andrews, Bryan; Ramsey, Joseph; Cooper, Gregory F (2018) Scoring Bayesian Networks of Mixed Variables. Int J Data Sci Anal 6:3-18