New technologies afford the acquisition of dense ?data clouds? of individual humans. However, heterogeneity, dimensionality and multi-scale nature of such data (genomes, transcriptomes, clinical variables, etc.) pose a new challenge: How can one query such dense data clouds of mixed data as an integrated set (as opposed to variable by variable) against multiple knowledge bases, and translate the joint molecular information into the clinical realm? Current lexical mapping and brute-force data mining seek to make heterogeneous data interoperable and accessible but their output is fragmented and requires expertise to assemble into coherent actionable information. We propose DeepTranslate, an innovative approach that incorporates the known actual physical organization of biological entities that are the substrate of pathogenesis into (i) networks (data graphs) and (ii) hierarchies of concepts that span the multiscale space from molecule to clinic. Organizing data sources along such natural structures will allow translation of burgeoning high-dimensional data sets into concepts familiar to clinicians, while capturing mechanistic relationships. DeepTranslate will take a hybrid approach to learn and organize its content from both (i) existing generic comprehensive knowledge sources (GO, KEGG, IDC, etc.) and (ii) newly measured instances of individual data clouds from two demonstration projects: (1) ISB?s Pioneer 100 and (2) St. Jude Lifetime cancer survivors. We will focus on diabetes as test case. These two studies cover a deep biological scale-space and thus can test the full extent of the multiscale capacity of DeepTranslate in a focused application. 1. TYPES OF RESEARCH QUESTION ENABLED. How can a clinician find out that the dozens of ?out of range? variables observed in a patient?s data cloud, form a connected set with respect to pathophysiology pathways, from gene to clinical variable? How can the high-dimensional data of studies that measure for each individual 100+ data points of various types (?personal data clouds?) be analyzed as one set in an integrated fashion (as opposed to variable by variable) against existing knowledge bases and also be used to improve the databases? DeepTranslate addresses these two types of questions and thereby will accelerate translation of future personal data clouds into (A) care decisions and (B) hypotheses on new disease mechanisms / treatments, thereby benefiting providers as well as researchers. 2. USE OF EXPERTISE AND RESOURCES. ? ISB: pioneer in personalized, big-data driven medicine (Demo Project 1); biomedical content expertise; multiscale omics and molecular pathogenesis, big data analysis, housing of databases for public access; query engine designs, GUI. ? UCSD: leader in biomedical data integration; automated assembly of molecular and clinical data into hierarchical structures; translation between data types ? U Montreal: biomedical database curation from literature and construction of gene/protein/drug interaction networks; machine learning, open resource database ? St Jude CRH: Cancer monitoring Demo Project 2, cancer patient data analytics. 3. POTENTIAL DATA AND INFRASTRUCTURE CHALLENGES. (a) Existing comprehensive clinical data sources are not uniform and not explicitly based on biological networks; cross-mapping is being performed at NLM based on lexical relationships: HPO (phenotypes) vs. SNOMED CT (for EMR) vs. IDC or Merck Manual (for diseases). Careful selection of these sources in close collaboration with NLM is needed. (b) Existing molecular pathway databases are static, based on averages of heterogeneous non-stratified populations, while the newly measured high-dimensional data clouds are varied due to intra-individual temporal fluctuation and inter-individual variation. How this will affect building of ontotypes in our hybrid approach, and how large cohorts of data clouds must be to offer statistical power is yet to be determined. Our two Demonstration Projects with their uniquely deep (high-dimensional and multiscale) data in cohorts of limited but growing size are thus crucial first steps in a long journey of collective learning in the TRANSLATOR community.

National Institute of Health (NIH)
National Center for Advancing Translational Sciences (NCATS)
Project #
Application #
Study Section
Special Emphasis Panel (ZTR1-SRC (99))
Program Officer
Colvis, Christine
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Institute for Systems Biology
United States
Zip Code
Huang, Justin K; Carlin, Daniel E; Yu, Michael Ku et al. (2018) Systematic Evaluation of Molecular Networks for Discovery of Disease Genes. Cell Syst 6:484-495.e5
Ma, Jianzhu; Yu, Michael Ku; Fong, Samson et al. (2018) Using deep learning to model the hierarchical structure and function of a cell. Nat Methods 15:290-298
Wang, Sheng; Ma, Jianzhu; Yu, Michael Ku et al. (2018) Annotating gene sets by mining large literature collections with protein networks. Pac Symp Biocomput 23:602-613
Glusman, Gustavo; Rose, Peter W; Prli?, Andreas et al. (2017) Mapping genetic variations to three-dimensional protein structures to enhance variant interpretation: a proposed framework. Genome Med 9:113