Ever-increasing amounts of physical, functional, and statistical interaction data among bio-molecules, ranging from DNA regulatory regions, functional RNAs, proteins, metabolites, lipids, as well as those among genomic variants, offer unprecedented opportunities for computational discovery and for constructing a unified systems view of the cellular machinery. These data and associated formalisms have enabled systems approaches that led to unique advances in biomedical sciences. Unfortunately, however, storage schemes, data structures, representations, and query mechanisms for network data are considerably more complex, compared to other, at or low-dimensional data representations (e.g., sequences or molecular expression). This complexity is even more evident when we consider heterogeneity of possible interactions that can occur in the cell. For example, a pair of protein-coding genes can interact in a variety of ways: i) we can model physical interactions between their gene products, or their protein-protein interactions, ii) inter- action of a gene product with the promoter/enhancer/silencer region of the other gene, or iii) genetic interaction among double-mutants with significantly different phenotype than the effect of single mutations combined. This complexity is further evident, when one considers different versions of datasets, different techniques used for assaying and gathering the interactions between molecules, linkages across data, and interfaces with other tools. This project seeks to answer a number of fundamental questions that relate to efficient utilization of large network- structured datasets: - what are (provably) optimal storage schemes for large network structured databases? how should multiple versions of same/ related datasets be stored? how does one trade-off compression with query efficiency? and how does one suitably abstract network data so that users can interactively interrogate them using front-ends such as Cytoscape? This project aims to answer these questions by developing theoretically grounded and computationally validated storage schemes, algorithms, and software that will enable efficient and effective storage, update, processing, and querying of biological networks. We will develop compression techniques for efficient storage and version control mechanisms that allow users to create their own versions of networks, algorithms for efficient query processing on these networks, and implementations of these algorithms into broadly accessible and user-friendly software. This research will result in novel computational tools that will be disseminated to the community in the form of open source public domain software. Our tools will render network data fundamentally more accessible to the broader community in biomedical sciences. This will make use of network data more common place in applications including the identification of composite prognostic and diagnostic markers, disease gene prioritization, modeling of tumor het- erogeneity and progression in cancers, informing treatment, identification of therapeutic targets, and drug repositioning. From these points of view, the algorithms and software have far reaching and deep impact.

Public Health Relevance

Biochemical networks provide a unified systems view of the cellular machinery in living organisms, but the complexity of network-structured data poses challenges in storing, analyzing, and querying of large collections of networks. This project aims to develop compression techniques for efficient storage of 'big' network data, version control mechanisms that allow users to create their own versions of networks, and algorithms for efficient query processing on these networks. All these methods will be implemented into accessible software and will be made publicly available.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01CA198941-03
Application #
9301507
Study Section
Special Emphasis Panel (ZRG1-BST-N (50)R)
Program Officer
Li, Jerry
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2017-06-01
Budget End
2018-05-31
Support Year
3
Fiscal Year
2017
Total Cost
$436,692
Indirect Cost
$33,310
Name
Case Western Reserve University
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
077758407
City
Cleveland
State
OH
Country
United States
Zip Code
44106
Qiao, Shi; Koyuturk, Mehmet; Ozsoyoglu, Meral Z (2018) Querying of Disparate Association and Interaction Data in Biomedical Applications. IEEE/ACM Trans Comput Biol Bioinform 15:1052-1065
Maxwell, Sean; Chance, Mark R; Koyutürk, Mehmet (2017) Linearity of network proximity measures: implications for set-based queries and significance testing. Bioinformatics 33:1354-1361
Mohammadi, Shahin; Gleich, David F; Kolda, Tamara G et al. (2017) Triangular Alignment (TAME): A Tensor-Based Approach for Higher-Order Network Alignment. IEEE/ACM Trans Comput Biol Bioinform 14:1446-1458
Savel, Daniel; LaFramboise, Thomas; Grama, Ananth et al. (2017) Pluribus-Exploring the Limits of Error Correction Using a Suffix Tree. IEEE/ACM Trans Comput Biol Bioinform 14:1378-1388
Stanfield, Zachary; Co?kun, Mustafa; Koyutürk, Mehmet (2017) Drug Response Prediction as a Link Prediction Problem. Sci Rep 7:40321
Cowman, Tyler; Koyutürk, Mehmet (2017) Prioritizing tests of epistasis through hierarchical representation of genomic redundancies. Nucleic Acids Res 45:e131
Mukund, Kavitha; Subramaniam, Shankar (2017) Co-expression Network Approach Reveals Functional Similarities among Diseases Affecting Human Skeletal Muscle. Front Physiol 8:980
Ajami, Nassim E; Gupta, Shakti; Maurya, Mano R et al. (2017) Systems biology analysis of longitudinal functional response of endothelial cells to shear stress. Proc Natl Acad Sci U S A 114:10990-10995
Magner, Abram; Kihara, Daisuke; Szpankowski, Wojciech (2017) A Study of the Boltzmann Sequence-Structure Channel. Proc IEEE Inst Electr Electron Eng 105:286-305
Perez-Riverol, Yasset; Bai, Mingze; da Veiga Leprevost, Felipe et al. (2017) Discovering and linking public omics data sets using the Omics Discovery Index. Nat Biotechnol 35:406-409

Showing the most recent 10 out of 17 publications