The Protein Data Bank (PDB) archive has doubled in size since 2008 and exceeded 100,000 entries in 2014. At the same time, the size and complexity of structures are increasing dramatically, for example the recently determined structure of the HIV-capsid contains about 2.5 million atoms. The emerging techniques of integrative Structural Biology are starting to determine structures of molecular machines in the mega-Dalton range by combining cryo-Electron Microscopy, Small-Angle X-ray Scattering, X-ray, and NMR at increasingly higher resolution. Interactive visualization of large complexes exceeds available network bandwidth and memory of typical scientists' desktops, laptops, or mobile devices. Large-scale structural analyses and queries of the archive have become a Big Data challenge. To make these structures accessible to all scientists, educators, and students, new ways of representing these data are required. In domains such as high-definition television, satellite communication, video or audio streaming, high-efficiency compression has been key to deliver interactive media to phones, tablets, laptops, and desktops. A similar trend has emerged in the handling of whole genome sequence data. An entire discipline Compressive Genomics has been developed to deal with data compression and development of algorithms to process these data. This proposal introduces the concept of Compressive Structural Bioinformatics, a set of compression algorithms, applications, and workflows that analyze and visualize large structures and large sets of structures at an unprecedented speed (100-1000 fold speedup) and with minimal client side overhead.
The aims of this project are: 1. Develop a compact and extensible representation of 3-D biomolecular structures, 2. Enable interactive visualization of large complexes by reducing network bandwidth and enabling data streaming, 3. Enable large-scale analyses of the PDB archive for I/O bound workflows, and 4. Develop open source software libraries. Through collaboration with developers of widely used visualization applications and distributed data-parallel workflow systems, the new techniques will be implemented, benchmarked, and reference implementations will be provided in several programming languages for easy adoption. It is expect that these new Compressive Structural Bioinformatics tools will enable transformative research as intended by the NIH's Big Data to Knowledge initiative.

Public Health Relevance

The 3-D structures (shapes) of proteins and nucleic acids, the building blocks of life, are fundamental to the understanding of disease processes, the mechanism of drug actions, and the development of new medicines. We develop data compression and streaming techniques for large 3-D structures, similar to what YouTube does for videos, to enable access, large-scale analysis, and interactive visualization of very large biomolecules by scientists, educators, students, and educators.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01CA198942-02
Application #
9070726
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Li, Jerry
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2016-06-01
Budget End
2017-05-31
Support Year
2
Fiscal Year
2016
Total Cost
Indirect Cost
Name
University of California San Diego
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
804355790
City
La Jolla
State
CA
Country
United States
Zip Code
92093
Rose, Alexander S; Bradley, Anthony R; Valasatava, Yana et al. (2018) NGL viewer: web-based molecular graphics for large complexes. Bioinformatics 34:3755-3758
Rose, Peter W; Prli?, Andreas; Altunkaya, Ali et al. (2017) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res 45:D271-D281
Valasatava, Yana; Bradley, Anthony R; Rose, Alexander S et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLoS One 12:e0174846