Advanced correlation analyses to infer sequence and structural determinants of protein function

Neuwald, Andrew

Abstract

A long-term goal of molecular biology is assigning functional and mechanistic roles to specific protein residues, beyond the obvious roles in catalysis. Although this task is hindered by the relative sparsity of experimentally- based sequence annotations, it is facilitated by an abundance of sequence data augmented by structural data. This has spurred sequence- and structure-based prediction of function determining residues using a wide variety of methods. However, by focusing on experimentally characterized functions, these methods disfavor recognition of residues involved in important uncharacterized functions, insofar as these will be benchmarked incorrectly as false positives. Instead, this project focuses more generally on inferring functionally-relevant residues (FRRs) by allowing the sequence data itself to reveal its most statistically surprising properties without making assumptions about what will be found. We argue that, in the absence of experimental annotations, it is only possible to directly link individual residues to other residues and such residue sets to structural features. This project will make such associations by identifying sequence-to-sequence and sequence-to-structure correlations, and will focus solely on the observed data rather than on predicting (unseen) biochemical properties. The goal is to obtain hypothesis-generating observations for experimental follow up.
Aim 1 will create advanced tools for characterizing correlated residue patterns due to functional divergence with each pattern consisting of an arbitrary number of residues.
Aim 2 will develop a tool to probabilistically assess correlations between independent sequence- and structurally-defined residue sets. This tool will be modified for other purposes, including the evaluation of FRR-prediction programs.
Aim 3 will integrate Aims 1 & 2 methods and direct coupling analysis (DCA) into a nearly comprehensive system for sequence/structural correlation analysis. (Unlike the correlations under Aims 1 & 2, DCA focuses on direct correlations between residue pairs.) This strategy involves a high degree of model complexity and optimization over diverse sequence properties synergistically (due to interrelationships and dependencies) and over alternative models and parameters; hence, considerable care is required to ensure reliable results. Therefore, we will apply information theoretical principles to adjust accurately for multiple hypotheses, to avoid under- and over-fitting to the data, and to eliminate inherent biases.
Aim 3 will also characterize the relationships among the various types of correlations. We will apply these tools to large, functionally diverse superfamilies in collaboration with researchers interested in these proteins. Using tools developed under Aim 2 and hundreds of conserved domain datasets, Aim 4 will rigorously benchmark the performance of tools developed under Aims 1 & 3 relative to competing methods. This project will aid research efforts in protein engineering, the molecular basis of human disease, drug design and personalized medicine.

Public Health Relevance

By developing advance statistically-based programs for identifying and structurally-visualizing biologically- relevant correlations in vast amounts of sequence data, this project will obtain clues to important protein properties that thus far have evaded characterization. This is relevant to human health because biomedical research breakthroughs require formulation of plausible hypotheses that, when tested, are likely to be supported experimentally. Hence, this project will aid research efforts in understanding the molecular basis of human disease, in drug design and in personalized medicine.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM125878-04
Application #: 10093067
Study Section: Macromolecular Structure and Function D Study Section (MSFD)
Program Officer: Lyster, Peter

Project Start: 2018-02-01
Project End: 2022-01-31
Budget Start: 2021-02-01
Budget End: 2022-01-31
Support Year: 4
Fiscal Year: 2021
Total Cost
Indirect Cost

Institution

Name: University of Maryland Baltimore
Department: Biochemistry
Type: Schools of Medicine
DUNS #: 188435911

City: Baltimore
State: MD
Country: United States
Zip Code: 21201

Related projects


NIH 2021 R01 GM	Advanced correlation analyses to infer sequence and structural determinants of protein function Neuwald, Andrew F. / University of Maryland Baltimore
NIH 2020 R01 GM	Advanced correlation analyses to infer sequence and structural determinants of protein function Neuwald, Andrew F. / University of Maryland Baltimore
NIH 2019 R01 GM	Advanced correlation analyses to infer sequence and structural determinants of protein function Neuwald, Andrew F. / University of Maryland Baltimore
NIH 2019 R01 GM	Advanced correlation analyses to infer sequence and structural determinants of protein function Neuwald, Andrew F. / University of Maryland Baltimore
NIH 2018 R01 GM	Advanced correlation analyses to infer sequence and structural determinants of protein function Neuwald, Andrew F. / University of Maryland Baltimore

Publications

Neuwald, Andrew F; Aravind, L; Altschul, Stephen F (2018) Inferring joint sequence-structural determinants of protein functional specificity. Elife 7:

Comments

Be the first to comment on Andrew Neuwald's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: