The research involves the design and analysis of a framework to compute the spatial arrangements, also known as conformations, in which a protein chain of amino acids is biologically-active (in its native state). This is an important goal towards understanding protein function. While proteins are central to many biochemical processes, little is known about millions of protein sequences obtained from organismal genomes.
Intellectual Merit: The intellectual merit of this work lies in the development of a novel computational framework that combines probabilistic exploration with the theory of statistical mechanics to efficiently enhance the sampling of the conformational space near the native state. Low-dimensional projections guide the exploration towards low-energy and geometrically-diverse conformations. Additional intellectual merit lies in the incorporation of knowledge and observations emerging from biophysical theory and experiment, such as the use of coarse graining, relation between energy barrier height and temperature, and hierarchical organization of tertiary structure. Algorithmic components of the framework will be systematically evaluated for efficiency, accuracy, and how they enhance the sampling of the conformational space near the native state.
Broader Impact: The broader impact of this research will be the creation of a filter that efficiently computes diverse coarse-grained conformations relevant for the protein native state that can then be further refined through detailed biophysical studies. The work lies at the interface between computer science and protein biophysics and can benefit both communities. On the computational side, the work will lead to new algorithms on modeling articulated chains characterized by continuous high-dimensional search spaces and complex energy surfaces. On the biophysical side, the framework will elucidate which aspects of our understanding of proteins allow efficient and accurate modeling. The work will impact both undergraduate and graduate students. New courses are proposed by the investigator as part of efforts to introduce computational biology in the computer science curriculum at George Mason University. The work will be employed as a pedagogic device in courses and educational outreach venues to spawn and maintain interest in computer science, with a particular focus on women and minorities.
Protein molecules are central to our biology. They achieve their biological function by adopting specific three-dimensional shapes or structures with which they are then able to interact with other molecules in the cell. A central goal of genomic sequencing efforts has been to decode sequences of proteins operating in a species' cells. Unfortunately, the ability to understand biological functions of novel proteins is tightly related to our ability to resolve the three-dimensional, biologically-active structures adopted by the decoded sequences of these proteins. Therefore, extracting structural information about a protein from its amino-acid sequence is a central problem in molecular biology. The problem is known as de novo protein structure prediction, with de novo indicating the fact that the process has to be performed from first principles in the absence of other structures of proteins of similar sequence to the target sequence. Computational approaches offer to complement wet-laboratory efforts in this endeavor, promising both speed and accuracy. This project has advanced algorithmic research to address this problem. Its main contribution has been on designing more powerful algorithms capable of efficiently navigating the vast space of alternative spatial arrangements of a protein's chain of amino acids in search of those arrangements representing the biologically-active structure. The space of such arrangements, also known as conformations, is vast and high-dimensional, as a protein chain is highly deformable and composed of hundreds of amino acids or more. The algorithmic framework put forth by this project essentially conducts a guided navigation of the conformational space, exploiting knowledge that conformations adopted to perform a biological function position atoms so as to maximize favorable interactions and minimize unfavorable ones. Summing these interactions in an energy function allows associating an energy score with a conformation. Thus, de novo structure prediction can be viewed as an optimization problem. Given that the space of conformations is vast and high-dimensional, and the underlying energy surface associated with this space is rich in local minima, baseline optimization algorithms typically underperform. Most suffer from premature convergence to local minima which do not represent the biologically-active structure. The algorithmic framework that has resulted from this project is able to avoid premature convergence. The framework navigates the conformational space by growing a tree in it. Vertices of the tree are conformations, and the tree grows in iterations. At every iteration, a decision is made on which vertex to expand so as to grow an additional branch terminating with a new vertex added to the tree. The selection mechanism is key to guide the exploration towards both low-energy and geometrically-diverse conformations, thus reconciling these two conflicting optimization objectives. Discretization layers over low-dimensional projections of the conformation space and energy surface are employed to achieve these objectives. Branches in the tree are short Metropolis Monte Carlo (MMC) runs. This project has made general contributions and advanced algorithmic research in stochastic optimization for vast, high-dimensional, and non-linear search spaces. It has proposed that discretization layers can be employed to facilitate gathering statistics over the identified conflicting optimization objectives, which can then be used to formulate probability distribution functions that guide the directions of growth for the search. The concepts put forth in this project are of special relevance to sub-areas of computer science pursuing stochastic optimization, such as evolutionary computation and robotics-inspired motion planning. The latter has served as an inspiration for the tree-based search framework in this project, building on analogies between proteins and articulated mechanisms. In protein modeling, this project has demonstrated that the framework is capable of dealing with inaccurate protein energy functions. Specifically, it has demonstrated that weak over strong energy bias allows delaying convergence. Various geometric projections coordinates have been compared for their ability to summarize conformations and project conformational space. Specifically, contact-based have been shown more effective over shape-based ones. A popular technique in protein structure modeling, molecular fragment replacement, has been integrated in the branch computations to grow the tree. Prior to this project, the multistart MMC framework was considered the standard in conformational exploration for de novo structure prediction. This project has demonstrated that other frameworks that build over MMC but pursue richer search structures can afford higher exploration capability. This project has also laid additional important directions that broaden the search structure to graph-based over tree-based and has shown promise on additional applications to compute structures mediating the transition of a protein between two functionally-relevant states for those proteins with a rich functional repertoire in the cell.