The physical structure that a protein molecule folds up into in the cell is critical to understanding the function that the protein performs in the body. Predicting this folded structure from the basic sequence of amino acids comprising the protein molecule is a key computational task in molecular biology and bioinformatics, and a subject of intense research into computer methods, due to its broad impact in the life sciences and ultimately human health. This project capitalizes on a recent breakthrough by the investigator in both the computational speed and accuracy of algorithms for predicting a discrete form of folded structure known as protein secondary structure. The project aims to further break the speed barrier for protein secondary structure prediction through faster methods for an essential algorithmic step called nearest neighbor search over a database, and to further break the accuracy barrier through more accurate methods for automatically learning the proximity measure used in nearest neighbor search. The project will impact national infrastructure through the release of open-source software tools implementing these algorithms, the training of doctoral students in research, and integrating this research into the teaching of university-level undergraduate and graduate bioinformatics courses.
To achieve these goals for faster and more accurate protein secondary structure prediction and related protein property prediction tasks, the project builds on a radically-different computational approach that forgoes the costly sequence database homology searches employed by all current state-of-the-art methods, and instead leverages nearest neighbor search on fixed-length strings under a distance metric to estimate residue structure probabilities, followed by dynamic programming to compute a globally-optimal, maximum-likelihood, physically-valid, secondary structure prediction. To further break the speed and accuracy barrier, the project will develop new faster data structures for the core problem of nearest neighbor search on strings under a distance metric, and new more accurate formulations of distance metric learning for nearest-neighbor-like classification. The techniques for nearest neighbor search and distance metric learning are general, which would yield advances in these fundamental computational problems beyond the motivating bioinformatics applications of protein property prediction.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.