Determining protein structure and function from genomic sequences and protein classification remains one of the most significant challenges in modern computational biology. Significant enhancement to the capacity of algorithms to predict protein shapes from sequences is proposed, focusing on major bottlenecks; e.g., the folding energy and the ability of making approximate matches. Algorithms to determine protein shapes from sequences have two major components: The first component (sampling) generates a set of plausible protein shapes; at least one of the sampled shapes is expected to be similar to the correct fold. The second component scores the different structures and decides on the best model. The radius of convergence of the energy function must be sufficiently large so that approximate matches will be detected as well (in threading approximate matches may include deletions and insertion). It is therefore clear that poor scoring functions (or energies), which are unable to identify the correct fold, are likely to diminish the capacity of the folding algorithm. At present, it is easy to generate a set of decoy (wrong) structures that will confuse existing energy functions. Mathematical programming and machine learning techniques (Support Vector Machines) will design enhanced folding and threading potentials. The training by these methods is automated and will lead to monotonic improvement in recognition as a function of the data size. To more effectively cover protein space, the goal is to learn 100 million data points in a single consistent potential with (at most) 10,000 parameters. The automated large scale learning is crucial at times in which the information on sequences and structures grows rapidly. A threading prediction server, based on the old and the new potentials, is and will be available, to the community at http://ser-loopp.tc.cornell.edu/Ioopp.html ? ?
Showing the most recent 10 out of 24 publications