The determination of homology by pair-wise sequence alignments is notoriously insensitive because many proteins with similar structure and function often have only 8-10% sequence identity ? well below the detection threshold required for conventional methods. It has been shown that by transforming protein sequences into vectors of properties associated with sequence and structure can significantly improve the task of finding these remote homologues (proteins with low sequence identity). However, existing feature-based methods are limited to classifying a protein into a family, which means that proteins cannot be classified unless they fall into a pre-defined family. Methods devised to overcome this caveat by assessing pair-wise similarity have primarily relied on network propagation because of the extremely large training space needed for pair-wise training. For example, a small benchmark dataset of 4000 proteins equates to over 8 million pairs. Unfortunately, these network propagation methods have demonstrated only marginal improvement over the state-of-the-art PSI-BLAST method. This is largely because (1) only a small limited number of features are used and (2) the underlying reliance of the network propagation method to a similarity network derived from BLAST scores. Thus, the use of statistical discrimination methods to answer the pair-wise question has remained beyond reach in the homology detection field. This limitation is a serious technological gap for large-scale genome sequencing since automated annotation is not possible without highly reliable homology detection. The development of a biologically-driven integrated protein feature representation will significantly improve the task of remote homology detection. Additionally, the use of a SVM, which only requires a linear computation for the classification task, will offer a fast computation time. These two components ? faster sequence comparisons and improved sensitivity ? will break a long standing time/sensitivity paradigm in the field of remote homology detection. The proposed pair-wise SVM implementation can also be applied to other large real - world diverse science and engineering problems characterized by classification through association. The PI already has a joint faculty appointment to WSU and is currently serving as a committee member for two students. For the proposed work, one additional Ph.D. graduate student from the WSU computer science department will perform thesis work on components of the proposed project, giving her hands-on access to unique supercomputing facilities.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0742553
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2007-09-15
Budget End
2009-08-31
Support Year
Fiscal Year
2007
Total Cost
$200,000
Indirect Cost
Name
Battelle Memorial Institute
Department
Type
DUNS #
City
Richland
State
WA
Country
United States
Zip Code
99354