The 'twilight zone' of protein sequence comparison is the region in which sequence similarity does not suffice to conclude e.g. structural similarity. The vast majority of all protein pairs of similar structure populate a 'midnight zone' i.e. their sequences differ too much for sequence-based comparisons. Here, we propose to refine, extend, and specialise methods combining sequence alignment, structure prediction and functional information. Goal is to unravel hidden similarities in entirely sequenced organisms by a reliable, automatic tool. Towards the end of our project, the sequences for most protein families realised by life will supposedly be available. We hope that our system will correctly detect a relation for most of these. (1) Prediction-based threading combines sequence alignments with predictions of secondary structure and accessibility to find remote similarities. We hope to considerably improve detection and alignment accuracy by comparing families with families rather than single proteins. (2) About one third of all proteins in worm and fly seem to have long regions lacking regular secondary structure. We hope to develop a method tailored to reliably detect and compare such regions. (3) No current method finds similarities between extremely diverged membrane proteins. We propose to develop such a method combining 'membrane threading' with classifications of membrane proteins. (4) Since sequence comparison in the twilight zone and below is an extremely demanding task, most existing methods have very low levels of accuracy. In practice, experts compare aspects of function between the protein pair under investigation. We want to develop an automatic method evaluating functional aspects. In particular, we intend to start with proteins binding to DNA. The tasks will be to (i) predict DNA-binding sites in proteins, and to (ii) restrict the threading to the subset of proteins for which binding regions were found. In the following step, we hope to use general sequence motifs for the automatic comparison. (5) Threading entire genomes: the first task will be to find all proteins in an entire organism for which we know structure. However, the particular edge of our method will be to find remote similarities even in the absence of experimental information about structure.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BBCB (01))
Program Officer
Edmonds, Charles G
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Columbia University (N.Y.)
Schools of Medicine
New York
United States
Zip Code
Ofran, Yanay; Rost, Burkhard (2007) ISIS: interaction sites identified from sequence. Bioinformatics 23:e13-6
Mika, Sven; Rost, Burkhard (2006) Protein-protein interactions more conserved within species than across species. PLoS Comput Biol 2:e79
Mika, Sven; Rost, Burkhard (2005) NMPdb: Database of Nuclear Matrix Proteins. Nucleic Acids Res 33:D160-3
Schlessinger, Avner; Rost, Burkhard (2005) Protein flexibility and rigidity predicted from sequence. Proteins 61:115-26
Mika, Sven; Rost, Burkhard (2004) Protein names precisely peeled off free text. Bioinformatics 20 Suppl 1:i241-7
Liu, Jinfeng; Rost, Burkhard (2004) Sequence-based prediction of protein domains. Nucleic Acids Res 32:3522-30
Bigelow, Henry R; Petrey, Donald S; Liu, Jinfeng et al. (2004) Predicting transmembrane beta-barrels in proteomes. Nucleic Acids Res 32:2566-77
Wrzeszczynski, Kazimierz O; Rost, Burkhard (2004) Cataloging proteins in cell cycle control. Methods Mol Biol 241:219-33
Przybylski, Dariusz; Rost, Burkhard (2004) Improving fold recognition without folds. J Mol Biol 341:255-69
Liu, Jinfeng; Rost, Burkhard (2003) Domains, motifs and clusters in the protein universe. Curr Opin Chem Biol 7:5-11

Showing the most recent 10 out of 36 publications