The value of a newly sequenced genome is directly dependent on our ability to assign function to the genes within that genome. This is especially true in an era in which pathogen genomes may be fully sequenced shortly after a pathogen is isolated, and in which sequencing capacity has already outstripped the ability of experimentalists to examine a large number of genes within each individual organism in the lab. Unfortunately, the ability to assign gene functions has not kept pace with sequencing output, and a significant fraction of the genes in many new genomes remain unassigned. Our group and others have also identified another problem related to the inability to assign gene functions, that of previously characterized enzyme activities that have no assigned sequence data. In fact, over a third of activities with assigned E.C. numbers have neither gene nor protein sequence information associated with them. These """"""""orphan enzyme activities"""""""" represent a significant problem and opportunity in biomedical research. Most notably, as long as these activities remain """"""""orphan"""""""" and devoid of sequence, they will never be predicted as functions for any genes in newly sequenced genomes. It is our hypothesis that there is likely to be significant overlap between these orphan activities and many of the genes that currently have no assigned function. As a consequence, it is critically important for modern, genome-driven biology to find sequences for orphan activities. We propose to develop a systematic approach for resolving the problem of orphan activities by identifying a gene sequence associated with each such activity. We will carry out an initial literature evaluate stage that will confirm the orphan status of each activity, a phase that we expect will yield 200-300 artifactual orphans, immediately adding a large body of activities associated with sequence to public databases. This will be followed by laboratory work that will identify 21 major orphan activities and help lay the groundwork for future large- and small-scale orphan identifications, with the eventual goal of enabling the identification of at least one gene for each activity.

Public Health Relevance

This project will generate a list of orphan activities with a demonstrated lack of sequence, capture literature and other key data related to those orphans, resolve hundreds of artifactual orphan activities, identify sequences for 21 major orphan activities, and provide guidelines for other investigators to identify additional orphans. The resolution of 200-300 artifactual orphan activities and 21 genuine orphan activities will help reduce wasteful duplicated efforts in enzymology and will enhance the quality of all future genome annotations.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Macromolecular Structure and Function D Study Section (MSFD)
Program Officer
Jones, Warren
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Sri International
Menlo Park
United States
Zip Code
Shearer, Alexander G; Altman, Tomer; Rhee, Christine D (2014) Finding sequences for over 270 orphan enzymes. PLoS One 9:e97250
Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil et al. (2013) Rapid identification of sequences for orphan enzymes to power accurate protein annotation. PLoS One 8:e84508