The value of a newly sequenced genome is directly dependent on our ability to assign function to the genes within that genome. This is especially true in an era in which pathogen genomes may be fully sequenced shortly after a pathogen is isolated, and in which sequencing capacity has already outstripped the ability of experimentalists to examine a large number of genes within each individual organism in the lab. Unfortunately, the ability to assign gene functions has not kept pace with sequencing output, and a significant fraction of the genes in many new genomes remain unassigned. Our group and others have also identified another problem related to the inability to assign gene functions, that of previously characterized enzyme activities that have no assigned sequence data. In fact, over a third of activities with assigned E.C. numbers have neither gene nor protein sequence information associated with them. These """"""""orphan enzyme activities"""""""" represent a significant problem and opportunity in biomedical research. Most notably, as long as these activities remain """"""""orphan"""""""" and devoid of sequence, they will never be predicted as functions for any genes in newly sequenced genomes. It is our hypothesis that there is likely to be significant overlap between these orphan activities and many of the genes that currently have no assigned function. As a consequence, it is critically important for modern, genome-driven biology to find sequences for orphan activities. We propose to develop a systematic approach for resolving the problem of orphan activities by identifying a gene sequence associated with each such activity. We will carry out an initial literature evaluate stage that will confirm the orphan status of each activity, a phase that we expect will yield 200-300 artifactual orphans, immediately adding a large body of activities associated with sequence to public databases. This will be followed by laboratory work that will identify 21 major orphan activities and help lay the groundwork for future large- and small-scale orphan identifications, with the eventual goal of enabling the identification of at least one gene for each activity.

Public Health Relevance

This project will generate a list of orphan activities with a demonstrated lack of sequence, capture literature and other key data related to those orphans, resolve hundreds of artifactual orphan activities, identify sequences for 21 major orphan activities, and provide guidelines for other investigators to identify additional orphans. The resolution of 200-300 artifactual orphan activities and 21 genuine orphan activities will help reduce wasteful duplicated efforts in enzymology and will enhance the quality of all future genome annotations.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM086755-02
Application #
7808849
Study Section
Macromolecular Structure and Function D Study Section (MSFD)
Program Officer
Jones, Warren
Project Start
2009-05-01
Project End
2012-10-31
Budget Start
2010-05-01
Budget End
2012-10-31
Support Year
2
Fiscal Year
2010
Total Cost
$325,924
Indirect Cost
Name
Sri International
Department
Type
DUNS #
009232752
City
Menlo Park
State
CA
Country
United States
Zip Code
94025
Shearer, Alexander G; Altman, Tomer; Rhee, Christine D (2014) Finding sequences for over 270 orphan enzymes. PLoS One 9:e97250
Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil et al. (2013) Rapid identification of sequences for orphan enzymes to power accurate protein annotation. PLoS One 8:e84508