Among the most compelling windfalls provided by the genome sequencing projects is the opportunity to begin to explore protein function space using systems approaches. Proteins play a critical role in biology and medicine; they are the targets of virtually all drugs. The study of protein function begins by expressing proteins from cloned copies of the corresponding cDNAs. Functional proteomics exploits new high-throughput technologies to study many proteins simultaneously using large collections of protein expression cDNA clones. However, the ability to interpret and rely upon the resulting data requires confidence that the clones accurately reflect the natural cDNA sequences. The successful automated production of these clone collections has exerted significant pressure to develop rapid and accurate methods for quality control. There is currently no software available that automates the sequence validation and evaluation of such clones, requiring that this be done by hand - a tedious and error prone process. The purpose of this proposal is to develop, maintain and openly distribute software that will automate and facilitate the process of biologically evaluating protein expression cDNA clones. This modular software will compare the assembled sequence of a cDNA clone to a user-specified reference sequence to identify and categorize all discrepancies. Most notably, the software will analyze: 1) the polypeptide effects of discrepancies between the clone and reference sequences, 2) if the discrepancies are due to natural polymorphisms, and 3) whether these differences render the clone unacceptable for a given experiment. Based upon user-defined penalties for each discrepancy type (truncation, conservative amino acid substitution, etc.), it will then recommend whether to pass, fail or require further review for a clone. By resetting the penalty scheme to meet different experimental requirements, users can re-evaluate the same clone set or any other sequenced clone set. This software will benefit molecular biologists who must validate the sequences of the cDNA clones or clone collections either produced in their labs or obtained from other sources. This software may also prove useful in identifying the natural genetic diversity of expressed mRNA. In this revised submission, we address the two major concerns expressed in the summary statement. First, we include letters of support from more than 20 labs currently building large clone collections who have expressed the need for this software. Second, we affirm that this software is exportable and not dependent on other software in our lab. Finally, we note that an alpha version of the software has now been implemented and tested on 6000 clones for yeast. In a sampling of 192 clones done by hand, the analysis was correct for all 192.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG003041-02
Application #
6929793
Study Section
Special Emphasis Panel (ZRG1-BST-D (51))
Program Officer
Bonazzi, Vivien
Project Start
2004-08-01
Project End
2007-07-31
Budget Start
2005-08-01
Budget End
2006-07-31
Support Year
2
Fiscal Year
2005
Total Cost
$350,000
Indirect Cost
Name
Harvard University
Department
Biochemistry
Type
Schools of Medicine
DUNS #
047006379
City
Boston
State
MA
Country
United States
Zip Code
02115
Taycher, Elena; Rolfs, Andreas; Hu, Yanhui et al. (2007) A novel approach to sequence validating protein expression clones with automated decision making. BMC Bioinformatics 8:198