The Carnegie Institute of Washington is awarded a grant to create a network of plant metabolism databases. At the center of the network will be PlantCyc containing pathways from many plants, supported by experimental evidence for pathways, reactions or enzymes. PlantCyc will be initialized from currently available plant metabolism databases such as AraCyc, TomatoCyc, RiceCyc, MedicagoCyc and Soybase. It will be used as a reference database (in conjunction with MetaCyc) to create multiple plant pathway genome databases (PGDBs) with substantial sequence data. To build a PGDB, putative enzyme sequences will be identified for each organism using several sequence analysis methods and Pathway Tools software will be used to generate the initial PGDBs from the annotated sequences. As each PGDB is built, all of the pathways and enzymes in the new PGDB will be validated and added to PlantCyc, and subsequently curated. Therefore, with each round of PGDB prediction, the quantity and quality of PlantCyc will be increased. The project will leverage the curation teams at other databases interested in different species as well as biochemistry experts who are interested in specific domains of metabolism in improving the content of PlantCyc and the PGDBs. All of the data will be made freely available and updates will be released on a regular basis.

As the worldwide demand for production of biofuels, food, animal feed and new medicines continues to grow, there is an increasingly urgent need to develop new technologies using plants. The long-term goal of developing these technologies has prompted the sequencing of plant genomes and gene complements. There is a growing need to place the sequenced and annotated genomes in a biochemical context in order to facilitate discovery of enzymes and engineering of metabolism. This proposal will generate an infrastructure for comprehensive plant metabolism information that addresses the need to store, analyze and display the growing body of data that is emerging from both conventional biochemistry and high-throughput/large-scale data experiments. The proposed network of databases will facilitate the discovery of new enzymes and pathways, the engineering of metabolic pathways, and the curation of new findings in the context of overall metabolic scheme of an organism.

Project Report

Plant metabolism produces the oxygen we need to breathe and transforms simple compounds like carbon dioxide in the air and nitrate and water in the soil into an amazing array of chemicals that impact our lives in many ways. The vitamin A in papayas, the reseveratrol in wine grapes, the caffeine in coffee, the corn cell walls that we turn into ethanol, the fragrances floating from roses, the beautiful fall color of poplar leaves, and the starch of cassava that provides food security to millions of people are all synthesized by plant enzymes. At the Plant Metabolic Network (PMN), with collaborating institutions, we bring together information about compounds, enzymes, reactions, and biochemical pathways discovered over decades by researchers around the globe. We provide this information freely to researchers, educators, students, and the general public at our website (www.plantcyc.org). There visitors can access the PlantCyc database, which houses over 900 biochemical pathways based on data for over 400 plant species. We also provide species-specific databases that focus on the complete set of metabolic pathways that enable plants like corn, soybeans, poplar trees, and wine grapes to survive and thrive. CornCyc, PoplarCyc, GrapeCyc and our other resources have been built using a bioinformatics pipeline that starts with the proteins identified in a sequenced plant genome and predicts what biochemical reactions they may catalyze. We created the enzyme prediction pipeline using a combination of available algorithms and consistent and robust training and performance assessment routines. The pipeline predicts the set of enzymes and metabolic pathways present in any species in a consistent manner, allowing easy cross-species comparisons. The predicted enzymes were used to identify reactions and pathways using Pathway Tools software developed by SRI International. To date, we have ten single species databases for Arabidopsis thaliana (a model plant related to broccoli), Zea mays (corn), Glycine max (soybean), Vitis vinifera (wine grape), Carica papaya (papaya), Populus trichocarpa (poplar trees), Manihot esculenta (cassava), Physcomitrella patens (a moss), Selaginella moellendorffii (a seedless vascular plant), and Chlamydomonas reinhardtii (a unicellular green alga). In our most recent release, AraCyc (version 10.0) has the largest number of pathways (540), reactions (3,418), and compounds (3,323), whereas SoyCyc (version 3.0) contains the highest number of enzymes (13,055). The vast majority of the enzymes across all ten single-species databases (65,055 out of 67,596 - 96%) have been predicted computationally using the PMN pipeline because few enzymes have been experimentally characterized in these species. Meanwhile, the PlantCyc database (verson 7.0) contains close to 1000 pathways and 2,619 enzymes that have been experimentally verified in a diverse range of species (343) including Zingiber officinale (ginger), Triticum aestivum (wheat), Solanum tubersum (potato), Hevea brasiliensis (rubber tree), Gossypium hirsutum (cotton), Pinus sylvestris (Scotch pine), and Eschscholzia californica (California poppy). All of the PMN databases can contribute to research on metabolic engineering, evolution and biodiversity, agriculture, pharmaceutical development, and more. Usage of our data has increased over the years with an average of 6,711 unique visitors accessing the PMN every month since our latest release at the end of August 2012. During this period, an average of 32,916 page views per month have been generated by visitors from 144 countries. In addition, since the PMN site launched in June 2008, over 327 people have downloaded the complete database files to further analyze the data on their own computers. We have provided training for these users through on-line tutorials and presentations and one-on-one help sessions at 19 conferences and 8 universities. We have also published 8 articles in scientific journals, which have been cited 476 times according to the ISI Web of Science database. We have worked collaboratively with several organizations, including SRI International, MaizeGDB, the Sol Genomics Network, TAIR, SoyBase, the Nobel Foundation, Sabio-RK, Gramene, an Arabidopsis Metabolic Reconstruction working group, and the Plant Metabolomics consortium and with over 84 individual scientific contributors and 28 editorial board members to improve the quality and increase the content of our databases over time. Meanwhile, at the Carnegie Institution, 3 post-docs, 3 biocurators, 5 undergraduate interns, one programmer, and one Master's intern have contributed to and learned from this project.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0640769
Program Officer
Peter H. McCartney
Project Start
Project End
Budget Start
2008-03-15
Budget End
2013-02-28
Support Year
Fiscal Year
2006
Total Cost
$1,477,870
Indirect Cost
Name
Carnegie Institution of Washington
Department
Type
DUNS #
City
Washington
State
DC
Country
United States
Zip Code
20005