Genome data enable scientists to pose a host of compelling questions spanning diverse disciplines. However, relational databases are inefficient at modeling the complex relationships between genes and the proteins they encode. The PI will enable biologists to answer these questions efficiently and automatically by developing a computational infrastructure that models the inherent structure of biological data, by creating a graphical database of genome and proteome data for the human genome and related eukaryotic genomes to model relationships (evolutionary, interaction, regulatory) that cannot be represented effectively in relational databases. Nodes will represent different biological entities - genes, proteins, species - and edges between nodes will represent different relationships between these entities. For example, edges between genes and proteins can represent "Gene G encodes protein P" or "Gene G is regulated by protein P". Edges between proteins can represent physical interaction or homology. Different types of features for these entities at each node will be stored and the team will use the network structure and statistical modeling methods to enable precise predictions of various aspects of "function" -- molecular function, metabolic pathway, biological process, cellular localization, inter-molecular interactions, protein 3D structure, etc. Functional annotation will be automated, with results produced in both machine-readable and human-readable formats. Intuitive web-based interfaces will be provided for navigation and interpretation of data by experimental biologists. Provenance of predicted functions will be provided, allowing biologists to drill down to examine the underlying support and evidence. All core software tools will be provided in open source, and data will be downloadable. This project will contribute curriculum materials suitable for inclusion in undergraduate and graduate courses in bioinformatics, genomics, phylogenomics and evolutionary biology and provide a resource for researchers in vertebrate genomes.This project will contribute curriculum materials suitable for inclusion in undergraduate and graduate courses in bioinformatics, genomics, phylogenomics and evolutionary biology, and provide a resource for researchers in vertebrate genomes.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1355632
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-09-15
Budget End
2015-08-31
Support Year
Fiscal Year
2013
Total Cost
$250,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94710