This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. The explosion in DNA sequencing capacities, wireless sensor technologies, and data storage capabilities makes it possible for biologists to generate and store vast amounts of data in the pursuit of biomedical discoveries. To realize the full potential in this approach, this new, data-driven biology requires a new cyberinfrastructure that is readily accessible to all scientists, and that facilitates the movement, storage and analysis of large amounts of data. To meet this need, a variety of novel tools and utilities must be created including: 1) automated pipelining tools that allow users to analyze vast quantities of data; 2) access to highly integrated database resources, so collected data can be instantly linked to existing knowledge; 3) a software workbench where federated data can be manipulated and visualized in a user-friendly environment; 4) access to computational resources to drive the calculations through grid computing; and 5) tools to store and share the results of individual investigations. The design philosophy of these tools must provide these tools in a mode that requires minimal resources on the part of the user: computations must be carried out on the server side, graphics must be lightweight, data must be provided in forms that allow for interoperability; and access to computational resources must be transparent: the scientist must be able to use grid computing resources with no need for an awareness of where these resources are, or even that they are grid computing resources. The Encyclopedia of Life (EOL) is a Grand Challenge project aimed at creating precisely this type of cyberinfrastructure for the Proteomics community of the 21st century. The EOL consists of three elements. The first is a software pipeline that allows the automated annotation of sequenced genomes. This pipeline consists of protein sequence and structure prediction and annotation tools that run on whole genomes, and a workflow system that maps these calculations onto distributed resources at partner institutions throughout the world. The second is a reference database; annotations derived from the pipeline are stored in a normalized reference database that is federated with seven other major biological databases, allowing direct queries across several areas of specialization. The third element of EOL is focused on use and distribution of the data: all data generated, stored, and federated by the EOL project will be presented to the user for analysis and distribution using innovative Web services-based data sharing tools, including a web browser-based encyclopedia of annotated genomes, and a virtual user notebook that allows for data storage, preservation of workflow information, and peer-to-peer data sharing. These distribution tools are implemented in alpha form or under development at the present time. The cyberinfrastructure created by the EOL project is designed with an interest in creating a significant number of generic software tools and middleware that is not confined to proteomics but can be applied across the biomedical community. These tools can be implemented directly to facilitate the exchange of information and analysis of data within any given domain in Biology, and, importantly, between domains, with a minimum of additional effort.
Showing the most recent 10 out of 292 publications