The Carnegie Institution for Science is awarded a grant to build an application programming interface (API) that will provide high-performance, web-based computational access for plant genomics data. The project will include a new, open technology for generating computational data web services; an open, portable, optimized data warehouse that supports very fast queries of plant biology data; and a plant biology query language, query builder, and query optimizer that will provide a simple way to limit query results to only the required data. The project will provide a range of data access methods to serve the needs of computational biologists and bench biologists for large and custom datasets. Implementation of this software at TAIR will provide REST and SOAP web services for computational use of TAIR data, RSS feeds for TAIR objects, and an implementation of a new query builder for TAIR.
While strong advances have been made in data generation methods including new genome sequencing methods, high throughput phenotyping, protein localization and others, computational access to the resulting data still requires large amounts of both human and machine resources. By addressing this issue through architecture, this project leverages advances in software engineering by combining and applying them to the specific domain of plant genomics. In particular, developing a minimal but effective plant genomics schema using modern data modeling; leveraging model-driven architecture to enable generation of high-performance web services from platform-independent models; and developing a basic, well-formed query language for the plant genomics domain are intellectually challenging tasks that will have a significant impact on the technology required for computational access. Wide adoption of a standard set of web interfaces for computational access to plant genomics resources will greatly simplify the effort required to access and integrate plant genomic data, thereby facilitating computational analyses of the data. By providing an easy, robust, and consistent route to computational data access, standard web interfaces will also facilitate development of new resources that could transform existing datasets and present them in new ways, analogous to mashups using Google Maps data along with real estate listings, weather data, Wikipedia entries, etc. By providing computational APIs and the technological infrastructure to create them as open source tools, this project makes available a key set of technologies to computational biologists beyond TAIR. As the technology proves itself, it can move beyond plant biology into the more general biological realm. Further information about this project may be found at the TAIR website: http://arabidopsis.org.
Continuing technical advances in high throughput sequencing and other genome-scale technologies have fueled a scientific revolution with immense potential for extending biological knowledge. These advances have also posed an immense challenge: how to make optimal use of vast quantities of biological data. Without long-term high quality mechanisms for computational access and analysis of the data, the full potential of these large new datasets for transformative scientific research will not be realized. The goals of the PLAIN project over the period of the award were (1) develop a set of reusable tools that will make it as easy as possible to access plant genomic data by computational means and (2) apply this toolset and additional methods to provide a computational interface for genomic data on Arabidopsis thaliana, a well-studied reference plant for which many types of genetic and genomic data have been captured in TAIR. Tools developed in this funding period include a set of data warehouses for specific types of plant genomic data, a new query language and parser (PSQL) for querying plant genome data and web services for accessing the data from the new data warehouses. Intellectual Merit: We have made significant progress in implementing a domain-specific query language for expressing biological queries and constructed a set of data warehouses for the queries to operate against. The data warehouses have been deployed within TAIR’s infrastructure where they already provide improved access to specific data types. The tools developed here have also been used by another scientific database to extract data for display to researchers. Broader Impacts: A publisher of scientific journals has made use of these tools to add a new feature to research articles about Arabidopsis genes. Each time an Arabidopsis gene is mentioned in the article, the online version of the publication uses the tools developed in this project to display a sidebar containing background information about that gene drawn from the TAIR website, thereby making the research data contained in TAIR more widely available to researchers, students and the general public. Project information and source code can be accessed from www.arabidopsis.org/about/plain.jsp.