Biodiversity comprises all variations of life at all levels of biological organization, most of which arise from genomic diversity. As genomic technologies become available across the biological sciences, a full characterization of biodiversity demands a full characterization of genomes. Similarly, data synthesis across the full range of biodiversity research domains demands development, implementation, integration and harmonization of data exchange standards. Such interoperable informatics would be transformational for our understanding of biology, with consequent impact on environmental and conservation policy. Adding to the transformational potential is the fact that the microbial world represents half of the world's biomass and nearly all of its biodiversity, yet is still effectively invisible and intractable to traditional biodiversity research. Metagenomic data are not amenable to the concepts, standards, semantics, and methods of traditional eukaryotic biodiversity, and therefore, require an alternate informatics framework.
The EAGER will transform the collaborations between two previously separate research communities: the informaticists of the traditional biodiversity community, who employ the Darwin Core (DwC) as a standard, and the informaticists of the Genomic Standards Consortium (GSC), who have developed the Minimal Information about any Sequence (MIxS) standard for genomics, metagenomics and marker genes. Together, these groups will engage in a unified informatics effort to develop three layers of interoperability. The EAGER will harmonize the observational (DwC) and genomic (MIxS) standards, building on a community dialogue and interdisciplinary networking hosted and established under an NSF Research Coordination Network. Standards interoperability is the basis for the next two layers. Syntactic interoperability (in the context of Internet APIs and a database Reference Model) will be supported. The EAGER will assemble experts from the two communities to (a) devise a database Reference Model that integrates the DwC and GSC MIxS standards; and (b) for effective data management, create specific implementations for different database platforms to foster adoption. The practical implementation of the reference model on/for different database systems will allow, for the first time, systematic comparative testing of technical performance and of use cases (e.g., which implementation best serves which complex data query). The EAGER will create task groups to establish the infrastructure for managing ontologies, and to construct a reference model on the purely semantic level in order to fuse the two worlds of data standards, both of which are advanced enough to engage in useful interoperability.
In developing an interdisciplinary information infrastructure to achieve data interoperability across domains, this EAGER would advance understanding of complex environmental phenomena and, thereby, inform future policy decisions. Indeed, by leading to an informatics standards platform to conceive a novel conceptual and theoretical framework for the world of microbial ?dark matter,? the EAGER would have a transformational impact beyond science.