The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) has been involved in the development of Water Data Services (WDS) through the CUAHSI Hydrologic Information Systems (HIS) project. The vision for WDS is to bring together the nation's (and, potentially, the earth's) water data in a federated system of servers linked using a services-oriented architecture. CUAHSI WDS is used by both academic researchers and by government data providers at both the Federal and State levels. A critical challenge in achieving this vision is understanding and reconciling structural and semantic differences across publishers of hydrologic data. The HIS project achieved interoperability between different data repositories by developing a common relational schema (Observations Data Model, ODM V1.1), an XML schema for exchanging hydrologic observations (Water Markup Language, WaterML V1.0), and a prototype ontology (V1.0) of hydrologic concepts that is used for data discovery purposes. Experience has shown that the prototypes for WaterML and the ontology are in need of further development. The prototype ontology provided semantic mediation among measured properties, but as this ontology was applied to more diverse data holdings, it became apparent that i) semantic heterogeneity continued to exist at various levels, including sampled media, sampling environment, chemical speciation, and units, that ii) the ontology needs to cover a wider range of parameters (such as those from the EPA Substance Registry System), and that iii) the development of semantic knowledge needs to be put on a much broader community base. WaterML has proven to be of great utility as a standardized way for transferring water information, but needs to be moved through a formalized standardization process. This project will address the underlying semantic problem through the development of a more comprehensive, extensible ontology that harmonizes the more generic information model contained within ODM with those from various existing federal information sources. This includes the development of a community process for an evolving hydrologic ontology. In addition, WaterML, which has been adopted by USGS and NCDC, will be extended to reflect this more generic information model. The project will contribute to forming an international standard. The fundamental goal of the project is the organization of hydrologic concepts in a way that allows publishers to describe their data unambiguously and helps users to discover data easily yet with a precise understanding of the properties measured and their context.
The goal of this project was to improve the ability of scientists from multiple disciplines as well as the general public to discover data about water by developing controlled vocabularies and ontologies. Controlled vocabularies are simply lists of words that provide a set of options for discovering data. By restricting the description of data to a set of specific words, data are described in a more consistent manner. An ontology describes the logical relationships among terms and permits searches to be done more flexibly. In this project, we focused on ways of organizing words from broad concepts to narrower, more precise concepts. either using heirarchies of concepts or less structured approaches that permit the controlled vocabularies simply to be tagged with more general terms. Water data is a particularly useful data set to consider for discovery because mutliple scientific disciplines depend on such data in their research. Water is important to many aspects of the geosciences but is also essential for the biological and social sciences. Thus, this formed a case study in how scientists discover and re-use data that can be applied to other disciplines. This project developed tools, such as a thesaurus of terms used by multiple geoscience disciplines, as well as observed how scientists used these tools in discovering data. We found that scientists were not content to descibe their data with general terms; they demanded very precise terms. This created a challenge to develop controlled vocabularies that met the precision demanded but remained manageable. We found that we had to carefully construct controlled vocabularies that described aspects of the data in a small number of mutually exclusive terms. By doing so, we avoided a rapid multiplication of terms needed to describe data. This is a substantial operational challenge for data centers. On the other hand, scientists were quite flexible in how terms were organized. So long as they could understand the logic behind the hierarchy, they could navigate from broad to more specific concepts. For scientists searching for data from other disciplines as well as for students and non-specialists interested in the data, the ontological hierarchies were critical for successful discovery of data. We found that scientists have difficulty in understanding the terms from other, even fairly closefly related disciplines. For example, hydrologists cannot easily interpret many terms used by atmospheric scientists. Often disciplines have implicit reporting conventions that make it easy for non-specilaists to misinterpret the data; such conventions must be made explicit in cross-disciplinary vocabularies. There is the need for each discipline to explain data from other disciplines in terms that its scientists can readily understand. The results of this project are being transmitted to the CUAHSI Water Data Center for inclusion in its data publishing and discovery software stack. The CUAHSI WDC is working with data centers from related disciplines such as Unidata (atmospheric sciences) and IEDA (solid-phase geochemistry) to improve data sharing in the geosciences. The intellectual merit of this project was a systematic evaluation of the use of "data about data" (known as metadata) for the purpose of data discovery and reuse as well as the examination of how scientists actually attempt to discover and to reuse data. The broader impacts of this work will be to enable more effective cross-disciplinary and interdisciplinary research in environmental science and geoscience and to permit broader access to data by non-specialists.