The science community has increasingly employed multi-method approaches to scientific exploration with an increasing reliance on computational methods. This is particularly the case with the science of climate and global environmental change. With this evolution has come the fundamental importance of data storage and archiving. There is an open problem in archiving digital science data that affects many fundamental science initiatives.

We propose an Open Archival Information System (OAIS) (2009) compliant data archival repository that lives early in the scientific research pipeline, supporting the ingest and access mechanisms that users have become accustomed to and have staff to support, while simultaneously providing support for curation and preservation of data, and making relational database, and eventually other databases, more usable in real time by researchers and policy makers. The testbed for this approach is the International Forestry Resources and Institutions (IFRI) database, the most complete data archive of how communities develop strategies for sustainable forest management. The IFRI scientific user community consists of field visits every five years to over 250 diverse sites in 11 countries.

The proposed repository conceptually wraps the original database into a unit that also contains a metadata catalog and provenance collection tool with interaction and replication guided by the OAIS standard. A fundamental research question in this effort is the data model that maps a database schema to an object model which abstracts scientific intent. The abstraction of scientific intent is grounded in a general conceptual model for reasoning about the life cycle of social-ecological systems and their interactions and outcomes. We thus expect to generalize the tools and data model that provide the map from a database to the science-oriented conceptual model expressed as an ontology.

The International Forestry Resources and Institutions network includes twelve Collaborating Research Centers in ten countries on four continents. The early research conducted in this project will form a foundation for outreach through IFRI that could have broad potential for science and policy impacts worldwide. The proposal funds a computer science graduate student and postdoctoral fellow who will be engaged in interdisciplinary research in an area of emerging importance in the next many generations.

A critical component of the long term success of the ideas of this proposal will be by getting word out. Therefore we will seek to present talks about these tools and approach long-term digital data collection projects, particularly ones focused on environmental monitoring such as LTER and OOI.

Project Report

Beth Plale, PI; Elinor Ostrom, Co-PI; Tom Evans, Co-PI; Scott Jensen, Postdoc Indiana University Social-ecological research is the study of coupled social-ecological systems; examining what governance structures work in what situations and how the sustainability of such resources can be enhanced. The data arising from the study of social-ecological system is highly complex. In this project, computer scientists and social-ecological researchers collaborated on new ways for data gathered through study of social-ecological interactions to be more easily shared and preserved. Over 30 years ago, the late Elinor Ostrom, 2009 Nobel Laureate in Economics and one of the Co-PI’s on this project, in collaboration with researchers around the world, identified factors that are critical to the actions and outcomes of complex social-ecological systems. Dr. Ostrom pioneered the SES Framework, a multi-tiered framework of the critical variables of complex social-ecological systems. This SES Framework identifies critical characteristics of resource systems, the users of the resources, the systems governing the use of the resources, the interactions between these elements and the resulting outcomes. We adopt the SES Framework as a common framework to organize findings. We transform a database consisting of 18 years of social-ecological system data collected by the International Forestry Resources and Institutions (IFRI) research program at University of Michigan. The IFRI data is challenging in that it, like most social-ecological data, is complex: it is captured using a survey instrument consisting of 10 forms totaling over 180 pages with 922 distinct questions. The responses are stored in a relational database. We leverage techniques from the sematic web to convert a relational schema to a hierarchy of relations called logical objects. This mapping is possible because although the IFRI survey instrument captures complex relationships between resource systems, users, and governance structures, the fields in the database are simply answers to individual questions gathered during a visit to a site. Through the creation of mapping files that can be created using common spreadsheet tools, each question in the research instrument is mapped to a second-tier category in the SES Framework and grouped into generalized logical objects. By categorizing each survey question in accordance with the SES framework, we obtain new views on the data. For instance the heatmap shown in the Figure shows the data density for each site study in the IFRI data set. A row is a site study, a columns is a second-tier variable in the SES Framework. A cell that has a high data density is red, one with low data density is blue. This kind of visualization can help a researcher better locate and understand an existing data set. We worked on automating the manual task of mapping a social-ecological dataset to the SES Framework. In the case of the IFRI study instrument, this requires mapping 922 questions. We studied how well one could automate the classification of questions using machine learning. One question we addressed is whether as the number of datasets classified against the SES Framework grows, can the classification of an earlier dataset be used to teach a machine classifier to categorize previously unseen research instruments? To test this approach, we used the leave-one-out cross validation (LOOCV) method to classify each question. The initial results are promising in that when classifying each question to a single SES second-tier category, the F-measure is 0.597. We expect that an increase in the number of research instruments classified combined with refinements to our machine learning approach will increase the classification accuracy. We additionally explored using the Data Document Initiative (DDI) to describe the IFRI data. DDI, which has wide adoption in social science research captures the coding of survey responses. The IFRI data does not contain any particular coding however, it is raw site visit data. Early investigation with this EAGER grant suggests that DDI2 could represent a case study that includes IFRI data as contributing evidence, but DDI2 is not suited to representing raw, uncoded data. DDI v 3.0 describes complex data and relationships, so has the potential to be a better fit for describing the IFRI data. Early investigation suggests that while it was possible to describe the structure of a relational database, the result would likely be more difficult for a social-ecological researcher to understand than the existing relational database.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1058452
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2010-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2010
Total Cost
$204,991
Indirect Cost
Name
Indiana University
Department
Type
DUNS #
City
Bloomington
State
IN
Country
United States
Zip Code
47401