A large percentage of vaulable geoscience data is based on the analysis of discrete samples and is collected manually (e.g., paleontological collections, structural/tectonic data, petrographic/mineralogic data, economic data, geochemical measurements, rock mechanics, etc.) Often, these data are reported only in tables in the published literature or in .pdf or spreadsheets on individual investigator websites. Commonly these data are not registerd on or entered into standardized, publicly accessible databases. As a result, for this data to be discovered and used/reused, researchers or other interested parties must manually comb through the text, figures, and appendices of journal articles or websites of individual investigators, sometimes having to sift through raw experimental data. This process is extremely time intensive and slows down the time needed to make scientific discoveries or allow verification of research results. As a result the vast amount of surface earth geoscience data is currently inaccessible. This inaccessible data is termed "Dark Data". This EAGER combines the expertise of top-notch computer scientists and geoscientists whose goal is to create a search algorithm to bring this dark data to light in a way that will enable the next generation of integrative geoscience research. The approach will involved development of an innovative search engine "crawler" that will comb the geoscience literature and bring dark data to light from the text and figures in this corpus. The cyberinfrastructure tool being developed will be able to interpret the semantics of English text and the concepts of geoscience. The tool will be piloted by examining entries on the Macrostrat database, a structured spatial database of lithologic and geochronologic information, and then employing a geoscience ontology by means of the Hazy framework for information extraction. Questions to be addressed will be to find out to what extent dark data is presently accessible and if it can be extracted and placed into an accessible format and repository where it can be discovered by web services or other search engines. Broader impacts of the work include training of graduate students and increasing the infrastructure for science through the development of a new and much needed data search tool.
During this project, we took the first steps towards the two stated project goals: (1) linking the geological literature to a structured knowledge base called macrostrat, and (2) extracting and aggregating measurement data. During this project we have extracted facts from over 122k. In particular, we have extracted over one million formations and over one million measrements with precision over 95% and 87%, respectively. These quality numbers required close joint work between the computer science and geoscience collaborators, which was faciliated by this grant. In particular, to support (1) we have linked over 489k formations to their corresponding relavant unit in macrostrat. Since the database is structured, we are able to compute a wide range of aggregations over this data set. We are in the process of understanding how these aggregations can be used to support further scientific study by the broader geological community. Moreover, to create these data sets we have employed state-of-the-art statistical reasoning and natural language processing techniques. This processing is expensive, and we required over 78 machine years, which is only possible using the Condor High-throughput Infrastructure. As a result of this data collection effort, we have begun to collect facts about palebiology. This allows us to compare our results to previous data collection efforts like PaleoDB. Such efforts have taken over a decade and hundreds of scientists. However, our approach is able to achieve over 90% precision on the same data collection effort with a much smaller, less expensive team. We believe this is an interesting first step toward demonstrating this new technology. We have acheived the broader impacts of this work by advertising the work in the geoscience community, with several top-tier conference publications, and by training students in geoscience and computer science about these emerging data processing systems.