A large percentage of vaulable geoscience data is based on the analysis of discrete samples and is collected manually (e.g., paleontological collections, structural/tectonic data, petrographic/mineralogic data, economic data, geochemical measurements, rock mechanics, etc.) Often, these data are reported only in tables in the published literature or in .pdf or spreadsheets on individual investigator websites. Commonly these data are not registerd on or entered into standardized, publicly accessible databases. As a result, for this data to be discovered and used/reused, researchers or other interested parties must manually comb through the text, figures, and appendices of journal articles or websites of individual investigators, sometimes having to sift through raw experimental data. This process is extremely time intensive and slows down the time needed to make scientific discoveries or allow verification of research results. As a result the vast amount of surface earth geoscience data is currently inaccessible. This inaccessible data is termed "Dark Data". This EAGER combines the expertise of top-notch computer scientists and geoscientists whose goal is to create a search algorithm to bring this dark data to light in a way that will enable the next generation of integrative geoscience research. The approach will involved development of an innovative search engine "crawler" that will comb the geoscience literature and bring dark data to light from the text and figures in this corpus. The cyberinfrastructure tool being developed will be able to interpret the semantics of English text and the concepts of geoscience. The tool will be piloted by examining entries on the Macrostrat database, a structured spatial database of lithologic and geochronologic information, and then employing a geoscience ontology by means of the Hazy framework for information extraction. Questions to be addressed will be to find out to what extent dark data is presently accessible and if it can be extracted and placed into an accessible format and repository where it can be discovered by web services or other search engines. Broader impacts of the work include training of graduate students and increasing the infrastructure for science through the development of a new and much needed data search tool.
This project was focused on the 'dark data' – hidden, unused data – that exist in the sciences. It is also focused especially on ocean geological data. To locate the hidden data we can use word searches. With some special research techniques exact matches are not necessary, just parts of words or the concepts. This is done by linking the words in semantic 'meaning-nets' and by recognizing the parts (stems) of the words. Our part of the project compiled vocabularies of rock, sediment, soil, ice terms which could be used in this way. The vocabularies were written so computer programs could pick up the terms and quickly scan the web, documents, databases for hidden data resources on specialist topics such as sub-seabed ice (hydrates), or minerals. Although we worked with other high-throughput computing labs, we also applied these techniques ourselves to mapping the coastlines, harbours, reefs of all the shorelines around the world from many different hidden data resources. A surprising, huge amount of data was recovered which will be combined with existing data to map the difficult ocean zone in the surf, too shallow for ships, and muddy from rivers. Post-award the software and vocabularies will continue to be available for other groups to pick up, use and even improve. This is one way the 'dark data' problem will be reduced. CJ