The scientific community is awash in 'big data' but few practicing ecologists use these data to answer important ecological questions. They rely instead on the traditional approach of collecting new, experimental data focused on particular species, habitats, or problems. In addition, the data-intensive computational methods commonly needed to analyze big datasets are not easily accessible to most researchers. This high-risk, high-reward project could dramatically alter both the ways in which ecologists address questions and the types of questions that they tackle. It therefore represents a major contribution to NSF's efforts to extend ecological research in new directions to provide answers to more complex questions.
A knowledge-driven, open access system that 'learns' and becomes more efficient and easier to use as data streams increase in variety and size is needed for timely scientific progress in an era of big data. This approach is centered on establishing linkages between databases and hypothesis-based inquiry that result in the derivation of new or refined hypotheses as a result of improved access to dynamic databases. The investigators recently implemented a hypothesis-driven, process-based analytical methodology that was conceptually integrated with a data-intensive machine learning approach. This integrated approach allowed them to use multiple long-term datasets to narrow a diverse suite of mechanistic explanations to a single, most likely process. This process was then tested by a short-term experiment that saved time and money and yielded a more definitive answer than the more traditional approach described above. To further this approach, this project will test, refine, and automate this new integrative effort to develop a prototype cyber-infrastructure capable of significantly advancing the environmental sciences. Open access data, programming scripts, and derived data products will reduce the time lag for knowledge transfer from an individual to the research community, likely increase the speed of scientific progress, and provide a filter and memory for how to deal with large amounts of data of mixed quality. A postdoctoral researcher will work collaboratively with computer scientists, ecologists, and eco-informatics experts from three universities (New Mexico State University, University of Texas El Paso, and Evergreen College) and one corporation (Microsoft) to develop, test, and automate this knowledge-learning analytics system. Two workshops will be organized to test the ability of the system to learn while using diverse datasets and to introduce the approach to a wide variety of users.