Scientific challenges in hydrology and water resources such as understanding impacts of variable climate, sustainability of water supply with population growth and land use change, and impacts of hydrologic change on ecosystems and humans are increasingly data intensive. The volume of data produced by environmental scientists to study hydrologic systems requires advanced software tools for effective data visualization, analysis, and modeling. Scientists spend much of their time accessing, organizing, and preparing datasets for analyses, which can be a barrier to efficient analyses and hinders scientific inquiries and advances. This project will develop new software that will enhance scientists' ability to apply advanced data visualization and analysis methods (collectively referred to as "data science" methods) in the hydrology and water resources domain. The project will promote standardized software tools and data formats to help scientists enhance the consistency, share-ability, and reproducibility of the analyses they perform - all of which are important in building trust in scientific results. The software developed in the project will make data loading and organization for analysis easier, reducing the time spent by scientists in choosing appropriate data structures and writing computer code to read and parse data. It will enable users to automatically retrieve data from the HydroShare system, which is a hydrology domain data repository, as well as from important national water data sources like the United States Geological Survey's National Water Information System. The software will automatically load data from these sources into standardized and high performance data structures targeted to specific scientific data types and that integrate with visualization, analysis, and other data science capabilities commonly used by scientists in the hydrology and water resources domains. The project will also reduce the technical burden for water scientists associated with creating a computational environment within which to execute their analyses by installing and maintaining the Python packages developed within the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) HydroShare-linked JupyterHub environment. Finally, the project will demonstrate the functionality and use of the software by producing a set of educational modules based on real water-data science applications that provide a specific mechanism for delivering the software to the community and promoting its use in classroom and research environments.
Scientific and related management challenges in the water domain are inherently multi-disciplinary, requiring synthesis of data of multiple types from multiple domains. Many data manipulation, visualization, and analysis tasks performed by water scientists are difficult because (1) datasets are becoming larger and more complex; (2) standard data formats for common data types are not always agreed upon, and, when they are, they are not always mapped to an efficient structure for visualization and/or analysis within an analytical environment; and (3) water scientists generally lack training in data intensive scientific methods that would enable them to use new and existing tools to efficiently tackle large and complex datasets. This project will advance Data Science and Analytics for Water (DSAW) by developing: (1) an advanced object data model that maps common water-related data types to high performance data structures within the object-oriented Python language and analytical environment based upon standard file, data, and content types established by the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) HydroShare system; (2) two new Python packages that enable users to write Python code for automating retrieval of desired water data, loading it into high performance memory objects specified by the object data model designed in the project, and performing analysis in a reproducible way that can be shared, collaborated around, and formally published for reuse. The project will use domain-specific data science applications to demonstrate how the new Python packages can be paired with the powerful data science capabilities of existing Python packages like Pandas, numpy, and scikit-learn to develop advanced analytical workflows within cloud and desktop environments. The project aims to extend the data access, collaboration, and archival capabilities of the HydroShare data and model repository and promote its use as a platform for reproducible water-data science. The project also aims to overcome barriers associated with accessing, organizing, and preparing datasets for data science intensive analyses. Overcoming these barriers will be an enabler for transforming scientific inquiries and advancing application of data science methods in the hydrology and water resources domains.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.