The success of machine learning (ML) in many applications where large-scale data is available has led to a growing anticipation of similar accomplishments in scientific disciplines. The use of data science is particularly promising in scientific problems involving processes that are not completely understood. However, a purely data-driven approach to modeling a physical process can be problematic. For example, it can create a complex model that is neither generalizable beyond the data on which it was trained nor physically interpretable. This problem becomes worse when there is not enough training data, which is quite common in science and engineering domains. A machine learning model that is grounded by explainable theories stands a better chance at safeguarding against learning spurious patterns from the data that lead to non-generalizable performance. This is especially important when dealing with problems that are critical and associated with high risks (e.g., extreme weather or collapse of an ecosystem). Hence, neither an ML-only nor a scientific knowledge-only approach can be considered sufficient for knowledge discovery in complex scientific and engineering applications. This project is developing novel techniques to explore the continuum between knowledge-based and ML models, where both scientific knowledge and data are integrated synergistically. Such integrated methods have the potential for accelerating discovery in a range of scientific and engineering disciplines. This project will train interdisciplinary scientists who are well versed in such methods and will disseminate results of the project via peer-reviewed publications, open-source software, and a series of workshops to engage the broader scientific community.

This project aims to develop a framework that uses the unique capability of data science models to automatically learn patterns and models from data, without ignoring the treasure of accumulated scientific knowledge. Specifically, the project builds the foundations of knowledge-guided machine learning (KGML) by exploring several ways of bringing scientific knowledge and machine learning models together using pilot applications from four domains: aquatic ecodynamics, climate and weather, hydrology, and translational biology. These pilot applications were selected because they are at tipping points where knowledge-guided machine learning can have a transformative effect. KGML has the potential for providing scientists and engineers with new insights into their domains of interest and will require the development of innovative new machine learning approaches and architectures that can incorporate scientific principles. Scientific knowledge, KGML methods, and software developed in this project could potentially be extended to a wide range of scientific applications where mechanistic (also known as process-based) models are used.

This project is part of the National Science Foundation's Harnessing the Data Revolution (HDR) Big Idea activity.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Advanced CyberInfrastructure (ACI)
Application #
Program Officer
Eva Zanzerkia
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Minnesota Twin Cities
United States
Zip Code