In almost every field of science, it is now possible to capture large amounts of data. This has led machine learning to play an increasingly important role in scientific discovery, for example sifting through large amounts of data to identify interesting events. But modern machine learning techniques are less well suited for the critical tasks of devising hypotheses consistent with the data or imagining new experiments to test those hypotheses. The goal of this Expeditions project is to develop new learning techniques that can help automate this process of generating scientific theories from data. In order to ground this research in real applications, the project focuses on four domains where these techniques can have the most significant impact: organic chemistry, RNA splicing, cognitive and behavioral science, and computing systems. Machine learning is already demonstrating value in all of these domains, including predicting properties of organic compounds, recognizing complex social activities, and modeling the performance of computer systems. However, the proposed techniques could have a transformative impact in all of these domains by helping scientists gain a deeper understanding of the processes that give rise to their data. This deeper understanding could lead to important contributions ranging from more efficient drug discovery to improved teaching methods grounded on a better understanding of cognition.

To realize this vision, the project will develop new methods for learning neurosymbolic models that combine neural elements capable of identifying complex patterns in data with symbolic constructs that are able to represent higher level concepts. The approach is based on the observation that programming languages provide a uniquely expressive formalism to describe complex models. The aim is therefore to develop learning techniques that can produce models that look more like the models that scientists already write by hand in code. These neurosymbolic techniques will more easily incorporate prior knowledge about the phenomena being modeled, and produce interpretable models that can be analyzed to devise new experiments or to infer causal relations. By developing these techniques and building them into tools that can be used by scientists in a variety of fields, this project has the potential to revolutionize the way scientific knowledge is derived from data. More broadly, these new techniques will be useful in any setting that requires learning more interpretable models with strong requirements on their desired behavior.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
1918865
Program Officer
Nina Amla
Project Start
Project End
Budget Start
2020-04-01
Budget End
2025-03-31
Support Year
Fiscal Year
2019
Total Cost
$319,998
Indirect Cost
Name
California Institute of Technology
Department
Type
DUNS #
City
Pasadena
State
CA
Country
United States
Zip Code
91125