People are highly motivated to find explanations and solutions to address the pressing problems facing our country and our world, but they often lack the proper analytical tools to move beyond limited-scope analyses, guesswork, and instincts. This project aims to tackle this problem by developing a set of tools that make use of the massive amount of data that is freely available on the web. Anyone with a modern web browser will be able to run these tools and take part in a collaborative effort to construct comprehensive causal models for complex socio-economic and other systems. Uncovering the causal relations that exist among the variables in multivariate datasets is one of the ultimate goals in data analytics. A causal assertion separates cause and effect, for example, it states that "smoking causes cancer", but not the reverse. This is what makes causal models more definite than correlation. Causal models are attractive since they are inherently interpretable. They are able to directly explain the complex interactions that exist in the underlying data. The online tools developed in this project will take causal modeling to the next level. They will support collaboration in constructing comprehensive causal models of unprecedented scale for complex socio-economic and other systems.

This exploratory project aims to develop a set of tools that use freely available, web-scale data for the collaborative construction of comprehensive causal models of unprecedented scale for complex socio-economic and other systems. The project will break new grounds on how the creative energies of experts and non-experts can be harnessed to (1) identify datasets on the web that can add novel aspects (variables) to an evolving causal model, and (2) integrate these novel aspects (variables) as new nodes and causal edges into the model. Since building a complex, large-scale causal model can become difficult as the model grows in size, the project will produce several new automated tools that will hide this complexity from the human users, aid them in dealing with incomplete or adverse data, and provide inspiration for possible refinements. At the same time, novel techniques will also be developed that ensure validity and correctness of the evolving causal model in the presence of concurrent users. In order to hide complexity, the project will produce new techniques that can break a large causal model into a set of human-manageable subgraphs which will nevertheless retain sufficient information about the particular thematic aspect to be refined. A subgraph will be visualized in the form of a causal flowchart that can effectively show the propagation of causal relationships, and support users who may lack sufficient domain knowledge, intuition, or other helpful information to identify promising variables that could make the model more expressive. The project will develop new techniques based on the paradigm of word embeddings to assist users in this discovery process. Word embeddings map words mentioned in similar contexts in large text corpora into close neighborhoods in high-dimensional space. A 2D map-like visualization will be developed that maps words (denoting candidate variables) in the causal subgraph's thematic context near the labels of semantically related variables already in the model. Human model editors can then inspect this visual map of words (candidate variables), hypothesize possible new causal relations from these new variables, search for associated data on the web or in the evolving causal model, and test and embed the new causal relations into the subgraph using the system's causal inference engine. Behind the scenes, an automated causal network manager will then derive causal edges to other variables and so fully evolve the model. Since automated causal inference in the presence of observed data can occasionally generate wrongly directed or undirected edges, the interface will also provide new paradigms that allow human model appraisers to verify the generated edges and suggest changes. A set of carefully designed user experiments will be conducted to verify and optimize all system components. The research is expected to yield new theoretical knowledge and algorithms on human centered computational causal reasoning and the utilization of the vast body of data available online. It will also deliver new insights on how humans interact with the tools for deriving and exploring causal models. The platform and tools generated in this research will be applicable to multiple fields of knowledge and enable construction of causal models capable of explaining how the various aspects and fields relate in a larger context. The developed tools will be made available as part of the online platform.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1941613
Program Officer
Hector Munoz-Avila
Project Start
Project End
Budget Start
2019-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2019
Total Cost
$198,971
Indirect Cost
Name
State University New York Stony Brook
Department
Type
DUNS #
City
Stony Brook
State
NY
Country
United States
Zip Code
11794