Causality is central to scientific inquiry across the sciences. Without causal information, researchers cannot predict the effects of new interventions, estimate retrospective counterfactuals, and, perhaps more importantly, construct meaningful, in-depth explanations of the phenomenon under investigation. With the unprecedented accumulation of data, the challenge of finding meaningful explanations can be summarized under the rubric of "data-fusion" -- namely, deriving a causal interpretation from a combination of experimental and observational studies collected under disparate, non-exchangeable conditions (Bareinboim and Pearl, Proc. Natl. Acad. Sci. U.S.A, 2016). Despite all the recent progress, it is still non-trivial to apply state-of-the-art causal inference methods in many large-scale settings. In particular, the scientist's available knowledge does not always match what the theory expects, and the theory does not accept as input (and generate as output) more relaxed causal specifications. Given the completeness of the theory, these requirements cannot be strictly waived. In reality, however, some researchers continue to make their claims even when the required conditions are not met. There is an increasing recognition throughout the empirical disciplines that many of the scientific findings articulated today are too fragile, incapable of resisting to a more rigorous scrutiny or even being reproduced. The goal of this project is to bridge the gap between the conditions entailed by the theory (which, if followed, would generate robust and scientifically-grounded claims) and the knowledge available at the hands of the scientist. Specifically, the project seeks (1) to characterize the trade-off between the combination of data and background knowledge (scientific theories) available versus the strength of newly hypothesized causal explanations, and (2) to construct approximation schemes allowing inputs that are coarse and imprecise, while generating outputs that are still causally meaningful. The proposed research is expected to offer foundational grounding for most of the data science inferences made today, which will impact the practice of several data-intensive fields that are built on cause-and-effect relationships, including econometrics, education, bioinformatics, and medicine. The project also contains a significant educational component. Similar to the importance of physics and calculus in basic science education in the 20th century, causal inference will be a vital component of the curriculum of undergraduate studies in a modern, data-rich society. The project develops a new educational platform tailored to teaching causal inference concepts, principles, and tools to STEM students. The primary goal of this new platform is to move from acausal claims obtained from pervasive regression-based techniques, as well as vague and self-evident statements such as "association does not imply causation", and go towards a more fundamental understanding of the conditions necessary to support causal statements.

The goal of this proposal is to develop a principled framework for approximations in causal inference. There are two possible approximation dimensions, one regarding the input and the other the output of a given problem instance. First, we will develop sufficient and necessary identification conditions to accept as input a model that is not fully specified (e.g., a causal DAG), but only a coarser description of the phenomenon is available. We will further develop effective procedures for determining whether a causal quantity can be approximated from a combination of observational and experimental datasets, given structural knowledge about the underlying data-generating process. The project will further leverage both results to design efficient learning algorithms under the relaxed assumption that the input is just partially specified and the output can be an approximation of the target causal distribution. Finally, we will consider the problem of learning causal explanations when multiple biased datasets are available, including when plagued with selection bias, confounding bias, and structural heterogeneity. The goal is to develop a general algorithmic theory of approximate causal inference that is capable of producing more robust, reproducible, and generalizable causal explanations.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Rebecca Hwa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Purdue University
West Lafayette
United States
Zip Code