The presence of unmeasured confounding factors can result in incorrect statistical and causal inferences if the confounding factors are correlated with the observed data. This phenomenon has been well documented in at least two important applications. One application is identifying genetic variation involved in disease from populations of related individuals. A second application is identifying genes active in a disease when comparing disease and health samples. In this proposal we propose a new approach to correct for unobserved confounders in taking advantage of insights into how confounders affect high dimensional data. These insights motivate a formal definition for a specific type of confounder which we term a 'low-rank confounder.' Formalizing this definition allows us to motivate methods for correcting for the effects of these types confounders even when the confounders are not observed. Our proposal will develop a theory of how confounders affect data and under what conditions unobserved confounders can be corrected. The proposed theory is related to recent developments in understanding sparsity which has been well studied in electrical engineering, computer science and statistics. The result of our proposed methods will lead to improved methods for applications where such confounders are present.

Nontechnical Abstract

Inference of knowledge from high dimensional data is a fundamental problem affecting virtually all areas of science including physics, astronomy, chemistry, computer science, social science and many areas of biology. Many of these problems are driven by recently available large sources of data and advances in measurement or data collection technologies. A major challenge is the presence of unknown (and unmeasured) confounding factors. Confounding factors are variables that are often not observed in the data, but are correlated with various features of the data. Unfortunately, confounding factors can cause incorrect inferences. This phenomenon has been well documented in at least two important applications: one application is identifying genetic variation involved in disease from populations of related individuals, and a second application is identifying genes active in a disease when comparing disease and health samples. There are traditional approaches to perform inference if the confounders are observed in the data. However, dealing with unobserved confounders is more difficult. This project will develop and study a new approach to correct for unobserved confounders, taking advantage of insights into how confounders affect high dimensional data. The project has broad impact due to its utility in a wide range of scientific questions, through the interdisciplinary research opportunities provided to undergraduate and graduate students, and through the distribution of software and data.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1320589
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-06-01
Budget End
2017-05-31
Support Year
Fiscal Year
2013
Total Cost
$499,919
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095