The presence of unmeasured confounding factors can result in incorrect statistical and causal inferences if the confounding factors are correlated with the observed data. This phenomenon has been well documented in at least two important applications. One application is identifying genetic variation involved in disease from populations of related individuals. A second application is identifying genes active in a disease when comparing disease and health samples. In this proposal we propose a new approach to correct for unobserved confounders in taking advantage of insights into how confounders affect high dimensional data. These insights motivate a formal definition for a specific type of confounder which we term a 'low-rank confounder.' Formalizing this definition allows us to motivate methods for correcting for the effects of these types confounders even when the confounders are not observed. Our proposal will develop a theory of how confounders affect data and under what conditions unobserved confounders can be corrected. The proposed theory is related to recent developments in understanding sparsity which has been well studied in electrical engineering, computer science and statistics. The result of our proposed methods will lead to improved methods for applications where such confounders are present.
Inference of knowledge from high dimensional data is a fundamental problem affecting virtually all areas of science including physics, astronomy, chemistry, computer science, social science and many areas of biology. Many of these problems are driven by recently available large sources of data and advances in measurement or data collection technologies. A major challenge is the presence of unknown (and unmeasured) confounding factors. Confounding factors are variables that are often not observed in the data, but are correlated with various features of the data. Unfortunately, confounding factors can cause incorrect inferences. This phenomenon has been well documented in at least two important applications: one application is identifying genetic variation involved in disease from populations of related individuals, and a second application is identifying genes active in a disease when comparing disease and health samples. There are traditional approaches to perform inference if the confounders are observed in the data. However, dealing with unobserved confounders is more difficult. This project will develop and study a new approach to correct for unobserved confounders, taking advantage of insights into how confounders affect high dimensional data. The project has broad impact due to its utility in a wide range of scientific questions, through the interdisciplinary research opportunities provided to undergraduate and graduate students, and through the distribution of software and data.