The long-term goal of the research proposed here is to develop, validate and apply methods for very large-scale principled causal discovery that scale up to massive datasets such as the ones found in bioinformatics, electronic patient records, and bibliographic systems. The explosive proliferation and growth (in sample, variables, and quality) of such datasets creates tremendous opportunities for biomedical discoveries, hence powerful methods for causal discovery have the potential to revolutionize biomedicine. To address this problem of scale, the co-PIs have developed several novel causal discovery algorithms with well-defined properties and guarantees that employ a principled local approach: these algorithms focus only on the local causal neighborhood (e.g. direct causes and effects or, alternatively, Markov Blanket) of a single or several """"""""target"""""""" variable(s), and they are built on a formal framework for representing and learning causality. A plethora of preliminary experiments with simulated and real data suggest that the algorithms are sound and highly scalable. The local algorithms, by their assumptions, are expected to have applicability to a broad application context that includes bioinformatics, epidemiology, text analysis, and clinical medicine. The proposed research intends to take two focused steps in this broad application space. The local algorithms will be applied to (a) gene expression data from patients with lung cancer and (b) data from a large epidemiologic analysis of factors that influence development of breast cancer in patients with non-invasive breast disease. It is hypothesized that novel and potentially significant new causal relationships will be discovered. This hypothesis bears great biomedical and methodological significance.
The specific aims are to (i) validate the novel causal algorithms; (ii) induce novel hypotheses about the immediate causes and effects of a selected group of genes implicated in lung cancer; (iii) induce novel causal hypotheses about the causes of breast cancer; (iv) compare the performance of the novel local algorithms to state-of-the-art alternatives; (v) disseminate new and powerful causal discovery tools. The methods to evaluate the novel causal algorithms and the hypotheses generated by them are: (a) validation against existing knowledge using structured, evidence-based, blinded literature review by domain experts; (b) selective experimentation in cell lines (lung cancer domain), and (c) statistical performance metrics.