The long-term goal of the research proposed here is to develop, validate and apply methods for very large-scale principled causal discovery that scale up to massive datasets such as the ones found in bioinformatics, electronic patient records, and bibliographic systems. The explosive proliferation and growth (in sample, variables, and quality) of such datasets creates tremendous opportunities for biomedical discoveries, hence powerful methods for causal discovery have the potential to revolutionize biomedicine. To address this problem of scale, the co-PIs have developed several novel causal discovery algorithms with well-defined properties and guarantees that employ a principled local approach: these algorithms focus only on the local causal neighborhood (e.g. direct causes and effects or, alternatively, Markov Blanket) of a single or several """"""""target"""""""" variable(s), and they are built on a formal framework for representing and learning causality. A plethora of preliminary experiments with simulated and real data suggest that the algorithms are sound and highly scalable. The local algorithms, by their assumptions, are expected to have applicability to a broad application context that includes bioinformatics, epidemiology, text analysis, and clinical medicine. The proposed research intends to take two focused steps in this broad application space. The local algorithms will be applied to (a) gene expression data from patients with lung cancer and (b) data from a large epidemiologic analysis of factors that influence development of breast cancer in patients with non-invasive breast disease. It is hypothesized that novel and potentially significant new causal relationships will be discovered. This hypothesis bears great biomedical and methodological significance.
The specific aims are to (i) validate the novel causal algorithms; (ii) induce novel hypotheses about the immediate causes and effects of a selected group of genes implicated in lung cancer; (iii) induce novel causal hypotheses about the causes of breast cancer; (iv) compare the performance of the novel local algorithms to state-of-the-art alternatives; (v) disseminate new and powerful causal discovery tools. The methods to evaluate the novel causal algorithms and the hypotheses generated by them are: (a) validation against existing knowledge using structured, evidence-based, blinded literature review by domain experts; (b) selective experimentation in cell lines (lung cancer domain), and (c) statistical performance metrics.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM007948-01
Application #
6670333
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
2003-08-01
Project End
2006-07-31
Budget Start
2003-08-01
Budget End
2004-07-31
Support Year
1
Fiscal Year
2003
Total Cost
$199,320
Indirect Cost
Name
Vanderbilt University Medical Center
Department
Anatomy/Cell Biology
Type
Schools of Medicine
DUNS #
004413456
City
Nashville
State
TN
Country
United States
Zip Code
37212
Statnikov, Alexander; Li, Chun; Aliferis, Constantin F (2007) Effects of environment, genetics and data analysis pitfalls in an esophageal cancer genome-wide association study. PLoS One 2:e958
Aphinyanaphongs, Yindalon; Statnikov, Alexander; Aliferis, Constantin F (2006) A comparison of citation metrics to machine learning filters for the identification of high quality MEDLINE documents. J Am Med Inform Assoc 13:446-55
Aphinyanaphongs, Yindalon; Tsamardinos, Ioannis; Statnikov, Alexander et al. (2005) Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 12:207-16
Statnikov, Alexander; Aliferis, Constantin F; Tsamardinos, Ioannis et al. (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21:631-43
Statnikov, Alexander; Tsamardinos, Ioannis; Dosbayev, Yerbolat et al. (2005) GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 74:491-503
Brown, Laura E; Tsamardinos, Ioannis; Aliferis, Constantin F (2004) A novel algorithm for scalable and accurate Bayesian network learning. Medinfo 11:711-5
Statnikov, Alexander; Aliferis, Constantin F; Tsamardinos, Ioannis (2004) Methods for multi-category cancer diagnosis from gene expression data: a comprehensive evaluation to inform decision support system development. Medinfo 11:813-7