Rare diseases are studied in isolated laboratories, forgotten by main stream pharmacological companies, and considered almost academic curiosities. Finding variables that correlate/cause rare diseases (a condition is rare when it affects less than 1 person per 2,000) is a difficult task. The low number of cases and the sparse nature of the reports make it difficult to obtain significant/meaningful statistical results. There are two ways to avoid these problems. The first is to integrate reported cases and associations to generate enough statistical power. The second way is to have an independent data set, big enough to cover rare cases. Each of the two methods has intrinsic problems. For instance, the search in the literature puts together different studies, each of them with their own biases in population, methodology and objectives. On the other hand, blind searches for associations in big databases introduce a large number of false positives due to multiple hypothesis testing. These problems could be avoided by developing innovative methods that allow the integration of information and methodologies in the literature and longitudinal databases. To achieve this goal, we propose a team that combines expertise in natural language processing systems (Carol Friedman), electronic health records (George Hripcsak), statistics in combined databases and computational virology (Raul Rabadan). This team will generate an interdisciplinary approach to mine and integrate the literature and the dataset collected at Columbia/New York Presbyterian hospital. Identifying unusual correlations in rare diseases is the first step to understanding the origin of the diseases and to finding a cure for them. We hypothesize that we will develop effective methods aimed at improving our understanding of rare diseases by combining hypothesis testing and hypothesis discovery, and by integrating information from the literature and from the patient record to obtain increased statistical power. This will involve using natural language processing and statistical methods to mine both the literature and the electronic health record (EHR).
Pasqualucci, Laura; Khiabanian, Hossein; Fangazio, Marco et al. (2014) Genetics of follicular lymphoma transformation. Cell Rep 6:130-40 |
Chan, Joseph M; Rabadan, Raul (2013) Quantifying pathogen surveillance using temporal genomic data. MBio 4:e00524-12 |
Trifonov, Vladimir; Pasqualucci, Laura; Tiacci, Enrico et al. (2013) SAVI: a statistical algorithm for variant frequency identification. BMC Syst Biol 7 Suppl 2:S2 |
Anthony, S J; St Leger, J A; Pugliares, K et al. (2012) Emergence of fatal avian influenza in New England harbor seals. MBio 3:e00166-12 |
Vilar, Santiago; Harpaz, Rave; Uriarte, Eugenio et al. (2012) Drug-drug interaction through molecular structure similarity analysis. J Am Med Inform Assoc 19:1066-74 |
Silverstein, Samuel C; Rabadan, Raul (2012) How many neutrophils are enough (redux, redux)? J Clin Invest 122:2776-9 |
Singh, Devendra; Chan, Joseph Minhow; Zoppoli, Pietro et al. (2012) Transforming fusions of FGFR and TACC genes in human glioblastoma. Science 337:1231-5 |
Dapito, Dianne H; Mencin, Ali; Gwak, Geum-Youn et al. (2012) Promotion of hepatocellular carcinoma by the intestinal microbiota and TLR4. Cancer Cell 21:504-16 |
Greenbaum, Benjamin D; Li, Olive T W; Poon, Leo L M et al. (2012) Viral reassortment as an information exchange between viral segments. Proc Natl Acad Sci U S A 109:3341-6 |
Ntziachristos, Panagiotis; Tsirigos, Aristotelis; Van Vlierberghe, Pieter et al. (2012) Genetic inactivation of the polycomb repressive complex 2 in T cell acute lymphoblastic leukemia. Nat Med 18:298-301 |
Showing the most recent 10 out of 21 publications