Owing to exponential growth in scienti?c literature, it has become increasingly dif?cult for researchers to keep up with the latest developments in their ?elds of study. Hence, computational approaches that automatically mine large amounts of free text to extract essential information have gained popularity. This information is typically represented in the form of binary relations between different biomedical concepts. In this context, automatic extraction of meaningful relations from natural language narratives, a task often termed biomedical relation extraction (BRE), has garnered attention from informaticians. The relations extracted are used in high level applications including information retrieval (IR), literature based discovery (LBD), question answering, and text summarization. Most current BRE efforts tend to focus on a speci?c subdomain in biomedicine. For example, researchers built models that extract gene-protein or gene-gene interactions; in the clinical domain, recent results are focused on drug-drug and drug-disease interactions mentioned in clinical narratives. The only effort that extracts a broad set of relations adhering t a large standardized vocabulary is the rule based SemRep program being developed by researchers at the National Library of Medicine (NLM). SemRep extracts binary relations, called semantic predications, between biomedical entities from the UMLS Metathesaurus with predicates coming from an extension of the UMLS Semantic Network. Although SemRep achieves reasonable precision, its recall is very low on a gold standard dataset created for its evaluation. Given many applications in LBD and IR already use the predication database SemMedDB (obtained by running SemRep on all biomedical citations made available through PubMed), a predication extraction framework with a higher recall and a low acceptable loss in precision is more desirable especially if it can complement SemRep's extractions. We propose to build and evaluate a supervised BRE framework that converts syntactic relations obtained using the paradigm of open information extraction (OIE) to semantic predications by leveraging the existing database of predications in SemMedDB and relations from the UMLS Metathesaurus through distant supervision. We will conduct domain independent evaluation based on a gold standard dataset built by researchers at the NLM for evaluating SemRep. We will also conduct application oriented evaluations by simulating predication graph based document and passage retrieval using the Text REtrieval Conference (TREC) Genomics and OHSUMED datasets for IR experiments. We will also evaluate the quality of subgraphs resulting from LBD experiments to rediscover nine well known biomedical discoveries. We hypothesize that the predications extracted through our methods will complement those in SemMedDB and the combined predication dataset will result in improved overall performance compared with using SemMedDB alone.
Semantic predications are binary relations extracted from biomedical text by the SemRep program and connect biomedical entities with a ?xed set of relation types. Although SemRep extractions have reasonable precision, their recall is very low. We propose to build a supervised predication extraction framework whose results will complement SemRep's extractions in terms of improved performance in both direct gold standard evaluation and application oriented evaluation in the context of information retrieval and literature based discovery.