Mining useful knowledge from the biomedical literature is beginning to realize its potential for improving literature search, automating biological data curation, and enabling large scale studies not possible otherwise. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. In FY17, we continued the development of our own text mining tools for named entity recognition and relation extraction. We first developed tmVar 2.0, an improvement on our previous tool for recognition of genetic variants and to normalize them to unique identifiers (dbSNP RSIDs). We validated our method by a benchmarking evaluation, where it compared favorably to the state of the art. We then applied our approach to the entire PubMed and demonstrated that combining text mining in manual database curation can greatly assist human efforts in prioritizing variants in genomic research. Following the recognition of sequence variants, we created an extraction system for genotype-phenotype relationships. The method extracts disease-gene-variant triplets using machine learning approach that identifies the genes and protein products associated with each mutation from both the local context (same document) and global context (i.e. all of PubMed). The system was evaluated both by comparison with the entries in the human-curated database UniProt and using benchmark datasets, where it demonstrated a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. While aforementioned feature-based machine learning has been the most commonly used approach for text mining for some time, deep learning methods are rapidly being adopted due to their improved performance and the reduced need for feature engineering. We created a novel deep learning model (McDepCNN) for extracting protein-protein interactions, which uses multiple channels to combine information from each word with information from the head of the corresponding word. The method is thus capable of capturing long distance features, and was shown to achieve 24.4% relative improvement in F1-score over the existing methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on difficult instances. In addition to pure methodological work, we also developed several systems which combined novel methodologies and applications with potential clinical importance. We developed the DIGNiFi (Disease causing GeNe FInder) method using machine learning and features extracted from protein-protein interaction networks to identify disease causing genes. We evaluated our method using 1184 known disease genes from 128 orphan diseases collected from Orphanet, demonstrating our method to outperform existing methods for identifying disease-causing genes. We also demonstrated that combining the human-curated protein-protein interaction network with interactions text-mined from the biomedical literature results in an additional performance improvement. Finally, to demonstrate the utility of text mining in creating knowledge bases for large-scale studies, we created a new database containing over 100,000 labeled chest X-rays (ChestX-ray8), by text mining the associated radiological reports for eight commonly occurring thoracic diseases. We further demonstrated that these diseases can be automatically detected and even spatially-located via a unified weakly supervised multi-label image classification and disease localization framework.
Showing the most recent 10 out of 51 publications