Mining useful knowledge from the biomedical literature is beginning to realize its potential for improving literature search, automating biological data curation, and enabling large scale studies not possible otherwise. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. In FY17, we continued the development of our own text mining tools for named entity recognition and relation extraction. We first developed tmVar 2.0, an improvement on our previous tool for recognition of genetic variants and to normalize them to unique identifiers (dbSNP RSIDs). We validated our method by a benchmarking evaluation, where it compared favorably to the state of the art. We then applied our approach to the entire PubMed and demonstrated that combining text mining in manual database curation can greatly assist human efforts in prioritizing variants in genomic research. Following the recognition of sequence variants, we created an extraction system for genotype-phenotype relationships. The method extracts disease-gene-variant triplets using machine learning approach that identifies the genes and protein products associated with each mutation from both the local context (same document) and global context (i.e. all of PubMed). The system was evaluated both by comparison with the entries in the human-curated database UniProt and using benchmark datasets, where it demonstrated a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. While aforementioned feature-based machine learning has been the most commonly used approach for text mining for some time, deep learning methods are rapidly being adopted due to their improved performance and the reduced need for feature engineering. We created a novel deep learning model (McDepCNN) for extracting protein-protein interactions, which uses multiple channels to combine information from each word with information from the head of the corresponding word. The method is thus capable of capturing long distance features, and was shown to achieve 24.4% relative improvement in F1-score over the existing methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on difficult instances. In addition to pure methodological work, we also developed several systems which combined novel methodologies and applications with potential clinical importance. We developed the DIGNiFi (Disease causing GeNe FInder) method using machine learning and features extracted from protein-protein interaction networks to identify disease causing genes. We evaluated our method using 1184 known disease genes from 128 orphan diseases collected from Orphanet, demonstrating our method to outperform existing methods for identifying disease-causing genes. We also demonstrated that combining the human-curated protein-protein interaction network with interactions text-mined from the biomedical literature results in an additional performance improvement. Finally, to demonstrate the utility of text mining in creating knowledge bases for large-scale studies, we created a new database containing over 100,000 labeled chest X-rays (ChestX-ray8), by text mining the associated radiological reports for eight commonly occurring thoracic diseases. We further demonstrated that these diseases can be automatically detected and even spatially-located via a unified weakly supervised multi-label image classification and disease localization framework.

Project Start
Project End
Budget Start
Budget End
Support Year
5
Fiscal Year
2017
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536
Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196
Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:
Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390
Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:
Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529
Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981
van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568
Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15
Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications