Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Synonyms pose another challenge for high quality relevance searches. This is a problem for ordinary words, but it is even more of a difficult for entities that can be named in a number of different ways. LitVar address this problem for genetic variants. For example, searching for one of A146T, c.436G>A, or rs121913527 also finds instances of the other two. The goal is to extend this ability to other entity types. We participated in The CHEMPROT track at BioCreative VI, which aims to assess the state of the art in automatically extracting the chemicalprotein relations in running text (PubMed abstracts). We proposed an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions. Our system obtained 0.7266 in precision and 0.5735 in recall for an F-score of 0.6410 during the challenge, achieving the highest performance among all team submissions during the challenge. In addition to tackling relation extraction tasks with supervised machine-learning methods, we proposed a novel adversarial learning algorithm for unsupervised domain adaptation tasks where no labeled data are available in the target domain. We show domain invariant features can be learned in the latest neural networks such that classifiers trained for one relation type (proteinprotein) can be re-purposed to others (drugdrug). Compared to prior convolutional and recurrent NN-based relation classification methods without domain adaptation, we achieve improvements as high as 30% in F1-score. To further assist NLP tasks without pre-existing training data, we developed ezTag, a web-based annotation tool that allows users to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. Negative and uncertain medical findings are frequent in radiology reports, but discriminating them from positive findings remains challenging for information extraction. Here, we propose a new algorithm, NegBio, to detect negative and uncertain findings in radiology reports. Unlike previous rule-based methods, NegBio utilizes patterns on universal dependencies to identify the scope of triggers that are indicative of negation or uncertainty. We evaluated NegBio on four datasets, including two public benchmarking corpora of radiology reports, a new radiology corpus that we annotated for this work, and a public corpus of general clinical texts. Evaluation on these datasets demonstrates that NegBio is highly accurate for detecting negative and uncertain findings and compares favorably to the current state of the art. One promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we applied automated deep learning techniques to the literature triage process of UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog for genomic variation by collaborating with their database curators. Both the manual curation teams confirmed that our method achieved higher precision than their previous query-based triage methods without compromising recall. Both results show that our method is more efficient and can replace the traditional query-based triage methods of manually curated databases. Our method can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion. Deep learning, a class of machine learning algorithms, has showed impressive results in several of our recent studies as shown above in FY18. In addition to its applications in natural language processing, we have also seen its success in our medical image analysis such as processing chest X-ray images and colors fundus photographs.

Project Start
Project End
Budget Start
Budget End
Support Year
6
Fiscal Year
2018
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536
Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196
Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:
Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390
Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:
Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529
Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981
van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568
Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15
Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications