Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Synonyms pose another challenge for high quality relevance searches. This is a problem for ordinary words, but it is even more of a difficult for entities that can be named in a number of different ways. LitVar address this problem for genetic variants. For example, searching for one of A146T, c.436G>A, or rs121913527 also finds instances of the other two. The goal is to extend this ability to other entity types. We participated in The CHEMPROT track at BioCreative VI, which aims to assess the state of the art in automatically extracting the chemicalprotein relations in running text (PubMed abstracts). We proposed an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions. Our system obtained 0.7266 in precision and 0.5735 in recall for an F-score of 0.6410 during the challenge, achieving the highest performance among all team submissions during the challenge. In addition to tackling relation extraction tasks with supervised machine-learning methods, we proposed a novel adversarial learning algorithm for unsupervised domain adaptation tasks where no labeled data are available in the target domain. We show domain invariant features can be learned in the latest neural networks such that classifiers trained for one relation type (proteinprotein) can be re-purposed to others (drugdrug). Compared to prior convolutional and recurrent NN-based relation classification methods without domain adaptation, we achieve improvements as high as 30% in F1-score. To further assist NLP tasks without pre-existing training data, we developed ezTag, a web-based annotation tool that allows users to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. Negative and uncertain medical findings are frequent in radiology reports, but discriminating them from positive findings remains challenging for information extraction. Here, we propose a new algorithm, NegBio, to detect negative and uncertain findings in radiology reports. Unlike previous rule-based methods, NegBio utilizes patterns on universal dependencies to identify the scope of triggers that are indicative of negation or uncertainty. We evaluated NegBio on four datasets, including two public benchmarking corpora of radiology reports, a new radiology corpus that we annotated for this work, and a public corpus of general clinical texts. Evaluation on these datasets demonstrates that NegBio is highly accurate for detecting negative and uncertain findings and compares favorably to the current state of the art. One promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we applied automated deep learning techniques to the literature triage process of UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog for genomic variation by collaborating with their database curators. Both the manual curation teams confirmed that our method achieved higher precision than their previous query-based triage methods without compromising recall. Both results show that our method is more efficient and can replace the traditional query-based triage methods of manually curated databases. Our method can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion. Deep learning, a class of machine learning algorithms, has showed impressive results in several of our recent studies as shown above in FY18. In addition to its applications in natural language processing, we have also seen its success in our medical image analysis such as processing chest X-ray images and colors fundus photographs.
Showing the most recent 10 out of 51 publications