Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Built upon our previous success with PubTator which has served annotated PubMed abstracts for 300 million requests, we expanded its scope to include automated concept annotations for full-length articles in the newly released PubTator Central (PTC) system. Specifically, PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. To facilitate research in the development of pre-training language representations in the biomedicine domain, we introduced the Biomedical Language Understanding Evaluation (BLUE) benchmark. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT and ELMo and find that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available. In 2019, we developed ML-Net, a novel end-to-end deep learning framework, for multi-label classification of biomedical texts where each document is associated with one or more labels. As an important task that has broad applications in biomedicine, a number of different computational methods have been proposed. Many of these methods, however, have only modest accuracy or efficiency and limited success in practical use. Our ML-Net method combines a label prediction network with an automated label count prediction mechanism to provide an optimal set of labels. This is accomplished by leveraging both the predicted confidence score of each label and the deep contextual information (modeled by ELMo) in the target document. Our benchmarking results show that ML-Net compares favorably to state-of-the-art methods in multi-label classification of biomedical text. ML-Net is also shown to be robust when evaluated on different text genres in biomedicine. In addition to tackling text-mining tasks with supervised machine-learning methods that typically require a large amount of annotated labeled data for algorithm training, we proposed a novel semi-supervised learning algorithm based on variational autoencoders (VAE) in order to utilize unlabeled data to reduce the burden of manual annotation required in supervised learning. Our model consists of the following three parts, a classifier, an encoder and a decoder. The classifier is implemented using multi-layer convolutional neural networks (CNNs), and the encoder and decoder are implemented using both bidirectional long short-term memory networks (Bi-LSTMs) and CNNs, respectively. The semi-supervised mechanism allows our model to learn features from both the labeled and unlabeled data. We evaluate our method on multiple public PPI, DDI and CPI corpora. Experimental results show that our method effectively exploits the unlabeled data to improve the performance and reduce the dependence on labeled data. To our best knowledge, this is the first semi-supervised VAE-based method for (biomedical) relation extraction. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we organized a challenge task on text mining for precision medicine through BioCreative VI. The challenge was organized in two specific sub-tasks: (i) document triage subtask, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction subtask, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. Separately, we have applied cutting-edge deep learning techniques to the literature triage process for human kinome curation and the identification of translational research in genomic medicine beyond bench to bedside from the biomedical literature. In both applications, automated computer results are compared and validated by the expert curators of the two databases: neXtA5 and the CDCs Public Health Genomics Knowledge Base. Both results show that our deep-learning based computer method is more efficient and can replace the traditional manual triage methods of those databases. Our methods can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion. As shown above, deep learning, a class of machine learning algorithms, has showed impressive results in several of our recent studies this year. In addition to applying it to natural language processing, we have also seen its success in our medical image analysis such as processing chest X-rays, CT images, and various kinds of retinal images for autonomous disease diagonosis and prognosis. One such project relates to Age-related macular degeneration (AMD), which is the leading cause of blindness in developed countries and, by 2040, will affect approximately 300 million people worldwide. Accurate AMD severity detection and progression prediction to sight-threatening late disease stage is thus of significant importance for personalizing monitoring and preventative interventions. As a joint effort between National Library of Medicine and National Eye Institute, we developed novel machine-learning approaches to automatically classify AMD severity levels and predict risk of progression to late AMD based on retinal images taken from two US multi-center longitudinal studies. We demonstrated that the fully automated deep learning model was superior to existing clinical standards and has the potential for improved patient care.
Showing the most recent 10 out of 51 publications