Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. To assess the state of the art in biomedical entity recognition and relation extraction, we organized a science competition at BioCreative V, an international challenge event for evaluating advances in text mining research for biology. Specifically, we designed two challenge tasks: disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. To assist system development and assessment, we created a large annotated text corpus that consisted of human annotations of chemicals, diseases and their interactions from 1500 PubMed articles. 34 teams worldwide participated in the CDR task: 16 (DNER) and 18 (CID). The best systems achieved an F-score of 86.46% for the DNER task--a result that approaches the human inter-annotator agreement (0.8875)--and an F-score of 57.03% for the CID task, the highest results ever reported for such tasks. Given the level of participation and team results, we found our task to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction. In addition to organizing the BioCreative task, we continued our own development of biomedical named entity taggers in 2015-2016. First and foremost, we created a general toolkit called TaggerOne: the first machine learning model for joint named entity recognition and normalization. TaggerOne is an all-purpose tagger (i.e. not specific to any entity type), requiring only annotated training data and a corresponding lexicon, and has been optimized for high throughput. We validated TaggerOne with multiple gold-standard corpora containing both mention- and concept-level annotations. Its results compare favorably to the previous state of the art, notwithstanding the greater flexibility of the model. TaggerOne is implemented in Java and its source code has been made publicly available to the research community. However, large-scale use of open-source tools sometimes requires a significant investment in infrastructure and maintenance time. These investments not only impair the continued adoption of text mining tools, but also reduce the ability of individual researchers to explore applying text mining to problems in their research area. In contrast, Web services provide on-demand access to software tools through the Internet using straightforward interfaces and data formats. Providing text mining tools as web services therefore reduces the bar to use for biocurators and bioinformatics researchers not working specifically in text mining, allowing free exploration and the ability to focus on results rather than methodology. Therefore, in 2015 we developed NCBI text-mining web services, an online version of our text mining tool suite for biomedical concept recognition and information extraction. Our service incorporates multiple state of the art tools for identifying critical entity types: DNorm (for diseases), GNormPlus (genes and proteins), SR4GN (species), tmChem (chemicals and drugs), and tmVar (variants). Our web service has already processed over 60 million requests since its inception from researchers in 46 countries, supporting research projects in biocuration, crowdsourcing and translational bioinformatics. We anticipate that providing text mining tools as web services will greatly expand their utility to the biomedical research community. Finally, as mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we continued to improve our previous curation-assisting tool PubTator and to collaborate with domain experts: human database curators in this case. With these efforts, our PubTator system is continuously being used in the production curation pipeline of two external databases on a daily basis: 1. HuGE Navigator a CDCs knowledgebase of human genome epidemiology 2. SwissProt an annotated database of protein sequence and functional information

Project Start
Project End
Budget Start
Budget End
Support Year
4
Fiscal Year
2016
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536
Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196
Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:
Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390
Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:
Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529
Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981
van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568
Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15
Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications