Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. To assess the state of the art in biomedical entity recognition and relation extraction, we organized a science competition at BioCreative V, an international challenge event for evaluating advances in text mining research for biology. Specifically, we designed two challenge tasks: disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. To assist system development and assessment, we created a large annotated text corpus that consisted of human annotations of chemicals, diseases and their interactions from 1500 PubMed articles. 34 teams worldwide participated in the CDR task: 16 (DNER) and 18 (CID). The best systems achieved an F-score of 86.46% for the DNER task--a result that approaches the human inter-annotator agreement (0.8875)--and an F-score of 57.03% for the CID task, the highest results ever reported for such tasks. Given the level of participation and team results, we found our task to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction. In addition to organizing the BioCreative task, we continued our own development of biomedical named entity taggers in 2015-2016. First and foremost, we created a general toolkit called TaggerOne: the first machine learning model for joint named entity recognition and normalization. TaggerOne is an all-purpose tagger (i.e. not specific to any entity type), requiring only annotated training data and a corresponding lexicon, and has been optimized for high throughput. We validated TaggerOne with multiple gold-standard corpora containing both mention- and concept-level annotations. Its results compare favorably to the previous state of the art, notwithstanding the greater flexibility of the model. TaggerOne is implemented in Java and its source code has been made publicly available to the research community. However, large-scale use of open-source tools sometimes requires a significant investment in infrastructure and maintenance time. These investments not only impair the continued adoption of text mining tools, but also reduce the ability of individual researchers to explore applying text mining to problems in their research area. In contrast, Web services provide on-demand access to software tools through the Internet using straightforward interfaces and data formats. Providing text mining tools as web services therefore reduces the bar to use for biocurators and bioinformatics researchers not working specifically in text mining, allowing free exploration and the ability to focus on results rather than methodology. Therefore, in 2015 we developed NCBI text-mining web services, an online version of our text mining tool suite for biomedical concept recognition and information extraction. Our service incorporates multiple state of the art tools for identifying critical entity types: DNorm (for diseases), GNormPlus (genes and proteins), SR4GN (species), tmChem (chemicals and drugs), and tmVar (variants). Our web service has already processed over 60 million requests since its inception from researchers in 46 countries, supporting research projects in biocuration, crowdsourcing and translational bioinformatics. We anticipate that providing text mining tools as web services will greatly expand their utility to the biomedical research community. Finally, as mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we continued to improve our previous curation-assisting tool PubTator and to collaborate with domain experts: human database curators in this case. With these efforts, our PubTator system is continuously being used in the production curation pipeline of two external databases on a daily basis: 1. HuGE Navigator a CDCs knowledgebase of human genome epidemiology 2. SwissProt an annotated database of protein sequence and functional information
Showing the most recent 10 out of 51 publications