Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our recent research introduced a state-of-the-art system called DNorm for disease normalization based on pairwise learning to rank. In 2013-2014, we performed an investigation with regard to the different performance of DNorm when applied to clinical narratives vs. biomedical publications. We used closure properties to compare the richness of the vocabulary in clinical narrative text to biomedical publications. We found that while the size of the overall vocabulary is similar between clinical narrative and biomedical publications, clinical narrative uses a richer terminology to describe disorders than publications, which we believe to be one of the primary causes of reduced performance in clinical narrative. Accordingly, we introduced several lexical enhancements generalizable to other clinical NLP tasks that improved the ability of DNorm to handle this variation. The clinical version of DNorm (DNorm-C) is now made openly available to the research community, along with our other open source tools. One common challenge in biomedical named entity recognition (NER) and normalization is the identification and resolution of composite named entities, where a single span refers to more than one concept (e.g., BRCA1/2). Previous NER and normalization studies have either ignored composite mentions, used simple ad hoc rules, or only handled coordination ellipsis, making a robust approach for handling multitype composite mentions greatly needed. In 2014-2015, we proposed a hybrid method integrating a machine-learning model with a pattern identification strategy to identify the individual components of each composite mention. Our method, which we have named SimConcept, is the first to systematically handle many types of composite mentions. The technique achieves high performance in identifying and resolving composite mentions for three key biological entities: genes (90.42% in F-measure), diseases (86.47% in F-measure), and chemicals (86.05% in F-measure). Furthermore, our results show that using our SimConcept method can subsequently improve the performance of gene and disease concept recognition and normalization. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we continued to improve our previous curation-assisting tool PubTator and to collaborate with domain experts: human database curators in this case. With these efforts, our PubTator system is now being used in the production curation pipeline of two external databases on a daily basis: 1. HuGE Navigator: a CDC knowledgebase of human genome epidemiology 2. SwissProt: an annotated database of protein sequence and functional information. In 2014-2015, we also investigated the feasibility of using crowdsourcing for respectively assisting gene-mutation curation and drug-indication cataloging, given the high cost of expert annotation. In both studies, we first translated the complex expert-annotation task into human intelligence tasks (HITs) suitable for the average workers. For instance, instead of asking people to find drug indications from free text (e.g. lengthy paragraphs), we simplified the task such that each HIT only involved a worker making a binary judgment of whether a highlighted disease, in context of a given drug label sentence, is an indication. Then we recruited annotators from an unknown network of workers through the technical environment of Amazon Mechanical Turk (MTurk). Judgments from the crowds were then aggregated to become the final answer. For evaluation, we assessed the ability of our proposed method to achieve high-quality annotations in a time-efficient and cost-effective manner. In comparison with the expert annotations, we find that our crowdsourcing approach not only results in significant cost and time saving, but also leads to accuracy comparable to that of domain experts. Therefore, we conclude that our crowdsourcing-based approach provides a readily scalable and cost-effective model to manual curation.
Showing the most recent 10 out of 51 publications