Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our recent research introduced a machine learning method called DNorm for normalization based on pairwise learning to rank. In 2013-2014, we have continued our efforts in improving DNorm by increasing its scalability through a dimension reduction technique based on low-rank matrix approximation. When assessed on our recently developed NCBI disease corpus, the new algorithm demonstrates a significant reduction in the number of parameters to be learned while maintaining a high accuracy. Besides disease named entity recognition (NER), we also improved the state of the art in the chemical NER problem. Through participation of the BioCreative IV CHEMDNER task, we introduced the tmChem system, a chemical named entity recognizer created by combining two independent machine-learning models in an ensemble. We used the challenge task corpus to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the mention-level evaluation, which is the highest performance among all participating teams. To improve the interoperability among various biomedical text-mining tools our group has created over the years (e.g. DNorm, tmChem, tmVar, etc), we recently adopted a newly proposed community-wide scheme called BioC for handling heterogeneity and variety in data formats. Specifically, we modified our tools and enabled them to read/write data in the proposed BioC format. The resulting BioC wrapped toolkit, is named as tmBioC. Through empirical studies, we demonstrated that our tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we focused on Gene Ontology (GO) annotation, which is a common task among model organism database (MOD) groups. It is also a very time-consuming and labor-intensive task, thus often considered as one of the bottlenecks in literature curation. There is a growing need for semi- or fully-automated GO curation techniques that will help database curators rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The lack of sentence-level training data and opportunities for interaction between text mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two sub-tasks: a) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and b) to automatically identify relevant GO terms for the genes in a given article (a concept recognition task). With the support from five MODs, we provided teams with nearly 4,000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available due to the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we find an overall improvement in performance for recognizing GO terms when comparing to similar task results in the past. Future work should focus on improving performance of GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.
Showing the most recent 10 out of 51 publications