As a document retrieval system, PubMed aims at providing efficient access to millions of scientific documents. For this purpose, it relies on matching keywords and semantic representations of PubMed documents to user queries. One type of semantic representation used in MEDLINE citations is known as Medical Subject Heading (MeSH) indexing terms, which are assigned by professional human indexers at the National Library of Medicine. Alternatively, author keywords, provided by authors when submitting an article, capture the essence of the topic of a document from the authors perspective. Last but not least, readers have their own opinions about what words are of importance to an article, which may or may not agree with either MeSH terms or author keywords of the same article. PubMed relies on human indexers to assign the appropriate MeSH indexing terms to PubMed articles a very time and labor-intensive process. As a result, these terms are not immediately available for new articles. In fact, our analysis shows that on average it takes over 90 days for a PubMed citation to be manually annotated with MeSH terms. In response, we have developed a machine learning algorithm for automatically predicting MeSH terms with a set of novel features. When compared to other state-of-the-art methods, our approach achieved significantly better performance. We are currently exploring its potential for assisting the manual MeSH curation process in practice. As MeSH terms require human curation, author keywords can be obtained freely from journal articles when they are available. We conducted a first study on author keywords in biomedical articles where we described the growth of author keywords in biomedical journal articles and presented a comparative study of author keywords and MeSH indexing terms. A similarity metric from our past study was used to automatically assess the relatedness between pairs of author keywords and MeSH indexing terms. Furthermore, a set of 300 pairs was manually reviewed to evaluate the metric and characterize the relationships between the term types. Results show that author keywords are increasingly available in biomedical articles and that over 60% of author keywords can be linked to a closely related indexing term. Results of this work have implications in both MEDLINE document indexing and MeSH terminology development. Finally by comparison, we found neither MeSH terms nor author keywords overlap significantly with the important words from the users point of view, which motivated us to learn what characteristics make document words important from a collective user perspective. Specifically, we applied machine learning to identify document keywords which would likely be used frequently in user queries. Each word was represented by a set of features that included different types of information, such as semantic type, part of speech tag, TF-IDF weight and location in the abstract. We examined both traditional features such as TF-IDF, as well as novel ones such as named entity, which have not been explored before in this context. We identified the most important features and evaluated our model using months of real-world PubMed log data. Our results suggest that, in addition to carrying high TF-IDF weight, important words from the users perspective tend to be biomedical entities, to exist in article titles, and to occur repeatedly in article abstracts. This study enabled us to automatically predict words likely to appear in user queries that lead to document clicks. The relative importance of predicted words can also play a role in ranking documents by relevance.
Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8 |
Wilbur, W John; Kim, Won (2014) Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records. AMIA Annu Symp Proc 2014:1198-207 |
Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database (Oxford) 2012:bas026 |
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 40:D13-25 |
Yeganova, Lana; Kim, Won; Comeau, Donald C et al. (2012) Finding biomedical categories in Medline®. J Biomed Semantics 3 Suppl 3:S3 |
Islamaj Dogan, Rezarta; Lu, Zhiyong (2010) Click-words: Learning to Predict Document Keywords from a User Perspective. Bioinformatics : |