Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries, Query Autocomplete and Author Name Disambiguation. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. Previously, we developed an unsupervised method for finding closely related semantic patterns in PubMed queries (e.g. DrugA versus DrugB and DrugA DrugB Comparison) using latent semantic analysis (LSA) techniques. In 2015-2016, we investigated the application of making use of such pairs in PubMed search, specifically for query expansion and specification. We first focused on understanding PubMed users information needs, specifically, the search semantics of entity searches. We then studied automatic query expansion for two real-world scenarios: an entity pair search with (e.g. comparison between albuterol and levalbuterol) or without explicit relation mention (e.g. albuterol levalbuterol). Our results show that in these cases, better PubMed retrieval effectiveness, in terms of recall and precision, can be achieved, demonstrating the practical utility of our proposed framework. Through query log analysis, we also observed that frequently the link between a query and a document is not established because they use different forms of a term. These differences may be morphological (related by derivation or inflection) variations of a word, (e.g. autoimmune, autoimmunities, autoimmunity), synonyms (e.g. kidney disease and renal disease), abbreviations, etc. To find pairs of string variants that have the same meaning, we created PubTermVariants, a high-quality data-driven resource of term variant pairs that can improve search results in PubMed. For a given pair, we consider two terms to be variants if they stem to the same form, pass the hypergeometric test, and pass the morpho-semantic test. We performed manual evaluation of a subset of PubTermVariants that confirms the high quality of the candidate pairs. We further conducted experiments that demonstrated their usefulness for improving PubMed search. To satisfy the ultimate information needs of our PubMed users (learned through its search logs), we launched a new project in late 2015 that aims to develop a next-generation relevance-ranking system for PubMed search with improved user experience. Our goal is to deliver the most relevant results (from 26+ million articles) within a fraction of a second to drive accelerated discovery and better health.
Showing the most recent 10 out of 18 publications