Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries and Query Autocomplete. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In a recent survey, we compared and contrasted PubMed with other similar literature search tools developed by other researchers. Based on our investigation, we found that there are areas where PubMed may learn from others for self-improvement with respect to better retrieval and user search experience. For instance, several tools differ from PubMed in that they allow relevance search, an important feature that can be helpful for some PubMed searches. With respect to user interface, other tools have attempted to visualize search results using novel schemes such as clusters, word clouds or networks. Though these methods are not formally validated in large-scale user studies, the concept of better visualization of search results might still be useful for consideration towards improving PubMeds current list-based presentation. In 2011, we have also studied query logs beyond PubMed. One specific project involves the analysis of user logs of NCBIs Global Search where user queries are searched against all Entrez databases and results are presented without indicating the relevancy of different databases to the user queries. Hence our task is to predict which Entrez database(s) is mostly likely to contain the relevant information to the users based on their input queries. In our current approach we first collect a data corpus from logs where each data point contains a user query followed by a user click to a specific database. Next, we apply machine-learning algorithms to learn the characteristics in user queries that distinguish users intention for seeking different biological data. Based on the learned features, we classify new input queries and direct users to results in relevant database(s) for their search needs. Another use of query logs lies in our work for PubMed Health: a newly launched NCBI service offering up-to-date information on diseases, conditions, drugs, treatment options, and healthy living for both health consumers and healthcare professionals. Based on our log analysis of actual usage on disease and drug topics in PubMed Health, we discovered that approximately 80% of the usage falls on 20% of the database content. That is, the user access pattern satisfies the Pareto principle (aka the 80-20 rule), which can have many implications for further improving our Web service. For instance, the principle suggests that we prioritize our resources on those heavily accessed content. In addition, query logs were used when we deployed our research on building links for enriching user access between related drug and disease pages as portlets in PubMed Health. Specifically, we developed text-mining methods for automatically identifying drugs and its closely related diseases (e.g. Lipitor and heart disease). In particular, we took advantage of the co-occurrence information of drug and disease mentions in user queries to help determine their strength in relatedness and popularity in user needs. As a result, we computed several thousand pairs of drug and diseases that are not only closely related but also frequently requested by our users.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Yeganova, Lana; Kim, Won; Comeau, Donald C et al. (2018) A Field Sensor: computing the composition and intent of PubMed queries. Database (Oxford) 2018:
Fiorini, Nicolas; Canese, Kathi; Starchenko, Grisha et al. (2018) Best Match: New relevance search for PubMed. PLoS Biol 16:e2005343
NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46:D8-D13
Kim, Sun; Yeganova, Lana; Comeau, Donald C et al. (2018) PubMed Phrases, an open set of coherent phrases for searching biomedical literature. Sci Data 5:180104
Fiorini, Nicolas; Lipman, David J; Lu, Zhiyong (2017) Towards PubMed 2.0. Elife 6:
Kim, Sun; Fiorini, Nicolas; Wilbur, W John et al. (2017) Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J Biomed Inform 75:122-127
NCBI Resource Coordinators (2017) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res 45:D12-D17
NCBI Resource Coordinators (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44:D7-19
Huang, Chung-Chi; Lu, Zhiyong (2016) Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation. Database (Oxford) 2016:
NCBI Resource Coordinators (2015) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 43:D6-17

Showing the most recent 10 out of 18 publications