Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries and Query Autocomplete. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In 2011-2012, we have studied the usage of PubMed articles with regard to their citations. The citations of an article have been an important measurement of the quality and impact of the article. Recently there is an increasing interest on the correlation between the citations and number of downloads, investigating whether the latter can act as a predicting indictor or an alternative solution for evaluation. Our experiments based on the citation and query logs of PubMed show that there is a strong correlation between the count of citation and the number of full-text access for PubMed articles. The highest correlation is 0.6 when 6-month total full-text access and 2-year total citation was counted, while articles with less than 2 citations were excluded. As there is generally a lag between when an article is published and when it is cited in another article, we found that the best correlation occurs when citations are computed 3-month after the publication. We also analyzed the public PLoS usage data, and found that the correlation between their citations (from CrossRef) and the total PDF downloads is 0.655, which is very similar to our PubMed dataset. Another research on query log analysis we conducted in 2011-2012 was the development of search filters using PubMed click-through data in order to enable topic-specific literature searches. Search filters have been developed and demonstrated for better information access to the immense and ever-growing body of publications in the biomedical domain. However, to date the number of filters remains quite limited because the current filter development methods require significant human involvement. In this regard, we developed an automated method to build topic-specific filters on the basis of users search logs from PubMed. Specifically, for a given topic, we first detect relevant user queries and use their corresponding clicks to construct a topic relevant article set. Next, we use statistics to identify terms that best represent the topic-relevant document set. Lastly, the selected representative terms are combined with Boolean operators and evaluated on benchmark datasets to derive the final filter with the best performance. We applied our method to develop filters for four different clinical topics: nephrology, diabetes, pregnancy and depression. For the nephrology filter, our method obtained comparable performance to the state of the art (sensitivity of 91.3%, specificity of 98.7%, precision of 94.6%, accuracy of 97.2%). Similarly, high-performing results (over 90% in all measures) were obtained for the other three search filters.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Yeganova, Lana; Kim, Won; Comeau, Donald C et al. (2018) A Field Sensor: computing the composition and intent of PubMed queries. Database (Oxford) 2018:
Fiorini, Nicolas; Canese, Kathi; Starchenko, Grisha et al. (2018) Best Match: New relevance search for PubMed. PLoS Biol 16:e2005343
NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46:D8-D13
Kim, Sun; Yeganova, Lana; Comeau, Donald C et al. (2018) PubMed Phrases, an open set of coherent phrases for searching biomedical literature. Sci Data 5:180104
Fiorini, Nicolas; Lipman, David J; Lu, Zhiyong (2017) Towards PubMed 2.0. Elife 6:
Kim, Sun; Fiorini, Nicolas; Wilbur, W John et al. (2017) Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J Biomed Inform 75:122-127
NCBI Resource Coordinators (2017) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res 45:D12-D17
NCBI Resource Coordinators (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44:D7-19
Huang, Chung-Chi; Lu, Zhiyong (2016) Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation. Database (Oxford) 2016:
NCBI Resource Coordinators (2015) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 43:D6-17

Showing the most recent 10 out of 18 publications