Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries and Query Autocomplete. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In a recent survey, we compared and contrasted PubMed with other similar literature search tools developed by other researchers. Based on our investigation, we found that there are areas where PubMed may learn from others for self-improvement with respect to better retrieval and user search experience. For instance, several tools differ from PubMed in that they allow relevance search, an important feature that can be helpful for some PubMed searches. With respect to user interface, other tools have attempted to visualize search results using novel schemes such as clusters, word clouds or networks. Though these methods are not formally validated in large-scale user studies, the concept of better visualization of search results might still be useful for consideration towards improving PubMeds current list-based presentation. In 2011, we have also studied query logs beyond PubMed. One specific project involves the analysis of user logs of NCBIs Global Search where user queries are searched against all Entrez databases and results are presented without indicating the relevancy of different databases to the user queries. Hence our task is to predict which Entrez database(s) is mostly likely to contain the relevant information to the users based on their input queries. In our current approach we first collect a data corpus from logs where each data point contains a user query followed by a user click to a specific database. Next, we apply machine-learning algorithms to learn the characteristics in user queries that distinguish users intention for seeking different biological data. Based on the learned features, we classify new input queries and direct users to results in relevant database(s) for their search needs. Another use of query logs lies in our work for PubMed Health: a newly launched NCBI service offering up-to-date information on diseases, conditions, drugs, treatment options, and healthy living for both health consumers and healthcare professionals. Based on our log analysis of actual usage on disease and drug topics in PubMed Health, we discovered that approximately 80% of the usage falls on 20% of the database content. That is, the user access pattern satisfies the Pareto principle (aka the 80-20 rule), which can have many implications for further improving our Web service. For instance, the principle suggests that we prioritize our resources on those heavily accessed content. In addition, query logs were used when we deployed our research on building links for enriching user access between related drug and disease pages as portlets in PubMed Health. Specifically, we developed text-mining methods for automatically identifying drugs and its closely related diseases (e.g. Lipitor and heart disease). In particular, we took advantage of the co-occurrence information of drug and disease mentions in user queries to help determine their strength in relatedness and popularity in user needs. As a result, we computed several thousand pairs of drug and diseases that are not only closely related but also frequently requested by our users.

Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2011
Total Cost
$499,678
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
NCBI Resource Coordinators (2017) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res 45:D12-D17
NCBI Resource Coordinators (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44:D7-19
Huang, Chung-Chi; Lu, Zhiyong (2016) Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation. Database (Oxford) 2016:
NCBI Resource Coordinators (2015) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 43:D6-17
Khare, Ritu; Leaman, Robert; Lu, Zhiyong (2014) Accessing biomedical literature in the current information landscape. Methods Mol Biol 1159:11-31
NCBI Resource Coordinators (2014) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 42:D7-17
NCBI Resource Coordinators (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 41:D8-D20
Li, J; Lu, Z (2013) Developing topic-specific search filters for PubMed with click-through data. Methods Inf Med 52:395-402
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 40:D13-25
Névéol, Aurélie; Islamaj Do?an, Rezarta; Lu, Zhiyong (2011) Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction. J Biomed Inform 44:310-8

Showing the most recent 10 out of 12 publications