Query Log Analysis for Improving User Access to NCBI Web Services

Lu, Zhiyong

Abstract

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries, Query Autocomplete and Author Name Disambiguation. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In 2014-2015, we performed a large-scale semantic analysis of 6-month worth of PubMed queries: we applied our three state-of-the-art taggers for identifying gene, disease and chemical/drug names from the queries. Based on the tagging results, we constructed inverted lists to allow instant retrievals of multiple types of queries including: a) what are the most searched genes, diseases and chemical/drugs in PubMed (e.g. cancer is frequently searched); b) what are the most searched relationships between two entities (e.g. anemia and iron are frequently searched together); c) what are the most searched terms associated with a specific type of entity (e.g. treatment is frequently used in searches with diseases); and d) given a specific entity instance, what are the common co-occurring entities (e.g. for breast cancer, tamoxifen is commonly searched). Based on the results of our semantic analysis of PubMed queries, we further developed an unsupervised method for finding closely related semantic patterns in PubMed queries (e.g. DrugA versus DrugB and DrugA DrugB Comparison) towards better retrieval effectiveness. Specifically, we adopted latent semantic analysis (LSA), a technique of finding semantic topics of a set of documents by mining the terms they contain. In our study, we treated queries as documents and made use of computed entities for both mining search topics (e.g. drug comparison in the previous example) and identifying similar patterns (e.g. versus is similar to comparison in the above topic). Our method involved automatically transforming PubMed queries into query patterns in entity space and subsequently transforming entity space into LSA topic space. We pioneered in applying the LSA technique to semantic analysis of query patterns (specifically PubMed query patterns) and in identifying synonymous biomedical patterns without manual seeds. Our evaluations showed that the proposed LSA framework significantly outperforms a baseline approach and can effectively find pattern synonyms covering a myriad of bio-entity relations such as chemical-disease relationships and drug-drug interaction. Our preliminary results showed that the computationally generated synonymous patterns could lead to improved retrieval effectiveness when applied as query expansion in PubMed searches.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000001-05
Application #: 9160903
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 5
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2019 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine
NIH 2018 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine
NIH 2017 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine
NIH 2016 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine
NIH 2015 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine
NIH 2014 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine
NIH 2013 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine	$205,463
NIH 2012 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine	$260,305
NIH 2011 ZIA LM	Query Log Analysis for Improving User Access to NCBI Web Services Lu, Zhiyong / National Library of Medicine	$499,678

Publications

Yeganova, Lana; Kim, Won; Comeau, Donald C et al. (2018) A Field Sensor: computing the composition and intent of PubMed queries. Database (Oxford) 2018:

Fiorini, Nicolas; Canese, Kathi; Starchenko, Grisha et al. (2018) Best Match: New relevance search for PubMed. PLoS Biol 16:e2005343

NCBI Resource Coordinators (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46:D8-D13

Kim, Sun; Yeganova, Lana; Comeau, Donald C et al. (2018) PubMed Phrases, an open set of coherent phrases for searching biomedical literature. Sci Data 5:180104

Fiorini, Nicolas; Lipman, David J; Lu, Zhiyong (2017) Towards PubMed 2.0. Elife 6:

Kim, Sun; Fiorini, Nicolas; Wilbur, W John et al. (2017) Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J Biomed Inform 75:122-127

NCBI Resource Coordinators (2017) Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res 45:D12-D17

NCBI Resource Coordinators (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44:D7-19

Huang, Chung-Chi; Lu, Zhiyong (2016) Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation. Database (Oxford) 2016:

NCBI Resource Coordinators (2015) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 43:D6-17

Showing the most recent 10 out of 18 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: