Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries, Query Autocomplete and Author Name Disambiguation. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In 2014-2015, we performed a large-scale semantic analysis of 6-month worth of PubMed queries: we applied our three state-of-the-art taggers for identifying gene, disease and chemical/drug names from the queries. Based on the tagging results, we constructed inverted lists to allow instant retrievals of multiple types of queries including: a) what are the most searched genes, diseases and chemical/drugs in PubMed (e.g. cancer is frequently searched); b) what are the most searched relationships between two entities (e.g. anemia and iron are frequently searched together); c) what are the most searched terms associated with a specific type of entity (e.g. treatment is frequently used in searches with diseases); and d) given a specific entity instance, what are the common co-occurring entities (e.g. for breast cancer, tamoxifen is commonly searched). Based on the results of our semantic analysis of PubMed queries, we further developed an unsupervised method for finding closely related semantic patterns in PubMed queries (e.g. DrugA versus DrugB and DrugA DrugB Comparison) towards better retrieval effectiveness. Specifically, we adopted latent semantic analysis (LSA), a technique of finding semantic topics of a set of documents by mining the terms they contain. In our study, we treated queries as documents and made use of computed entities for both mining search topics (e.g. drug comparison in the previous example) and identifying similar patterns (e.g. versus is similar to comparison in the above topic). Our method involved automatically transforming PubMed queries into query patterns in entity space and subsequently transforming entity space into LSA topic space. We pioneered in applying the LSA technique to semantic analysis of query patterns (specifically PubMed query patterns) and in identifying synonymous biomedical patterns without manual seeds. Our evaluations showed that the proposed LSA framework significantly outperforms a baseline approach and can effectively find pattern synonyms covering a myriad of bio-entity relations such as chemical-disease relationships and drug-drug interaction. Our preliminary results showed that the computationally generated synonymous patterns could lead to improved retrieval effectiveness when applied as query expansion in PubMed searches.
Showing the most recent 10 out of 18 publications