Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries and Query Autocomplete. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In 2012-2013, we have studied the problem of predicting the clicks of PubMed articles. Predicting the popularity or access usage of an article has the potential to improve the quality of PubMed searches. We modeled the click trend of each article as its access changes over time by mining the PubMed query logs, which contain the previous access history for all articles. In this study, we examined the access patterns produced by PubMed users in two years (July 2009 to July 2011). More specifically, we explored the time series of accesses for each article in the query logs, modeled the trends with regression approaches, and subsequently used the models for prediction. We show that the click trends of PubMed articles are best fitted with a log-normal regression model. Such a model allows the number of accesses an article receives and the time since it first becomes available in PubMed to be related via quadratic and logistic functions, with the model parameters to be estimated via maximum likelihood. Our experiments predicting the number of accesses for an article based on its past usage demonstrate that the mean absolute error and mean absolute percentage error of our model are 4.0% and 8.1% lower than the power-law regression model, respectively. The log-normal distribution is also shown to perform significantly better than a previous prediction method based on a human memory theory in cognitive science. This work warrants further investigation on the utility of such a log-normal regression approach towards improving information access in PubMed. Another research on query log analysis we conducted in 2012-2013 was the construction of a Web-based tool that demonstrates the four literature search filters we previously developed using PubMed click-through data. Our high-performance search filters are designed to help users to retrieve relevant articles more efficiently and effectively with respect to four different clinical topics: nephrology, diabetes, pregnancy and depression.
Showing the most recent 10 out of 18 publications