Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. One source for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries, Query Autocomplete and Author Name Disambiguation. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Many users only look at the first page of search results. With the growing usage of mobile devices, even fewer search results may be seen. While PubMeds traditional inverse date-order results meets the needs of many, improving Best Match results provides quick access to relevant results. In addition to search terms in the title and abstract, these results are based on the length of the query, the journal, the publication date, and whether other users clicked on the article. A first step in providing relevant results is understanding the query. To better understand queries, we developed a Field Sensor to completely identify the portions and aims of a query. In other words, we identify which part of the query is an author name, a journal title, a date, or key phrases describing a knowledge the searcher would like to uncover. One use for this tool is reminding those looking for information, not specific articles, about our improved Best Match searching. It would seem like queries to obtain a particular article would be the easiest to process. But when one considers the bewildering array of reference formats, and a users partial and incomplete memory, properly handling a single citation query is more difficult than one might think. Our approach was to match portions of a query to the bibliographic material of a clicked article. From these patterns, artificial queries could be constructed following patterns seen in actual queries. Since any number of these artificial queries could be constructed, there was plenty of data to inform and train an algorithm. Helping the user enter the query they really want is also valuable. Query auto completion is a commonly available feature in search engines that helps users enter queries quickly, efficiently, accurately, and prompts for useful detail the searcher may not have not originally thought to include. The usual algorithm for these completions is the most popular completion, because popular queries are popular. However, this is not as useful in a scientific context. Scientists instead are often looking for the novel, the unknown, and the unseen. Using a personalized language model, time-sensitive data, neural nets, and a beam search for diversity we were able to better predict actual future queries. This could let to better query auto completion. Some phrases carry more meaning than is obvious from the words in the phrase. Identifying these phrases in queries and documents helps identify the most relevant documents. But these phrases cannot be obtained by simply collecting known lists of good phrases. Creative researchers coin and use new phrases all the time. A way of identifying these phrases by comparing an articles title and body has been developed. The value of this approach has been confirmed by the large number of known phrases identified. But it also identifies phrases not yet recognized by curated lists. When any of these phrases are recognized in a query and then used in the search, the results are better than merely searching for the words individually. Deep Learning and Neural Networks have shown their value in image processing for some time. Now their application in text processing is becoming more important. Several of our projects use word embeddings, or even neural nets directly. One challenge is having models efficient enough to scale for use in a high capacity search engine. By focusing on the differences between vectors describing a query and vectors describing a document, we have developed a system that could contribute to a relevance ranking algorithm. Another approach directly compares the vectors for words in the query and words in the documents. Using an algorithm motivated by Word Movers Distance, it provides results better than traditional information retrieval methods. Moreover, when combined with those traditional approaches, the combined approach is superior to either single method. Now that the full text of more and more articles is available, we want to use the full text to improve search. That is harder than might be expected because authors usually do a good job of summarizing their work in the abstract. A major hurdle is the lack of humanly annotated relevance data. One project showed MeSH terms which overlap with queries can be used as a proxy for direct human relevance judgements.
Showing the most recent 10 out of 18 publications