Search engines have become modern society's main sources of information. They put vast amounts of knowledge about virtually any topic at our fingertips wherever we go. To do so, a search engine studies our search query and tries to form an understanding of what it is that we were looking for in the first place, when querying for "boston tea party reason", "IRS form 1040" or "pizza near me". This internal representation is then compared with billions of webpages, books, or news articles in an attempt to find the best possible information for our searcher's query. This project will support and innovate this process in two ways. First, it will develop a more accurate understanding of search queries using insights into the way that humans use language, rather than just comparing queries and documents word-by-word. Secondly, using these improved representations of query meaning, the researchers will develop a fundamentally different way of searching for information. Instead of comparing our query with every possible match, they let the search engine come up with an idealized response to the query and then try to find those webpages that are most similar to this optimal answer. The expected consequences will be better search results and faster computation for the machines running the search engines (that, in turn, can lead to reduced electricity demand and CO2 emissions). To ensure a lasting impact even after this project has concluded, the research team will actively reach out to researchers and engineers at search engine companies to raise facilitate widespread adoption of the technology developed in this project. To broaden participation in computer science beyond traditionally well-represented demographics (e.g., in terms of genders, ethnicities, or socio-economic backgrounds), the study team will host a range of technology literacy outreach events among student populations at the college and middle school level. During these events, the researchers, supported by undergraduates from diverse backgrounds, will inform the student participants about effective information search tools and strategies, the research goals of this project, and college-level computer science education in general.

Deep and representation learning have brought promising improvements to various Information Retrieval (IR) tasks. Existing neural IR models estimate a matching score between the information need - such as a query or question - and the documents, using semantic similarities between terms, learned from a large set of relevance information. In contrast to classical IR models where the estimation of matching scores is constrained to only those documents containing the query terms, neural IR models need to trace over all documents, or instead re-rank the top-retrieved documents, obtained from a classical IR model. In addition, since neural IR models are often based on purely distributional representations of term meaning, they lack a grounded understanding of language subtleties such as for example gradable terms. The objective of this project is to design generative information retrieval models enhanced by distributed representations of gradable terms. To accomplish this, the research team plans the following concrete objectives. (1) Generative IR models: Instead of computing matching scores for each query-document pair, a document generative model can effectively approximate a representation in the relevance sub-space for a given query, facilitating efficient fully-neural document retrieval. The investigators will explore generative models to approximate hierarchical representations of relevant documents, and use efficient nearest-neighbor algorithms to find and retrieve the most suitable organic documents in the collection. (2) Distributed representations of gradable terms: The often intangible meaning of gradable terms can be resolved by considering the global context of each term. The project will study a probabilistic formulation of gradable terms based on their hypothetical value ranges and frames of reference, estimated from the collection. (3) Incorporation of gradable term representations into generative retrieval systems: The integration of grounded representations of gradable terms in the generative retrieval model will provide better understanding and support of information needs. The project will study this effect on information needs with and without gradable terms. The expected artifacts produced by the project are peer-reviewed scientific publications, open-source implementations of the proposed models, pre-trained word and phrase embeddings, logged retrieval runs in trec_eval format, a manually annotated Subjective Entailment dataset, and a suite of middle school search literacy education materials. All of these will be shared on the project website under Creative Commons CC0 License.

This project is jointly funded by the Information Integration & Informatics Program in the Division of Information & Intelligent Systems, and the Established Program to Stimulate Competitive Research (EPSCoR).

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1956221
Program Officer
Hector Munoz-Avila
Project Start
Project End
Budget Start
2020-09-01
Budget End
2024-08-31
Support Year
Fiscal Year
2019
Total Cost
$300,000
Indirect Cost
Name
Brown University
Department
Type
DUNS #
City
Providence
State
RI
Country
United States
Zip Code
02912