The Principal Investigator (PI) formulates the information procuration problem of converting unstructured data into structured information as one of using limited resources (such as processing time and collection costs) among several available strategies for information acquisition, extraction, collation and aggregation in a sequential and adaptive manner. The proposal aims to build a Markov decision process (MDP) for which both the states and the rewards will be learned, and from which an optimal adaptive strategy for effective information procuration will be extracted.

Recent methods for designing adaptive strategies for multi-armed bandit problems and budgeted learning approaches by the PI will be extended for this purpose, as well as techniques from inverse reinforcement learning. Moreover, given the intended size of the application data sets, the focus will be on on scalable algorithms for these problems. Due to the centrality of the problem, new approaches to making better sense of unstructured data will have much impact both in terms of developing new methods and in practice. The proposed synthesis of methods from Operations Research, Approximation Algorithms and Machine Learning is novel in this context. This proposal will increase the cross-fertilization of ideas between Operations Research and Machine Learning, via a collaboration team formed at this intersection.

Project Report

An important aspect of curating data into useful, structured information is to provide recommendation links that expose relevant yet largely unexplored pages in a robust and reduandant way that maximizes the discoverability of the less visited pages. The process of searcing for relevant information in the world-wide web is greatly aided by the presence of recommendations that link information pages to others that bear closely related information. Such recommendations are also ubiquitous in retail sites, video and movie streaming services as well as in social question-answering sites. The choice of relevant pages to recommend has been well studied with most existing methods using the textual content of the originating and destination pages, as well as some additional information about other pages that are linked to or from them. In this work, we have re-cast the problem of choosing pages to recommend as a problem of exposing as many of undiscovered pages as possible. Typical web-sites that hold information have a very small fraction of highly visited and popular pages and a very large unexplored set of undiscovered pages. Given the paucity of how many pages we can recommend from each popular page, and the very large number of undiscovered pages, we consider the choice of recommendations as a two step process: First, a large list of relevant undiscovered pages are chosen per popular page as candidates using traditional recommendation methods (typically running to the hundreds per page). Second, we choose a much smaller list of candidate pages to recommend (about ten or so among the hundreds) so that the undiscovered pages chosen are redundantly linked from at least one or more popular pages. The precise requirements of this recommendation design problem are specified by how many links can be put out from each popular page, as well as the requirement of how redundantly each undiscovered page must be covered for it to be counted. Our work formulates this class of recommendation subgraph design problems and presents three highly scalable algorithms for providing good solutions. We also analyze these methods on simple theoretical models of random generation of the underlying data, and perform simulations on such data to verify our conclusions abut the relative effectiveness of the three methods depending on the problem settings.Our methods have been useful in the deployment of effective recommendations in many web relevance engines. A full report is publicly available here: http://arxiv.org/abs/1409.2042

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1347308
Program Officer
Balasubramanian Kalyanasundaram
Project Start
Project End
Budget Start
2013-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2013
Total Cost
$99,714
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213