This work establishes a new approach to providing ad hoc ("discovery") queries requiring integration and structuring: such queries help scientists learn possible relationships between topics, and help decision-makers or consumers explore options. The work develops a new system and underlying architecture based on an iterative process, where the system and user engage in a dialogue until the user has answers meeting his or her information need.
The resulting system takes sources on the Web, discovers semantic relationships among them, and allows users to pose discovery queries. It leverages existing extraction, matching, and recommendation algorithms as sources of evidence to generate hypotheses and corresponding queries, and adjusts these hypotheses based on user feedback over the query results. Innovations include scalable models for combining features and learning to re weight hypotheses; query and source recommendation techniques; and means of generalizing tuple-based feedback to support or refute hypotheses.
The research impact is a new paradigm for data integration by end users, which scalably combines machine learning and database concepts. The broader impact includes better discovery tools for scientific users and other users who sorely need them; improved integration of existing Web data resources; and new educational material on how networks of data can be as important as networks of systems and people. The PI is incorporating the research concepts into courses in the University of Pennsylvania's new Market and Social Systems Engineering Program, focused on the interface between people, protocols, and systems on the Internet, especially through social and data networks, as well as markets. More information on the project can be found on the project website at www.cis.upenn.edu/~zives/dialogue/
This EAGER exploratory grant resulted in new technologies for assisting users in answering queries across multiple databases on an as-needed basis -- an area where there is still significant debate about how much impact automated techniques can have. The grant showed that state-of-the-art techniques can indeed have significant benefit. Given a collection of databases, and a keyword query whose words span multiple pieces of information, our Q System will attempt to find "connections" among the data and assemble the different data items into a combined answer. The novelty of the Q System lies in its ability to take user feedback on query results, and to learn from this feedback and improve its answers. In this grant we developed new techniques to: (1) Use active learning techniques so the system will seek feedback on data it is uncertain about, rather than simply returning answers to which it assigns high scores. (2) Allow the user to specify how much to generalize from limited amounts of feedback – where the user’s preferences will depend on whether the user is exploring ideas (meaning a good deal of generalization is desirable) or carefully refining answers (meaning changes should be specific to the data on which feedback was given). (3) Make the process scale to large amounts of data and many data sources, requiring new algorithms for sharing work across multiple queries and scaling up analysis of data whose relationships are captured in a graph structure. The work explored a number of issues at the intersection of machine learning and databases, and showed the promise of such techniques. It has resulted in three publications in top conferences (SIGMOD 2011, ICDE 2012, VLDB 2013) including a runner-up for Best Paper, two journal papers, a workshop paper, and three additional manuscripts under preparation. As broader impacts, the grant funded the majority of the training for two PhD students, one of whom will be graduating in 2013; and enabled a summer undergraduate research project. The PI also participated in a number of outreach efforts to engage women and underrepresented minorities from high school, and co-authored a definitive textbook on data integration. He also co-designed a new course on Big Data and Cloud Computing (Penn’s MKSE 212) and in 2010 was given Penn’s Lindback Award for Distinguished Teaching. Moreover, the work resulted in the development of the Q System, which is being incorporated into the www.ieeg.org portal for data-centric neuroscience experiments.