Traditional search engines like Google typically ignore a large amount of information behind the search engines of many online text information sources. Federated text search provides one-stop access to the hidden information via a single interface that connects to multiple search engines of text information sources. Existing federated search solutions only focus on content relevance and ignore a large amount of valuable information about users and information sources. This project includes novel research on: (1) Multiple Type Resource Representation: model important information of text information sources such as search response time and search engine effectiveness; (2) Utility-Centric Resource Selection: satisfy a user's search criteria by considering multiple types of evidence such as content relevance, search results from past queries, personal information needs, and search response time; (3) Effective and Efficient Results Merging: produce accurate merged ranked results with little cost of acquiring the content information of the returned documents; (4) System Adaptation by Results Analysis: analyze the search results from past queries for more accurate federated search solutions; (5) System Development and Evaluation: build and test algorithms within research environments as well as a new FedLemur system for a real world application. The project advances the state-of-the-art of research in federated search. It will have broad impacts for other applications such as peer to peer search. The project Web site (www.cs.purdue.edu/~lsi/Federated_Search_Career_Award.html) will be used for results dissemination.

The education component of the project will expand information retrieval instruction to address multi-disciplinary requirements, improve the education of information technology workforce, and arouse interests of K-12 students for search technologies.

Project Report

Federated Text Search (also known as Distributed Information Retrieval or Hidden Web Search) is a technique for searching multiple text sources simultaneously, which is preferred over centralized search alternatives when some hidden information cannot be arbitrarily copied or frequently updated by the centralized solutions. Previous studies have shown that the size of hidden Web is larger or much larger than traditional Web contents. There are three major topics of federated search such as: resource representation (obtain representative information from each information source), resource selection (select a few most valuable resources) and results merging (merge the returned results from individual information sources). This research project substantially advances the state-of-the-art of federated text search. In resource representation, we recognized the importance of a set of factors such as results from past queries and search engine response time in federated text search. The research work (CIKM2009) utilized search results from past queries for more accurate resource representation for better resource selection. In particular, the proposed method analyzes the results from past queries to refine static resource representation obtained from sampled documents. The research work (SIGIR2009) proposed the first learning method that predicts search response time for different user queries. In resource selection, we proposed a joint probabilistic classification framework for estimating the relevance of sources by considering both content relevance of individual sources and the relationship between individual sources (SIGIR2010). Current resource selection algorithms focus on the evidence of individual information sources to determine the relevance of available sources. On the other side, relationship information among individual sources can be important. For example, an information source tends to be relevant to a user query if it is similar to another source with high probability of being relevant. This paper proposes a joint probabilistic classification model for resource selection. The model estimates the probability of relevance of information sources in a joint manner by considering both the evidence of individual sources and their relationship. Our work in (SIGIR 2013) was the first research that studies the trade-off of relevance and novelty in resource selection for federated search, which is promising to make federated search more practical. In results merging, we proposed an effective and efficient solution for merging results from multilingual information sources, which integrates both techniques of results merging in federated text search and cross-lingual information retrieval. A recent Sample-Agglomerate Fitting Estimate (SAFE) algorithm (Shokouhi et al, 2009robust) proposed by other researchers extended our SSL results merging algorithm by combining both overlapping sampled documents and non-overlapping ones into a uniform regression model, but it does not distinguish documents' types and levels of importance. We propose (SIGIR 2012) a novel method of mixture model with multiple centralized retrieval algorithms for result merging in federated search. Existing result merging algorithms like SSL or SAFE do not fully address the issue of heterogeneity of information sources in federated search. Their arbitrary choices of a single centralized retrieval algorithm suffer from the fact that information sources are inherently different in source statistics, query processing techniques, and/or document retrieval algorithms. The proposed model attempts to combine various evidence from multiple centralized retrieval algorithms in a mixture model framework, in order to map source-specific document ranks to comparable scores for result merging. Furthermore, information sources in federated search environments may not be willing to disclose the contents of their documents or their own identities. For example, privacy-preserved federated similarity search solutions need to be developed for detecting plagiarized documents between two conferences, where submissions are confidential. We have conducted research work by collaborating with colleagues in information privacy for designing privacy-preserved protocols to achieve the goal in an effective and efficient manner. Moreover, the work in (SIGIR 2007) developed a solution to protect the privacy of resource identities in federated search. We have built prototype federated search systems such as a digital library recommendation system (SIGIR 2010). We published a survey paper (about 100 pages) on federated search (FnTIR 2011) with collaborator from Microsoft Research, which has generated a good impact in the fields with more citation 51 citations in the last a couple of years. This grant has provided substantial opportunities for student training. 7 graduate students and 2 undergraduate students have been involved in the research project. They have learned a lot such as algorithm design, software implementation. Many of them have conducted internships in different industry companies. The research project has helped them move to the next stage of their careers. The research from this project has been presented in many important venues such as SIGIR and CIKM. We have included relevant materials in tutorials (e.g., Machine Learning for Information Retrieval in SIGIR 2011, Complex Information Retrieval Applications in Qatar Computing Research Institute 2013). We have designed education materials for information retrievals and have shared the contents in the project website.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0746830
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2008-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2007
Total Cost
$492,983
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907