The objective of this research project is to develop statistical tools and construct efficient simulation methods for validation and testing. Rapidly growing webs of interconnected, dynamic and complex information sets, e.g., the World Wide Web (WWW), scientific data, social networks, news, national security data, etc., are reaching unprecedented scales. Hence, effective methods for ordering/ranking these data sets are of utmost importance for making the best use of this wealth of information. Given that the scale and complexity of these information sets will continue to increase in the future, a new probabilistic approach for understanding their average behavior is needed in the same way that statistical mechanics was needed for understanding large sets of molecules. To this end, statistical tools will be developed for the analysis of a variety of dynamic, distributed, and possibly nonlinear information ranking algorithms. The novel statistical methodology to be developed will provide a framework for designing ranking algorithms with a pre-specified behavior. As a complement to the analytical work, efficient simulation methods will be constructed for the validation of modeling assumptions and for testing the second-order properties of ranking algorithms and information webs.

If successful, the results of this research will provide a new framework for the analysis and design of customized ranking algorithms with a predetermined typical behavior, which will result in algorithms better tailored to the diverse requirements of specific application areas. Given that the work will pursue analytically tractable approximation methods, it is expected that it will provide a considerable amount of new insights and design rules of thumb for ranking algorithms. Furthermore, the developed mathematical techniques will significantly enrich the existing literature on weighted stochastic recursions, heavy-tailed large deviations, weighted branching processes, and efficient simulation methods. This work is also expected to have a substantial broader impact, since the preceding mathematical disciplines are heavily used in a wide variety of application areas that include the analysis of algorithms, biology, and statistical mechanics.

Project Report

In the age of "Big Data", where huge amounts of information from a wide range of sources are continuously being collected, the need to organize it has become crucial. The work funded through this award had as its main goal the analysis of ranking algorithms, for example, Google's PageRank. PageRank is the algorithm that Google's search engine was initially based on, and it allowed internet users to efficiently search through billions of webpages for relevant information. The idea behind PageRank was to assign to each webpage in the WWW an absolute rank that would determine the order in which it would appear within a given set of search results. The algorithm worked so well that it allowed Google to grow into the company that it is today. The main goal of this award was to develop a model for analyzing a general class of ranking algorithms, which includes PageRank as a special case. Rather than looking at the ranks produced by the algorithm on specific data sets and use data mining techniques, the idea was to create a mathematical framework that would allow us to analyze the rank of a typical node in a large complex network. In particular, we were interested in answering the question: "How does a node get to be highly ranked?" Surprisingly, even though a fair amount of empirical work has been done over the last 20 years, no one had been able to really answer it until now. Through a series of more than 10 journal articles, the PI and Co-PI of this award have been able to provide the first mathematically sound analysis of the typical behavior of PageRank and other related algorithms. In particular, the theory developed through this grant has shown that highly ranked webpages are those with either a huge number of incoming links or those having one very highly ranked webpage linking to them. This insight is consistent with empirical findings and heuristics, but since it is based on a mathematical model, it required no knowledge of the data itself. Moreover, experiments done with real data have consistently shown that the predictions obtained from our theoretical model are remarkably accurate. Besides the insights that the new theoretical framework can provide for existing ranking algorithms, the value of this work lies in its use in the design of new algorithms. In particular, the insights that the theory can provide for specific choices of the ranking algorithm's parameters will allow us to develop a methodology for designing new algorithms capable of ranking highly specific data attributes. Given the diversity of application areas where organizing data is essential, having highly customized ranking algorithms can be extremely useful. In this respect, the outcomes of this award lay the foundation necessary to guide the design of new and improved ranking algorithms.

Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-12-31
Support Year
Fiscal Year
2011
Total Cost
$330,000
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027