This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).

This project is designing new data structures and efficient algorithms for scaling modern machine learning techniques to massive datasets, and their application to recent sky surveys to solve central problems in astronomy. The long term goal is to scale up all the best machine learning techniques, by focusing on key computational primitives, and by creating the educational initiatives to allow future generations to do the same.

In the shorter term, the project is accelerating the singular value decomposition (SVD), the key computational bottleneck in a number of state-of-the-art methods in machine learning (and well beyond). In addition to the classic principal component analysis, we consider the application of our ideas to kernel ridge regression, graphical model inference, and maximum variance unfolding, each representing a larger class (kernelized, graphical, and convex models). Working with leading astrophysicist collaborators, we validate each of these in a fundamental astronomical data analysis problem: respectively, estimation of the distances to objects, cross-matching of objects in different catalogs, and discovery of new types of objects.

The key insight is the use of a new data structure called a cosine tree, which partitions vectors based on their mutual orthogonality, using analogies of successful ideas for distance-based geometric problems to enable a new Monte Carlo sampling technique. Preliminary results demonstrate as much as 20,000 times speedup over exact SVD in moderate-sized problems with user-specifiable high approximation accuracy.

The broader impact of the work is the transformative ability to utilize the advanced data analysis techniques to unlock the potential insights across science, engineering, and business lying within the tera- and peta-scale datasets of the present and future. Apropos these goals, the project educational goals are deep integration of real-world data analysis, and cross-disciplinary thinking into traditional computing programs.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
0845865
Program Officer
Daniel Katz
Project Start
Project End
Budget Start
2009-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2008
Total Cost
$590,000
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332