Many emerging applications of data mining call for techniques that can deal with data instances with millions, if not billions of dimensions. Hence, there is a need for effective approaches to dealing with extremely high dimensional data sets.
This project focuses on a class of novel theoretically well-founded hashing algorithms that allow high dimensional data to be encoded in a form that can be efficiently processed by standard machine learning algorithms. Specifically, it explores: One-permutation hashing, to dramatically reduce the computational and energy cost of hashing; Sparsity-preserving hashing, to take advantage of data sparsity for efficient data storage and improved generalization; Application of the new hashing techniques with standard algorithms for learning "linear" separators in high dimensional spaces. The success of this EAGER project could lay the foundations of a longer-term research agenda by the PI and other investigators focused on developing effective methods for building predictive models from extremely high dimensional data using "standard" machine learning algorithms.
Broader Impacts: Effective approaches to building predictive models from extremely high dimensional data can impact many areas of science that rely on machine learning as the primary methodology for knowledge acquisition from data. The PI's education and outreach efforts aim to broaden the participation of women and underrepresented groups. The publications, software, and datasets resulting from the project will be freely disseminated to the larger scientific community.
This project has focused on developing novel, practical, and mathematically rigorous hashing algorithms for processing massive and extremely high-dimensional data, which are common in morden applications, for example, search engines. The core of the project is the development of "one permutation hashing", which is a safe replacement of "minwise hashing", the standard data processing algorithm used in search. Minwise hashing has been a popular tool for efficiently computing data similarities using small storage space. It is also a standard indexing tool to organize data in a way that allows searching for nearest neighbors in sublinear time (i.e., faster than linear scans).The drawback of minwise hsahing is that it needs to permute the data many (hundreds or thousands of) times. This preprocessing cost typically dominates many subsequent operations such as search and learning. Even though the preprocessing can be trivially paralelized, it is not an engergy-efficient solution. One permutation hashing solves the problem by doing only ONE permutation on the data and exploiting statistical properities which were not studied in the past. This dramatically reduces the processing cost by a factor of hundreds or thousands, depending on applications. One permutation can be used for both learning (e.g., building SVMs or logistic regression models) and sublinear time near neighbor search (i.e., indexing). However, in order to use it for indexing sparse data, we must address a crucial issue, that is, the existence of "empty bins". Basically, one permutation hashing works by breaking the space (after permutation) into bins and keeping the smallest nonzero entry in each bin. When the data are sparse, empty bins are not avoidable, in fact they can dominate when the data are extremely sparse. Our solution to this problem is "denisified one permutation hashing", by borrowing the neigbhoring non-empty bins. We have shown that this approach is mathematically rigorous and we have successfully used the method for indexing & sublinear time near neighbor search. In summary, this project is a successful in that we have developed "one permutation hashing" and "denisified one permutation" which are simple, practical, mathemetically rigorous, and can safely replace the standard data processing methods used in industry.