Tree-based classification algorithms have proven very useful for a wide variety of data mining applications, but their performance sometimes suffers for very large data sets or data sets in very high dimensions. There are two fundamental challenges to overcome in these situations: First, as the number of dimensions of the attribute space increases, the sparseness of the data points increases dramatically. Second, as the number of dimensions of the attribute space increases, the standard Euclidian metric becomes less relevant for many problems of practical interest. Tree-based classifiers can be viewed as decomposing feature space into n-dimensional rectangles. If we view these cubes as being based implicitly on a Euclidean distance function, it becomes natural to consider decomposing feature space into n-dimensional rectangles based upon other distance functions, such as those that arise in hyperbolic geometry. In this research, the investigator uses this principle to develop tree-based classification algorithms and clustering algorithms based upon hyperbolic distance functions. He develops novel methods for scaling tree-based classifiers and other partition-based classification methods to very large data sets and to data sets in very large dimensions. In practice, applying classification algorithms to high volume data streams is limited by the difficulty that current network protocols have transporting high volume data flows over wide area networks with high bandwidth delay products. To overcome this, the investigator develops versions of hyperbolic clustering and classification algorithms suitable for high volume data streams by layering these algorithms over a new application layer network protocol developed in his laboratory specifically for wide area networks with high bandwidth delay products.
According to an Army report, the goal of actionable intelligence is to give commanders and soldiers the ability to conduct successful operations by providing them with a high level of situational understanding in a manner that is rapid, accurate, and timely. An important enabling technology for this is the ability to classify, integrate, and route high volume data streams in real time that originate anywhere in the world. Because of the volume of the data, today's data mining algorithms have trouble scaling to the data sets of interest. Most data mining today is done using Euclidean metrics, although there are many other metrics that might be used. The investigator develops new data mining algorithms based upon hyperbolic metrics in order to develop extremely scalable data mining algorithms. Because of the limited capacities of many of today's networks, data mining applications have difficulty working with very large data sets over long distances. Using recent work in his laboratory which developed very high performance network protocols, he develops data mining applications for high performance networks which are effective with even very large data sets. He is preparing open source implementations of these algorithms so that they may be easily used by students and other interested parties.
This award is supported jointly by the NSF and the Intelligence Community. The Approaches to Terrorism program in the Directorate for Mathematics and Physical Sciences supports new concepts in basic research and workforce development with the potential to contribute to national security.