Next-generation DNA sequencing technologies produce datasets that are the epitome of "big data." Resulting files are typically quite large, consisting almost entirely of symbolic (i.e., non-numeric) short DNA sequences. In contrast, the most widely used machine learning algorithms require numerical datasets to learn. Unfortunately, both traditional and cutting-edge methods to numerically represent symbolic data often suffer from high-dimensionality or substantial running time requirements, which hinder the application of powerful machine learning algorithms to modern biological questions. To overcome these crucial issues, this project addresses the fundamental problem of determining the "right" dimension in which to embed symbolic data for a data-mining or classification task. It does so by representing symbolic datasets numerically via a method reminiscent of Global Positioning Systems (GPS) but in a far more general setting. Besides exploring modern biology applications, the project will also investigate how to predict the source of a spread (i.e., ground zero) over large networks. This may assist administrators in determining how best to respond to new epidemics and cyber-threats. Additionally, the project will closely mentor undergraduate and graduate students to become mature data scientists. Its findings will be communicated as notes, open-source software, and video-lectures available to the general public, including students in the Colorado Data Science Team, which encourages the participation of women and under-represented minorities in Engineering education.
Much like GPS uses trilateration to locate a receiver anywhere on the planet, finite metric spaces contain resolving sets, that is sets of points that uniquely identify every point in the space via multilateration (i.e., the vector of distances to points in the set). Associated with any resolving set R, there is a one-to-one transformation from its ambient metric space to a Euclidean space of dimension |R|, the cardinality of R. The smallest resolving set thus induces the lowest-dimensional representation of its ambient space. Importantly, even when the ambient metric space is finite but exponentially large, its metric dimension is often much smaller than its cardinality. Determining the metric dimension is, however, an NP-hard problem in a variety of contexts. Building on this abstracted notion of multilateration, this project will: (1) assess the computational complexity of calculating the metric dimension of Hamming graphs, and characterize the metric dimension of various random graph models to guide the development of new and efficient algorithms to approximate this quantity; (2) explore relaxations and constraints of multilateration, including approximate and probabilistic algorithms, to expand the reach of applications of multilateration to other finite but large metric spaces; and finally (3) provide proofs-of-concept of multilateration to learn non-contiguous regions of dependencies in genomic sequences, develop classifiers for historically elusive virus targets, and identify the source of spread of information or disease in large networks.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.