Recent technological advances have resulted in the emergence of many decentralized/distributed systems comprising interconnected components that communicate among themselves over wireless links and the internet backbone for coordination and decision making. Examples of such systems include sensor networks, Internet-of-Things (IoT) systems, multiagent systems, high-performance computing clusters, and federated computing systems. One defining characteristics of many of these distributed systems is the continuous gathering of new data samples by the individual system components (e.g., motes, robots, IoT devices, cell towers, GPU nodes, etc.). Several use cases of these systems, which range from smart agriculture and smart homes to smart grids and smart transportation, are being envisioned that extract actionable information in real time from the incoming distributed "data streams" through adoption of sophisticated machine learning techniques. But the world's growing appetite for data coupled with the price and capacity projections for sensing, storage, computation, and bandwidth point to a near future in which (affordable) bandwidth capacity will start lagging behind the rate at which distributed systems gather new data samples. Such a future does not bode well for distributed systems that are expected to rely on machine learning advances for effective decision making. As such, it is crucial to develop distributed learning strategies that accommodate high volumes of data while operating over (relatively) low-throughput communication links. This project addresses this challenge and delivers a comprehensive set of analytical and algorithmic frameworks for communications-aware and optimization-based distributed machine learning from (possibly corrupted) data arriving in the form of (extremely) fast streams at multiple interconnected entities. In doing so, the project directly benefits the national economy through advances in the state-of-the-art in distributed systems, which will lead to reduction in both energy costs and wastage, increase in industrial efficiency, better containment of environmental disasters, efficient monitoring of nation's infrastructure, etc. Further, this project will also help address the shortage of talent in the critical areas of machine learning and data science by training two graduate and several undergraduate students.
This project develops and analyzes an algorithmic framework for real-time, in-network machine learning that acknowledges and accounts for the mismatch between the communications rate and the rate of distributed data streams in many emerging applications, where continuous data gathering is cheap and communications is over infrastructure-free device-to-device and/or machine-to-machine links. The investigator formalizes this setting as a distributed stochastic approximation problem, in which the optimum machine learning model is iteratively trained using the random data streaming into individual devices and machines. The research then focuses on the design and analysis of collaborative strategies that operate in the regime of (extremely) fast streaming rates. These strategies account for the topology of the network, the severity of the mismatch between communications and data streaming rates, and the convexity and structure of the learning problem. Further, they account for the challenges of real-world networks and data gathering, including intermittent communications links, heterogeneous data modalities, correlated data streams, and corrupt or missing data. The result is a comprehensive set of techniques and analysis that provide fast, reliable learning and a thorough understanding of network learning performance.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.