The project is focused on developing coordination policies for large-scale multi-agent systems operating in uncertain environments through the use of multi-agent reinforcement learning (MARL). Existing MARL techniques do not scale well. This research addresses the scaling issue by using coordination technology to "coordinate" the individual agent learning so as to speed up convergence and lead to learned policies that better reflect overall system objectives. This novel idea is being implemented using an emergent supervisory organization with low overhead that exploits non-local information to dynamically coordinate and shape the learning processes of individual agents while still allowing agents to react autonomously to local feedback. A key question is how to automate the development of the supervisory control process (including supervisory information generation and organization formation). One approach to automation is using a formal model of interactions among agents that also includes a model of global system objectives and policy space of agents to derive the information necessary for appropriate supervisory control. Another approach is the formulation of the supervision problem as a distributed constraint optimization problem. The results of this work provide a necessary component for the development of a wide variety of next-generation adaptive applications, such as smart power grids, cloud computing, and large-scale sensor networks. The broader impact stems from the wide applicability of the resulting learning technology for distributed control, undergraduate and graduate educational activities at UMass, dissemination efforts that make the experimental domain and algorithms publically available, and the development of international collaborations.
Cooperative multi-agent systems (CMAS) are finding applications in a wide variety of next-generation adaptive applications, such as smart power grids, cloud computing, large-scale sensor networks, autonomic computing, disaster management, etc. A CMAS consists of a group of autonomous software agents that are distributed and interact with one another using limited communication bandwidth in order to optimize global performance. A central challenge in building such systems is to design distributed coordination policies that define what actions local agents should take in the context of the state and actions of other agents. Computing optimal policies offline is infeasible for complex systems with unknown environment characteristics and involving tens to thousands of agents with limited communication bandwidth and partial views of the whole system. Multi-agent reinforcement learning (MARL) potentially provides an attractive approximate approach for agents to incrementally develop effective coordination policies. It allows agents to adapt their behaviors to the dynamics of the uncertain and evolving environment. However, existing MARL techniques do not effectively scale as the number of agents increases because of communication overhead or poor overall performance in terms of likelihood of convergence to stable policies, the time required for this convergence, and the quality of the learned policies as measure against overall system performance characteristics. This novel approach that has been explored in this research project is the use of a supervisory software agent organization, with acceptable levels of computation and communication overhead, that exploits non-local information to dynamically coordinate and shape learning processes of individual learning agents while still allowing agents to react autonomously to local feedbacks. A key idea motivating this work is interaction sparsity (also called nearly-decomposability), where agents interact strongly with a small group of closely related partners (by proximity, similarity, or some task-based measure). This leads to promising opportunities to summarize agent interactions in a compact way. In addition, it may not be necessary to reason about individual interactions but rather some aggregate interaction effect over groups of anonymous agents that is maintained by a regional supervisory agent. The three main research explorations that were conducted focused on 1) What type of information should a supervisory agent provide to assist the learning in a local agent? 2) Can agents share the intermediate results of their learning with other agents to speed up overall learning in the network?, and 3) Can supervisory control be implemented in a way that does not require detail knowledge of the application domain? In all these studies, the result was a significant increase in the speed of learning and in some cases also the quality of the learned policies. We demonstrated the effectiveness of this approach on a large-scale distributed task allocation problem with hundreds of agents operating in an unknown environment. In the first exploration, an adaptive supervisory control was developed that used both action-shaping and reward-shaping to influence local agent learning. Action-shaping biased agent learning to so that all the different action options that agent had a their disposal to react to their specific local state were not uniformly explored in the learning process. Whereas in reward-shaping, a modified version of the reward that agent’s received from the environment when taking a specific action was used in the learning process rather than the actual reward. Both of these shaping signals and their importance were generated dynamically based on the supervisor’s view of the evolving state of the group of agents under its control. In the second exploration, an approach that adaptively identifies opportunities to periodically transfer experiences among agents in a large network of reinforcement learning agents was developed. This algorithm operates in an on-line, distributed manner, using supervisor-directed transfer, leading to more rapid acquisition of appropriate policies. Our method constructs high-level characterizations of the system called contexts and uses them to identify which agents operate under approximately similar dynamics. A set of supervisory agents compute and reason over contextual similarity between agents, identifying candidates for experience sharing, or co-learning. Using a tiered architecture, state, action, and reward tuples are propagated amongst the members of these co-learning groups. In the third exploration, a domain-independent approach was developed for building supervisory control based on the use of distributed constraint optimization (DCOP) techniques. This approach uses an interaction measure that allows agents to dynamically identify their beneficial coordination set (i.e., whom to coordinate with) in different situations and to trade off its performance and communication cost. By limiting their coordination set, agents dynamically decompose the coordination network in a distributed way, resulting in dramatically reduced communication for DCOP algorithms without significantly affecting overall learning performance.