Hasso-Plattner-Institut
Prof. Dr. Holger Giese
 

Robust Multi-Agent RL for Self-Adaptive Systems

Context

Recently, Alpha Go-Zero [Silver et al. 2017] learned to win the famous game of Go without supervision.  Instead, it learned by playing against previous versions of itself. Besides that, as the blue and purple curves here show, the self-play approach achieved  higher performance than the previous version of the Alpha-Go. If you recall, the one that defeated the world-champion.

In another recent progress, this Google [Mirhoseini et al. 2021] announced that a reinforcement learning agent was able to design optimal circuit layouts for its TensorFlow chip. A feat that was still beyond the capacity of automated tools.

Real-World Challenges

However, industry surveys report that from 55% to 72% of companies are not being able to deploy AI systems [Algorithmia 2020] [Capgemini 2020].

Probable Reason

Current systems cannot adapt to more complex and evolving realities. Realities that play like strong adversaries against these AI systems. The inability to adapt is a problem of lack of robustness in these AI Systems [Jordan 2019][D’Amour et al. 2020].

Contact

For more information, please contact Christian M. Adriano.

 

    The Nature of the Problem

    Before we search for solutions, let’s try to understand the nature of the AI System deployment difficulty. It has two dimensions:

    • how structured the laws are, for instance, well-constrained as in quadrants 1 and 2 corresponding to the artificial laws in games or the physical laws in circuit design and
    • how frequent the events are, like high frequency as in quadrants 1 and 4 or sparse as in 2 and 3.

    Many solutions cover quadrants 1, 2, and 4, but quadrant 3 of adversarial laws is under-explored, as it is more challenging. In the last 2 years, our group has been exploring exactly this quadrant 3.

     

    Our Approach

    We are extending the standard reinforcement learning architecture [Sutton & Barto 2018]. Our approach is to model the problem a POMDP (partially observable Markov decision process ). The environment and the agent are loosely coupled by an integration layer that decides which internal states are exposed (for that we use a Hidden Markov Model). The environment is still responsible to transition its states via its MDP and calculate the stochastic rewards that correspond to actions taken by the agent. 

    Meanwhile, in the agent side we have a simulator for the agent to hallucinate (predict) the behavior of the environment, which has shown to be more effective for planning actions, particularly when not all states are visible.

    As a case study, we adopted an e-commerce platform for online shops, whose states represent component failure modes, and the actions allow repairing or optimize the failing components.

    • Platform: E-Commerce for online shops (mRubis [Vogel et al. 2018])
    • States/Observations: Component failure modes
    • Actions:  Self-Repair and Self-Optimization

    Technology stack: PyTorch, Open AI Gym, GraphQL

    We will build on top of previous work [Ghahremani, Adriano & Giese 2018] and projects on Reinforcement Learning for Self-Adaptive Systems and Machine Learning-Based Control

    Model-Based Reinforcement Learning Architecture where the Agent has local Models of the Environment.

    Multi-Agent Architecture

    We are developing a multi-agent architecture [Janer et al. 2021][Chen et al. 2021], where agents tend to multiple shops. This has two advantages: specialization and modularity. Specialization allows each agent to manage a set of shops that share the same utility model. Modularity allows partition the large state in smaller sets that can be efficiently monitored for adversarial changes, while still allow agents to coordinate the information necessary to cope with these changes.

    Multi-Agent Architecture

    Project Scope

    Finally, we have a set of clear goals for the project that we will tackle very incrementally. First will be the reward sparsity that is caused by large state spaces, then Distribution shift, Concept drift, and Domain shift with their corresponding causes. For each of these goals we will apply the state-of-art techniques as exemplified here.

     

     

    References