Policy-Space Response Oracles

In Multi-Agent Learning, Reinforcement Learning, and Game Theory, Policy-Space Response Oracles^[1] (PSRO) is a collection of multi-agent learning algorithms for training agents in two-player, zero-sum, imperfect information (partially observable), extensive form (stochastic form) games using deep reinforcement learning as an approximate best response operator. Knowledge of all players' payoffs is required (complete information). PSRO unifies, and is heavily influenced by, other algorithms such as Double Oracle ^[2] (DO), fictitious play (FP). It can be considered a framework of algorithms which, under certain parameterizations, are equivalent to algorithms that have come before it. PSRO is closely related to Empirical Game Theoretic Analysis (EGTA).

The multi-agent problem setting involves agents learning to interact with others in a shared environment. PSRO works by iteratively training new policies against past opponents' policies (so called "self-play"). A key property of PSRO is that the resulting distribution over policies it finds provably converges to a normal-form Nash Equilibrium (NE) under certain parameterizations. For two-player, zero-sum games, an NE cannot be exploited by any other policy, which makes it a particularly suitable solution concept in this setting.

Many interesting games are two-player, zero-sum ("purely competitive"). Notable projects such as AlphaZero^[3], and AlphaStar^[4] make use of this family of algorithms. Other classes of games, such as those with more than two players or general payoffs, do not provably converge using PSRO. Extensions (such as JPSRO) are more suitable, but use different solution concepts.

History

(TODO) There is a long list of breakthroughs / other algorithms that PSRO is based on. Credit them here.

Double Oracle^[2] (DO).

Empirical game-theoretic analysis (EGTA)

Algorithm

PSRO works by iteratively training a policy against a distribution over all previous opponent policies found so far. This step of the algorithm is called the best response (BR) and is commonly estimated using reinforcement learning (RL) and function approximation (typically a neural network).

The distribution over opponent policies is determined by a meta-solver (MS) – which in turn determines many of the properties of PSRO. For example, if one were to use a uniform distribution, PSRO would be similar to FSP, and if the Nash distribution were used, PSRO would be similar to Double Oracle.

The meta-solver determines a distribution from a meta-game.

(Placeholder update)

function expected_return(policy policy₁, policy policy₂) → float is

    return payoff₁, payoff₂

function meta_solver(matrix payoff₁, matrix payoff₂) → (dist, dist) is

    return payoff₁, payoff₂

function PSRO(game g) → (dist, dist), (list[policy], list[policy]) is

    // Initialize.
    Π₁ := {π_random}
    Π₂ := {π_random}
    // Iterate until convergence.
    for i in 1,... do
    
        if gap == 0 then
            break
    return (σ₁,  σ₂), (Π₁, Π₂)

Other extensions

Performance Extensions

Pipeline PSRO^[5]

TODO

Double Oracle citation:

McMahan, H. Brendan, Geoffrey J. Gordon, and Avrim Blum. "Planning in the presence of cost functions controlled by an adversary." Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003

References

↑ Lanctot, Marc; Zambaldi, Vinicius; Gruslys, Audrunas; Lazaridou, Angeliki; Tuyls, Karl; Perolat, Julien; Silver, David; Graepel, Thore (2017). "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning". arXiv:1711.00832 [cs.AI].
↑ ^2.0 ^2.1 Fawcett, Tom; Mishra, Nina (21 August 2003). Planning in the presence of cost functions controlled by an adversary. Icml'03. pp. 536–543. ISBN 9781577351894. Search this book on
↑ Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (7 December 2018). "A general reinforcement learning algorithm that masters chess, shogi, and go through self-play". Science. 362 (6419): 1140–1144. Bibcode:2018Sci...362.1140S. doi:10.1126/science.aar6404. PMID 30523106.
↑ Vinyals, Oriol; Babuschkin, Igor; Czarnecki, Wojciech M.; et al. (30 October 2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning". Nature. 575 (7782): 350–354. Bibcode:2019Natur.575..350V. doi:10.1038/s41586-019-1724-z. PMID 31666705. Unknown parameter |s2cid= ignored (help)
↑ McAleer, Stephen; Lanier, John; Fox, Roy; Baldi, Pierre (2020). "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games". arXiv:2006.08555 [cs.GT].
↑ McAleer, Stephen; Lanier, John; Baldi, Pierre; Fox, Roy (2021). "XDO: A Double Oracle Algorithm for Extensive-Form Games". arXiv:2103.06426 [cs.GT].
↑ Marris, Luke; Muller, Paul; Lanctot, Marc; Tuyls, Karl; Graepel, Thore (2021). "Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers". arXiv:2106.09435 [cs.MA].
↑ Muller, Paul; Omidshafiei, Shayegan; Rowland, Mark; Tuyls, Karl; Perolat, Julien; Liu, Siqi; Hennes, Daniel; Marris, Luke; Lanctot, Marc; Hughes, Edward; Wang, Zhe; Lever, Guy; Heess, Nicolas; Graepel, Thore; Munos, Remi (2019). "A Generalized Training Approach for Multiagent Learning". arXiv:1909.12823 [cs.MA].

This article "Policy-Space Response Oracles" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Policy-Space Response Oracles. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[1] Lanctot, Marc; Zambaldi, Vinicius; Gruslys, Audrunas; Lazaridou, Angeliki; Tuyls, Karl; Perolat, Julien; Silver, David; Graepel, Thore (2017). "A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning". arXiv:1711.00832 [cs.AI].

[do-2] 2.0 ^2.1 Fawcett, Tom; Mishra, Nina (21 August 2003). Planning in the presence of cost functions controlled by an adversary. Icml'03. pp. 536–543. ISBN 9781577351894. Search this book on

[alphazero-3] Silver, David; Hubert, Thomas; Schrittwieser, Julian; Antonoglou, Ioannis; Lai, Matthew; Guez, Arthur; Lanctot, Marc; Sifre, Laurent; Kumaran, Dharshan; Graepel, Thore; Lillicrap, Timothy; Simonyan, Karen; Hassabis, Demis (7 December 2018). "A general reinforcement learning algorithm that masters chess, shogi, and go through self-play". Science. 362 (6419): 1140–1144. Bibcode:2018Sci...362.1140S. doi:10.1126/science.aar6404. PMID 30523106.

[4] Vinyals, Oriol; Babuschkin, Igor; Czarnecki, Wojciech M.; et al. (30 October 2019). "Grandmaster level in StarCraft II using multi-agent reinforcement learning". Nature. 575 (7782): 350–354. Bibcode:2019Natur.575..350V. doi:10.1038/s41586-019-1724-z. PMID 31666705. Unknown parameter |s2cid= ignored (help)

[5] McAleer, Stephen; Lanier, John; Fox, Roy; Baldi, Pierre (2020). "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games". arXiv:2006.08555 [cs.GT].

[6] McAleer, Stephen; Lanier, John; Baldi, Pierre; Fox, Roy (2021). "XDO: A Double Oracle Algorithm for Extensive-Form Games". arXiv:2103.06426 [cs.GT].

[7] Marris, Luke; Muller, Paul; Lanctot, Marc; Tuyls, Karl; Graepel, Thore (2021). "Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers". arXiv:2106.09435 [cs.MA].

[8] Muller, Paul; Omidshafiei, Shayegan; Rowland, Mark; Tuyls, Karl; Perolat, Julien; Liu, Siqi; Hennes, Daniel; Marris, Luke; Lanctot, Marc; Hughes, Edward; Wang, Zhe; Lever, Guy; Heess, Nicolas; Graepel, Thore; Munos, Remi (2019). "A Generalized Training Approach for Multiagent Learning". arXiv:1909.12823 [cs.MA].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]