Ad-Hoc Teamwork (AHT) in Reinforcement Learning

Problem Setting & Motivation

Ad-hoc teamwork concerns agents that must cooperate effectively with previously unknown teammates without any prior coordination, communication protocol, or shared learning history. This is the multi-agent analogue of zero-shot generalization: the agent is dropped into a team and must immediately contribute productively. The setting departs from self-play / centralized training in that the partner distribution at test time is exogenous and possibly non-stationary.

The canonical motivating example is a robot soccer player that must join a team of unknown robots from different labs and play a match — no shared codebase, no joint training, no communication channel.

Stone et al. (2010) — Foundational Definition

“To create a good ad hoc team player, an agent must be able to assess the capabilities of other agents, both in terms of what tasks they can accomplish and how well they coordinate with one another, and then to alter its own behavior accordingly.” ~ Stone, Kaminka, Kraus, Rosenschein (AAAI 2010)

The original paper, Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination, frames the challenge as a research agenda, not a single algorithm. Three core capabilities are required:

  1. Teammate modeling — infer what other agents will do.
  2. Behavior adaptation — best-respond to inferred policies.
  3. Influence — proactively guide the team toward better joint behavior (via demonstrations, leadership, signaling).

The “challenge problem” framing matters: AHT is not a single MDP but a family of coordination problems parameterized by the partner distribution.

[!note] Why AHT is not just MARL Standard cooperative MARL trains a fixed team via centralized training / decentralized execution (CTDE). The trained policies are tightly co-adapted — they fail catastrophically when paired with unseen partners. AHT explicitly requires generalization across the partner space. See Zero-Shot Coordination for the closely related problem under symmetry assumptions.


Formalism

The AHT Decision Process

We formalize AHT as a decentralized POMDP with unknown teammate types. Let $N$ agents act in environment $\mathcal{E} = \langle S, \{A_i\}, T, R, \gamma \rangle$. The ad-hoc agent (the “learner”, index $1$) controls $\pi_1$. Teammates $2, \dots, N$ have policies $\pi_{-1} = (\pi_2, \dots, \pi_N)$ drawn from a partner distribution $\mathcal{P}$ over policy profiles.

$$ \pi_1^* = \arg\max_{\pi_1} \; \mathbb{E}_{\pi_{-1} \sim \mathcal{P}} \left[ \mathbb{E}_{\tau \sim (\pi_1, \pi_{-1})} \sum_t \gamma^t r_t \right] $$

The crucial twist: the learner does not observe the type of $\pi_{-1}$ at test time. It must infer it from interaction history $h_t = (o_0, a_0, \dots, o_t)$.

Type-Based Reasoning

$$ P(\theta \mid h_t) \propto P(\theta) \prod_{k=0}^{t-1} \pi_\theta(a_k^{-1} \mid o_k) $$

and acts to maximize expected return under this belief — this is the Bayes-optimal AHT policy, conceptually a BAMDP (Bayes-Adaptive MDP) extended to multi-agent settings.

Aspect Single-Agent BAMDP AHT (Type-Based)
Hidden variable MDP parameters $\theta$ Teammate policy $\pi_{-1}$
Belief update Bayes on transition data Bayes on teammate actions
Optimal policy Information-gathering balanced with exploitation Same, plus influence on teammate
Tractability Exponential in horizon Compounded by multi-agent state

[!tip] Connection to your ToM work Type inference here is essentially Theory of Mind cast as approximate Bayesian inference. The strategic ToM benchmarks you’ve worked on test exactly this capacity: maintaining and updating beliefs about partners’ cognitive states from interaction. The AHT literature gives you a decision-theoretic objective (regret vs. Bayes-optimal) that strategic-ToM evaluation often lacks.

Optimality and Regret

$$ \text{Reg}(\pi_1, \mathcal{P}) = \mathbb{E}_{\pi_{-1} \sim \mathcal{P}}\left[ V^*(\pi_{-1}) - V^{\pi_1}(\pi_{-1}) \right] $$

where $V^*(\pi_{-1})$ is the value of the best response to $\pi_{-1}$. Note this is not the same as joint-optimal welfare — AHT cannot demand the teammate plays optimally; it accepts the teammate’s policy as exogenous and best-responds.

This decouples AHT from Cooperative Equilibrium Selection: the learner is solving a single-player decision problem where uncertainty lives over partner policies, not a joint optimization over the team.


Algorithmic Approaches

PLASTIC (Barrett & Stone, 2015)

Planning and Learning to Adapt Swiftly to Teammates to Improve Collaboration. Two variants:

  • PLASTIC-Model: Maintains a library of pretrained teammate models $\{\hat{\pi}_\theta\}$. At test time, performs Bayesian model selection over the library and uses Monte Carlo Tree Search / model-based planning against the believed model.
  • PLASTIC-Policy: Skips the explicit teammate model. Pretrains a library of best-response policies $\{\pi_1^\theta\}$, one per teammate type. At test time, identifies the most likely type and executes the corresponding precomputed best response.
# PLASTIC-Policy sketch
beliefs = uniform_prior(types)
for t in range(T):
    theta_hat = argmax(beliefs)
    a = best_responses<a href="/obs">theta_hat</a>
    obs_next, r, teammate_actions = env.step(a)
    # Bayesian update using teammate actions
    for theta in types:
        beliefs[theta] *= likelihood(teammate_actions, theta, obs)
    beliefs = normalize(beliefs)

Drawback: PLASTIC assumes the test-time teammate is in the type library (or close to it). Out-of-library teammates degrade gracefully only if the library is rich enough. The library is also discrete — no smooth interpolation.

AATEAM (Chen et al., 2020)

Attention-based neural network for Ad-hoc TEAMwork. Replaces discrete type identification with a learned attention mechanism over a set of pretrained policy embeddings. Trained end-to-end with a teammate-prediction auxiliary loss. Conceptually a soft, differentiable PLASTIC.

LIAM (Papoudakis, Christianos & Albrecht, ICML 2021)

Local Information Agent Modelling. Learns teammate representations using only the learner’s local observations and actions — no centralized teammate-action observation at training. Uses a variational encoder–decoder where:

  • Encoder $q_\phi(z_t \mid h_t^{\text{local}})$ produces a teammate embedding from local history.
  • Decoder reconstructs teammate observations and actions during training (centralized).
  • At deployment, only the encoder runs (decentralized).

The crucial CTDE-style trick: privileged information at training, decentralized at execution.

$$ \mathcal{L}_{\text{LIAM}} = \mathbb{E}\left[ \log p_\psi(o_{-1}, a_{-1} \mid z) \right] - \beta \, \text{KL}\left[ q_\phi(z \mid h_1) \,\|\, p(z) \right] + \mathcal{L}_{\text{RL}} $$

LIAM is the strong modern baseline you’ll see compared against in most AHT papers post-2021.

GPL — Graph-based Policy Learning (Rahman et al., ICML 2021)

Targets open AHT where the number and identity of teammates can change mid-episode. Uses a graph neural network to embed the team as a graph with dynamic node sets, producing permutation-invariant and cardinality-invariant policies.

Property LIAM GPL
Team size Fixed Variable
Teammate identity Implicit via embedding Explicit graph node
Open/Closed Closed Open
Inductive bias None on relational structure Permutation-invariant via GNN

ODITS (Gu et al., AAAI 2022)

$$ \max_\phi \; I(z; r_{>t}) - \beta I(z; h_t) $$

Empirically reduces overfitting to spurious teammate features (a common AHT failure mode).


Open Ad-Hoc Teamwork

Definition

Standard AHT assumes a fixed set of $N$ agents throughout an episode. Open AHT relaxes this: agents may enter or leave the team dynamically (a robot’s battery dies; a new agent joins mid-episode).

The team composition function $C: t \mapsto 2^{[N_{\max}]}$ is itself a stochastic process the learner must reason over.

This destroys the convenience of fixed-dimension input representations and forces architectures (GPL, transformer-based AHT) that are size- and permutation-invariant. Mou et al. (2024) and follow-ups have begun to characterize the open AHT regret under stochastic agent arrival processes.

N-Agent AHT (Wang et al.)

A recent generalization where the learner controls more than one agent: $K$ controlled agents must integrate with $N-K$ unknown ad-hoc teammates. Bridges the gap between full self-play MARL and pure single-agent AHT. Relevant when you have a small “core team” that must absorb arbitrary collaborators.


Relation to Adjacent Problems

AHT vs. Zero-Shot Coordination (ZSC)

ZSC (Hu, Lerer, Peysakhovich, Foerster, ICML 2020) is sometimes conflated with AHT but solves a different problem.

Feature AHT ZSC
Partner assumption Arbitrary unknown policies Independently trained but rational
Symmetry Not assumed Often exploits problem symmetries
Training paradigm Bayesian / type-based / population Other-Play, symmetry-breaking
Canonical method PLASTIC, LIAM Other-Play, off-belief learning
Canonical benchmark Hanabi (sometimes), level-based foraging Hanabi

ZSC asks: “How do I train such that pairing with another independent training run yields good cooperation?” — a self-play distribution problem. AHT asks: “How do I act when paired with policies I never trained against?” — a generalization-from-population problem.

The two converge when the partner population in AHT is itself the set of all rational ZSC-trained agents. See Other-Play for the canonical ZSC algorithm.

AHT vs. Opponent Modeling

Opponent modeling (in adversarial / general-sum settings) shares the inference machinery — predict $\pi_{-1}$ from history — but differs in what you do with it:

  • Opponent modeling: Best-respond to potentially adversarial $\pi_{-1}$. The other agent may exploit your model.
  • AHT: Best-respond to cooperative $\pi_{-1}$ — usually no incentive to deceive your model.

The distinction collapses in mixed-motive settings, where AHT becomes Cooperative-Competitive Multi-Agent Learning.

AHT vs. Population-Based Training (PBT)

PBT and Fictitious Co-Play (FCP) (Strouse et al., NeurIPS 2021) are the standard recipes for generating training partners for AHT. FCP specifically argues for training with a diverse population including past checkpoints, not just final-converged policies — because real-world teammates are imperfect.