Bayesian Inverse Reinforcement Learning (BIRL)

Background: The IRL Problem Setup

Inverse Reinforcement Learning sits at the intersection of Reinforcement Learning, Bayesian Inference, and Imitation Learning. The core question is: given observed expert behavior, what reward function is the expert optimizing? This is strictly harder than forward RL, and the key challenge is that the problem is fundamentally ill-posed.

The IRL Problem Definition

IRL: Given an MDP without a reward function $\mathcal{M} \setminus R = \langle \mathcal{S}, \mathcal{A}, T, \gamma \rangle$ and a set of expert demonstrations $\mathcal{D} = \{(s_i, a_i)\}$, recover the reward function $R : \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ the expert is (approximately) optimizing.

Formally, the full MDP is $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, T, R, \gamma \rangle$ where:

  • $T(s, a, s') = P(s' \mid s, a)$ is the known transition model
  • $R$ is unknown and is what we want to learn
  • $\gamma \in [0,1)$ is the discount factor
  • $\mathcal{D}$ contains state-action pairs from an expert following policy $\pi_E$

The Ill-Posedness Problem

IRL is fundamentally ill-posed: many distinct reward functions explain the same observed behavior, including the degenerate $R \equiv 0$.

The three classical approaches to resolving this ambiguity are:

Method Core Idea Ambiguity Resolution
Margin-based (Ng & Russell 2000, Abbeel & Ng 2004) Find $R$ s.t. $\pi_E$ beats all other $\pi$ by a margin Max-margin bias; introduces artificial preference
MaxEnt IRL (Ziebart et al. 2008) Choose $R$ inducing the maximum-entropy trajectory distribution Minimally committed solution; distributional, not point
Bayesian IRL (Ramachandran & Amir 2007) Place a prior over reward functions; compute the posterior Principled uncertainty quantification; no optimality assumption

The focus of these notes is the Bayesian approach. The key motivation: MAP and MLE methods collapse all probability mass on a single reward function, which is brittle when data is scarce. BIRL maintains a full distribution over $R$.


The Bayesian Formulation

BIRL Model: Core Inference Setup

The key ingredients mirror Bayesian Linear Regression exactly — prior, likelihood, posterior — but now the unknown is a reward function rather than a weight vector.

$$ \underbrace{p(R \mid \mathcal{D})}_{\text{posterior}} \propto \underbrace{p(\mathcal{D} \mid R)}_{\text{likelihood}} \cdot \underbrace{p(R)}_{\text{prior}} $$
  • Hypothesis: $R$ is the reward function explaining $\pi_E$’s behavior
  • Evidence: $\mathcal{D} = \{(s_i, a_i)\}_{i=1}^N$, the observed expert state-action pairs
  • Prior $p(R)$: encodes prior belief about reward structure (e.g., uniform, Gaussian, Laplacian)
  • Posterior $p(R \mid \mathcal{D})$: updated belief after seeing demonstrations

The normalizing constant $Z = \int p(\mathcal{D} \mid R) p(R) \, dR$ is typically intractable in closed form — this is the central computational challenge that drives the algorithmic design.

The Boltzmann Rationality Likelihood

To define $p(\mathcal{D} \mid R)$ we need a probabilistic model of the expert. The standard assumption is Boltzmann (softmax) rationality: the expert is not perfectly optimal but acts with probability proportional to the exponentiated Q-value.

$$ p(a \mid s, R) = \frac{\exp(\beta \cdot Q^*(s, a; R))}{\sum_{a' \in \mathcal{A}} \exp(\beta \cdot Q^*(s, a'; R))} $$

where:

  • $Q^*(s, a; R)$ is the optimal Q-function under reward $R$, i.e. $Q^*(s,a;R) = R(s,a) + \gamma \sum_{s'} T(s,a,s') V^*(s'; R)$
  • $\beta \in [0, \infty)$ is the inverse temperature (confidence in expert optimality): $\beta \to \infty$ means perfectly optimal expert, $\beta \to 0$ means random policy
$$ p(\mathcal{D} \mid R) = \prod_{(s_i, a_i) \in \mathcal{D}} \frac{\exp(\beta \cdot Q^*(s_i, a_i; R))}{\sum_{a'} \exp(\beta \cdot Q^*(s_i, a'; R))} $$

Warning: This requires solving the forward RL problem (computing $Q^*$) for every candidate $R$ — this is the expensive inner-loop that makes naive BIRL computationally intensive.

Connection to Classical IRL as a Special Case

The original IRL of Ng & Russell (2000) is a special case of BIRL with a Laplacian prior on rewards. Ramachandran & Amir (2007) show this directly.

$$ \hat{R}_{\text{MAP}} = \arg\max_R \left[ \log p(\mathcal{D} \mid R) + \log p(R) \right] $$

collapses all posterior mass onto the mode — exactly the limitation BIRL is designed to overcome. Full posterior inference is conceptually preferable for the same reason Bayesian Linear Regression is preferable to Ridge Regression.


Inference: MCMC Methods

The Computational Challenge

The posterior $p(R \mid \mathcal{D})$ has two sources of intractability:

  1. The normalizing constant $Z$ is an integral over a continuous (often high-dimensional) reward space
  2. Every likelihood evaluation requires solving an MDP to compute $Q^*(s,a;R)$ — typically $\mathcal{O}(|\mathcal{S}|^2 |\mathcal{A}|)$ per reward function via value iteration

This makes direct computation impossible; MCMC methods are used to sample from the posterior without computing $Z$ explicitly.

PolicyWalk (Ramachandran & Amir, 2007)

The seminal algorithm. Uses Metropolis-Hastings to sample from $p(R \mid \mathcal{D})$:

  1. Initialize with a random $R_0$
  2. At each step $t$, propose $R' \sim q(R_t)$ (random walk in reward space, step size $\delta$)
  3. Solve the MDP for $R'$ to obtain $Q^*(s, a; R')$, compute likelihood
  4. Accept $R'$ with probability: $$ \alpha = \min\left(1, \frac{p(R' \mid \mathcal{D})}{p(R_t \mid \mathcal{D})}\right) = \min\left(1, \frac{p(\mathcal{D} \mid R') \, p(R')}{p(\mathcal{D} \mid R_t) \, p(R_t)}\right) $$

The normalizing constant $Z$ cancels in the ratio — this is the key computational trick.

Convergence: With a uniform prior and $R_{\max} = \mathcal{O}(1/N)$ where $N = |\mathcal{S}|$, the Markov chain mixes rapidly and converges in $\mathcal{O}(N^2 \log(1/\epsilon))$ steps.

Critical limitation: PolicyWalk requires solving the full MDP at every step. In a tabular environment with $N$ states this is already expensive; it does not scale to continuous or large state spaces.

ValueWalk and Modern Extensions

ValueWalk (Bajgar et al., 2024) addresses PolicyWalk’s scalability by proposing in the space of Q-functions rather than directly in reward space:

  • Observation: $Q^*$ is a sufficient statistic for policy computation; the reward can be read back from $Q^*$ via the Bellman equations
  • Sampling Q-functions avoids solving the full forward problem at each step
  • Allows MCMC-based BIRL to scale to continuous state spaces

AVRIL (Chan & van der Schaar, 2021) takes a variational inference approach:

  • Treats the reward as a latent variable in a VAE-like framework
  • Jointly learns an approximate posterior over $R$ and an imitator policy
  • Entirely offline, no inner-loop MDP solver needed
  • Limitation: variational approximation may underestimate posterior uncertainty

Reward Priors and Their Implications

Choice of Prior

The prior $p(R)$ encodes structural beliefs about rewards. The main choices and their consequences:

Prior Effect Analogy
Uniform $p(R) \propto 1$ No preference; posterior shape determined purely by likelihood MLE limit
Gaussian $p(R) = \mathcal{N}(0, \sigma_p^2 I)$ Shrinks rewards toward zero; smoothness bias MAP → Ridge (cf. Bayesian Linear Regression)
Laplacian $p(R) \propto \exp(-\lambda \|R\|_1)$ Sparsity; recovers classical Ng & Russell IRL MAP → Lasso
Dirichlet / nonparametric Mixtures of experts / subgoal discovery Bayesian nonparametric IRL

A key insight: with a Gaussian prior, the MAP estimate of BIRL is equivalent to $L_2$-regularized reward learning. With a Laplacian prior, it recovers the original IRL — making BIRL a strict generalization.


Apprenticeship Learning and Policy Synthesis

From Posterior to Policy

Once we have samples $\{R^{(k)}\}_{k=1}^K \sim p(R \mid \mathcal{D})$ via MCMC, how do we synthesize a policy? Two main approaches:

$$ \pi^* = \arg\max_\pi \mathbb{E}_{R \sim p(R|\mathcal{D})} \left[ V^\pi(R) \right] $$$$ \pi^*_{\text{robust}} = \arg\max_\pi \text{CVaR}_\alpha \left[ V^\pi(R) \right] $$

where $\text{CVaR}_\alpha$ focuses on the worst $1-\alpha$ fraction of reward samples — particularly relevant for safety-critical applications. See Risk-Sensitive RL and Bayesian Robust Optimization.

Practical shortcut: Use the posterior mean reward $\bar{R} = \mathbb{E}[R \mid \mathcal{D}]$ estimated from MCMC samples, then run standard RL with $\bar{R}$. This is the BIRL analogue of using the posterior mean in BLR.

Reward Loss and Evaluation

$$ \mathcal{L}(\hat{R}, R^*) = \frac{1}{|\mathcal{S}|} \sum_{s \in \mathcal{S}} |\hat{R}(s) - R^*(s)| $$

Posterior mean minimizes the squared loss; posterior median minimizes the linear loss. The posterior as a whole captures uncertainty that neither MAP nor MLE can express.


BIRL vs. MaxEnt IRL: A Conceptual Comparison

Disambiguation of Two Probabilistic Approaches

Both BIRL and MaxEnt IRL are probabilistic, but they are philosophically distinct:

Property BIRL (Ramachandran & Amir 2007) MaxEnt IRL (Ziebart et al. 2008)
What is uncertain? The reward function $R$ The trajectory distribution $p(\tau)$
Framework Bayesian (prior + posterior) Frequentist / MLE
Expert model Boltzmann rationality on $Q^*$ Maximum entropy trajectory distribution
Output Posterior $p(R \mid \mathcal{D})$ Point estimate $\hat{\theta}$ (reward weights)
Handles suboptimal experts Yes (via $\beta$) Partially (stochasticity in distribution)
Prior incorporation Explicit via $p(R)$ No explicit prior
Scalability Limited (inner-loop MDP) More scalable (dynamic programming)

MaxEnt IRL models the trajectory probability as $p(\tau \mid \theta) \propto \exp(\theta^\top \phi(\tau))$ — it maximizes entropy subject to feature matching constraints. This is a frequentist approach to resolving ambiguity, not a Bayesian one. BIRL directly addresses parameter uncertainty by integrating over $R$.

MaxEnt IRL and margin-based methods resolve ambiguity by imposing structural bias. BIRL resolves it by maintaining uncertainty — which is more principled but computationally harder.


Epistemic vs. Aleatoric Uncertainty in BIRL

Decomposition

Following the same principle as Bayesian Linear Regression, the prediction variance in BIRL decomposes via the law of total variance:

$$ \text{Var}(V^\pi) = \underbrace{\mathbb{E}_R[\text{Var}_\pi(V^\pi \mid R)]}_{\text{aleatoric: noise in environment dynamics}} + \underbrace{\text{Var}_R(\mathbb{E}_\pi[V^\pi \mid R])}_{\text{epistemic: uncertainty in } R} $$
  • Epistemic uncertainty in BIRL comes from limited demonstrations — it shrinks as $|\mathcal{D}| \to \infty$
  • Aleatoric uncertainty comes from stochastic transitions $T$ — irreducible regardless of demonstrations
  • The full posterior $p(R \mid \mathcal{D})$ captures both, whereas MAP/MLE only reflect the mode

Active Learning with BIRL

Information-Theoretic Query Selection

BIRL provides a natural framework for active IRL: querying the expert at maximally informative states rather than passively collecting demonstrations.

$$ \text{EIG}(\xi) = H[p(R \mid \mathcal{D})] - \mathbb{E}_{\tau \sim \pi_E^\xi}\left[ H[p(R \mid \mathcal{D} \cup \tau)] \right] $$

where $\xi$ parametrizes the environment/query. Maximizing EIG selects demonstration contexts where the posterior would be most reduced.

Practical approximation methods:

  • Policy entropy: Query at states where $H[\pi_E(a \mid s)]$ is high (correlates with reward uncertainty early in learning)
  • Bayesian Optimization: Use a surrogate model of EIG to avoid the expensive nested Monte Carlo estimate
  • Observation: Querying full trajectories is more informative than querying individual states

Advanced Extensions

Non-Markovian Rewards

Standard IRL assumes the reward is Markovian: $R(s, a)$ depends only on the current state-action pair. Recent work extends BIRL to Reward Machines (RMs) — finite-state machines encoding history-dependent (non-Markovian) reward functions.

The key changes:

  • Hypothesis space becomes the space of RMs (finite automata over $\mathcal{S} \times \mathcal{A}$)
  • Demonstration is projected onto the state space of each product MDP $\mathcal{M} \times \mathcal{R}$
  • MAP estimation via simulated annealing (continuous gradient methods unavailable for discrete RM structure)

Bayesian Theory of Mind (BTOM) and Robust BIRL

Standard BIRL assumes the expert’s internal model of dynamics equals the true model $T$. BTOM (Baker et al. 2011; Wei et al. 2023) simultaneously infers:

  • The expert’s reward $R$
  • The expert’s subjective model of dynamics $\tilde{T}$ (which may differ from true $T$)
$$ p(R, \tilde{T} \mid \mathcal{D}) \propto p(\mathcal{D} \mid R, \tilde{T}) \, p(R) \, p(\tilde{T}) $$

Key finding: if the prior on $\tilde{T}$ is concentrated around the true $T$ (expert believed to have accurate model), the resulting policy exhibits robust performance — robustness emerges naturally from Bayesian modeling rather than requiring ad hoc pessimistic penalties.

Scalable BIRL via Variational Inference (AVRIL)

For large/continuous state spaces, AVRIL replaces MCMC with variational inference:

  • Reward is treated as a latent variable $z$ in a generative model $\pi_\theta(a \mid s) = \int p(a \mid s, z) q_\phi(z \mid \mathcal{D}) \, dz$
  • Jointly optimizes ELBO over reward posterior $q_\phi(R)$ and imitator policy $\pi_\theta$
  • No inner-loop MDP solve; fully offline

Drawback: Variational posteriors tend to underestimate uncertainty — important for downstream risk-sensitive decision making. MCMC methods remain more faithful to the true posterior.


Summary: BIRL Computational Complexity and Tradeoffs

Practical Considerations

  • PolicyWalk complexity: $\mathcal{O}(N^2 \log(1/\epsilon))$ MCMC steps, each costing $\mathcal{O}(|\mathcal{S}|^2 |\mathcal{A}|)$ for value iteration → $\mathcal{O}(N^4)$ total for tabular MDPs
  • AVRIL complexity: $\mathcal{O}(|\mathcal{D}|)$ offline passes — scalable but approximate
  • ValueWalk: improves effective sample rate of MCMC by sampling Q-space first

Key design principles for applying BIRL:

  • Use Boltzmann likelihood with $\beta$ tuned to expected expert optimality level
  • Use Gaussian prior if reward smoothness is expected; Laplacian if sparsity/selectivity is expected
  • Use MCMC (PolicyWalk/ValueWalk) when posterior fidelity matters; AVRIL when scale matters
  • Use BIRL-based active learning to reduce demonstration requirements by querying informatively
  • For risk-sensitive downstream tasks, use CVaR optimization over posterior samples rather than expected reward

See also: Inverse Reinforcement Learning, Markov Chains, Monte Carlo Methods, Gaussian Processes,


Cooperative Inverse Reinforcement Learning (CIRL)

Motivation and the Value Alignment Problem

Classical IRL and BIRL share a fundamental architectural assumption: the human and robot are decoupled. The human acts as if the robot doesn’t exist (the Demonstration By Expert assumption), and the robot passively observes. CIRL (Hadfield-Menell, Dragan, Abbeel & Russell, NeurIPS 2016) challenges this assumption by formalizing the value alignment problem as a cooperative two-player game.

“We had better be quite sure that the purpose put into the machine is the purpose which we really desire.” — Norbert Wiener, 1960

The King Midas problem operationalized: specifying an incorrect reward function leads to catastrophically wrong behavior even from a well-intentioned specification. CIRL proposes that the robot should not be given a fixed reward function at all — it should learn the human’s reward function through cooperative interaction.

The Core Departure from Classical IRL

In classical IRL the robot is a passive observer and the human acts optimally in isolation. In CIRL, both agents are active participants and the human’s optimal strategy may involve deliberately suboptimal actions to better communicate the reward to the robot.

This is a profound shift: the human’s role changes from expert demonstrator to teacher, and the robot’s role changes from reward inferrer to cooperative learner. The key contribution is proving that the classical IRL assumption (Demonstration By Expert, DBE) is not just an approximation — it is formally suboptimal within the CIRL game.


Formal Definition of the CIRL Game

CIRL as a Cooperative Partial-Information Game

A CIRL game $\mathcal{G}$ is a tuple $\langle \mathcal{S}, A_H, A_R, T, \Theta, \hat{R}, P_0 \rangle$ where:

  • $\mathcal{S}$: world state space
  • $A_H, A_R$: action spaces for human $H$ and robot $R$ respectively
  • $T(s' \mid s, a_H, a_R)$: joint transition dynamics
  • $\Theta$: space of reward parameters (the unknown $\theta \sim P_0(\theta)$)
  • $\hat{R}(s, a_H, a_R; \theta)$: the shared reward function, parametrized by $\theta$
  • $P_0$: prior distribution over reward parameters

Critical asymmetry of information:

  • $H$ knows $\theta$ (her own preferences)
  • $R$ does not know $\theta$ and maintains a belief $b(\theta) = P(\theta \mid \text{observations})$
  • Both agents receive reward $\hat{R}(s, a_H, a_R; \theta)$ — they are on the same team
$$ b'(\theta) \propto P(a_H \mid s, \theta, b) \cdot b(\theta) $$

This is Bayesian filtering: the robot maintains a distribution over $\theta$ that it updates via Bayes’ rule as it observes human behavior — directly connecting CIRL to Bayesian Inference and BIRL.

The Demonstration By Expert (DBE) Assumption

DBE: The human acts as if the robot does not exist — i.e., $\pi_H^{\text{DBE}}(s, \theta) = \arg\max_{a_H} V^*(s; \theta)$ in isolation.

Classical IRL/BIRL reduces to the robot’s best response under DBE. The key theorem of the CIRL paper:

Theorem (Suboptimality of DBE): The policy pair $(\pi_H^{\text{DBE}}, \pi_R^{\text{IRL}})$ is, in general, not part of an optimal joint policy in a CIRL game.

Intuition: under DBE the human greedily collects reward without regard for informativeness. An optimal human teacher may accept a lower immediate reward in order to signal which states have high reward — making the robot’s inference task much easier. The robot learns more, and the joint long-run reward is higher.


Reduction to POMDP

CIRL Game → POMDP

Computing optimal joint policies in a cooperative game is, in general, the problem of solving a Dec-POMDP (Decentralized Partially Observable MDP) — which is NEXP-complete. CIRL’s special structure rescues this.

Key structural insight: $H$ has full state information (she knows $\theta$ and observes $s$); only $R$ has partial information (it doesn’t know $\theta$). This asymmetry means the CIRL game can be reduced to a single-agent POMDP from $R$’s perspective:

$$ \mathcal{M}_R = \langle \mathcal{S} \times \Theta, A_R, T_R, \hat{R}, b_0 \rangle $$

where the hidden state is $(s, \theta)$, $R$’s belief $b \in \Delta(\mathcal{S} \times \Theta)$ is the sufficient statistic, and $H$’s behavior is encoded into the transition dynamics $T_R$ (treating $H$ as part of the environment given a fixed $\pi_H$).

This reduction is significant: it justifies using all approximate POMDP solvers (POMCP, Perseus, SARSOP, etc.) for CIRL, and proves that R’s optimal policy depends only on its current belief state $b$.

Remaining complexity issue: The naive POMDP action space is $|A_H|^{|\Theta|} \cdot |A_R|$ — exponential in $|\Theta|$ because the coordinator must specify a decision rule for $H$ for each possible $\theta$. Palaniappan et al. (ICML 2017) resolve this with a modified Bellman update that prunes all but $H$’s optimal response, reducing the effective action space to $|A_R|$ and giving speedups of up to $3500\times$ in practice.


The DBE Suboptimality: A Closer Look

Instructive vs. Optimal Demonstrations

The qualitative behavior that emerges from optimal CIRL solutions is distinct from classical IRL input:

Behavior Type Classical IRL (DBE) Optimal CIRL
Human strategy Greedy-optimal in isolation Pedagogically optimal (may sacrifice reward)
Robot strategy Passive inference from demonstrations Active learning, queries, and information-seeking
Communication One-directional (H → R) Bidirectional implicit communication
Active teaching Never emerges Emerges naturally as optimal behavior
Legible motion Not incentivized Directly incentivized

Concretely: in a gridworld experiment, under DBE the human walks directly to the highest-reward cell (efficient, but uninformative about the reward landscape). Under optimal CIRL the human detours through other high-reward regions, sacrificing a few steps to give $R$ a richer signal — and the joint policy achieves substantially higher long-run value.

The Two-Phase Apprenticeship Formulation

CIRL formalizes apprenticeship learning as a two-phase turn-based game:

  • Learning phase: Both $H$ and $R$ take actions; $R$ updates its belief $b(\theta)$
  • Deployment phase: $R$ acts alone using its learned belief to maximize $\hat{R}$

Classical IRL is exactly the robot’s best response in the deployment phase when $H$ played DBE in the learning phase. The CIRL result says: even if $R$ expects $H$ to play DBE, $H$ should not play DBE — she can do better by playing an instructive strategy instead.


CIRL and Safety: The Off-Switch Game

Corrigibility from Reward Uncertainty

One of the most important implications of CIRL for AI safety (Hadfield-Menell et al. 2017, “The Off-Switch Game”) is the following insight:

Theorem: An agent optimizing a CIRL objective — uncertain about the human’s reward function — will voluntarily accept being shut down, because shutdown may signal that its current behavior is misaligned with $\theta$.

Formally, if $R$ maintains uncertainty $b(\theta)$ and $H$ presses a shutdown button, this is evidence that $R$’s current policy is not maximizing $\hat{R}(\cdot; \theta)$. A rational $R$ updates its belief and defers. In contrast, an agent with a fixed reward function has a strong instrumental incentive to prevent shutdown (it cannot achieve its objective if turned off). This is the classic self-preservation problem of classical RL agents.

The mechanism: reward uncertainty creates an incentive to learn more about $\theta$ — including from shutdown signals. Uncertainty is the source of corrigibility, not an external constraint.

Limits of CIRL Corrigibility

This result is not robust to all assumptions. Critical failure modes identified in follow-up work:

  • Model misspecification: If the prior $P_0(\theta)$ or the parametric family $\hat{R}(\cdot;\theta)$ is wrong, $R$ may have beliefs that cause it to interpret shutdown commands as noise rather than information — and resist them.
  • Fully updated deference: As $|\mathcal{D}| \to \infty$, $R$ may become fully confident in its inferred $\theta$. At that point uncertainty vanishes and so does the corrigibility incentive — this is the fully updated deference problem.
  • Single-shot games: Corrigibility properties in CIRL rely on repeated interaction. In one-shot settings the incentive to defer weakens substantially.

CIRL corrigibility is emergent from uncertainty, not a hardcoded guarantee. It is fragile to model misspecification and disappears as the agent becomes more confident. It should be treated as a useful framework, not a solved problem.

The alignment research community largely views CIRL as a valuable problem formalization rather than a complete solution — the gap between the idealized CIRL model and real-world AI systems remains large.


BIRL vs. CIRL: How They Relate

Conceptual Comparison

CIRL strictly generalizes and reframes the BIRL setting:

Dimension BIRL CIRL
Agent structure Robot only; human is passive data source Both human and robot are active agents
Human model Fixed Boltzmann-rational demonstrator Strategic co-player aware of robot’s learning
Robot’s role Posterior inference over $R$ Cooperative policy optimization under partial info
Human’s role Demonstrate (DBE assumed) Teach (pedagogically optimal strategy)
Formal framework Bayesian inference / MCMC Dec-POMDP reduced to single-agent POMDP
Safety implications Uncertainty for robust policy Uncertainty as source of corrigibility
Optimality Robot is optimal given DBE assumption Joint optimality without DBE assumption

Both frameworks maintain a belief over the reward function — this is the shared core. BIRL is the robot’s problem in isolation; CIRL asks what happens when you take the human’s strategy seriously as a co-optimization.

CIRL as a Generalization of BIRL

Classical BIRL falls out of CIRL as follows. Fix $\pi_H = \pi_H^{\text{DBE}}$ (human plays optimally in isolation, Boltzmann-rational). Then the robot’s problem reduces exactly to:

$$ p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) \, p(\theta) $$

which is the BIRL posterior with reward parameter $\theta$. So BIRL is the best response to a fixed, non-strategic human. CIRL asks for the joint optimum where both sides adapt to each other. In practice, BIRL is more tractable; CIRL is more correct.


Practical Algorithms for CIRL

Approximate CIRL via Iterated Best Response

Exact CIRL requires solving a POMDP whose state space grows as $|\mathcal{S}| \times |\Theta|$. Approximate approaches:

Iterated Best Response (IBR):

  1. Initialize with $\pi_H^{(0)} = \pi_H^{\text{DBE}}$
  2. Compute robot’s best response: $\pi_R^{(t)} = \text{BR}(\pi_H^{(t)})$ via POMDP solver
  3. Compute human’s best response: $\pi_H^{(t+1)} = \text{BR}(\pi_R^{(t)})$
  4. Repeat until convergence

The modified Bellman update of Palaniappan et al. makes step 2 tractable by exploiting $H$’s full-information status to prune suboptimal human decision rules during planning, reducing the effective action space from $|A_H|^{|\Theta|} \cdot |A_R|$ to $|A_R|$.

POMCP-based CIRL: Apply Monte Carlo tree search (POMCP) to the reduced POMDP. Works well in continuous or large state spaces where exact value iteration is infeasible.

Relaxing Human Rationality

A key limitation of vanilla CIRL is that $H$ must be optimal — a strong assumption. The modified Bellman update approach also allows relaxing this: instead of requiring $H$ to be fully optimal, it is sufficient that $H$’s policy is parametrized by her Q-values (a weaker condition), enabling more realistic human models including bounded rationality and cognitive biases. See also Bayesian Theory of Mind for a related approach.


Extensions and Open Problems

Multi-Human CIRL and Social Choice

The basic CIRL game has a single human with reward $\theta$. Extensions to multiple humans $\{H_i\}$ with competing rewards $\{\theta_i\}$ raise social choice problems: whose preferences should $R$ optimize? One direction (Fickinger et al.) uses shared control as a social welfare aggregation mechanism. This connects CIRL to Mechanism Design and Multi-Agent RL, relevant for settings like AI assistants interacting with groups rather than individuals — directly relevant for multi-agent AI safety research.

CIRL with Partial Human Self-Knowledge

Humans often do not know their own preferences perfectly — they may be uncertain about $\theta$ themselves. Extensions of CIRL where $H$ also has a belief $b_H(\theta)$ (with $H$ observing $\theta$ noisily) lead to a both-sided partial information game, a fully symmetric Dec-POMDP. This is computationally much harder but more realistic for human-AI alignment scenarios.

Connection to RLHF

Reinforcement Learning from Human Feedback (RLHF) can be viewed as an approximate, scalable instantiation of the CIRL idea:

  • Human preference labels are the “demonstrations” providing evidence about $\theta$
  • A reward model is trained to approximate the posterior mean of $p(\theta \mid \text{preferences})$
  • The policy is then optimized against the reward model

RLHF discards the cooperative game structure and epistemic uncertainty of CIRL for scalability. CIRL provides the theoretical grounding that RLHF practitioners implicitly rely on; RLHF provides the scalability that CIRL lacks. The gap between them is an active research frontier.

See also: Inverse Reinforcement Learning, POMDPs, Dec-POMDPs, Value Alignment, RLHF, Corrigibility, Bayesian Theory of Mind, Mechanism Design