The goal of Inverse Reinforcement Learning(IRL) is to infer reward function from expert demonstrations.
Given
-
state &action space
-
Rool-outs from $ \pi^* $
Goal
-
recover reward function
-
use reward to get policy
Maximum Entropy Inverse Reinforcement Learning
Notation:
- trajectory :
- learned reward :
($ \psi $ are learned parameters.)
- expert demonstrations :
MaxEnt formulation:
\[p(\tau)=\frac 1 Z \exp(R_\psi(\tau))\] \[Z = \int \exp(R_\psi(\tau))d\tau\]To infer the reward function, we maximize the log likelihood of our set of demonstrations with respect to the parameters of our reward function.
\[\max_\psi\sum_{\tau\in \mathcal{D}}\log p_{r_\psi}(\tau)\]Maximum Entropy IRL Optimization
\[\mathcal{L}(\psi) = \sum_{\tau\in \mathcal{D}}\log p_{r_\psi}(\tau)\] \[= \sum_{\tau\in \mathcal{D}} \log \frac 1 Z \exp(R_\psi(\tau))\] \[= \sum_{\tau\in \mathcal{D}} R_\psi(\tau) - M \log Z\] \[= \sum_{\tau\in \mathcal{D}} R_\psi(\tau) - M \log \sum_\tau \exp(R_\psi(\tau))\]We will use gradient descent to optimize $ \mathcal{L} $
\[\nabla_\psi\mathcal{L}(\psi) = \sum_{\tau\in\mathcal{D}} \frac{\mathcal{d}R_\psi(\tau)}{\mathcal{d}\psi} - M\underbrace{\frac1{\sum_\tau\exp(R_\psi(\tau))}\sum_\tau \exp(R_\psi(\tau))\frac {\mathcal{d}R_\psi(\tau)}{\mathcal{d}\psi} }\]The part with under brace can be written as $ \sum_\tau p(\tau|\psi)\frac{\mathcal{d}R_\psi(\tau)}{\mathcal{d}\psi} $
Further more, the sum of trajectory is also the sum of state, thus equals $ \sum_s p(s|\psi)\frac{\mathcal{d}R_\psi(s)}{\mathcal{d}\psi} $
So now the problem becomes calculating $ p(s|\psi) $
let $ \mu_t(s) $ be the prob of visiting s at step t. It can be calculated using DP:
\[\mu_1(s) = p(s1 = s)\]where $ p(s1) $ is the initial distribution.
\[\mu_{t+1}(s') = \sum_a\sum_s\mu_t(s)\Pi(a|s)p(s'|s,a)\]So
\[p(s|\psi) = \frac 1 T \sum_t\mu_t(s)\]MaxEnt IRL Algorithm
- Initialize $ \psi $, gather demonstrations $ \mathcal{D} $
- Solve for optimal policy $ \pi(a \vert s) $ w.r.t. reward $ r_\psi $
- Solve for state visitation frequencies $ p(s \vert \psi) $
- Compute gradient $ \nabla_\psi\mathcal{L} = - \frac 1 {\vert \mathcal{D} \vert} \sum_{\tau_d \in \mathcal{D}} \frac {\mathcal{d}r_\psi}{\mathcal{d}\psi}(\tau_\mathcal{d})- \sum_s p(s \vert \psi)\frac {\mathcal{d}r_\psi}{\mathcal{d}\psi}(s)$
- Update $ \psi $ with one gradient step using $ \nabla_\psi\mathcal{L} $
- goto 2
Discuss
The algorithm is only suitable for low dimension state space, action space and state transaction probability is needed.