A pytorch implementation of successor features in an advantage actor-critic model, that works for both discrete and continuous action spaces. This code is adapted from https://github.com/lcswillems/rl-starter-files and https://github.com/lcswillems/torch-ac.
Included here is a novel learning rule for successor features, inspired by the off-line
See requirements file, or use the provided yml file to create a conda env:
conda env create -f environment.yml
python train.py --algo sr --env MiniGrid-Empty-6x6-v0 --frames 50000 --input image --feature-learn curiosity
python train.py --algo sr --env MountainCarContinuous-v0 --frames 100000 --input flat --feature-learn curiosity
RL models can be divided into two categories, model-based and model-free, each of which has advantages and disadvantages.
Model-based RL can be used for transfer learning and latent learning: an agent can learn a state transition model (without a reward signal) that can be used for navigation in novel environments as well as adapted to distal reward changes.
Consider a maze task with a goal at a certain location. With a model of the environment
Successor representation (SR) provides a middle ground between model-based and model-free methods.
In this paradigm one learns a reward function,
where
Thus, model-free algorithms for learning a value function (e.g. TD learning, eligibility traces, etc.) can be easily adapted for learning the SR. Furthermore, given the SR and a reward function we can easily compute state values,
SR can be generalized for continuous states. Let
If
This is so that the value function can be recovered linearly:
In past work using successor features, features were often learned using an autoencoder trained to reconstruct the raw state input,
A diagram of the full advantage actor-critic generalized successor estimation (A2C-GSE) model is shown at the top of this readme.
Learning a useful state representation while simultaneously learning values and/or policies is a major obstacle in deep RL.
This is especially true in the context of transfer learning, where the features should be ideally be agnostic to particularities of the task, particularly in SR models where these features are needed to define the successor features.
The network that encodes the states cannot simply be optimized using the loss function of the SR as a solution where
Generalized advantage estimation (GAE) is used to compute the policy loss.
An exponentially weighted average of the
where
This equation can be rewritten using the successor features:
I call the quantity in brackets the generalized successor estimation (GSE). It can be computed for an episode and used to learn the SR itself.
This is analogous to the offline TD(