Skip to content

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

Notifications You must be signed in to change notification settings

bay3s/rl-squared

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RL^2 : Fast Reinforcement Learning via Slow Reinforcement Learning

Reference

Practical Matters

Formulation

  • A trial is a series of episodes of interaction with a fixed task (MDP), each trial may consist of multiple episodes.
  • For each new trial, a separate task is drawn from the distribution and for each episode within the trial a fresh starting state is drawn.
  • When an action is produced by the agent, the environment computes a reward, steps forward, computes the next state, and if the episode has terminated it sets the termination flag to 1.
  • The next state, reward, and termination flag are concatenated to form the input to the recurrent policy in the next step.
  • The recurrent policy conditioned on the hidden state generates the next hidden state and action.
  • The objective is to maximize the total discounted reward accumulated during a single trial (consisting of multiple episodes) rather than a single episode.
  • As long as different strategies are required for different tasks the agent must act differently according to its belief regarding the current task assigned.

Implementation

  • The implementation in this repository uses an Actor-Critic with a memory-augmented policy and value function.
  • GRUs are used in order to alleviate vanishing and exploding gradients, and due to their good empirical performance.
  • The policy and value function do not rely on a shared GRU, although there is empircal evidence to suggest that this form of weight sharing leads to more stable learning.
  • For the policy, the output of the GRU is fed to a fully connected layer followed by a softmax function which forms the distribution over actions for the policy.
  • For the value function, the output of the GRU is fed to an MLP consisting of fully connected layers which directly outputs a value estimate.
  • The repository contains a minor divergence from the original paper which is that PPO is used as the learning algorithm rather than TRPO as in the original paper.

Experimental Results

Multi-Armed Bandits

Theory

Abstract

  • Deep RL has been successful, but the learning process requires a huge number of trials.
  • RL-squared seeks to bridge this gap - rather than designing a fast RL algorithm proposes to represent it in a recurrent neural network and learn it from data.
  • RNN receives all information a typical RL algorithm would receive including the state, reward, actions, and termination flags, and retains it state across episodes in a given MDP.
  • The activations of the RNN store the state of the "fast" RL algorithm on the current previously unseen MDP.
  • After training RL-squared shows performance on new MDPs close to human-designed algorithms with optimality guarantees.
  • RL-squared on vision-based navigation task shows the ability to scale.

Bayesian RL

  • RL success seems to come at the expense of high sample complexity, this is largely due to the lack of a good prior which results in deep RL agents needing to rebuild knowledge about the world from scratch.
  • Bayesian RL provides a framework for incorporating priors into the learning process, but exact computation of the Bayesian update is intractable in all but the simplest cases.
    • Practical Bayesian RL methods incorporate a mix of Bayesian and domain-specific ideas to bring down sample complexity.
    • But,they tend to make assumptions regarding the environment or become computationally intractable in high-dimensional settings.

RL-Squared

  • Rather than hand-designing domain-specific RL algorithms, RL-squared views thee learning process of the agent itself as an objective which can be optimized using standard RL algorithms.
  • The objective is averaged across possible MDPs according to a speicifc distribution whcih reflects the prior that we would like to distill into the agent.
  • The agent is structured as an RNN which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations.
  • Internal state of the RNN is preserved across episodes, it has the capacity to perform learning in its own hidden activations.
  • Thus, the learned algorithm also acts as the learning algorithm and can adapt to the task at hand when deployed.

Fast RL via Slow RL

  • The paper suggests a different approach for designing better RL algorithms: instead of acting as the designers ourselves, learn the algorithm end-to-end using standard RL techniques.
  • The fast RL algorithm is a computation whose state is stored in the RNN activations, and the RNN's weights are learned by a general-purpose slow RL algorithm.

About

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published