Skip to content

Latest commit

 

History

History
161 lines (115 loc) · 7.38 KB

Report.md

File metadata and controls

161 lines (115 loc) · 7.38 KB

Report

Results

Plot of Rewards

Below graphs show solutions to Tennis environment, scores Plot from Training showing scores (blue line, max of total rewards for 2 agents for episode) and a mean of those scores over 100 episode window. The graph shows agent until it achieves a target mean score of 0.61, which happend after 819 episodes. Below are graphs of loss functions for 2 actor networks and 2 critic networks trained.

Rewards per episode

Graph shows max of both agents accumulated rewards per episode (blue line) and mean of this score over 100 episode window (red line) Plot of agent scores by episode

Actors Loss function

Plot of Actor network loss function

Critics Loss function

Plot of Critic network loss function

Longer trained - 1.54 Mean score

Same model trained for over 4000 epsiodes on GPU achieved 1.54 mean score!

Rewards per episode

Graph shows max of both agents accumulated rewards per episode (blue line) and mean of this score over 100 episode window (red line) Plot of agent scores by episode

Actors Loss function

Plot of Actor network loss function

Critics Loss function

Plot of Critic network loss function

Learning Algorithm

This repository, specifically files below:

  • src/
    • maddpg.py
    • utils.py
    • model.py
    • replaybuffers.py

Contain implementation of Multi Agent Deep Deterministic Policy Gradient or MADDPG as described in OpenAI publication where 2 agents instances of (MADDPGAgent defined in src/maddpg.py) independently train separate policy network (Actor) and each contain a separatly trained copy of centralized Critic network.

Best shown on this diagram from orignal publication: MADDPG diagram

MADDPG Algorithm pseudo code MADDPG algorithm

Exploration vs Exploitation

I found that forcing agents to explore a lot in first stage of training was necessary to solve the environment in reasonable time, otherwise agents tended to barely take any actions, often staying motionless next to the net. In order to achive that I introduced to take fully random actions for first 20% of training episodes, just by drawing action vectors from standard normal distribution. After that period I switch back to Ornstein-Uhlenbeck process with theta = 0.15 and sigma = 0.2, but I scale noise volume drawn from it by epsilon value. Epsilon starts at value of 0.9 and decays by 0.999 every 10 episodes down to the minimum of 0.01. Below code snippet from utils.py train_agent function shows the importants explained above:

 eps = 0.9
 #...
 if i_episode < 0.2 * n_episodes:
        actions = np.random.standard_normal((num_agents, action_size))
    else:
        actions   =  np.vstack( [a.act(s, add_noise=True, epsilon=eps) for (a,s) in zip(agents, states[:]) ])  

#...
if step_counter % 10 == 0:
    eps = max(eps_end, eps_decay*eps)         

Introduction of this greatly improved the learning speed.

Neural Network architecture

File src/model.py contains PyTorch implementation of two small Neural Networks, for approximating the action value function Q (the Critic) and the Policy function (Actor).

Actor network

Consists of 4 fully connected layers of following size:

  • Input layer (state_size, 256), ReLU activation
  • Hidden layer (256, 256), ReLU activation
  • Another hidden layer (256, 128), ReLU activation
  • Output layer (128, action_size), Tanh activation

Below output of how Pytorch prints out Actor network architecture

ActorQNetwork(
  (fc_1): Linear(in_features=24, out_features=256, bias=True)
  (fc_2): Linear(in_features=256, out_features=256, bias=True)
  (fc_3): Linear(in_features=256, out_features=128, bias=True)
  (output): Linear(in_features=128, out_features=2, bias=True)
)

Critic network

Consists of 4 fully connected layers of following size:

  • Input layer (52, 512), ReLU activation, input size of 52 comes from (state_size+action_size)*2
  • Hidden layer (256, 256), ReLU activation
  • Hidden layer (256, 128), ReLU activation
  • Output layer (256, 1), Linear activation

Below output of how Pytorch prints out Critic network architecture

CriticQNetwork(
  (fc_1): Linear(in_features=52, out_features=256, bias=True)
  (fc_2): Linear(in_features=256, out_features=128, bias=True)
  (fc_3): Linear(in_features=128, out_features=128, bias=True)
  (output): Linear(in_features=128, out_features=1, bias=True)
)

For Tennis environment

state_size = 24
action_size = 2

The size and number of hidden layers was chosen experimentally.

NN Training

Critic network was trained by minimizing Smooth L1 Loss function with help of ADAM optimizer. I've found L1 loss function performed better then Mean Squared error in my experiments.

Learning rate of LR_ACTOR = 1e-4 for Actor network and LR_CRITIC = 1e-3 for Critic network, was chosen after some experimentation with values between (1e-2, 1e-5). All chosen hyperparameter values are listed below. Training notebook contains an executable demonstration of training process and results.

Model weights

Pretrained weights used to generate results presented here are part of this repository and can be found in folder model_weights:

  • 2 Actor model parameters:
  • 061_solution_0_actor.pth
  • 061_solution_1_actor.pth
  • 2 Critic model parameters:
  • 061_solution_0_critic.pth
  • 061_solution_1_critic.pth

Hyperparameters used

BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 256         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-2              # for soft update of target parameters               
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic

UPDATE_EVERY = 1        # how often to update the network
WEIGHT_DECAY = 0
hidden_1_size = 256
hidden_2_size = 128

Ideas for Future Work

  • test this solution on more environments with continuous control, especially with larger number of agents and mixuture of collaboration and competition
  • I've found this model to be slow and rather unstable to train so it would worthwile implementing some technique to train it more efficiently. Perhaps some of the original DQN ideas would be a good start
  • Prioretized Experience Replay
  • Double or Dueling DQN or even Rainbow algorithm
  • try implementing Policy Ensembles which should improve performance as suggested in MADDPG paper