#DRAFT

DRL Agent Training

As a fundament for our Deep Reinforcement Learning approaches StableBaselines3 was used.

Features included so far:

Simple handling of the training script through program parameters
Choose a predefined Deep Neural Network
Create your own custom Multilayer Perceptron via program parameters
Networks will get trained, evaluated and saved
Load your trained agent to continue training
Optionally log training and evaluation data
Enable and modify a custom training curriculum
Multiprocessed rollout collection for training

Table of Contents

DRL Agent Training

Quick Start

In one terminnal, start the arena simulation:

roslaunch arena_bringup start_arena_gazebo.launch  train_mode:=true use_viz:=true  task_mode:=random

In a second terminal, run the train script:

workon rosnav
roscd arena_local_planner_drl && python scripts/training/train_agent.py --agent MLP_ARENA2D

Training Script

Usage

Generic program call:

train_agent.py [agent flag] [agent_name | unique_agent_name | custom mlp params] [optional flag] [optional flag] ...

Program call	Agent Flag (mutually exclusive)	Usage	Description
`train_agent.py`	`--agent`	agent_name (see below)	initializes a predefined network from scratch
	`--load`	unique_agent_name (see below)	loads agent to the given name
	`--custom-mlp`	custom_mlp_params (see below)	initializes custom MLP according to given arguments

Custom Multilayer Perceptron parameters will only be considered when --custom-mlp was set!

Custom Mlp Flags	Syntax	Description
`--body`	`{num}-{num}-...`	architecture of the shared latent network
`--pi`	`{num}-{num}-...`	architecture of the latent policy network
`--vf`	`{num}-{num}-...`	architecture of the latent value network
`--act_fn`	`{relu, sigmoid or tanh}`	activation function to be applied after each hidden layer

Optional Flags	Description
`--config {string}`, defaults to "default"	Looks for the given config file name in ../arena_local_planner_drl/configs/hyperparameters to load the configurations from
`--n {integer}`	timesteps in total to be generated for training
`--tb`	enables tensorboard logging
`-log`, `--eval_log`	enables logging of evaluation episodes
`--no-gpu`	disables training with GPU
`--num_envs {integer}`	number of environments to collect experiences from for training (for more information refer to Multiprocessed Training)

Examples

Training with a predefined DNN

Currently you can choose between several different Deep Neural Networks each of which have been object of research projects, for example:

Agent name	Inspired by
MLP_ARENA2D	arena2D
DRL_LOCAL_PLANNER	drl_local_planner
CNN_NAVREP	NavRep

e.g. training with the MLP architecture from arena2D:

train_agent.py --agent MLP_ARENA2D

You can find the most recently implemented neural network architectures in: custom_policy.py

Load a DNN for training

In order to differentiate between agents with similar architectures but from different runs a unique agent name will be generated when using either --agent or --custom-mlp mode (when train from scratch).

The name consists of:

[architecture]_[year]_[month]__[hour]_[minute]

To load a specific agent you simply use the flag --load, e.g.:

train_agent.py --load MLP_ARENA2D_2021_01_19__03_20

Note: currently only agents which were trained with PPO given by StableBaselines3 are compatible with the training script.

Training with a custom MLP

Instantiating a MLP architecture with an arbitrary number of layers and neurons for training was made as simple as possible by providing the option of using the --custom-mlp flag. By typing in the flag additional flags for the architecture of latent layers get accessible (see above).

e.g. given following architecture:

					   obs
					    |
					  <256>
					    |
					  ReLU
					    |
					  <128>
					    |
					  ReLU
				    /               \
				 <256>             <16>
				   |                 |
				 action            value

program must be invoked as follows:

train_agent.py --custom-mlp --body 256-128 --pi 256 --vf 16 --act_fn relu

Multiprocessed Training

We provide for either testing and training purposes seperate launch scripts:

start_arena_gazebo.launch encapsulates the simulation environment featuring the different intermediate planners in a single process. Training is also possible within this simulation.
start_training.launch depicts the slimer simulation version as we target a higher troughput here in order to be able to gather training data as fast as possible. The crucial feature of this launch file is that it is able to spawn an arbitrary number of environments to collect the rollouts with and thus allows for significant speedup through asynchronicity.

First terminal: Simulation The first terminal is needed to run arena.

Run these commands:

workon rosnav
roslaunch arena_bringup start_training.launch train_mode:=true use_viz:=false task_mode:=random map_file:=map_small num_envs:=4

Second terminal: Training script A second terminal is needed to run the training script.

Run these four commands:

workon rosnav
roscd arena_local_planner_drl

Now, run one of the two commands below to start a training session:

python scripts/training/train_agent.py --load pretrained_ppo_mpc --n_envs 4 --eval_log
python scripts/training/train_agent.py --load pretrained_ppo_baseline --n_envs 4 --eval_log

Note: Please inform yourself how many cores are provided by your processor in order to fully leverage local computing capabilities.

Third terminal: Visualization A third terminal is needed in order to start rviz for visualization.

Run this command:

roslaunch arena_bringup visualization_training.launch ns:=*ENV NAME*

Note: The training environments start with prefix _sim__ and end with the index. For example: _sim_1_, sim_2 and so on. The evaluation environment which is used during the periodical benchmarking in training can be shown with ns:=eval_sim.

Ending a training session

When the training script is done, it will print the following information and then exit:

Time passed: {time in seconds}s
Training script will be terminated

Hyperparameters

You can modify the hyperparameters in the upper section of the training script which is located at:

/catkin_ws/src/arena-rosnav/arena_navigation/arena_local_planner/learning_based/arena_local_planner_drl/scripts/training/train_agent.py

Following hyperparameters can be adapted:

Parameter	Description
robot	Robot name to load robot specific .yaml file containing its settings.
gamma	Discount factor
n*steps	The number of steps to run for each environment per update
ent_coef	Entropy coefficient for the loss calculation
learning_rate	The learning rate, it can be a function of the current progress remaining (from 1 to 0) (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
vf_coef	Value function coefficient for the loss calculation
max_grad_norm	The maximum value for the gradient clipping
gae_lambda	Factor for trade-off of bias vs variance for Generalized Advantage Estimator
batch_size	Minibatch size
n_epochs	Number of epoch when optimizing the surrogate loss
clip_range	Clipping parameter, it can be a function of the current progress remaining (from 1 to 0).
reward_fnc	Number of the reward function (defined in ../rlagent/utils/reward.py*)
discrete_action_space	If robot uses discrete action space
task_mode	Mode tasks will be generated in (custom, random, staged). In custom mode one can place obstacles manually via Rviz. In random mode there's a fixed number of obstacles which are spawned randomly distributed on the map after each episode. In staged mode the training curriculum will be used to spawn obstacles. (more info)
curr_stage	When "staged" training is activated which stage to start the training with.

(more information on PPO implementation of SB3)

Note: For now further parameters like max_steps_per_episode or goal_radius have to be changed inline (where GazeboEnv gets instantiated). n_eval_episodes which will take place after eval_freq timesteps can be changed also (where EvalCallback gets instantiated).

Reward Functions

The reward functions are defined in: (alternatively, click here)

../arena_local_planner_drl/rl_agent/utils/reward.py

At present, one can chose between five reward functions which can be set in the hyperparameters yaml file:

rule_00

rule_01

rule_02

rule_03

rule_04

Reward Function at timestep t
$r_{00}^{t} = r_{s}^{t} + r_{c}^{t} + r_{d}^{t} + r_{p}^{t}$

reward	description	value
$r_{s}^{t}$	success reward	$r_{s}^{t} = \begin{cases} 15 & \text{ if goal reached} \\ 0 & \text{ otherwise } \end{cases}$
$r_{c}^{t}$	collision reward	$r_{c}^{t} = \begin{cases} -10 & \text{ if robot collides} \\ 0 & \text{ otherwise } \end{cases}$
$r_{d}^{t}$	danger reward	$r_{d}^{t} = \begin{cases} -0.25 & \text{ if } \exists{o \in O} : d(p_{robot}^t, p_{obs}^t) < D_{s}\\ 0 & \text{ otherwise } \end{cases}$
$r_{p}^{t}$	progress reward	$\text{diff}_{robot,x}^t = d(p_{robot}^{t-1}, p_{x}^{t-1}) - d(p_{robot}^t, p_{x}^t)$ $r_{p}^{t} = \begin{cases} 0.3 * \text{diff}_{robot,goal}^t & \text{ if } \text{diff}_{robot,goal}^t > 0\\ 0.4 * \text{diff}_{robot,goal}^t & \text{ otherwise } \end{cases}$

Reward Function at timestep t
$r_{01}^{t} = r_{s}^{t} + r_{c}^{t} + r_{d}^{t} + r_{p}^{t} + r_{dt}^{t}$

reward	description	value
$r_{s}^{t}$	success reward	$r_{s}^{t} = \begin{cases} 15 & \text{ if goal reached} \\ 0 & \text{ otherwise } \end{cases}$
$r_{c}^{t}$	collision reward	$r_{c}^{t} = \begin{cases} -10 & \text{ if robot collides} \\ 0 & \text{ otherwise } \end{cases}$
$r_{d}^{t}$	danger reward	$r_{d}^{t} = \begin{cases} -0.25 & \text{ if } \exists{o \in O} : d(p_{robot}^t, p_{obs}^t) < D_{s}\\ 0 & \text{ otherwise } \end{cases}$
$r_{p}^{t}$	progress reward	$\text{diff}_{robot,x}^t = d(p_{robot}^{t-1}, p_{x}^{t-1}) - d(p_{robot}^t, p_{x}^t)$ $r_{p}^{t} = \begin{cases} 0.3 * \text{diff}_{robot,goal}^t & \text{ if } \text{diff}_{robot,goal}^t > 0\\ 0.4 * \text{diff}_{robot,goal}^t & \text{ otherwise } \end{cases}$
$r_{dt}^{t}$	distance travelled reward	$r_{dt}^{t} = (vel_{linear}^{t} + (vel_{angular}^{t}0.001))-0.0075$

Reward Function at timestep t
$r_{02}^{t} = r_{s}^{t} + r_{c}^{t} + r_{d}^{t} + r_{p}^{t} + r_{dt}^{t} + r_{fg}^{t}$

reward	description	value
$r_{s}^{t}$	success reward	$r_{s}^{t} = \begin{cases} 15 & \text{ if goal reached} \\ 0 & \text{ otherwise } \end{cases}$
$r_{c}^{t}$	collision reward	$r_{c}^{t} = \begin{cases} -10 & \text{ if robot collides} \\ 0 & \text{ otherwise } \end{cases}$
$r_{d}^{t}$	danger reward	$r_{d}^{t} = \begin{cases} -0.25 & \text{ if } \exists{o \in O} : d(p_{robot}^t, p_{obs}^t) < D_{s}\\ 0 & \text{ otherwise } \end{cases}$
$r_{p}^{t}$	progress reward	$\text{diff}_{robot,x}^t = d(p_{robot}^{t-1}, p_{x}^{t-1}) - d(p_{robot}^t, p_{x}^t)$ $r_{p}^{t} = \begin{cases} 0.3 * \text{diff}_{robot,goal}^t & \text{ if } \text{diff}_{robot,goal}^t > 0\\ 0.4 * \text{diff}_{robot,goal}^t & \text{ otherwise } \end{cases}$
$r_{dt}^{t}$	distance travelled reward	$r_{dt}^{t} = (vel_{linear}^{t} + (vel_{angular}^{t}0.001))-0.0075$
$r_{fg}^{t}$	following global plan reward	$r_{fg}^{t} = \begin{cases} \begin{aligned} 0.1 * vel_{linear}^{t} & \text{ if } \min_{wp \in G}d(p_{wp}^t, p_{r}^t) < 0.5 \text{m} \\ 0 & \text{ otherwise } \end{aligned} \end{cases}$

Reward Function at timestep t
$r_{03}^{t} = r_{s}^{t} + r_{c}^{t} + r_{d}^{t} + r_{p}^{t} + r_{fg}^{t} + r_{dg}^{t}$

reward	description	value
$r_{s}^{t}$	success reward	$r_{s}^{t} = \begin{cases} 15 & \text{ if goal reached} \\ 0 & \text{ otherwise } \end{cases}$
$r_{c}^{t}$	collision reward	$r_{c}^{t} = \begin{cases} -10 & \text{ if robot collides} \\ 0 & \text{ otherwise } \end{cases}$
$r_{d}^{t}$	danger reward	$r_{d}^{t} = \begin{cases} -0.25 & \text{ if } \exists{o \in O} : d(p_{robot}^t, p_{obs}^t) < D_{s}\\ 0 & \text{ otherwise } \end{cases}$
$r_{p}^{t}$	progress reward	$\text{diff}_{robot,x}^t = d(p_{robot}^{t-1}, p_{x}^{t-1}) - d(p_{robot}^t, p_{x}^t)$ $r_{p}^{t} = \begin{cases} 0.3 * \text{diff}_{robot,goal}^t & \text{ if } \text{diff}_{robot,goal}^t > 0\\ 0.4 * \text{diff}_{robot,goal}^t & \text{ otherwise } \end{cases}$
$r_{fg}^{t}$	following global plan reward	$r_{fg}^{t} = \begin{cases} \begin{aligned} 0.1 * vel_{linear}^{t} & \text{ if } \min_{wp \in G}d(p_{wp}^t, p_{r}^t) < 0.5 \text{m} \\ 0 & \text{ otherwise } \end{aligned} \end{cases}$
$r_{dg}$	distance to globalplan reward	$r_{dg} = \begin{cases} \begin{aligned} 0.2* \text{diff}_{robot, wp}^{t} & \text{ if }\min_{wp \in G}d(p_{r}^t, p_{wp}^t): \text{diff}_{robot, wp}^{t} > 0 \\ 0.3* \text{diff}_{robot, wp}^{t} & \text{ if } \min_{wp \in G}d(p_{r}^t, p_{wp}^t): \text{diff}_{robot, wp}^{t} <= 0 \\ 0 & \text{ if } \min_{o \in O}d(p_{r}^t, p_{o}^t) < D_s \end{aligned} \end{cases}$

Reward Function at timestep t
$r_{04}^{t} = r_{s}^{t} + r_{c}^{t} + r_{d}^{t} + r_{p}^{t} + r_{fg}^{t} + r_{dg}^{t} + r_{dc}^{t}$

| reward | description | value | | -------------------------------------------------------------------------------- | ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | --------------------------------------------- | ---------------------------------------------- | --------------- | | $r_{s}^{t}$ | success reward | $r_{s}^{t} = \begin{cases} 15 & \text{ if goal reached} \\ 0 & \text{ otherwise } \end{cases}$ | | $r_{c}^{t}$ | collision reward | $r_{c}^{t} = \begin{cases} -10 & \text{ if robot collides} \\ 0 & \text{ otherwise } \end{cases}$ | | $r_{d}^{t}$ | danger reward | $r_{d}^{t} = \begin{cases} -0.25 & \text{ if } \exists{o \in O} : d(p_{robot}^t, p_{obs}^t) < D_{s}\\ 0 & \text{ otherwise } \end{cases}$ | | $r_{p}^{t}$ | progress reward | $\text{diff}_{robot,x}^t = d(p_{robot}^{t-1}, p_{x}^{t-1}) - d(p_{robot}^t, p_{x}^t)$ $r_{p}^{t} = \begin{cases} 0.3 * \text{diff}_{robot,goal}^t & \text{ if } \text{diff}_{robot,goal}^t > 0\\ 0.4 * \text{diff}_{robot,goal}^t & \text{ otherwise } \end{cases}$ | | $r_{fg}^{t}$ | following global plan reward | $r_{fg}^{t} = \begin{cases} \begin{aligned} 0.1 * vel_{linear}^{t} & \text{ if } \min_{wp \in G}d(p_{wp}^t, p_{r}^t) < 0.5 \text{m} \\ 0 & \text{ otherwise } \end{aligned} \end{cases}$ | | $r_{dg}$ | distance to globalplan reward | $r_{dg} = \begin{cases} \begin{aligned} 0.2* \text{diff}_{robot, wp}^{t} & \text{ if }\min_{wp \in G}d(p_{r}^t, p_{wp}^t): \text{diff}_{robot, wp}^{t} > 0 \\ 0.3* \text{diff}_{robot, wp}^{t} & \text{ if } \min_{wp \in G}d(p_{r}^t, p_{wp}^t): \text{diff}_{robot, wp}^{t} <= 0 \\ 0 & \text{ if } \min_{o \in O}d(p_{r}^t, p_{o}^t) < D_s \end{aligned} \end{cases}$ | | $r_{dc}^t$ | direction change reward | $r_{dc}^t = - \frac{\left | vel_{angular}^{t-1} - vel_{angular}^{t} \right |^{4}}{2500}$ |

Training Curriculum

For the purpose of speeding up the training an exemplary training currucilum was implemented. But what exactly is a training curriculum you may ask. We basically divide the training process in difficulty levels, here the so called stages, in which the agent will meet an arbitrary number of obstacles depending on its learning progress. Different metrics can be taken into consideration to measure an agents performance.

In our implementation a reward threshold or a certain percentage of successful episodes must be reached to trigger the next stage. The statistics of each evaluation run is calculated and considered. Moreover when a new best mean reward was reached the model will be saved automatically.

Exemplary training curriculum:

Stage	Static Obstacles	Dynamic Obstacles
1	0	0
2	10	0
3	20	0
4	0	10
5	10	10
6	13	13

For an explicit example, click here.

Run the trained Agent

Now that you've trained your agent you surely want to deploy and evaluate it. For that purpose we've implemented a specific task mode in which you can specify your scenarios in a .json file. The agent will then be challenged according to the scenarios defined in the file. Please refer to https://github.com/ignc-research/arena-scenario-gui/ in order to read about the process of creating custom scenarios. Moreover, you can test your agent on custom maps in randomly generated scenarios with a predefined number of dynamic obstacles.

As with the training script, one can start the testing simulation environment with either one of two launch scripts:

start_arena_gazebo.launch:
- Allows for evaluation in continuous simulation time (emulates real time) as well as in controlled time stepping with four different subgoal modes, consisting of 3 intermediate planner approaches (spatial horizon, timed A-star, simple sample) with the DRL agent acting as the local planner.
- Episode information can be logged via rosbag (refer to document)
start_training.launch:
- Starts an evaluation environment in continuous simulation time (emulates real time) as well as in controlled time stepping with either the spatial horizon intermediate planner or the end goal being the only subgoal.
- One can test multiple agents sequentially with run_script.py. This feature is only realized with this launch file, as start_arena_gazebo.launch starts an own plan manager which interfers with the plan manager of the run script. Both plan managers have their own goal radius and thus might detect an end of episode differently. This can mess up the logged statistics.
- Episode information can optionally be logged in a csv file by setting the --log flag for the run script dedicated plan manager to control the episodes.

Firstly, you need to start the simulation environment:

# Start the simulation with one of the launch files
roslaunch arena_bringup start_arena_gazebo.launch map_file:="map1"  disable_scenario:="false" scenario_file:="eval/obstacle_map1_obs20.json"

roslaunch arena_bringup start_training.launch num_envs:=1 map_folder_name:=map1 train_mode:=false

Note:

The train_mode parameter determines if the simulation will run in emulated real time (where step_size and update_rate determine the simulation speed) or in the manipulating time stepping modus (via /step_world rostopic).

Then, run the run_agent.py script with the desired scenario file:

python run_agent.py --load DRL_LOCAL_PLANNER_2021_03_22__19_33 --scenario obstacle_map1_obs20

Generic program call:

roscd arena_local_planner_drl/scripts/deployment/
run_agent.py --load [agent_name] -s [scenario_name] -v [number] [optional flag]

Program call	Flags	Usage	Description
`run_agent.py`	`--load`	agent_name (see below)	loads agent to the given name
	`-s` or `--scenario`	scenario_name (as in ../scenario/eval/)	loads the scenarios to the given .json file name
	(optional)`-v` or `--verbose`	0 or 1	verbose level
	(optional) `--no-gpu`	None	disables the gpu for the evaluation
	(optional) `--num_eps`	Integer, defaults to 100	number of episodes the agent/s get/s challenged
	(optional) `--max_steps`	Integer, defaults to np.inf	max amount of actions per episode, before the episode is resetted automatically

Example call:

python run_agent.py --load DRL_LOCAL_PLANNER_2021_03_22__19_33 -s obstacle_map1_obs20

Notes:

The --log flag should only be set with start_training.launch as simulation launcher as it requires the dedicated plan manager to control the episodes in order to log correct statistics.
When running start_arena_gazebo.launch: Make sure that drl_mode is activated in plan_fsm_param.yaml
Make sure that the simulation speed doesn't overlap the agent's action calculation time (an obvious indicator: same action gets published multiple times successively and thus the agent moves unreasonably)
If your agent was trained with normalized observations, it's necessary to provide the vec_normalize.pkl

Sequential Evaluation of multiple Agents

For automatic testing of several agents in a sequence, one can specify a list containing an arbitrary number of agent names in run_script.py.

Note:

Guaranteed execution of each agent is currently only provided with the start_training.launch as simulation launcher
--load flag has to be set None, otherwise the script will only consider the agent provided with the flag.

Important Directories

Path	Description
`../arena_local_planner_drl/agents`	models and associated hyperparameters.json will be saved to and loaded from here (uniquely named directory)
`../arena_local_planner_drl/configs`	yaml files containing robots action spaces and the training curriculum
`../arena_local_planner_drl/training_logs`	tensorboard logs and evaluation logs
`../arena_local_planner_drl/scripts`	python file containing the predefined DNN architectures and the training script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRL-Training.md

DRL-Training.md

DRL Agent Training

Quick Start

Training Script

Usage

Examples

Training with a predefined DNN

Load a DNN for training

Training with a custom MLP

Multiprocessed Training

Hyperparameters

Reward Functions

Training Curriculum

Run the trained Agent

Sequential Evaluation of multiple Agents

Important Directories

Files

DRL-Training.md

Latest commit

History

DRL-Training.md

File metadata and controls

DRL Agent Training

Quick Start

Training Script

Usage

Examples

Training with a predefined DNN

Load a DNN for training

Training with a custom MLP

Multiprocessed Training

Hyperparameters

Reward Functions

Training Curriculum

Run the trained Agent

Sequential Evaluation of multiple Agents

Important Directories