Deep reinforcement learning with PyLIS

This part describes a sample environment with a sample agent to perform deep reinforcement learning.
The terms "environment", "agent", "observation," "action," "reward" are used as those in the field of reinforcement learning.

Sample Environment

Source code

PyLIS/gym-foodhunting/gym_foodhunting/foodhunting/gym_foodhunting.py
- FoodHuntingEnv Class

Robot model

HSR is used.
Only the left and right wheels are controlled.
The robot models can be specified with the following setting file parameter:
- PyLIS/gym-foodhunting/gym_foodhunting/__init__.py
- Robot models are specified with the robot_model parameter.
Implemented classes
- HSR (Learning with its wheels and all its joints can be controlled with continuous values
  requires much more time than the two other classes below.)
- HSRSimple (continuous action values for only wheels)
- HSRDiscrete (discrete action values for only wheels)

Task setting

Environment compatible to OpenAI Gym [7][8]
- The OpenAI Gym compatible environment makes the use of existing deep reinforcement learning libraries easier. (See below.)
Procedure
- Place one robot and one or more baits in a virtual 3D space randomly.
  - Reset API
    - FoodHuntingEnv.reset
    - FoodHuntingEnv._generateFoodPositions
- When the robot touches a bait, it earns reward 1 and the bait disappears.
  - Step APIs
    - FoodHuntingEnv.step
    - FoodHuntingEnv._getReward
    - HSRSimple.setAction
    - HSRDiscrete.setAction
- When a certain number of steps elapse or all the baits are consumed, the episode is terminated.
  - Step API
    - FoodHuntingEnv.step
Difficulty of the task
- Parameters can be set in the setting file:
  - PyLIS/gym-foodhunting/gym_foodhunting/__init__.py
- Number of baits (num_foods)
  - Number of baits with positive reward +1.
    If there are many baits, the agent must prioritize them in order to have a better result.
- Number of fake baits (num_fakes)
  - Number of baits with negative reward -1.
    The agent must seek baits while avoiding fake baits.
- The size of baits (object_size)
  - The larger they are, the easier the task is.
- Range of random bait distribution
  - Radius (object_radius_scale)
  - Offset (object_radius_offset)
  - Angle (object_angle_scale)
  - The smaller the range is, the easier the task is.
- Max steps per episode (max_steps)
  - The episode is reset when steps reaches the max step number.

Observation

Based on the input from the camera, the observation is rendered as tensor by integrating the color image (RGB 3D) and depth image (1D) for each pixel.
- Tensor of width×height×4
Color image and depth image
- Step APIs
  - FoodHuntingEnv.step
  - Robot. getObservation
Absolute position and absolute direction
- Absolute position is a 3D data, absolute direction is a quaternion.
- Step APIs
  - FoodHuntingEnv.step
  - Robot.getPositionAndOrientation

Actions

Locomotion by robot wheels
- Either "discrete" or "continuous" can be specified for values in the setting file:
  - PyLIS/gym-foodhunting/gym_foodhunting/__init__.py
  - For the robot_model parameter, HSRDiscrete is for discrete and HSRSimple is for continuous values.
- If discrete:
  - Actions are:
    - Go forward
    - Turn right
    - Turn left
  - HSRDiscrete.setAction
- If continuous:
  - Actions are:
    - The velocity of the right wheel
    - The velocity of the left wheel
  - HSRSimple.setAction
Step API
- FoodHuntingEnv.step

Rewards

When the robot touches a bait, it gets a positive reward +1.
The number of the baits are the max total reward of the episode.
For every time step, the agent gets a small negative reward: -1×(number of baits)/(max step number)
E.g., if the number of baits is 10 and max step number is 200, the agent gets the reward of -1×10/200=-0.05 per step.
To maximize the total reward, the agent must touch all the baits as soon as possible.

Step APIs
- FoodHuntingEnv.step
- FoodHuntingEnv. _getReward

Rendering

With or without rendering can be selected in the setting file:
- gi_3d_sim_env/gym-foodhunting/gym_foodhunting/__init__.py
Environments of which the id parameter contains the string GUI in the setting file are rendered, otherwise not rendered.
For example, FoodHuntingHSR-v0 is without rendering and FoodHuntingHSRGUI-v0 is with rendering.
Environments without rendering enable fast learning.
When evaluating a learned model, use an environment with rendering to visually check the behavior of the robot.

PyLIS_Manual2019Fig4

Fig. 3: Running model evaluation with rendering. The robot (center) is HSR, red spheres are true baits, red cubes are fake baits, the upper left window is the RGB image from the camera position, left middle window is the depth image, lower left window is the segmentation mask buffer (not used in PyLIS).

Sample agent

Deep reinforcement learning library

Deep reinforcement learning was implemented using Stable Baselines [9] [10], one of the reinforcement learning libraries. Stable Baselines is an implementation of the reinforcement learning algorithm developed by Hill, Ashley, and others, with many improvements based on the OpenAI Baselines implementation [9].

Stable Baselines uses Tensorflow [12] internally, and the use of GPU can significantly reduce the computation time. In this environment, however, the calculation load of physics engine (PyBullet) for CPU is larger than that of deep reinforcement learning and the effect of GPU is limited.

Reinforcement learning algorithms

PPO [11] is used as reinforcement learning algorithm. PPO is currently one of the most powerful reinforcement learning algorithms, which supports continuous/discrete action, recurrent models, and multi-processing [9]. It is also possible to replace it with other reinforcement learning methods such as A2C supported by Stable Baselines [9].

Source code

PyLIS/gym-foodhunting/agents/ppo_agent.py

Neural network model

CNN+LSTM is used as neural network model.

Policy class
- stable_baselines.common.policies.CnnLstmPolicy

It is also possible to replace it with a normal CNN model (CnnPolicy) or MLP model (MlpPolicy).

Learning on the neural network model

The following parameters can be specified for learning.
- Environment name (env_name)
  - E.g.: FoodHuntingHSR-v0
- Load file name (load_file)
  - Initial values for the weights of the neural network model are set from the file.
- Save file name (save_file)
  - The weights of the neural network model learning are saved to the file.
- Tensorboard log file name (tensorboard_log)
  - Save log data from Tensorflow [12] used inside Stable Baselines to the file. The log file is also used to monitor the progress of learning on Tensorboard [13] (see below).
- Total time steps (total_timesteps)
  - Learning is terminated when the number of steps elapse.
- Number of CPU cores (n_cpu)
  - Run multiple environments simultaneously with multi-processing.
  - Specifying twice the actual number of cores often provides optimal performance.
- Reward threshold (reward_threshold)
  - Learning is terminated when total reward exceeds this threshold.
- Learning API
  - ppo_agent.learn
- Visualizing learning log data on Tensorboard
  - The log data for learning is saved in the directory specified by the Tensorboardlog file name ("FoodHuntingLog" in the following example).
  - To start the Tensorboard server, execute the following command:
    - tensorboard \--logdir "FoodHuntingLog"
  - Next, start a web browser and access http://localhost:6006/ to display Tensorboard. Fig. 4 shows how Tensorboard is displayed on the web browser, where camera images and parameter sequences are visualized.

PyLIS_Manual2019Fig5

Fig.4: Tensorboard on a Web browser

Neural network model evaluation

The following parameters can be specified for evaluation:
- Environment name (env_name)
  - E.g.: FoodHuntingHSR-v0
- Load filename (load_file)
  - Initial values for the weights of the neural network model are set from the file.
- Total time steps (total_timesteps)
  - Evaluation is terminated when the number of steps elapse.
- Number of CPU cores (n_cpu)
  - Run multiple environments simultaneously with multi-processing.
  - Specifying twice the actual number of cores often provides optimal performance.
  - When rendering is present, the number of windows displayed is limited to one to reduce the load.
    When there is no rendering, it is as specified by n_cpu.
evaluationAPI
- ppo_agent.play

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep reinforcement learning with PyLIS

Sample Environment

Source code

Robot model

Task setting

Observation

Actions

Rewards

Rendering

Sample agent

Deep reinforcement learning library

Reinforcement learning algorithms

Source code

Neural network model

Learning on the neural network model

Neural network model evaluation

Clone this wiki locally