- ➤ 📝 About The Project
- ➤ 🔨 Usage
- ➤ 💾 Key Project File Descriptions
- ➤ 🔨 Basic Installation
- ➤ 🚀 Dependencies
- ➤ ☕ Buy me a coffee
- ➤ 📜 Credits
- ➤ License
This repository explores and implements reinforcement learning strategies for active simultaneous localization and mapping (SLAM) using a single robot. The project integrates advanced reinforcement learning algorithms, specifically Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradients (DDPG), enabling a robotic agent to autonomously explore and map unknown environments effectively.
At its core, this project seeks to bridge the gap between theoretical reinforcement learning and practical robotic applications. It focuses on creating a robust learning environment where the robot iteratively updates and adjusts its policy based on real-world dynamics. This approach applies various state-of-the-art reinforcement learning algorithms to enhance both the precision of the spatial maps generated and the efficiency of goal-based exploration in True Unknown Environments (TUE).
Designed for researchers and developers with an interest in robotics and machine learning, this project provides a deep dive into how autonomous agents can learn to adapt and navigate independently, pushing the boundaries of robotic autonomy in exploration and mapping tasks.
To simulate our robotic environment, we create a custom training environment using gazebo classic, a physics engine simulator. This package is responsible for setting up Gazebo with custom maps and our robot model. It also starts up the SLAM algorithm from the slam toolbox.
To train our agent in our custom training environment we handle the logic in this package, It creates several nodes which communicate with each other as well as hosts the main reinforcement learning algorithms including the training loop
To allow all our nodes to communicate with each other, we use this package to establish msg and srv files which establish a protocol for sending and receiving information amongst the nodes
The slam_toolbox is managed and distributed by Steve Macenski and its GitHub repository can be found here. It allows us to localise and map the robots environment which essentially is the basis behind this research
I highly recommend running this with TMUX. Tmux allows us to split the terminal into panes to view all the essential ROS2 processes simultaneously. apt install tmux
After following the basic installation
Start tmux with:
tmux
First source ROS2 Humble with the command:
source /opt/ros/humble/setup.bash
Change into the workspace directory:
cd reinforcement-learning-active-slam
and run the following lines to split the panes:
tmux split-window -v -p 30
tmux split-window -h -p 50
tmux split-window -h -p 50 -t 0
tmux split-window -v -p 50 -t 1
tmux split-window -v -p 66
tmux split-window -v -p 50
tmux select-pane -t 6
If you dont have tmux, you may create 5 seperate terminal windows instead
Launch the gazebo simulation physics engine with our robot model:
source install/setup.bash
tmux select-pane -t 4
ros2 launch active_slam_simulations main_world.launch.py
Next launch the SLAM algorithm from the slam toolbox:
source install/setup.bash
tmux select-pane -t 5
ros2 launch active_slam_simulations slam.launch.py
Next run the Gazebo Bridge node:
source install/setup.bash
tmux select-pane -t 1
ros2 run active_slam_learning gazebo_bridge
Next run the Learning Environment node:
source install/setup.bash
tmux select-pane -t 3
ros2 run active_slam_learning learning_environment
Startup RViz to see the map the robot generates per episode:
source install/setup.bash
tmux select-pane -t 0
ros2 launch active_slam_simulations view_map.launch.py
Lastly start the Learning node (DDPG or PPO):
source install/setup.bash
ros2 run active_slam_learning learning_ddpg
or
ros2 run active_slam_learning learning_ppo
Please refer to this cheatsheet for more information but two heplful commands are:
ctrl+b o
~ To switch to the next pane
and
ctrl+b z
~ To zoom in and out of a pane
-
main_world.launch.py
: This file is a ROS2 launch file that currently starts up Gazebo Classic, our physics simulation engine. It also runs the robot state publisher, which handles broadcasting the robot's state and spawns the robot model in the simulation environment. -
slam.launch.py
: This file launches the SLAM toolbox, which we use to perform Active SLAM. This allows us to calculate map certainty and, importantly, view estimated mappings of the environment while the robot explores. The SLAM_toolbox is managed and distributed by Steve Macenski and its GitHub repository can be found here. -
view_map.launch.py
: This file launches RViz, enabling the visualisation of grid maps generated by the SLAM algorithm. -
models folder
: This folder contains the model files for our simulation, including the Turtlebot3_burger robot and the goal pad. -
worlds folder
: This folder contains the different custom SDF world files which took a very very very long time to create 👍
-
learning_environment.py
: This file defines a Learning Environment node for simulating a single-robot system that learns to autonomously explore True Unknown Environments using reinforcement learning through ROS2. It manages state updates, action processing, and reward calculations necessary for RL experiments. Key components include handling robot velocities, calculating map uncertainty using D-Optimality and computing rewards. The environment interacts with the robot and simulated Gazebo environment to facilitate the training and evaluation of learning agents primarily through the custom Gazebo Bridge. -
gazebo_bridge.py
: This file implements a custom Gazebo Bridge node, which handles direct communication with Gazebo services and the training environment. It manages the spawning and movement of goal and robot models, resets the SLAM algorithm provided by the SLAM toolbox and provides services for managing simulation states such as pausing and unpausing the physics engine. -
reward_function.py
: This file implements the reward function for the reinforcement learning (RL) agent:-
Initial Reward: The reward calculation's starting point is -0.4.
-
Linear Velocity Penalty: Encourages the robot to maintain a higher linear velocity, calculated as -3 times the difference between the maximum speed and the current linear velocity.
~ This ranges from -1.2 to 0 for linear velocities between -0.2 and 0.2.
-
Angular Velocity Penalty: Penalises higher angular velocities, calculated as -0.2 times the square of the angular velocity.
~ This ranges from -0.968 to 0 for angular velocities between -2.2 and 2.2.
-
Collision Penalty: Imposes a significant penalty of -1000 if the robot collides with an obstacle.
-
Goal Reward: Rewards the robot with 1000 if it successfully finds the goal.
-
Map Uncertainty Reward: Provides a positive intrinsic reward based on the map uncertainty (D-Optimality). If D-Optimality is not provided, the reward is 0. Otherwise, it is calculated as the hyperbolic tangent of 0.01 divided by the D-Optimality value.
~ This ranges from 0 to 1.
-
-
srv
: This folder contains service types used by individual Nodes to communicate with each other, an essential foundation of the ROS framework. -
msg
: This folder contains msg types used by individual Nodes to communicate with each other, an essential foundation of the ROS framework.
-
learning_ddpg.py
: Facilitates the main training loop of the Deep Deterministic Policy Gradient algorithm, including Frame stacking, frame skipping, reward normalisation, initialising the model and the memory buffer, handling the scoring metrics per episode and lastly saving the training data for evaluation later. -
agent.py
: Defines the main agent interacting with the environment. It encapsulates the logic for selecting actions, applying noise, learning and updating the policy. -
replay_memory.py
: This file implements the replay memory buffer, storing the agent's experiences during training. It allows for random sampling of experiences to stabilise training by breaking the correlation between consecutive experiences. -
networks.py
: This file defines the neural network architectures used for the actor and critic models in the DDPG algorithm. These networks are responsible for approximating the policy and value functions.
-
learning_ppo.py
: Facilitates the main training loop for the Proximal Policy Optimization (PPO) algorithm, including frame stacking, frame skipping, reward clipping, model and memory buffer initialiasation, episode scoring metrics, and training data saving for later evaluation. It handles the overall flow of the PPO training process. -
agent.py
: Defines the main agent interacting with the environment. It encapsulates the logic for selecting actions, learning, and updating the policy using actor and critic networks. The agent implements the PPO update rule, which involves clipping probability ratios to ensure stable policy updates and maintaining entropy for exploration. -
replay_memory.py
: Implements the PPO memory buffer, storing experiences and enabling random sampling to stabilise training. It manages states, actions, rewards, next states, done flags, and action probabilities, facilitating mini-batch updates during learning. -
networks.py
: Defines the neural network architectures for the actor and critic models in PPO. The actor network uses Beta distributions for action sampling, ensuring a stochastic policy, while the critic network estimates state values to provide advantage estimates during training.
-
utilities.py
: Includes helper functions for communicating with the learning environment node and plotting functions. -
settings.py
: Contains all the configuration settings for the training process. Users can very much use this project by only ever changing this file
The following settings and options are exposed to you:
MAX_LINEAR_SPEED
: Maximum linear speed (0.22)MAX_ANGULAR_SPEED
: Maximum angular speed (2.0)MAX_SCAN_DISTANCE
: Maximum scan distance (3.5)NUMBER_OF_SCANS
: Number of scans (90)COLLISION_DISTANCE
: Collision distance (0.18)
-
ENVIRONMENT_OBSERVATION_SPACE
: Observation space for the environment (NUMBER_OF_SCANS + 2) -
ENVIRONMENT_ACTION_SPACE
: Action space for the environment (2) -
EPISODE_LENGTH_SEC
: Episode length in seconds (60) -
EPISODE_STEPS
: Number of steps in an episode (1000) -
GOAL_PAD_RADIUS
: Radius of the goal pad (0.7) -
REWARD_DEBUG
: Debug mode for rewards (True)
-
LOAD_MODEL
: Whether to load a pre-trained model (False) -
MODEL_PATH
: Path to the model ("training_data/models/single_robot_exploration") -
TRAINING_STEPS
: Number of training steps (1_000_000) -
RANDOM_STEPS
: Number of random steps (25000) -
MAX_MEMORY_SIZE
: Maximum memory size (1_000_000) -
FRAME_BUFFER_DEPTH
: Frame buffer depth (3) -
FRAME_BUFFER_SKIP
: Frame buffer skip (10) -
TRAINING_EPISODES
: Number of training episodes (2000)
ALPHA_DDPG
: Learning rate for the actor (0.0001)BETA_DDPG
: Learning rate for the critic (0.0003)ACTOR_DDPG_FC1
: Number of units in the first fully connected layer of the actor (400)ACTOR_DDPG_FC2
: Number of units in the second fully connected layer of the actor (512)CRITIC_DDPG_FC1
: Number of units in the first fully connected layer of the critic (512)CRITIC_DDPG_FC2
: Number of units in the second fully connected layer of the critic (512)TAU
: Soft update parameter (0.005)GAMMA_DDPG
: Discount factor for future rewards (0.99)BATCH_SIZE_DDPG
: Training batch size (100)
ALPHA_MAAPO
: Learning rate for the actor (0.0001)BETA_MAPPO
: Learning rate for the critic (0.003)ACTOR_PPO_FC1
: Number of units in the first fully connected layer of the actor (512)ACTOR_PPO_FC2
: Number of units in the second fully connected layer of the actor (512)CRITIC_PPO_FC1
: Number of units in the first fully connected layer of the critic (512)CRITIC_PPO_FC2
: Number of units in the second fully connected layer of the critic (512)POLICY_CLIP
: Clipping parameter for policy (0.2)GAMMA_PPO
: Discount factor for future rewards (0.99)TRAJECTORY
: Number of steps per trajectory (2048)NUM_MINI_BATCHES
: Number of mini-batches for training (64)N_EPOCHS
: Number of epochs per update (15)GAE_LAMBDA
: Generalized Advantage Estimation lambda (0.95)ENTROPY_COEFFICIENT
: Coefficient for entropy regularization (0.01)
Install Ubuntu 22.04 OS
Install ROS2 Humble
Download workspace
git clone https://github.com/i1Cps/reinforcement-learning-active-slam.git
cd reinforcement-learning-active-slam
Build workspace (Could take a few minutes)
colcon build --symlink-install
Change ROS2 DDS implementation
sudo apt install ros-humble-rmw-cyclonedds-cpp
echo 'export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp' >> ~/.bashrc
Install package dependencies
sudo rosdep init
rosdep update
rosdep install -i --from-path src --rosdistro humble -y
pip install setuptools==58.2.0
colcon build --symlink-install
This project is developed using; ROS2 and Gazebo for simulation and coordination of robotic agents and Pytorch for Reinforcement Learning
Whether you use this project, have learned something from it, or just like it, please consider supporting it by buying me a coffee, so I can dedicate more time on open-source projects like this (҂⌣̀_⌣́)
Theo Moore-Calters
Licensed under MIT.