Improving Policy-Conditioned Value Functions

Abstract

Current Actor-Critic Reinforcement Learning algorithms face a fundamental limitation. The critic has to implicitly learn about the current state of the policy embodied by the actor. This in turn leads to a delayed learning reaction of the critic, as it always lacks behind the actor. Policy-Conditioned Value Functions explicitly include a representation of the policy, that has to be evaluated, to the critic. Earlier research has shown that these methods are effective in Reinforcement Learning. In this thesis we build upon the current state of Policy-Conditioned Value Functions by adding several improvements. We explore three different methods for representing policies, which are the flat embedding, the neuron embedding and Network Fingerprinting, and assess their capabilities in different settings. The neuron embedding is our novel approach for policy representations. Furthermore, we establish a base algorithm for Policy-Conditioned Value Functions and explore three algorithmic variants which enable the algorithm to (1) explore multiple policies in one rollout, (2) reframe Reinforcement Learning in continuous space into a binary classification problem and (3) train multiple actors at once with one critic. Our results show that the established policy representation ”Network Fingerprinting” in combination with our algorithmic variants improve the overall performance of Policy-Conditioned Value Functions.

Code

The code has been developed during my master's thesis "Improving Policy-Conditioned Value Functions". The full thesis can be found in thesis.pdf. The aim was to create a framework that would make it easy to implement vastly different Reinforcement Learning algorithms and to compare them. Most of the experiments contained in the thesis can be reproduced with the provided configurations and scripts. In the following the details on how to do so are provided. The Code contains many abstraction layers to make implementation of new algorithms as easy a possible. Also a custom configuration format has been developed that can be used to setup different experimental setting with vastly different class structures. This configuration format has been published as a python package. I plan on reorganizing the code further and to build a standalone framework for exotic reinforcement learning algorithm.

Installation

Mujoco has to be installed separately. Refer to the official repository on instructions how to do so https://github.com/openai/mujoco-py.

Using Anaconda you can create the environment via

conda env create -f environment.yml

Alternatively requirements.txt contains all exact versions that have been tested. These can be installed with Anaconda via

conda create --name pcvf --file requirements.txt

The requirements might have to be adjusted if the use of cuda is desired. After installing the requirements execute the following command in the directory of the repository.

pip install -e .

Run experiments

The experiments folder contains configurations for most experiments of the thesis in a custom format. The script execute_experiment.py offers the possibility to run these configurations. For example

python execute_experiment.py hc_comp_s_pcac_fp 50000

will run the experiment with the name hc_comp_s_pcac_fp for 50000 exploration steps. How many time steps in the environment are executed for one exploration step depends on the specific algorithm. All other parameters are set in the according config.ini file (here experiments/hc_comp_s_pcac_fp/config.ini). At the end of the run a checkpoint is created in the experiment folder with an incremental run ID. This checkpoint may be used to evaluate the policy or continue training. Rerunning the command will result in a newly initialized run. Old runs can also be continued by including the --run_id option as follows:

python execute_experiment.py hc_comp_s_pcac_fp 50000 --run_id 1 --device "cpu"

Also --device specifies the device to run the experiment on. By default the script will check whether cuda is available and will use it if so. This option may be used to overrule this default.

Observe/Evaluate trained policies

After executing an experiment the resulting policy can be viewed with the observe.py script as following:

python observe.py hc_comp_s_pcac_fp 1000

The 1000 here indicate for how many time steps the demonstration should run. The environment is reset if the episode terminates. Here the latest (highest id) run of the experiment is shown, again with the --run_id option a specific run can be shown.

python observe.py "hc_comp_s_pcac_fp" 1000 --run_id 1 --no-render

The option --no-render disables the rendering. At the end of the time steps the mean and standard deviation of the return across all episodes is printed.

Experiments

The following table shows which experiments are contained in this repository and in which figures they (or very similar configurations) have been used.

Experiment identifier	Figures
hc_comp_s_pcac_fp	14, 17
hc_comp_ss_pcac_fp	14
hc_nstep_s_pcac_fp	13, 17
hc_psvf	2, 17
hc_s_pcac_fe_small	8
hc_s_pcac_fp	13, 14, 17
hc_s_pcac_fp_small	8
hc_s_pcac_ne_small	8
pen_comp_s_pcac_fp	14
pen_comp_ss_pcac_fp	14, 15, 17
pen_ma_comp_ss_pcac_fp	15, 16
pen_ma_nstep_s_pcac_fp	15, 16
pen_nstep_s_pcac_fp	13, 15, 17
pen_pavf	-
pen_pssvf	2, 17
pen_psvf	2
pen_s_pcac_fe	8, 13, 14, 15, 16, 17
pen_s_pcac_fp	8, 11
pen_s_pcac_ne	8, 12

The names are structured as follows: environment_algorithm_embedding_other

Shorthands:

hc - HalfCheetah-v2
pen - Pendulum-v0
s - state
ss - start state
comp - Comparing
pcac - Policy-Conditioned Actor Critic
ma - Multi Actor
fp - Fingerprinting
ne - Neuron embedding
fe - flat embedding
nstep - N-step
pavf - Parameter-Based State-Action Value Function
pssvf - Parameter-Based Start State Value Function

See thesis.pdf for details on individual methods.

Documentation

The Reinforcement Learning parts of the code, and some other bits, are documented with DocStrings in the code. If an interface exists, the purpose of all functions are documented within the interface. Specific subclasses are only documented in their general purpose or when a sufficient difference exists.

Example Policies

A Policy trained with the N-step PCAC algorithm in the Pendulum-v0 environment.

A Policy trained with the Comparing State PCAC algorithm in the HalfCHeetah-v2 environment.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
animations		animations
experiments		experiments
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
execute_experiment.py		execute_experiment.py
observe.py		observe.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
thesis.pdf		thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Policy-Conditioned Value Functions

Abstract

Code

Installation

Run experiments

Observe/Evaluate trained policies

Experiments

Documentation

Example Policies

About

Releases

Packages

Languages

License

Sebastian-Griesbach/Improving-Policy-Conditioned-Value-Functions

Folders and files

Latest commit

History

Repository files navigation

Improving Policy-Conditioned Value Functions

Abstract

Code

Installation

Run experiments

Observe/Evaluate trained policies

Experiments

Documentation

Example Policies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages