Chuning Zhu, Xinqi Wang, Tyler Han, Simon Shaolei Du, Abhishek Gupta
University of Washington
This is a Jax implementation of Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs). DiSPO is an unsupervised reinforcement learning method that models the distribution of all possible outcomes represented as discounted sums of state-dependent cumulants. The outcome model is paired with a readout policy that produces an action to realize a particular outcome. Assuming a linear dependence of rewards on cumulants, transferring to downstream tasks reduces to performing linear regression and solving a simple optimization problem for the optimal possible outcome.
git clone https://github.com/WEIRDLabUW/distributional-sf
pip install -r requirements.txt
To train DiSPOs on D4RL datasets and adapt to the default tasks, run the following commands
# Antmaze
python train.py --config-name antmaze.yaml env_id=antmaze-umaze-v2 exp_id=benchmark seed=0
python train.py --config-name antmaze.yaml env_id=antmaze-umaze-diverse-v2 exp_id=benchmark seed=0
python train.py --config-name antmaze.yaml env_id=antmaze-medium-diverse-v2 exp_id=benchmark seed=0
python train.py --config-name antmaze.yaml env_id=antmaze-medium-play-v2 exp_id=benchmark seed=0
python train.py --config-name antmaze.yaml env_id=antmaze-large-diverse-v2 exp_id=benchmark seed=0
python train.py --config-name antmaze.yaml env_id=antmaze-large-play-v2 exp_id=benchmark seed=0
# Kitchen
python train.py --config-name kitchen.yaml env_id=kitchen-partial-v0 exp_id=benchmark seed=0
python train.py --config-name kitchen.yaml env_id=kitchen-mixed-v0 exp_id=benchmark seed=0
To adapt DiSPOs to a new downstream reward, relabel subsampled transitions with the new reward function (e.g. by adding a env wrapper and modifying the dataset class) and run the following command (changing env_id
correspondingly)
python eval.py --config-name antmaze.yaml env_id=antmaze-medium-diverse-v2 exp_id=benchmark seed=0
This will load the pretrained outcome model and readout policy and perform linear regression to fit the new rewards.
To run the preference antmaze experiments, install D4RL with the custom antmaze environment from this repository. Then, download the accompanying dataset from this link and place it in data/d4rl
under the project root directory. Run the following commands to train on each preference mode. Alternatively, train on only one mode and adapt to the other mode using the adaptation script.
# Go Up
python train.py --config-name antmaze.yaml env_id=multimodal-antmaze-0 exp_id=benchmark seed=0 planning.planner=random_shooting
# Go Right
python train.py --config-name antmaze.yaml env_id=multimodal-antmaze-1 exp_id=benchmark seed=0 planning.planner=random_shooting
To run the roboverse experiments, download the roboverse dataset from this link and place the files data/roboverse
under the project root directory. Use one of the following commands to train a DiSPO.
python train.py --config-name roboverse.yaml env_id=roboverse-pickplace-v0 exp_id=benchmark seed=0
python train.py --config-name roboverse.yaml env_id=roboverse-doubledraweropen-v0 exp_id=benchmark seed=0
python train.py --config-name roboverse.yaml env_id=roboverse-doubledrawercloseopen-v0 exp_id=benchmark seed=0
If you find this code useful, please cite:
@article{zhu2024dispo,
author = {Zhu, Chuning and Wang, Xinqi and Han, Tyler and Du, Simon Shaolei and Gupta, Abhishek},
title = {Distributional Successor Features Enable Zero-Shot Policy Optimization},
booktitle = {ArXiv Preprint},
year = {2024},
}