Skip to content

Commit

Permalink
Documentation (#6)
Browse files Browse the repository at this point in the history
* documentation wip

* fix error

* fix bibtex

* fix workflow

* docs assets added

* fix linter

* fix linter

* update readme

* sac-n documentation

* more comments for configs

* update readme

* merge main, update docs benchmark scores

* linter fix

* fix docs typos
  • Loading branch information
Howuhh authored Aug 16, 2023
1 parent 2a30bc2 commit 23f8b2e
Show file tree
Hide file tree
Showing 39 changed files with 1,320 additions and 112 deletions.
25 changes: 25 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: ci
on:
push:
branches:
- main
- howuhh/docs-wip
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v3
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install mkdocs-material
- run: mkdocs gh-deploy --force
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,4 +145,4 @@ dmypy.json
.json
.yaml
wandb
assets/
#assets/
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,25 @@
🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by [cleanrl](https://github.com/vwxyzjn/cleanrl) for online RL, check them out too!<br/>

* 📜 Single-file implementation
* 📈 Benchmarked Implementation for N algorithms
* 📈 Benchmarked Implementation (11+ offline algorithms, 5+ offline-to-online algorithms, 30+ datasets with detailed logs)
* 🖼 [Weights and Biases](https://wandb.ai/site) integration

You can read more about CORL design and main results in our [technical paper](https://arxiv.org/abs/2210.07105).

----
* ⭐ If you're interested in __discrete control__, make sure to check out our new library — [Katakomba](https://github.com/corl-team/katakomba). It provides both discrete control algorithms augmented with recurrence and an offline RL benchmark for the NetHack Learning environment.
----

> ⚠️ **NOTE**: CORL (similarily to CleanRL) is not a modular library and therefore it is not meant to be imported.
At the cost of duplicate code, we make all implementation details of an ORL algorithm variant easy
to understand. You should consider using CORL if you want to 1) understand and control all implementation details
of an algorithm or 2) rapidly prototype advanced features that other modular ORL libraries do not support.


## Getting started

Please refer to the [documentation](https://corl-team.github.io/CORL/get-started/install/) for more details. TLDR:

```bash
git clone https://github.com/corl-team/CORL.git && cd CORL
pip install -r requirements/requirements_dev.txt
Expand Down Expand Up @@ -213,7 +222,7 @@ If you use CORL in your work, please use the following bibtex
```bibtex
@inproceedings{
tarasov2022corl,
title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
title={CORL: Research-oriented Deep Offline Reinforcement Learning Library},
author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
year={2022},
Expand Down
48 changes: 31 additions & 17 deletions algorithms/offline/any_percent_bc.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,26 +19,40 @@

@dataclass
class TrainConfig:
# Experiment
device: str = "cuda"
env: str = "halfcheetah-medium-expert-v2" # OpenAI gym environment name
seed: int = 0 # Sets Gym, PyTorch and Numpy seeds
eval_freq: int = int(5e3) # How often (time steps) we evaluate
n_episodes: int = 10 # How many episodes run during evaluation
max_timesteps: int = int(1e6) # Max time steps to run environment
checkpoints_path: Optional[str] = None # Save path
load_model: str = "" # Model load file name, "" doesn't load
batch_size: int = 256 # Batch size for all networks
discount: float = 0.99 # Discount factor
# BC
buffer_size: int = 2_000_000 # Replay buffer size
frac: float = 0.1 # Best data fraction to use
max_traj_len: int = 1000 # Max trajectory length
normalize: bool = True # Normalize states
# Wandb logging
# wandb project name
project: str = "CORL"
# wandb group name
group: str = "BC-D4RL"
# wandb run name
name: str = "BC"
# training dataset and evaluation environment
env: str = "halfcheetah-medium-expert-v2"
# total gradient updates during training
max_timesteps: int = int(1e6)
# training batch size
batch_size: int = 256
# maximum size of the replay buffer
buffer_size: int = 2_000_000
# what top fraction of the dataset (sorted by return) to use
frac: float = 0.1
# maximum possible trajectory length
max_traj_len: int = 1000
# whether to normalize states
normalize: bool = True
# discount factor
discount: float = 0.99
# evaluation frequency, will evaluate eval_freq training steps
eval_freq: int = int(5e3)
# number of episodes to run during evaluation
n_episodes: int = 10
# path for checkpoints saving, optional
checkpoints_path: Optional[str] = None
# file name for loading a model, optional
load_model: str = ""
# training random seed
seed: int = 0
# training device
device: str = "cuda"

def __post_init__(self):
self.name = f"{self.name}-{self.env}-{str(uuid.uuid4())[:8]}"
Expand Down
48 changes: 34 additions & 14 deletions algorithms/offline/awac.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,29 +20,49 @@

@dataclass
class TrainConfig:
# wandb project name
project: str = "CORL"
# wandb group name
group: str = "AWAC-D4RL"
# wandb run name
name: str = "AWAC"
checkpoints_path: Optional[str] = None

# training dataset and evaluation environment
env_name: str = "halfcheetah-medium-expert-v2"
seed: int = 42
test_seed: int = 69
deterministic_torch: bool = False
device: str = "cuda"

buffer_size: int = 2_000_000
num_train_ops: int = 1_000_000
batch_size: int = 256
eval_frequency: int = 1000
n_test_episodes: int = 10
normalize_reward: bool = False

# actor and critic hidden dim
hidden_dim: int = 256
# actor and critic learning rate
learning_rate: float = 3e-4
# discount factor
gamma: float = 0.99
# coefficient for the target critic Polyak's update
tau: float = 5e-3
# awac actor loss temperature, controlling balance
# between behaviour cloning and Q-value maximization
awac_lambda: float = 1.0
# total number of gradient updated during training
num_train_ops: int = 1_000_000
# training batch size
batch_size: int = 256
# maximum size of the replay buffer
buffer_size: int = 2_000_000
# whether to normalize reward (like in IQL)
normalize_reward: bool = False
# evaluation frequency, will evaluate every eval_frequency
# training steps
eval_frequency: int = 1000
# number of episodes to run during evaluation
n_test_episodes: int = 10
# path for checkpoints saving, optional
checkpoints_path: Optional[str] = None
# configure PyTorch to use deterministic algorithms instead
# of nondeterministic ones
deterministic_torch: bool = False
# training random seed
seed: int = 42
# evaluation random seed
test_seed: int = 69
# training device
device: str = "cuda"

def __post_init__(self):
self.name = f"{self.name}-{self.env_name}-{str(uuid.uuid4())[:8]}"
Expand Down
7 changes: 1 addition & 6 deletions algorithms/offline/cql.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@

@dataclass
class TrainConfig:
# Experiment
device: str = "cuda"
env: str = "halfcheetah-medium-expert-v2" # OpenAI gym environment name
seed: int = 0 # Sets Gym, PyTorch and Numpy seeds
Expand All @@ -32,8 +31,6 @@ class TrainConfig:
max_timesteps: int = int(1e6) # Max time steps to run environment
checkpoints_path: Optional[str] = None # Save path
load_model: str = "" # Model load file name, "" doesn't load

# CQL
buffer_size: int = 2_000_000 # Replay buffer size
batch_size: int = 256 # Batch size for all networks
discount: float = 0.99 # Discount factor
Expand All @@ -59,9 +56,7 @@ class TrainConfig:
q_n_hidden_layers: int = 3 # Number of hidden layers in Q networks
reward_scale: float = 1.0 # Reward scale for normalization
reward_bias: float = 0.0 # Reward bias for normalization

# AntMaze hacks
bc_steps: int = int(0) # Number of BC steps at start
bc_steps: int = int(0) # Number of BC steps at start (AntMaze hacks)
reward_scale: float = 5.0
reward_bias: float = -1.0
policy_log_std_multiplier: float = 1.0
Expand Down
42 changes: 34 additions & 8 deletions algorithms/offline/dt.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# inspiration:
# 1. https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/models/decision_transformer.py # noqa
# 1. https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/models/decision_transformer.py
# 2. https://github.com/karpathy/minGPT
import os
import random
Expand All @@ -17,44 +17,70 @@
import wandb
from torch.nn import functional as F
from torch.utils.data import DataLoader, IterableDataset
from tqdm.auto import tqdm, trange # noqa
from tqdm.auto import trange

@dataclass
class TrainConfig:
# wandb params
# wandb project name
project: str = "CORL"
# wandb group name
group: str = "DT-D4RL"
# wandb run name
name: str = "DT"
# model params
# transformer hidden dim
embedding_dim: int = 128
# depth of the transformer model
num_layers: int = 3
# number of heads in the attention
num_heads: int = 1
# maximum sequence length during training
seq_len: int = 20
# maximum rollout length, needed for the positional embeddings
episode_len: int = 1000
# attention dropout
attention_dropout: float = 0.1
# residual dropout
residual_dropout: float = 0.1
# embeddings dropout
embedding_dropout: float = 0.1
# maximum range for the symmetric actions, [-1, 1]
max_action: float = 1.0
# training params
# training dataset and evaluation environment
env_name: str = "halfcheetah-medium-v2"
# AdamW optimizer learning rate
learning_rate: float = 1e-4
# AdamW optimizer betas
betas: Tuple[float, float] = (0.9, 0.999)
# AdamW weight decay
weight_decay: float = 1e-4
# maximum gradient norm during training, optional
clip_grad: Optional[float] = 0.25
# training batch size
batch_size: int = 64
# total training steps
update_steps: int = 100_000
# warmup steps for the learning rate scheduler
warmup_steps: int = 10_000
# reward scaling, to reduce the magnitude
reward_scale: float = 0.001
# number of workers for the pytorch dataloader
num_workers: int = 4
# evaluation params
# target return-to-go for the prompting durint evaluation
target_returns: Tuple[float, ...] = (12000.0, 6000.0)
# number of episodes to run during evaluation
eval_episodes: int = 100
# evaluation frequency, will evaluate eval_every training steps
eval_every: int = 10_000
# general params
# path for checkpoints saving, optional
checkpoints_path: Optional[str] = None
# configure PyTorch to use deterministic algorithms instead
# of nondeterministic ones
deterministic_torch: bool = False
# training random seed
train_seed: int = 10
# evaluation random seed
eval_seed: int = 42
# training device
device: str = "cuda"

def __post_init__(self):
Expand Down Expand Up @@ -180,7 +206,7 @@ def __prepare_sample(self, traj_idx, start_idx):

states = (states - self.state_mean) / self.state_std
returns = returns * self.reward_scale
# pad up to seq_len if needed
# pad up to seq_len if needed, padding is masked during training
mask = np.hstack(
[np.ones(states.shape[0]), np.zeros(self.seq_len - states.shape[0])]
)
Expand Down
32 changes: 27 additions & 5 deletions algorithms/offline/edac.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,36 +21,58 @@

@dataclass
class TrainConfig:
# wandb params
# wandb project name
project: str = "CORL"
# wandb group name
group: str = "EDAC-D4RL"
# wandb run name
name: str = "EDAC"
# model params
# actor and critic hidden dim
hidden_dim: int = 256
# critic ensemble size
num_critics: int = 10
# discount factor
gamma: float = 0.99
# coefficient for the target critic Polyak's update
tau: float = 5e-3
# coefficient for the ensemble diversification loss
eta: float = 1.0
# actor learning rate
actor_learning_rate: float = 3e-4
# critic learning rate
critic_learning_rate: float = 3e-4
# alpha learning rate
alpha_learning_rate: float = 3e-4
# maximum range for the symmetric actions, [-1, 1]
max_action: float = 1.0
# training params
# maximum size of the replay buffer
buffer_size: int = 1_000_000
# training dataset and evaluation environment
env_name: str = "halfcheetah-medium-v2"
# training batch size
batch_size: int = 256
# total number of training epochs
num_epochs: int = 3000
# number of gradient updates during one epoch
num_updates_on_epoch: int = 1000
# whether to normalize reward (like in IQL)
normalize_reward: bool = False
# evaluation params
# number of episodes to run during evaluation
eval_episodes: int = 10
# evaluation frequency, will evaluate eval_every training steps
eval_every: int = 5
# general params
# path for checkpoints saving, optional
checkpoints_path: Optional[str] = None
# configure PyTorch to use deterministic algorithms instead
# of nondeterministic ones
deterministic_torch: bool = False
# training random seed
train_seed: int = 10
# evaluation random seed
eval_seed: int = 42
# frequency of metrics logging to the wandb
log_every: int = 100
# training device
device: str = "cpu"

def __post_init__(self):
Expand Down
Loading

0 comments on commit 23f8b2e

Please sign in to comment.