Documentation (#6)

* documentation wip * fix error * fix bibtex * fix workflow * docs assets added * fix linter * fix linter * update readme * sac-n documentation * more comments for configs * update readme * merge main, update docs benchmark scores * linter fix * fix docs typos
corl-team · Aug 16, 2023 · 23f8b2e · 23f8b2e
1 parent 2a30bc2
commit 23f8b2e
Show file tree

Hide file tree

Showing 39 changed files with 1,320 additions and 112 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,25 @@
+name: ci
+on:
+  push:
+    branches:
+      - main
+      - howuhh/docs-wip
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+        with:
+          python-version: 3.x
+      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
+      - uses: actions/cache@v3
+        with:
+          key: mkdocs-material-${{ env.cache_id }}
+          path: .cache
+          restore-keys: |
+            mkdocs-material-
+      - run: pip install mkdocs-material
+      - run: mkdocs gh-deploy --force
diff --git a/.gitignore b/.gitignore
@@ -145,4 +145,4 @@ dmypy.json
 .json
 .yaml
 wandb
-assets/
+#assets/
diff --git a/README.md b/README.md
@@ -10,16 +10,25 @@
 🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by [cleanrl](https://github.com/vwxyzjn/cleanrl) for online RL, check them out too!<br/>
 
 * 📜 Single-file implementation
-* 📈 Benchmarked Implementation for N algorithms
+* 📈 Benchmarked Implementation (11+ offline algorithms, 5+ offline-to-online algorithms, 30+ datasets with detailed logs)
 * 🖼 [Weights and Biases](https://wandb.ai/site) integration
 
+You can read more about CORL design and main results in our [technical paper](https://arxiv.org/abs/2210.07105).
+
 ----
 * ⭐ If you're interested in __discrete control__, make sure to check out our new library — [Katakomba](https://github.com/corl-team/katakomba). It provides both discrete control algorithms augmented with recurrence and an offline RL benchmark for the NetHack Learning environment.
 ----
 
+> ⚠️ **NOTE**: CORL (similarily to CleanRL) is not a modular library and therefore it is not meant to be imported.
+At the cost of duplicate code, we make all implementation details of an ORL algorithm variant easy 
+to understand. You should consider using CORL if you want to 1) understand and control all implementation details 
+of an algorithm or 2) rapidly prototype advanced features that other modular ORL libraries do not support.
+
 
 ## Getting started
 
+Please refer to the [documentation](https://corl-team.github.io/CORL/get-started/install/) for more details. TLDR:
+
 ```bash
 git clone https://github.com/corl-team/CORL.git && cd CORL
 pip install -r requirements/requirements_dev.txt
@@ -213,7 +222,7 @@ If you use CORL in your work, please use the following bibtex
 ```bibtex
 @inproceedings{
 tarasov2022corl,
-  title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
+  title={CORL: Research-oriented Deep Offline Reinforcement Learning Library},
   author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
   booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
   year={2022},

diff --git a/algorithms/offline/any_percent_bc.py b/algorithms/offline/any_percent_bc.py
@@ -19,26 +19,40 @@
 
 @dataclass
 class TrainConfig:
-    # Experiment
-    device: str = "cuda"
-    env: str = "halfcheetah-medium-expert-v2"  # OpenAI gym environment name
-    seed: int = 0  # Sets Gym, PyTorch and Numpy seeds
-    eval_freq: int = int(5e3)  # How often (time steps) we evaluate
-    n_episodes: int = 10  # How many episodes run during evaluation
-    max_timesteps: int = int(1e6)  # Max time steps to run environment
-    checkpoints_path: Optional[str] = None  # Save path
-    load_model: str = ""  # Model load file name, "" doesn't load
-    batch_size: int = 256  # Batch size for all networks
-    discount: float = 0.99  # Discount factor
-    # BC
-    buffer_size: int = 2_000_000  # Replay buffer size
-    frac: float = 0.1  # Best data fraction to use
-    max_traj_len: int = 1000  # Max trajectory length
-    normalize: bool = True  # Normalize states
-    # Wandb logging
+    # wandb project name
     project: str = "CORL"
+    # wandb group name
     group: str = "BC-D4RL"
+    # wandb run name
     name: str = "BC"
+    # training dataset and evaluation environment
+    env: str = "halfcheetah-medium-expert-v2"
+    # total gradient updates during training
+    max_timesteps: int = int(1e6)
+    # training batch size
+    batch_size: int = 256
+    # maximum size of the replay buffer
+    buffer_size: int = 2_000_000
+    # what top fraction of the dataset (sorted by return) to use
+    frac: float = 0.1
+    # maximum possible trajectory length
+    max_traj_len: int = 1000
+    # whether to normalize states
+    normalize: bool = True
+    # discount factor
+    discount: float = 0.99
+    # evaluation frequency, will evaluate eval_freq training steps
+    eval_freq: int = int(5e3)
+    # number of episodes to run during evaluation
+    n_episodes: int = 10
+    # path for checkpoints saving, optional
+    checkpoints_path: Optional[str] = None
+    # file name for loading a model, optional
+    load_model: str = ""
+    # training random seed
+    seed: int = 0
+    # training device
+    device: str = "cuda"
 
     def __post_init__(self):
         self.name = f"{self.name}-{self.env}-{str(uuid.uuid4())[:8]}"

diff --git a/algorithms/offline/awac.py b/algorithms/offline/awac.py
@@ -20,29 +20,49 @@
 
 @dataclass
 class TrainConfig:
+    # wandb project name
     project: str = "CORL"
+    # wandb group name
     group: str = "AWAC-D4RL"
+    # wandb run name
     name: str = "AWAC"
-    checkpoints_path: Optional[str] = None
-
+    # training dataset and evaluation environment
     env_name: str = "halfcheetah-medium-expert-v2"
-    seed: int = 42
-    test_seed: int = 69
-    deterministic_torch: bool = False
-    device: str = "cuda"
-
-    buffer_size: int = 2_000_000
-    num_train_ops: int = 1_000_000
-    batch_size: int = 256
-    eval_frequency: int = 1000
-    n_test_episodes: int = 10
-    normalize_reward: bool = False
-
+    # actor and critic hidden dim
     hidden_dim: int = 256
+    # actor and critic learning rate
     learning_rate: float = 3e-4
+    # discount factor
     gamma: float = 0.99
+    # coefficient for the target critic Polyak's update
     tau: float = 5e-3
+    # awac actor loss temperature, controlling balance
+    # between behaviour cloning and Q-value maximization
     awac_lambda: float = 1.0
+    # total number of gradient updated during training
+    num_train_ops: int = 1_000_000
+    # training batch size
+    batch_size: int = 256
+    # maximum size of the replay buffer
+    buffer_size: int = 2_000_000
+    # whether to normalize reward (like in IQL)
+    normalize_reward: bool = False
+    # evaluation frequency, will evaluate every eval_frequency
+    # training steps
+    eval_frequency: int = 1000
+    # number of episodes to run during evaluation
+    n_test_episodes: int = 10
+    # path for checkpoints saving, optional
+    checkpoints_path: Optional[str] = None
+    # configure PyTorch to use deterministic algorithms instead
+    # of nondeterministic ones
+    deterministic_torch: bool = False
+    # training random seed
+    seed: int = 42
+    # evaluation random seed
+    test_seed: int = 69
+    # training device
+    device: str = "cuda"
 
     def __post_init__(self):
         self.name = f"{self.name}-{self.env_name}-{str(uuid.uuid4())[:8]}"

diff --git a/algorithms/offline/cql.py b/algorithms/offline/cql.py
@@ -23,7 +23,6 @@
 
 @dataclass
 class TrainConfig:
-    # Experiment
     device: str = "cuda"
     env: str = "halfcheetah-medium-expert-v2"  # OpenAI gym environment name
     seed: int = 0  # Sets Gym, PyTorch and Numpy seeds
@@ -32,8 +31,6 @@ class TrainConfig:
     max_timesteps: int = int(1e6)  # Max time steps to run environment
     checkpoints_path: Optional[str] = None  # Save path
     load_model: str = ""  # Model load file name, "" doesn't load
-
-    # CQL
     buffer_size: int = 2_000_000  # Replay buffer size
     batch_size: int = 256  # Batch size for all networks
     discount: float = 0.99  # Discount factor
@@ -59,9 +56,7 @@ class TrainConfig:
     q_n_hidden_layers: int = 3  # Number of hidden layers in Q networks
     reward_scale: float = 1.0  # Reward scale for normalization
     reward_bias: float = 0.0  # Reward bias for normalization
-
-    # AntMaze hacks
-    bc_steps: int = int(0)  # Number of BC steps at start
+    bc_steps: int = int(0)  # Number of BC steps at start (AntMaze hacks)
     reward_scale: float = 5.0
     reward_bias: float = -1.0
     policy_log_std_multiplier: float = 1.0

diff --git a/algorithms/offline/dt.py b/algorithms/offline/dt.py
@@ -1,5 +1,5 @@
 # inspiration:
-# 1. https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/models/decision_transformer.py  # noqa
+# 1. https://github.com/kzl/decision-transformer/blob/master/gym/decision_transformer/models/decision_transformer.py
 # 2. https://github.com/karpathy/minGPT
 import os
 import random
@@ -17,44 +17,70 @@
 import wandb
 from torch.nn import functional as F
 from torch.utils.data import DataLoader, IterableDataset
-from tqdm.auto import tqdm, trange  # noqa
+from tqdm.auto import trange
 
 @dataclass
 class TrainConfig:
-    # wandb params
+    # wandb project name
     project: str = "CORL"
+    # wandb group name
     group: str = "DT-D4RL"
+    # wandb run name
     name: str = "DT"
-    # model params
+    # transformer hidden dim
     embedding_dim: int = 128
+    # depth of the transformer model
     num_layers: int = 3
+    # number of heads in the attention
     num_heads: int = 1
+    # maximum sequence length during training
     seq_len: int = 20
+    # maximum rollout length, needed for the positional embeddings
     episode_len: int = 1000
+    # attention dropout
     attention_dropout: float = 0.1
+    # residual dropout
     residual_dropout: float = 0.1
+    # embeddings dropout
     embedding_dropout: float = 0.1
+    # maximum range for the symmetric actions, [-1, 1]
     max_action: float = 1.0
-    # training params
+    # training dataset and evaluation environment
     env_name: str = "halfcheetah-medium-v2"
+    # AdamW optimizer learning rate
     learning_rate: float = 1e-4
+    # AdamW optimizer betas
     betas: Tuple[float, float] = (0.9, 0.999)
+    # AdamW weight decay
     weight_decay: float = 1e-4
+    # maximum gradient norm during training, optional
     clip_grad: Optional[float] = 0.25
+    # training batch size
     batch_size: int = 64
+    # total training steps
     update_steps: int = 100_000
+    # warmup steps for the learning rate scheduler
     warmup_steps: int = 10_000
+    # reward scaling, to reduce the magnitude
     reward_scale: float = 0.001
+    # number of workers for the pytorch dataloader
     num_workers: int = 4
-    # evaluation params
+    # target return-to-go for the prompting durint evaluation
     target_returns: Tuple[float, ...] = (12000.0, 6000.0)
+    # number of episodes to run during evaluation
     eval_episodes: int = 100
+    # evaluation frequency, will evaluate eval_every training steps
     eval_every: int = 10_000
-    # general params
+    # path for checkpoints saving, optional
     checkpoints_path: Optional[str] = None
+    # configure PyTorch to use deterministic algorithms instead
+    # of nondeterministic ones
     deterministic_torch: bool = False
+    # training random seed
     train_seed: int = 10
+    # evaluation random seed
     eval_seed: int = 42
+    # training device
     device: str = "cuda"
 
     def __post_init__(self):
@@ -180,7 +206,7 @@ def __prepare_sample(self, traj_idx, start_idx):
 
         states = (states - self.state_mean) / self.state_std
         returns = returns * self.reward_scale
-        # pad up to seq_len if needed
+        # pad up to seq_len if needed, padding is masked during training
         mask = np.hstack(
             [np.ones(states.shape[0]), np.zeros(self.seq_len - states.shape[0])]
         )

diff --git a/algorithms/offline/edac.py b/algorithms/offline/edac.py
@@ -21,36 +21,58 @@
 
 @dataclass
 class TrainConfig:
-    # wandb params
+    # wandb project name
     project: str = "CORL"
+    # wandb group name
     group: str = "EDAC-D4RL"
+    # wandb run name
     name: str = "EDAC"
-    # model params
+    # actor and critic hidden dim
     hidden_dim: int = 256
+    # critic ensemble size
     num_critics: int = 10
+    # discount factor
     gamma: float = 0.99
+    # coefficient for the target critic Polyak's update
     tau: float = 5e-3
+    # coefficient for the ensemble diversification loss
     eta: float = 1.0
+    # actor learning rate
     actor_learning_rate: float = 3e-4
+    # critic learning rate
     critic_learning_rate: float = 3e-4
+    # alpha learning rate
     alpha_learning_rate: float = 3e-4
+    # maximum range for the symmetric actions, [-1, 1]
     max_action: float = 1.0
-    # training params
+    # maximum size of the replay buffer
     buffer_size: int = 1_000_000
+    # training dataset and evaluation environment
     env_name: str = "halfcheetah-medium-v2"
+    # training batch size
     batch_size: int = 256
+    # total number of training epochs
     num_epochs: int = 3000
+    # number of gradient updates during one epoch
     num_updates_on_epoch: int = 1000
+    # whether to normalize reward (like in IQL)
     normalize_reward: bool = False
-    # evaluation params
+    # number of episodes to run during evaluation
     eval_episodes: int = 10
+    # evaluation frequency, will evaluate eval_every training steps
     eval_every: int = 5
-    # general params
+    # path for checkpoints saving, optional
     checkpoints_path: Optional[str] = None
+    # configure PyTorch to use deterministic algorithms instead
+    # of nondeterministic ones
     deterministic_torch: bool = False
+    # training random seed
     train_seed: int = 10
+    # evaluation random seed
     eval_seed: int = 42
+    # frequency of metrics logging to the wandb
     log_every: int = 100
+    # training device
     device: str = "cpu"
 
     def __post_init__(self):
-Original file line number
+Diff line change
@@ Expand Up / @@ -145,4 +145,4 @@ dmypy.json @@
     .json
     .yaml
     wandb
-    assets/
+    #assets/