Merge pull request #49 from Farama-Foundation/agent-wrappers

Single Agent wrappers (tested on item gathering for now)
Farama-Foundation · Mar 15, 2024 · 59484d0 · 59484d0
2 parents 8e2f1fb + 3d1b76c
commit 59484d0
Show file tree

Hide file tree

Showing 14 changed files with 534 additions and 234 deletions.
diff --git a/.gitignore b/.gitignore
@@ -10,6 +10,10 @@ __pycache__/
 
 # Pycharm
 /.idea
+# Cluster scripts
+/hpc
+momaland/learning/wandb/
+momaland/learning/weights/
 
 # Distribution / packaging
 .Python

diff --git a/momaland/envs/beach/beach.py b/momaland/envs/beach/beach.py
@@ -53,18 +53,57 @@ def raw_env(**kwargs):
 
 
 class MOBeachDomain(MOParallelEnv):
-    """Environment for MO-BeachDomain problem.
-
-    The init method takes in environment arguments and should define the following attributes:
-    - possible_agents
-    - action_spaces
-    - observation_spaces
-    These attributes should not be changed after initialization.
+    """A `Parallel` 2-objective environment of the Beach problem domain.
+
+    ## Observation Space
+    The observation space is a continuous box with the length `5` containing:
+     - agent type
+     - section id (where the agent is)
+     - section capacity
+     - section consumption
+     - percentage of agents of the agent's type in the section
+
+    Example:
+    `[a_type, section_id, section_capacity, section_consumption, %_of_a_of_current_type]`
+
+    ## Action Space
+    The action space is a Discrete space, where:
+    - moving left is -1
+    - moving right is +1
+    - staying is 0
+
+    ## Reward Space
+    The reward space is a 2D vector containing rewards for two different schemes ('local' or 'global') for:
+    - the occupation level
+    - the mixture level
+    If the scheme is 'local', the reward is given for the currently occupied section.
+    If the scheme is 'global', the reward is summed over all sections.
+
+    ## Starting State
+    The initial position is a uniform random distribution of agents over the sections. This can be changed via the
+    'position_distribution' argument. The agent types are also randomly distributed according to the
+    'type_distribution' argument. The default is a uniform distribution over all types.
+
+    ## Episode Termination
+    The episode is terminated if num_timesteps is reached. The default value is 100.
+    Agents only receive the reward after the last timestep.
+
+    ## Episode Truncation
+    The problem is not truncated. It has a maximum number of timesteps.
+
+    ## Arguments
+    - 'num_timesteps (int)': number of timesteps in the domain. Default: 100
+    - 'num_agents (int)': number of agents in the domain. Default: 100
+    - 'reward_scheme (str)': the reward scheme to use ('local', or 'global'). Default: local
+    - 'sections (int)': number of beach sections in the domain. Default: 6
+    - 'capacity (int)': capacity of each beach section. Default: 10
+    - 'type_distribution (tuple)': the distribution of agent types in the domain. Default: 2 types equally distributed (0.5, 0.5).
+    - 'position_distribution (tuple)': the initial distribution of agents in the domain. Default: uniform over all sections (None).
+    - 'render_mode (str)': render mode. Default: None
     """
 
     metadata = {"render_modes": ["human"], "name": "mobeach_v0"}
 
-    # TODO does this environment require max_cycle?
     def __init__(
         self,
         num_timesteps=10,

diff --git a/momaland/envs/breakthrough/breakthrough.py b/momaland/envs/breakthrough/breakthrough.py
@@ -10,58 +10,6 @@
 | Observation Shape  | (board_height=8, board_width=8, 2)               |
 | Observation Values | [0,1]                                            |
 | Reward Shape       | (num_objectives=4,)                              |
-
-MO-Breakthrough is a multi-objective variant of the two-player, single-objective turn-based board game Breakthrough.
-In Breakthrough, players start with two rows of identical pieces in front of them, on an 8x8 board, and try to reach
-the opponent's home row with any piece. The first player to move a piece on their opponent's home row wins. Players
-move alternatingly, and each piece can move one square straight forward or diagonally forward. Opponent pieces can also
-be captured, but only by moving diagonally forward, not straight.
-MO-Breakthrough extends this game with up to three additional objectives: a second objective that incentivizes faster
-wins, a third one for capturing opponent pieces, and a fourth one for avoiding the capture of the agent's own pieces.
- Additionally, the board width can be modified from 3 to 20 squares, and the board height from 5 to 20 squares.
-
-
-### Observation Space
-
-The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
-below, and an  `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
-
-The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
-Each plane represents a specific agent's pieces, and each location in the grid represents the placement of the
-corresponding agent's piece. 1 indicates that the agent has a piece placed in the given location, and 0 indicates they
-do not have a piece in that location (meaning that either the cell is empty, or the other agent has a piece in that
-location).
-
-
-#### Legal Actions Mask
-
-The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
-The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
-or not; the action encoding is described in the Action Space section below.
-The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
-game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
-
-
-### Action Space
-
-The action space is the set of integers from 0 to board_width*board_height*3 (exclusive). If a piece at coordinates
-(x,y) is moved, this is encoded as the integer x * 3 * board_height + y * 3 + z where z == 0 for left diagonal, 1 for
-straight, and 2 for right diagonal move.
-
-
-### Rewards
-
-Dimension 0: If an agent moves one of their pieces to the opponent's home row, they will be rewarded 1 point. At the
-same time, the opponent agent will be awarded -1 point. There are no draws in Breakthrough.
-Dimension 1: If an agent wins, they get a reward of 1-(move_count/max_moves) to incentivize faster wins. The losing
-opponent gets the negated reward. In case of a draw, both agents get 0.
-Dimension 2: (optional) The number of opponent pieces (divided by the max number of pieces) an agent has captured.
-Dimension 3: (optional) The negative number of pieces (divided by the max number of pieces)
- an agent has lost to the opponent.
-
-
-### Version History
-
 """
 from __future__ import annotations
 
@@ -107,7 +55,52 @@ def raw_env(**kwargs):
 
 
 class MOBreakthrough(MOAECEnv):
-    """Multi-objective Breakthrough."""
+    """Multi-objective Breakthrough.
+
+    MO-Breakthrough is a multi-objective variant of the two-player, single-objective turn-based board game Breakthrough.
+    In Breakthrough, players start with two rows of identical pieces in front of them, on an 8x8 board, and try to reach
+    the opponent's home row with any piece. The first player to move a piece on their opponent's home row wins. Players
+    move alternatingly, and each piece can move one square straight forward or diagonally forward. Opponent pieces can also
+    be captured, but only by moving diagonally forward, not straight.
+    MO-Breakthrough extends this game with up to three additional objectives: a second objective that incentivizes faster
+    wins, a third one for capturing opponent pieces, and a fourth one for avoiding the capture of the agent's own pieces.
+    Additionally, the board width can be modified from 3 to 20 squares, and the board height from 5 to 20 squares.
+
+    ## Observation Space
+    The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
+    below, and an  `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
+
+    The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
+    Each plane represents a specific agent's pieces, and each location in the grid represents the placement of the
+    corresponding agent's piece. 1 indicates that the agent has a piece placed in the given location, and 0 indicates they
+    do not have a piece in that location (meaning that either the cell is empty, or the other agent has a piece in that
+    location).
+
+
+    ### Legal Actions Mask
+    The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
+    The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
+    or not; the action encoding is described in the Action Space section below.
+    The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
+    game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
+
+    ## Action Space
+    The action space is the set of integers from 0 to board_width*board_height*3 (exclusive). If a piece at coordinates
+    (x,y) is moved, this is encoded as the integer x * 3 * board_height + y * 3 + z where z == 0 for left diagonal, 1 for
+    straight, and 2 for right diagonal move.
+
+
+    ## Rewards
+    Dimension 0: If an agent moves one of their pieces to the opponent's home row, they will be rewarded 1 point. At the
+    same time, the opponent agent will be awarded -1 point. There are no draws in Breakthrough.
+    Dimension 1: If an agent wins, they get a reward of 1-(move_count/max_moves) to incentivize faster wins. The losing
+    opponent gets the negated reward. In case of a draw, both agents get 0.
+    Dimension 2: (optional) The number of opponent pieces (divided by the max number of pieces) an agent has captured.
+    Dimension 3: (optional) The negative number of pieces (divided by the max number of pieces)
+    an agent has lost to the opponent.
+
+    ## Version History
+    """
 
     metadata = {
         "render_modes": ["ansi"],

diff --git a/momaland/envs/connect4/connect4.py b/momaland/envs/connect4/connect4.py
@@ -10,57 +10,6 @@
 | Observation Shape  | (board_height=6, board_width=7, 2)               |
 | Observation Values | [0,1]                                            |
 | Reward Shape       | (2,) or (2+board_width,)                         |
-
-MO-Connect4 is a multi-objective variant of the two-player, single-objective turn-based board game Connect 4.
-In Connect 4, players can win by connecting four of their tokens vertically, horizontally or diagonally. The players
-drop their respective token in a column of a standing board (of width 7 and height 6 by default), where each token will
-fall until it reaches the bottom of the column or lands on top of an existing token.
-Players cannot place a token in a full column, and the game ends when either a player has made a sequence of 4 tokens,
-or when all columns have been filled (draw).
-MO-Connect4 extends this game with a second objective that incentivizes faster wins, and optionally the additional
-(conflicting) objectives of having more tokens than the opponent in every column. Additionally, width and height of the
-board can be set to values from 4 to 20.
-
-
-### Observation Space
-
-The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
-below, and an  `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
-
-The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
-Each plane represents a specific agent's tokens, and each location in the grid represents the placement of the
-corresponding agent's token. 1 indicates that the agent has a token placed in the given location, and 0 indicates they
-do not have a token in that location (meaning that either the cell is empty, or the other agent has a token in that
- location).
-
-
-#### Legal Actions Mask
-
-The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
-The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
-or not; the action encoding is described in the Action Space section below.
-The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
-game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
-
-
-### Action Space
-
-The action space is the set of integers from 0 to board_width (exclusive), where the number represents which column
-a token should be dropped in.
-
-
-### Rewards
-
-Dimension 0: If an agent successfully connects four of their tokens, they will be rewarded 1 point. At the same time,
-the opponent agent will be awarded -1 point. If the game ends in a draw, both players are rewarded 0.
-Dimension 1: If an agent wins, they get a reward of 1-(move_count/board_size) to incentivize faster wins. The losing opponent gets the negated reward. In case of a draw, both agents get 0.
-Dimension 2 to board_width+1 (default 8): (optional) If at game end, an agent has more tokens than their opponent in
-column X, they will be rewarded 1 point in reward dimension 2+X. The opponent agent will be rewarded -1 point. If the
-column has an equal number of tokens from both players, both players are rewarded 0.
-
-
-### Version History
-
 """
 from __future__ import annotations
 
@@ -120,7 +69,50 @@ def raw_env(**kwargs):
 
 
 class MOConnect4(MOAECEnv, EzPickle):
-    """Multi-objective Connect Four."""
+    """Multi-objective Connect Four.
+
+    MO-Connect4 is a multi-objective variant of the two-player, single-objective turn-based board game Connect 4.
+    In Connect 4, players can win by connecting four of their tokens vertically, horizontally or diagonally. The players
+    drop their respective token in a column of a standing board (of width 7 and height 6 by default), where each token will
+    fall until it reaches the bottom of the column or lands on top of an existing token.
+    Players cannot place a token in a full column, and the game ends when either a player has made a sequence of 4 tokens,
+    or when all columns have been filled (draw).
+    MO-Connect4 extends this game with a second objective that incentivizes faster wins, and optionally the additional
+    (conflicting) objectives of having more tokens than the opponent in every column. Additionally, width and height of the
+    board can be set to values from 4 to 20.
+
+    ## Observation Space
+    The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
+    below, and an  `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
+
+    The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
+    Each plane represents a specific agent's tokens, and each location in the grid represents the placement of the
+    corresponding agent's token. 1 indicates that the agent has a token placed in the given location, and 0 indicates they
+    do not have a token in that location (meaning that either the cell is empty, or the other agent has a token in that
+    location).
+
+    ## Legal Actions Mask
+    The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
+    The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
+    or not; the action encoding is described in the Action Space section below.
+    The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
+    game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
+
+
+    ## Action Space
+    The action space is the set of integers from 0 to board_width (exclusive), where the number represents which column
+    a token should be dropped in.
+
+    ## Rewards
+    Dimension 0: If an agent successfully connects four of their tokens, they will be rewarded 1 point. At the same time,
+    the opponent agent will be awarded -1 point. If the game ends in a draw, both players are rewarded 0.
+    Dimension 1: If an agent wins, they get a reward of 1-(move_count/board_size) to incentivize faster wins. The losing opponent gets the negated reward. In case of a draw, both agents get 0.
+    Dimension 2 to board_width+1 (default 8): (optional) If at game end, an agent has more tokens than their opponent in
+    column X, they will be rewarded 1 point in reward dimension 2+X. The opponent agent will be rewarded -1 point. If the
+    column has an equal number of tokens from both players, both players are rewarded 0.
+
+    ## Version History
+    """
 
     metadata = {
         "render_modes": ["human", "rgb_array"],