Skip to content

Commit

Permalink
Merge pull request #49 from Farama-Foundation/agent-wrappers
Browse files Browse the repository at this point in the history
Single Agent wrappers (tested on item gathering for now)
  • Loading branch information
rradules authored Mar 15, 2024
2 parents 8e2f1fb + 3d1b76c commit 59484d0
Show file tree
Hide file tree
Showing 14 changed files with 534 additions and 234 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ __pycache__/

# Pycharm
/.idea
# Cluster scripts
/hpc
momaland/learning/wandb/
momaland/learning/weights/

# Distribution / packaging
.Python
Expand Down
55 changes: 47 additions & 8 deletions momaland/envs/beach/beach.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,18 +53,57 @@ def raw_env(**kwargs):


class MOBeachDomain(MOParallelEnv):
"""Environment for MO-BeachDomain problem.
The init method takes in environment arguments and should define the following attributes:
- possible_agents
- action_spaces
- observation_spaces
These attributes should not be changed after initialization.
"""A `Parallel` 2-objective environment of the Beach problem domain.
## Observation Space
The observation space is a continuous box with the length `5` containing:
- agent type
- section id (where the agent is)
- section capacity
- section consumption
- percentage of agents of the agent's type in the section
Example:
`[a_type, section_id, section_capacity, section_consumption, %_of_a_of_current_type]`
## Action Space
The action space is a Discrete space, where:
- moving left is -1
- moving right is +1
- staying is 0
## Reward Space
The reward space is a 2D vector containing rewards for two different schemes ('local' or 'global') for:
- the occupation level
- the mixture level
If the scheme is 'local', the reward is given for the currently occupied section.
If the scheme is 'global', the reward is summed over all sections.
## Starting State
The initial position is a uniform random distribution of agents over the sections. This can be changed via the
'position_distribution' argument. The agent types are also randomly distributed according to the
'type_distribution' argument. The default is a uniform distribution over all types.
## Episode Termination
The episode is terminated if num_timesteps is reached. The default value is 100.
Agents only receive the reward after the last timestep.
## Episode Truncation
The problem is not truncated. It has a maximum number of timesteps.
## Arguments
- 'num_timesteps (int)': number of timesteps in the domain. Default: 100
- 'num_agents (int)': number of agents in the domain. Default: 100
- 'reward_scheme (str)': the reward scheme to use ('local', or 'global'). Default: local
- 'sections (int)': number of beach sections in the domain. Default: 6
- 'capacity (int)': capacity of each beach section. Default: 10
- 'type_distribution (tuple)': the distribution of agent types in the domain. Default: 2 types equally distributed (0.5, 0.5).
- 'position_distribution (tuple)': the initial distribution of agents in the domain. Default: uniform over all sections (None).
- 'render_mode (str)': render mode. Default: None
"""

metadata = {"render_modes": ["human"], "name": "mobeach_v0"}

# TODO does this environment require max_cycle?
def __init__(
self,
num_timesteps=10,
Expand Down
99 changes: 46 additions & 53 deletions momaland/envs/breakthrough/breakthrough.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,58 +10,6 @@
| Observation Shape | (board_height=8, board_width=8, 2) |
| Observation Values | [0,1] |
| Reward Shape | (num_objectives=4,) |
MO-Breakthrough is a multi-objective variant of the two-player, single-objective turn-based board game Breakthrough.
In Breakthrough, players start with two rows of identical pieces in front of them, on an 8x8 board, and try to reach
the opponent's home row with any piece. The first player to move a piece on their opponent's home row wins. Players
move alternatingly, and each piece can move one square straight forward or diagonally forward. Opponent pieces can also
be captured, but only by moving diagonally forward, not straight.
MO-Breakthrough extends this game with up to three additional objectives: a second objective that incentivizes faster
wins, a third one for capturing opponent pieces, and a fourth one for avoiding the capture of the agent's own pieces.
Additionally, the board width can be modified from 3 to 20 squares, and the board height from 5 to 20 squares.
### Observation Space
The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
below, and an `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
Each plane represents a specific agent's pieces, and each location in the grid represents the placement of the
corresponding agent's piece. 1 indicates that the agent has a piece placed in the given location, and 0 indicates they
do not have a piece in that location (meaning that either the cell is empty, or the other agent has a piece in that
location).
#### Legal Actions Mask
The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
or not; the action encoding is described in the Action Space section below.
The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
### Action Space
The action space is the set of integers from 0 to board_width*board_height*3 (exclusive). If a piece at coordinates
(x,y) is moved, this is encoded as the integer x * 3 * board_height + y * 3 + z where z == 0 for left diagonal, 1 for
straight, and 2 for right diagonal move.
### Rewards
Dimension 0: If an agent moves one of their pieces to the opponent's home row, they will be rewarded 1 point. At the
same time, the opponent agent will be awarded -1 point. There are no draws in Breakthrough.
Dimension 1: If an agent wins, they get a reward of 1-(move_count/max_moves) to incentivize faster wins. The losing
opponent gets the negated reward. In case of a draw, both agents get 0.
Dimension 2: (optional) The number of opponent pieces (divided by the max number of pieces) an agent has captured.
Dimension 3: (optional) The negative number of pieces (divided by the max number of pieces)
an agent has lost to the opponent.
### Version History
"""
from __future__ import annotations

Expand Down Expand Up @@ -107,7 +55,52 @@ def raw_env(**kwargs):


class MOBreakthrough(MOAECEnv):
"""Multi-objective Breakthrough."""
"""Multi-objective Breakthrough.
MO-Breakthrough is a multi-objective variant of the two-player, single-objective turn-based board game Breakthrough.
In Breakthrough, players start with two rows of identical pieces in front of them, on an 8x8 board, and try to reach
the opponent's home row with any piece. The first player to move a piece on their opponent's home row wins. Players
move alternatingly, and each piece can move one square straight forward or diagonally forward. Opponent pieces can also
be captured, but only by moving diagonally forward, not straight.
MO-Breakthrough extends this game with up to three additional objectives: a second objective that incentivizes faster
wins, a third one for capturing opponent pieces, and a fourth one for avoiding the capture of the agent's own pieces.
Additionally, the board width can be modified from 3 to 20 squares, and the board height from 5 to 20 squares.
## Observation Space
The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
below, and an `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
Each plane represents a specific agent's pieces, and each location in the grid represents the placement of the
corresponding agent's piece. 1 indicates that the agent has a piece placed in the given location, and 0 indicates they
do not have a piece in that location (meaning that either the cell is empty, or the other agent has a piece in that
location).
### Legal Actions Mask
The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
or not; the action encoding is described in the Action Space section below.
The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
## Action Space
The action space is the set of integers from 0 to board_width*board_height*3 (exclusive). If a piece at coordinates
(x,y) is moved, this is encoded as the integer x * 3 * board_height + y * 3 + z where z == 0 for left diagonal, 1 for
straight, and 2 for right diagonal move.
## Rewards
Dimension 0: If an agent moves one of their pieces to the opponent's home row, they will be rewarded 1 point. At the
same time, the opponent agent will be awarded -1 point. There are no draws in Breakthrough.
Dimension 1: If an agent wins, they get a reward of 1-(move_count/max_moves) to incentivize faster wins. The losing
opponent gets the negated reward. In case of a draw, both agents get 0.
Dimension 2: (optional) The number of opponent pieces (divided by the max number of pieces) an agent has captured.
Dimension 3: (optional) The negative number of pieces (divided by the max number of pieces)
an agent has lost to the opponent.
## Version History
"""

metadata = {
"render_modes": ["ansi"],
Expand Down
96 changes: 44 additions & 52 deletions momaland/envs/connect4/connect4.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,57 +10,6 @@
| Observation Shape | (board_height=6, board_width=7, 2) |
| Observation Values | [0,1] |
| Reward Shape | (2,) or (2+board_width,) |
MO-Connect4 is a multi-objective variant of the two-player, single-objective turn-based board game Connect 4.
In Connect 4, players can win by connecting four of their tokens vertically, horizontally or diagonally. The players
drop their respective token in a column of a standing board (of width 7 and height 6 by default), where each token will
fall until it reaches the bottom of the column or lands on top of an existing token.
Players cannot place a token in a full column, and the game ends when either a player has made a sequence of 4 tokens,
or when all columns have been filled (draw).
MO-Connect4 extends this game with a second objective that incentivizes faster wins, and optionally the additional
(conflicting) objectives of having more tokens than the opponent in every column. Additionally, width and height of the
board can be set to values from 4 to 20.
### Observation Space
The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
below, and an `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
Each plane represents a specific agent's tokens, and each location in the grid represents the placement of the
corresponding agent's token. 1 indicates that the agent has a token placed in the given location, and 0 indicates they
do not have a token in that location (meaning that either the cell is empty, or the other agent has a token in that
location).
#### Legal Actions Mask
The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
or not; the action encoding is described in the Action Space section below.
The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
### Action Space
The action space is the set of integers from 0 to board_width (exclusive), where the number represents which column
a token should be dropped in.
### Rewards
Dimension 0: If an agent successfully connects four of their tokens, they will be rewarded 1 point. At the same time,
the opponent agent will be awarded -1 point. If the game ends in a draw, both players are rewarded 0.
Dimension 1: If an agent wins, they get a reward of 1-(move_count/board_size) to incentivize faster wins. The losing opponent gets the negated reward. In case of a draw, both agents get 0.
Dimension 2 to board_width+1 (default 8): (optional) If at game end, an agent has more tokens than their opponent in
column X, they will be rewarded 1 point in reward dimension 2+X. The opponent agent will be rewarded -1 point. If the
column has an equal number of tokens from both players, both players are rewarded 0.
### Version History
"""
from __future__ import annotations

Expand Down Expand Up @@ -120,7 +69,50 @@ def raw_env(**kwargs):


class MOConnect4(MOAECEnv, EzPickle):
"""Multi-objective Connect Four."""
"""Multi-objective Connect Four.
MO-Connect4 is a multi-objective variant of the two-player, single-objective turn-based board game Connect 4.
In Connect 4, players can win by connecting four of their tokens vertically, horizontally or diagonally. The players
drop their respective token in a column of a standing board (of width 7 and height 6 by default), where each token will
fall until it reaches the bottom of the column or lands on top of an existing token.
Players cannot place a token in a full column, and the game ends when either a player has made a sequence of 4 tokens,
or when all columns have been filled (draw).
MO-Connect4 extends this game with a second objective that incentivizes faster wins, and optionally the additional
(conflicting) objectives of having more tokens than the opponent in every column. Additionally, width and height of the
board can be set to values from 4 to 20.
## Observation Space
The observation is a dictionary which contains an `'observation'` element which is the usual RL observation described
below, and an `'action_mask'` which holds the legal moves, described in the Legal Actions Mask section below.
The main observation space is 2 planes of a board_height * board_width grid (a board_height * board_width * 2 tensor).
Each plane represents a specific agent's tokens, and each location in the grid represents the placement of the
corresponding agent's token. 1 indicates that the agent has a token placed in the given location, and 0 indicates they
do not have a token in that location (meaning that either the cell is empty, or the other agent has a token in that
location).
## Legal Actions Mask
The legal moves available to the current agent are found in the `action_mask` element of the dictionary observation.
The `action_mask` is a binary vector where each index of the vector represents whether the represented action is legal
or not; the action encoding is described in the Action Space section below.
The `action_mask` will be all zeros for any agent except the one whose turn it is. Taking an illegal action ends the
game with a reward of -1 for the illegally moving agent and a reward of 0 for all other agents. #TODO this isn't happening anymore because of missing TerminateIllegalWrapper
## Action Space
The action space is the set of integers from 0 to board_width (exclusive), where the number represents which column
a token should be dropped in.
## Rewards
Dimension 0: If an agent successfully connects four of their tokens, they will be rewarded 1 point. At the same time,
the opponent agent will be awarded -1 point. If the game ends in a draw, both players are rewarded 0.
Dimension 1: If an agent wins, they get a reward of 1-(move_count/board_size) to incentivize faster wins. The losing opponent gets the negated reward. In case of a draw, both agents get 0.
Dimension 2 to board_width+1 (default 8): (optional) If at game end, an agent has more tokens than their opponent in
column X, they will be rewarded 1 point in reward dimension 2+X. The opponent agent will be rewarded -1 point. If the
column has an equal number of tokens from both players, both players are rewarded 0.
## Version History
"""

metadata = {
"render_modes": ["human", "rgb_array"],
Expand Down
Loading

0 comments on commit 59484d0

Please sign in to comment.