Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438

bing-j · 2024-03-17T00:02:42Z

Hello! I wrote some strategies that use armed bandit algorithms. Originally, I only wanted to implement the epsilon-greedy strategy, but I now plan on extending this effort and implementing all the algorithms mentioned in the multi-armed bandit chapter of Sutton's Reinforcement Learning: an Introduction (I added the reference to the bibliography). So the branch name is no longer very representative; I added both Greedy and EpsilonGreedy on this branch.

Greedy:
Always chooses the action that has the highest average/expected "reward" (score), calculated from its own previous turns. The reward function is updated incrementally and optionally recency weighted, and initial expected rewards of each action default to zero if not modified through a parameter.

EpsilonGreedy:
Mostly works like Greedy (with p=1-e), but sometimes acts randomly (with p=e).

These strategies are described in detail in the textbook mentioned above as well.

As I've mentioned on gitter, I was unable to find any strategies that implement these algorithms, although I did find some similar ones. For example, Adaptive() works similarly to Greedy() without weights, but has a hard coded initial sequence, and uses raw sum of scores to choose the optimal play instead of average score. (Side note: the comments in Adaptive().strategy() indicate that it was intended to use the highest average; this may be an error in the code!) If similar strategies already exist, and/or there's any modifications I need to make in the code, please let me know!

Cheers :)

… passed

…itial behaviour.

…perclass of EpsilonGreedy, and implemented option to weigh reward based on recency

…) so that it correctly calls update_rewards() through the parent method; all previous tests passed.

marcharper

Thanks for the contribution, looks interesting! Most of the feedback is just on matching style and improving the comments.

axelrod/strategies/_strategies.py

marcharper · 2024-03-18T01:58:57Z

axelrod/strategies/armed_bandits.py

+        "manipulates_state": False,
+    }
+
+    UNIFORM = np.inf  # constant that replaces weight when rewards aren't weighted


Is there another conceivable value?

I've changed this to -1.0, and changed other places in the code to refer to this constant for consistency. This does mean that if a user uses recency_wieght=-1.0 at time of creation, it will be treated as not recency weighted (instead of an out of range value limited to 0.0 as in previous implementations).

axelrod/strategies/armed_bandits.py

marcharper · 2024-03-18T02:01:38Z

axelrod/strategies/armed_bandits.py

+
+class EpsilonGreedy(Greedy):
+    """
+    Has a 1 - epsilon probability of behaving like Greedy(), and plays randomly otherwise.


Greedy() --> Greedy

Can you elaborate more on "plays randomly otherwise"?

Changed to "Has a 1 - epsilon probability of behaving like Greedy; otherwise, randomly choose to cooperate or defect."

axelrod/tests/strategies/test_armed_bandits.py

marcharper · 2024-03-18T02:04:54Z

Looks like we broke the test invocator with some recent commits, I'll try to fix it. You'll need to update one of the doc tests to indicate that two new strategies have been added.

…dated doc test, and changed value of UNIFORM constant to -1.

marcharper · 2024-04-28T22:47:26Z

Thanks for the updates. The test that's failing is:

======================================================================
FAIL: test_strategy (axelrod.tests.strategies.test_meta.TestNMWEDeterministic.test_strategy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_meta.py", line 636, in test_strategy
    self.versus_test(
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_player.py", line 580, in versus_test
    test_match.versus_test(
  File "/home/runner/work/Axelrod/Axelrod/axelrod/tests/strategies/test_player.py", line 665, in versus_test
    self.assertEqual((i, play), (i, expected_play))
AssertionError: Tuples differ: (2, D) != (2, C)

First differing element 1:
D
C

- (2, D)
?     ^ 

+ (2, C)
?     ^

This is happening because there are some ensemble strategies and the behavior of one of them has changed with the addition of these new strategies. You can run these tests with something like

python -m unittest axelrod.tests.unit.test_meta.py

I think in this case you just need to update the expected output that has changed now.

marcharper · 2024-06-12T05:33:41Z

Hi @bing-j, if you rebase onto the dev branch now the failing test should pass.

bing-j and others added 11 commits March 15, 2024 20:50

added and implemented epsilon_greedy.py

89afd2c

added and implemented epsilon_greedy.py

273b2af

added and implemented epsilon_greedy.py

b8d1fe0

epsilon-greedy added to _strategies.py and reference\strategy_index.rst

6f578aa

Fixed update_rewards() to correctly refer to attributes, added test file

f60dbaf

random test created and passed, basic deterministic tests created and…

faaf84b

… passed

strategy tests completed and passedl; updated docstring to clarify in…

a9b20a0

…itial behaviour.

References correctly added and docstring updated with correct Name.

b135871

for extendability: renamed module to armed_bandits, created Greedy su…

a8abca2

…perclass of EpsilonGreedy, and implemented option to weigh reward based on recency

modified UNIFORM to use np.inf instead; fixed EpsilonGreedy.strategy(…

17af031

…) so that it correctly calls update_rewards() through the parent method; all previous tests passed.

formatted armed_bandits.py using black.

ab947a8

marcharper requested changes Mar 18, 2024

View reviewed changes

bing-j added 2 commits April 22, 2024 15:38

style changes in docstrings and comments, renamed file to bandits, up…

80fbe79

…dated doc test, and changed value of UNIFORM constant to -1.

modified test file to align with changes. all tests passed!

ec568cc

bing-j requested a review from marcharper April 22, 2024 19:45

marcharper mentioned this pull request Jun 1, 2024

Remove test in test_meta.py #1441

Merged

marcharper mentioned this pull request Jul 12, 2024

Bingj rebase #1448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438

Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438

bing-j commented Mar 17, 2024 •

edited

Loading

marcharper left a comment

marcharper Mar 18, 2024

bing-j Apr 22, 2024

marcharper Mar 18, 2024

bing-j Apr 22, 2024

marcharper commented Mar 18, 2024

marcharper commented Apr 28, 2024

marcharper commented Jun 12, 2024

Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438

Are you sure you want to change the base?

Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438

Conversation

bing-j commented Mar 17, 2024 • edited Loading

marcharper left a comment

Choose a reason for hiding this comment

marcharper Mar 18, 2024

Choose a reason for hiding this comment

bing-j Apr 22, 2024

Choose a reason for hiding this comment

marcharper Mar 18, 2024

Choose a reason for hiding this comment

bing-j Apr 22, 2024

Choose a reason for hiding this comment

marcharper commented Mar 18, 2024

marcharper commented Apr 28, 2024

marcharper commented Jun 12, 2024

bing-j commented Mar 17, 2024 •

edited

Loading