-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Greedy and EpsilonGreedy strategies, using Multi-armed Bandits algorithms #1438
base: dev
Are you sure you want to change the base?
Conversation
…perclass of EpsilonGreedy, and implemented option to weigh reward based on recency
…) so that it correctly calls update_rewards() through the parent method; all previous tests passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, looks interesting! Most of the feedback is just on matching style and improving the comments.
axelrod/strategies/armed_bandits.py
Outdated
"manipulates_state": False, | ||
} | ||
|
||
UNIFORM = np.inf # constant that replaces weight when rewards aren't weighted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another conceivable value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to -1.0, and changed other places in the code to refer to this constant for consistency. This does mean that if a user uses recency_wieght=-1.0 at time of creation, it will be treated as not recency weighted (instead of an out of range value limited to 0.0 as in previous implementations).
axelrod/strategies/armed_bandits.py
Outdated
|
||
class EpsilonGreedy(Greedy): | ||
""" | ||
Has a 1 - epsilon probability of behaving like Greedy(), and plays randomly otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greedy() --> Greedy
Can you elaborate more on "plays randomly otherwise"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "Has a 1 - epsilon probability of behaving like Greedy; otherwise, randomly choose to cooperate or defect."
Looks like we broke the test invocator with some recent commits, I'll try to fix it. You'll need to update one of the doc tests to indicate that two new strategies have been added. |
…dated doc test, and changed value of UNIFORM constant to -1.
Thanks for the updates. The test that's failing is:
This is happening because there are some ensemble strategies and the behavior of one of them has changed with the addition of these new strategies. You can run these tests with something like
I think in this case you just need to update the expected output that has changed now. |
Hi @bing-j, if you rebase onto the dev branch now the failing test should pass. |
Hello! I wrote some strategies that use armed bandit algorithms. Originally, I only wanted to implement the epsilon-greedy strategy, but I now plan on extending this effort and implementing all the algorithms mentioned in the multi-armed bandit chapter of Sutton's Reinforcement Learning: an Introduction (I added the reference to the bibliography). So the branch name is no longer very representative; I added both Greedy and EpsilonGreedy on this branch.
Greedy:
Always chooses the action that has the highest average/expected "reward" (score), calculated from its own previous turns. The reward function is updated incrementally and optionally recency weighted, and initial expected rewards of each action default to zero if not modified through a parameter.
EpsilonGreedy:
Mostly works like Greedy (with p=1-e), but sometimes acts randomly (with p=e).
These strategies are described in detail in the textbook mentioned above as well.
As I've mentioned on gitter, I was unable to find any strategies that implement these algorithms, although I did find some similar ones. For example, Adaptive() works similarly to Greedy() without weights, but has a hard coded initial sequence, and uses raw sum of scores to choose the optimal play instead of average score. (Side note: the comments in Adaptive().strategy() indicate that it was intended to use the highest average; this may be an error in the code!) If similar strategies already exist, and/or there's any modifications I need to make in the code, please let me know!
Cheers :)