developing agents for team dominoes #1218

Brunozml · 2024-04-28T17:18:26Z

Hi,

for the last couple of days I have been working on scripts for training and evaluating RL agents for multiplayer dominoes, but I have faced two main issues:

Scale: Any training seems to be beyond the capacities of my local machine, and I've been wondering how I could try to scale up so that I have more compute. For now, I have developed a rough simplification of the game with 2 tiles per player instead of 7 so that I can run experiments locally. Still, I would eventually like to try things on the full-scale game. How do you suggest I approach that?
experimentation: I'm finding it hard to map example scripts to my project. For example, I've been trying to adapt the tic_tac_toe_qlearner.py example to this four player game; however, I've been unable to resolve and issue regarding legal_actions() being called by the random agent's step function when its empty—no available actions—and should therefore not be called. Being the only python game with more than two players, it has been to understand if the bug is in my own implementation of the game, the observer, or in some other script.

Any resources and/or advice help!

The text was updated successfully, but these errors were encountered:

lanctot · 2024-05-01T21:06:42Z

Hi @Brunozml ,

Apologies for the lag in communication. I am currently travelling and will be away at a conference until mid May.

The tic-tac-toe QLearner example is probably using a tabular method. You might need an analogous one that uses neural networks. Take a look at python/examples/breakthrough_dqn.py which is similar but uses DQN.

For the second point, the game should not have any states where a player has no actions. You can do two things in this case: add a special pass action which is the only legal action fior the player when they have no actions, or change the apply_action function so that it sets the current player id to the next one with actions. (There could be another game specific that is more appropriate of course, I am just not familiar with the rules)

Brunozml · 2024-05-13T07:17:47Z

Thank you @lanctot!

I have made some progress since then, focusing firstly on the 2-person version (i.e. python_block_dominoes) and hoping to eventually train on the larger four person game.

Here are some problem's dilemmas I'm facing, and I would love it if you could offer some advice whenever you have the time.

training and using DeepCFR as an agent. My ultimate goal is apply the deep CFR algorithm to both versions of dominoes (separately) and be able to benchmark its performance against other agents, and also play with it. So far I'm facing two main issues:

(a) training locally using the parameters used in the original papers takes ~20min/iteration, and I don't even know if these are even applicable to this larger game! (e.g. memory buffer probably shouldn't be as small as that used for Kuhn/leduc poker). However, results from a veeeeery preliminary experiment of 30 training iterations showed that the algorithm was making progress against random agents. How do I scale this, and how do I got about selecting appropriate hyper-parameters here?

(b) from policy to agent what interface should I use for the DeepCFR solver? I'm struggling to understand the difference between a pyspiel.Bot and a rl_agent, and how they interact with a game.

Pytorch vs tensorflow implementations. I'm struggling to settle on which to use, which ultimately comes down to (a) how easy is it to save/reload a trained agent and which one is faster for training? also I've noticed some errors/bugs in some of the implementations, such as the recent bug fix I proposed and the following:
pytorch implementation

I0513 08:45:05.520646 8586558144 deep_cfr_pytorch.py:66] Computed player 0 value: 0.14 (expected: -0.06).
I0513 08:45:05.520677 8586558144 deep_cfr_pytorch.py:68] Computed player 1 value: -0.14 (expected: 0.06).

vs tf1 implementation

Computed player 0 value: -0.05390531876957483
Expected player 0 value: -0.05555555555555555
Computed player 1 value: 0.05390531876957483
Expected player 1 value: 0.05555555555555555

(same for JAX, and tf2 version doesn't seem to be working)

evaluation: A number of the example scripts have eval_against_random_bots, as in the (breakthrough dqn example)[https://github.com/google-deepmind/open_spiel/blob/1208f832568063c36d0e3076069e103d5b00cf5b/open_spiel/python/examples/breakthrough_dqn.py#L50]. This function plays a number of games, but doesn't seem to take the 'closing step that is necessary during training runs. why is a last step not necessary here, like in training?

      # Episode is over, step all agents with final info state.
      for agent in agents:
        agent.step(time_step)

I ask this because I believe the reason my 4-agent training scripts haven't succeeded is due to my attempt at adapting this evaluation methodology to four players... and I'm also generally curious about some of the design choices I've observed in open_spiel =)

Finally (and sorry for condensing so many comments into one reply), giogix2's pygame extension of open_spiel inspired me to extend it to dominoes, and it was easy enough to set up the interface (see below) link kere I think this could grow to be a great way to attract and engage more contributors to open_spiel! Due to the issues mentioned above, I have so far failed to support any bot/rl_agent in this new graphical interface, but it will be really exciting once it works!

.

lanctot · 2024-05-17T12:00:51Z

Wow, cool to see the nice graphics @Brunozml @giogix2 ! I will change the docs to advertise pyspiel_game more prominently, it looks great!

Ok, these are non-trivial questions.

1a. Deep CFR is based on external-sampling MCCFR. For balanced trees each iteration will take roughly square root the size of the game. This just won't scale without some form of abstraction to keep the abstract game small enough. A more scalable approach would be, for example, DREAM (Steinberger et al.) or ARMAC (Grusyls et al) -- both are based on single-trajectory sampling... but we don't have implementations of either in OpenSpiel. Contributions of them would be very welcome but they're non-trivial.

1b. A pyspiel.Bot is something you can call step(state) on to get an action. A policy is a dictionary of information state strings to mixed strategies (state policies). You can get a bot from a policy using https://github.com/google-deepmind/open_spiel/blob/master/open_spiel/python/bots/policy.py

This is largely personal bias. Pick the framework you enjoy working with best :) We're likely removing the TF2 version eventually as there are known problems with Keras 3 unlikely to be fixed.

If you detect any other bugs we should be aware of, please let us know. I am not sure how much use these Deep CFR implementations are getting and they were all external contributions. There may still be bugs of the kind you found (some of these could be introduced when upgrading versions of things too, as sometimes the APIs and behaviors change over time). However, we did ask the authors to reproduce the original paper's curves when they were contributed originally (~2 years ago). It may be worthwhile checking that's still true on Kuhn and Leduc poker.

Also, keep in mind the metrics could look very different when using different hyper-parameters.

For easy saving/loading I would avoid TF1 and use Pytorch or JAX. Is there a specific problem with the JAX implementation? It has still been passing tests in every recent sync to github.

Evaluation is quite tricky in imperfect information games. Exploitability (or some proxy thereof if the game is too large) is a good metric, and complementing it with performance against some fixed pool of opponents is useful.

If you haven't yet, I strongly recommend taking a look at the OpenSpiel tutorial. https://www.youtube.com/watch?v=8NCPqtPwlFQ . It explains the basic API components. I also explain why that last step is necessary in training but not evaluation. Here's an ASCII diagram. Example, suppose you have two DQN agents in self-play at the end of a game of tic-tac-toe:

s_0: player 0 takes cell 4 (center)
s_1: player 1 takes cell 6
s_2: player 0 takes cell 0
s_3: player 1 takes cell 5
s_4: player 0 takes cell 1
s_5: player 1 takes cell 8
s_6: player 0 takes cell 2
s_7: terminal state
Game ends: rewards (+1, -1)

Player 0's (s, a, r, s') transitions in this game are: (s_0, 4, 0, s_2), (s_2, 0, 0, s_4), (s_4, 1, 0, s_6), (s6, 2, +1, s7)
Player 1's are: (s_1, 6, 0, s_3), (s_3, 5, 0, s_5), (s_5, 8, -1, s_7).

If there was no agent step on the final state, then Player 1 would never receive the last transition (s_5, 8, -1, s_7) which DQN needs to put into its replay buffer be able to learn. Similarly for Player 0's final transition (s6, 2, +1, s7).

However, during evaluation you don't need to worry about the agents learning, you just need the (+1, -1) reward at the end of the episode.

Hope this helps!

lanctot · 2024-05-17T12:17:50Z

As a follow-up to 1, instead of DREAM or ARMAC I suggest trying R-NaD (in this repos), which is the Friction FoReL algorithm behind the Mastering Stratego paper... or MMD by @ssokota @ryan-dorazio et al. We do not have a copy of the RL form of the MMD in our repos but you can find it on Sam's github (https://github.com/ssokota/mmd)

Brunozml · 2024-05-18T13:26:03Z

Hi @lanctot,

Thanks for the reply!! I took another look at the tutorial, and it all makes more sense now that I have a better grasp of OpenSpiel as a whole!

In terms of your follow-up reply, I have a few more doubts:

for MMD, I've skimmed through all three repos as well as their paper; however, I can't really understand how this method scales to larger games... it looks to me like they all rely on tabular policies, but block_dominoes has a total of 154 player actions and is a much larger game that what is presented in their paper (from what I understand). This is the same problem I've faced with computing exploitability, which is why I'm currently looking at approximate exploitability instead.
R-NaD seems promising, thanks! However I'm a bit confused since its implementation is rather different than other algorithms, plus the 3-step process proposed by the paper doesn't seem that easy to adapt. From what I see, there is no example script that uses it right?

Finally, out of pragmatic curiosity, how do you suggest I set-up a systematic approach to training and testing different algorithms for block_dominoes? I see, for example, that @ssokota and @ryan-dorazio decided to create their own Learner class, as opposed to working with the rl_agent or pyspiel.Bot implementations.

Again, I thank you for your dedication and patience. Much much appreciated

ssokota · 2024-05-18T13:57:31Z

Hi @Brunozml

The deep RL form of MMD is basically just PPO with appropriate hyperparameters, so you could use any reasonable PPO implementation.

giogix2 · 2024-05-18T21:43:23Z

Sorry to jump in in the middle of an on-going discussion, but I wanted to say thank you for looking at pygame_spiel. I'm happy to hear it served as an inspiration.
I haven't worked on it recently, but if it's useful for others I wouldn't mind keep expanding on it.
I planned to add more games from open_spiel, and making it easier for people to add more algorithms, but I ended up de-prioritizing it (temporarily...).

Creating a UI for Dominoes is not trivial, so congrats on what you've done @Brunozml, it looks awesome. My initial goal was to implement Hive for both open_spiel (likely to be hosted in my fork, because of licenses related issues), and later in pygame_spiel. Not easy at all.

lanctot · 2024-05-22T01:26:43Z

@giogix2 Cool, yes I'd love to hear about more developments in pygame_spiel and would be happy to advertise its use on our development page (in fact I will do that in the next few weeks).

Funny that you should mention Hive in particular. Someone is currently working on an implementation and we are discussing a potential license / agreement with the publishers to have it in the main repos. See #1194 for details and feel free to contact me or @rhstephens via email if you want updates.

Brunozml · 2024-05-22T07:12:53Z

That would be amazing! @giogix2 I've just invited to as a collaborator on my pygame_spiel fork. I haven't made a PR yet since I couldn't figure out some important details without breaking your code (e.g. how to switch between starting as player 0 and 1, which is quite important in dominoes). If you're open to it, I'd be happy to have a chat and discuss how we could collaborate on the repo!

Brunozml · 2024-05-22T07:18:53Z

@ssokota I've been playing around with MMD. I might have misunderstood something about the OpenSpiel implementation, but its performance doesn't seem to reach that of the paper (see here)

Its worth noting that MMD with Annealing temperature settings as specified in https://github.com/ssokota/mmd does show a much steeper decline on exploitability.

@lanctot In terms of RNa-D, has anyone already used the OpenSpiel implementation to do something similar (albeit at a smaller case ofc) to the Mastering Strategy paper? that would be great.

lanctot · 2024-05-22T09:36:12Z

@lanctot In terms of RNa-D, has anyone already used the OpenSpiel implementation to do something similar (albeit at a smaller case ofc) to the Mastering Strategy paper? that would be great.

Well I'm not sure what you mean by something similar to the Mastering Stratego paper. If you mean super-human performance, there are some people working on it n Liar's poker and they've managed to recover good strategies at small scale (@ciamac). And there's this paper that used it as a baseline: https://arxiv.org/pdf/2312.15220v2. Note: several people have worked with it / are working with it have reported issues, some of which we have fixed but some remain open, sp take a look at the github issues before using it as there are some known open questions.

There's also NFSP, btw, which is always a good place to start. But I'd say it's worth still trying MMD with annealing temperatures and wait for Sam's response. He might have some suggested learning rates and decay schedules.

There's very recent work using search and transformers (https://arxiv.org/abs/2404.13150). I'd love to see an implementation of something like this in OpenSpiel, but this is a fair bit of effort.

ssokota · 2024-05-22T15:37:28Z

@Brunozml What results are you trying to reproduce? The implementation in OpenSpiel is for sequence form, whereas the other implementation you linked is for behavioral form. So, it's expected that they give different results.

giogix2 · 2024-05-22T20:21:25Z

Great to hear about the ongoing work with the Hive implementation @lanctot. Looking forward to test it out.

@Brunozml I'd be happy to have a chat about pygame_spiel. I have some idea about the phylosophy of the project, on how to make it more expandable. Making it easier to switch between player 0 and 1 is quite fundamental I agree.

lanctot mentioned this issue May 22, 2024

Spades Implementation #1214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

developing agents for team dominoes #1218

developing agents for team dominoes #1218

Brunozml commented Apr 28, 2024

lanctot commented May 1, 2024

Brunozml commented May 13, 2024 •

edited

Loading

lanctot commented May 17, 2024 •

edited

Loading

lanctot commented May 17, 2024 •

edited

Loading

Brunozml commented May 18, 2024

ssokota commented May 18, 2024

giogix2 commented May 18, 2024

lanctot commented May 22, 2024

Brunozml commented May 22, 2024

Brunozml commented May 22, 2024

lanctot commented May 22, 2024 •

edited

Loading

ssokota commented May 22, 2024

giogix2 commented May 22, 2024

developing agents for team dominoes #1218

developing agents for team dominoes #1218

Comments

Brunozml commented Apr 28, 2024

lanctot commented May 1, 2024

Brunozml commented May 13, 2024 • edited Loading

lanctot commented May 17, 2024 • edited Loading

lanctot commented May 17, 2024 • edited Loading

Brunozml commented May 18, 2024

ssokota commented May 18, 2024

giogix2 commented May 18, 2024

lanctot commented May 22, 2024

Brunozml commented May 22, 2024

Brunozml commented May 22, 2024

lanctot commented May 22, 2024 • edited Loading

ssokota commented May 22, 2024

giogix2 commented May 22, 2024

Brunozml commented May 13, 2024 •

edited

Loading

lanctot commented May 17, 2024 •

edited

Loading

lanctot commented May 17, 2024 •

edited

Loading

lanctot commented May 22, 2024 •

edited

Loading