Alpha4 plays Connect4 in the style of AlphaGo. Specifically, it consists of policy and value neural networks embedded in Monte-Carlo Tree Search. The policy network is trained by reinforcement learning against some simple opponents and old versions of itself. Once the policy network is trained it is used to generate a large collection of positions which are used to train the value network.
Alpha4 is pretty strong. Connect4 is a solved game and it is known that the first player always wins. When Alpha4 plays first, it occasionally plays the perfect game to bring home a beautiful victory.
- Python - tested with v3.5
- Tensorflow - tested with v1.1.0
- Tkinter - normally included with python
- Play the pretrained networks
python gui.py
- Show winning moves while playing
python gui.py --threats
- If you can't beat it then you could always cheat!
Train a policy network with python policy_training.py
The policy network is trained against increasingly strong sets of opponents. To start with it plays against some simple handcoded algorithms
RandomPlayer
plays randomly with a bias towards central disksRandomThreatPlayer
makes most moves likeRandomPlayer
but always plays winning moves or blocks the opponent's winning movesMaxThreatPlayer
plays likeRandomThreatPlayer
except it also tries to create winning moves
Once the policy network can consistently beat them a clone of the current policy is created and added as an additional opponent. More and more clones are added as opponents as the network becomes stronger. Checkout tensorboard to see win rates against the various opponents created - tensorboard --logdir runs
REINFORCE
is used to train the network with the result (+1 for a win, 0 for a draw, -1 for a loss) used as the reward for each move played during a game.
Many games of Connect4 can be very similar so the games are initialised with some random moves to give the network a more diverse set of positions to train on. Entropy regularisation is also used to ensure the network doesn't end up with a single fixed policy.
Once the policy network is trained it can be used to generate a large quantity of positions
python generate_rollout_positions.py
These can then be used to supervise the training of a value network. The positions are split into training and validation sets. At the end of each epoch the validation score of the network is calculated and the network is only saved if the validation score improves
python value_training.py
- Multithreaded
- MCTS threads add nodes to a prior queue, a rollout queue and a value queue for batch processing
- Prior threads use a policy network to calculate the move priors for each position
- Rollout threads use another policy network to rollout positions until the end of the game
- Value threads use a value network to estimate the value of the position
- All updates from the above threads are optimistically applied lock-free
- Because Connect4 has a low branching factor many combinations of moves lead to the same position, known as transpositions. Rollouts and values estimates are applied to all branches of the search tree that lead to the end node. This is called UCT3 and is fully described here
- Despite the policy networks being trained with entropy regularisation they don't have enough entropy in the priors or diversity in the rollouts. This is remedied by appying high temperatures to the final softmax layers in these networks
- The performance is limited by the throughput of the policy and value networks. By using threads we at least ensure that the networks are busy most of the time and are processing reasonable size batches